dedoc.metadata_extractors
- class dedoc.metadata_extractors.MetadataExtractorComposition(extractors: List[AbstractMetadataExtractor])[source]
This class allows to extract metadata from any document according to the available list of metadata extractors. The list of metadata extractors is set via the class constructor. The first suitable extractor is used (the one whose method
can_extract()returns True), so the order of extractors is important.- __init__(extractors: List[AbstractMetadataExtractor]) None[source]
- Parameters:
extractors – the list of extractors with methods can_extract() and extract_metadata() to extract metadata from file
- extract_metadata(directory: str, filename: str, converted_filename: str, original_filename: str, parameters: dict | None = None, other_fields: dict | None = None) dict[source]
Extract metadata using one of the extractors if suitable extractor was found. Look to the method
extract_metadata()of the classAbstractMetadataExtractordocumentation to get the information about method’s parameters.
- class dedoc.metadata_extractors.AbstractMetadataExtractor[source]
This class is responsible for extracting metadata from the documents of different formats.
- abstract can_extract(directory: str, filename: str, converted_filename: str, original_filename: str, parameters: dict | None = None, other_fields: dict | None = None) bool[source]
Check if this extractor can handle the given file. Return True if the extractor can handle it and False otherwise. Look to the
extract_metadata()documentation to get the information about parameters.
- abstract extract_metadata(directory: str, filename: str, converted_filename: str, original_filename: str, parameters: dict | None = None, other_fields: dict | None = None) dict[source]
Extract metadata from file if possible, i.e. method
can_extract()returned True.- Parameters:
directory – path to the directory where the original and converted files are located
filename – name of the file after renaming (for example 23141.doc). The file gets a new name during processing by the dedoc manager (if used)
converted_filename – name of the file after renaming and conversion (for example 23141.docx)
original_filename – name of the file before renaming
parameters – additional parameters for document parsing
other_fields – other fields that should be added to the document’s metadata
- Returns:
dict with metadata information about the document
- class dedoc.metadata_extractors.BaseMetadataExtractor[source]
Bases:
AbstractMetadataExtractorThis metadata extractor allows to extract metadata from the documents of any format.
- It returns the following information about the given file:
file name;
file name during parsing (unique);
file type (MIME);
file size in bytes;
time when the file was last accessed;
time when the file was created;
time when the file was last modified.
- can_extract(directory: str, filename: str, converted_filename: str, original_filename: str, parameters: dict | None = None, other_fields: dict | None = None) bool[source]
This extractor can handle any file so the method always returns True. Look to the
can_extract()documentation to get the information about parameters.
- extract_metadata(directory: str, filename: str, converted_filename: str, original_filename: str, parameters: dict | None = None, other_fields: dict | None = None) dict[source]
Gets the basic meta-information about the file. Look to the
extract_metadata()documentation to get the information about parameters.
- class dedoc.metadata_extractors.DocxMetadataExtractor[source]
Bases:
BaseMetadataExtractorThis class is used to extract metadata from docx documents. It expands metadata retrieved by
BaseMetadataExtractor.- In addition to them, the following fields can be added to the metadata other fields:
document subject;
keywords;
category;
comments;
author;
author who last modified the file;
created, modified and last printed date.
- can_extract(directory: str, filename: str, converted_filename: str, original_filename: str, parameters: dict | None = None, other_fields: dict | None = None) bool[source]
Check if the document has .docx extension. Look to the
can_extract()documentation to get the information about parameters.
- extract_metadata(directory: str, filename: str, converted_filename: str, original_filename: str, parameters: dict | None = None, other_fields: dict | None = None) dict[source]
Add the predefined list of metadata for the docx documents. Look to the
extract_metadata()documentation to get the information about parameters.
- class dedoc.metadata_extractors.ImageMetadataExtractor(*, config: dict)[source]
Bases:
BaseMetadataExtractorThis class is used to extract metadata from images. It expands metadata retrieved by
BaseMetadataExtractor.- In addition to them, the following fields can be added to the metadata other fields:
date time, date time digitized, date time original;
digital zoom ratio;
exif image height, image width and version;
light source;
make;
model;
orientation;
resolution unit;
software;
subject distance range;
user comment.
- __init__(*, config: dict) None[source]
- Parameters:
config – configuration of the extractor, e.g. logger for logging
- can_extract(directory: str, filename: str, converted_filename: str, original_filename: str, parameters: dict | None = None, other_fields: dict | None = None) bool[source]
Check if the document has image-like extension (“.png”, “.jpg”, “.jpeg”). Look to the
can_extract()documentation to get the information about parameters.
- extract_metadata(directory: str, filename: str, converted_filename: str, original_filename: str, parameters: dict | None = None, other_fields: dict | None = None) dict[source]
Add the predefined list of metadata for images. Look to the
extract_metadata()documentation to get the information about parameters.
- class dedoc.metadata_extractors.NoteMetadataExtractor[source]
Bases:
BaseMetadataExtractorThis class is used to extract metadata from documents with extension .note.pickle. It expands metadata retrieved by
BaseMetadataExtractor.In addition to them, the author field can be added to the metadata other fields.
- can_extract(directory: str, filename: str, converted_filename: str, original_filename: str, parameters: dict | None = None, other_fields: dict | None = None) bool[source]
Check if the document has .note.pickle extension. Look to the
can_extract()documentation to get the information about parameters.
- extract_metadata(directory: str, filename: str, converted_filename: str, original_filename: str, parameters: dict | None = None, other_fields: dict | None = None) dict[source]
Add the predefined list of metadata for the .note.pickle documents. Look to the
extract_metadata()documentation to get the information about parameters.
- class dedoc.metadata_extractors.PdfMetadataExtractor(*, config: dict)[source]
Bases:
BaseMetadataExtractorThis class is used to extract metadata from pdf documents. It expands metadata retrieved by
BaseMetadataExtractor.- In addition to them, the following fields can be added to the metadata other fields:
producer;
creator;
author;
title;
subject;
keywords;
creation date;
modification date.
- __init__(*, config: dict) None[source]
- Parameters:
config – configuration of the extractor, e.g. logger for logging
- can_extract(directory: str, filename: str, converted_filename: str, original_filename: str, parameters: dict | None = None, other_fields: dict | None = None) bool[source]
Check if the document has .pdf extension. Look to the
can_extract()documentation to get the information about parameters.
- extract_metadata(directory: str, filename: str, converted_filename: str, original_filename: str, parameters: dict | None = None, other_fields: dict | None = None) dict[source]
Add the predefined list of metadata for the pdf documents. Look to the
extract_metadata()documentation to get the information about parameters.