dedoc.metadata_extractors

class dedoc.metadata_extractors.MetadataExtractorComposition(extractors: List[AbstractMetadataExtractor])[source]

This class allows to extract metadata from any document according to the available list of metadata extractors. The list of metadata extractors is set via the class constructor. The first suitable extractor is used (the one whose method can_extract() returns True), so the order of extractors is important.

__init__(extractors: List[AbstractMetadataExtractor]) None[source]
Parameters:

extractors – the list of extractors with methods can_extract() and extract_metadata() to extract metadata from file

extract_metadata(directory: str, filename: str, converted_filename: str, original_filename: str, parameters: dict | None = None, other_fields: dict | None = None) dict[source]

Extract metadata using one of the extractors if suitable extractor was found. Look to the method extract_metadata() of the class AbstractMetadataExtractor documentation to get the information about method’s parameters.

class dedoc.metadata_extractors.AbstractMetadataExtractor[source]

This class is responsible for extracting metadata from the documents of different formats.

abstract can_extract(directory: str, filename: str, converted_filename: str, original_filename: str, parameters: dict | None = None, other_fields: dict | None = None) bool[source]

Check if this extractor can handle the given file. Return True if the extractor can handle it and False otherwise. Look to the extract_metadata() documentation to get the information about parameters.

abstract extract_metadata(directory: str, filename: str, converted_filename: str, original_filename: str, parameters: dict | None = None, other_fields: dict | None = None) dict[source]

Extract metadata from file if possible, i.e. method can_extract() returned True.

Parameters:
  • directory – path to the directory where the original and converted files are located

  • filename – name of the file after renaming (for example 23141.doc). The file gets a new name during processing by the dedoc manager (if used)

  • converted_filename – name of the file after renaming and conversion (for example 23141.docx)

  • original_filename – name of the file before renaming

  • parameters – additional parameters for document parsing

  • other_fields – other fields that should be added to the document’s metadata

Returns:

dict with metadata information about the document

class dedoc.metadata_extractors.BaseMetadataExtractor[source]

Bases: AbstractMetadataExtractor

This metadata extractor allows to extract metadata from the documents of any format.

It returns the following information about the given file:
  • file name;

  • file name during parsing (unique);

  • file type (MIME);

  • file size in bytes;

  • time when the file was last accessed;

  • time when the file was created;

  • time when the file was last modified.

can_extract(directory: str, filename: str, converted_filename: str, original_filename: str, parameters: dict | None = None, other_fields: dict | None = None) bool[source]

This extractor can handle any file so the method always returns True. Look to the can_extract() documentation to get the information about parameters.

extract_metadata(directory: str, filename: str, converted_filename: str, original_filename: str, parameters: dict | None = None, other_fields: dict | None = None) dict[source]

Gets the basic meta-information about the file. Look to the extract_metadata() documentation to get the information about parameters.

class dedoc.metadata_extractors.DocxMetadataExtractor[source]

Bases: BaseMetadataExtractor

This class is used to extract metadata from docx documents. It expands metadata retrieved by BaseMetadataExtractor.

In addition to them, the following fields can be added to the metadata other fields:
  • document subject;

  • keywords;

  • category;

  • comments;

  • author;

  • author who last modified the file;

  • created, modified and last printed date.

can_extract(directory: str, filename: str, converted_filename: str, original_filename: str, parameters: dict | None = None, other_fields: dict | None = None) bool[source]

Check if the document has .docx extension. Look to the can_extract() documentation to get the information about parameters.

extract_metadata(directory: str, filename: str, converted_filename: str, original_filename: str, parameters: dict | None = None, other_fields: dict | None = None) dict[source]

Add the predefined list of metadata for the docx documents. Look to the extract_metadata() documentation to get the information about parameters.

class dedoc.metadata_extractors.ImageMetadataExtractor(*, config: dict)[source]

Bases: BaseMetadataExtractor

This class is used to extract metadata from images. It expands metadata retrieved by BaseMetadataExtractor.

In addition to them, the following fields can be added to the metadata other fields:
  • date time, date time digitized, date time original;

  • digital zoom ratio;

  • exif image height, image width and version;

  • light source;

  • make;

  • model;

  • orientation;

  • resolution unit;

  • software;

  • subject distance range;

  • user comment.

__init__(*, config: dict) None[source]
Parameters:

config – configuration of the extractor, e.g. logger for logging

can_extract(directory: str, filename: str, converted_filename: str, original_filename: str, parameters: dict | None = None, other_fields: dict | None = None) bool[source]

Check if the document has image-like extension (“.png”, “.jpg”, “.jpeg”). Look to the can_extract() documentation to get the information about parameters.

extract_metadata(directory: str, filename: str, converted_filename: str, original_filename: str, parameters: dict | None = None, other_fields: dict | None = None) dict[source]

Add the predefined list of metadata for images. Look to the extract_metadata() documentation to get the information about parameters.

class dedoc.metadata_extractors.NoteMetadataExtractor[source]

Bases: BaseMetadataExtractor

This class is used to extract metadata from documents with extension .note.pickle. It expands metadata retrieved by BaseMetadataExtractor.

In addition to them, the author field can be added to the metadata other fields.

__init__() None[source]
can_extract(directory: str, filename: str, converted_filename: str, original_filename: str, parameters: dict | None = None, other_fields: dict | None = None) bool[source]

Check if the document has .note.pickle extension. Look to the can_extract() documentation to get the information about parameters.

extract_metadata(directory: str, filename: str, converted_filename: str, original_filename: str, parameters: dict | None = None, other_fields: dict | None = None) dict[source]

Add the predefined list of metadata for the .note.pickle documents. Look to the extract_metadata() documentation to get the information about parameters.

class dedoc.metadata_extractors.PdfMetadataExtractor(*, config: dict)[source]

Bases: BaseMetadataExtractor

This class is used to extract metadata from pdf documents. It expands metadata retrieved by BaseMetadataExtractor.

In addition to them, the following fields can be added to the metadata other fields:
  • producer;

  • creator;

  • author;

  • title;

  • subject;

  • keywords;

  • creation date;

  • modification date.

__init__(*, config: dict) None[source]
Parameters:

config – configuration of the extractor, e.g. logger for logging

can_extract(directory: str, filename: str, converted_filename: str, original_filename: str, parameters: dict | None = None, other_fields: dict | None = None) bool[source]

Check if the document has .pdf extension. Look to the can_extract() documentation to get the information about parameters.

extract_metadata(directory: str, filename: str, converted_filename: str, original_filename: str, parameters: dict | None = None, other_fields: dict | None = None) dict[source]

Add the predefined list of metadata for the pdf documents. Look to the extract_metadata() documentation to get the information about parameters.