dedoc.metadata_extractors

class dedoc.metadata_extractors.MetadataExtractorComposition(extractors: List[AbstractMetadataExtractor])[source]

This class allows to extract metadata from any document according to the available list of metadata extractors. The list of metadata extractors is set via the class constructor. The first suitable extractor is used (the one whose method can_extract() returns True), so the order of extractors is important.

__init__(extractors: List[AbstractMetadataExtractor]) None[source]
Parameters:

extractors – the list of extractors with methods can_extract() and add_metadata() to extract metadata from file

add_metadata(document: UnstructuredDocument, directory: str, filename: str, converted_filename: str, original_filename: str, parameters: dict | None = None, other_fields: dict | None = None) UnstructuredDocument[source]

Add metadata to the document using one of the extractors if suitable extractor was found. Look to the method add_metadata() of the class AbstractMetadataExtractor documentation to get the information about method’s parameters.

class dedoc.metadata_extractors.AbstractMetadataExtractor[source]

This class is responsible for extracting metadata from the documents of different formats.

abstract add_metadata(document: UnstructuredDocument, directory: str, filename: str, converted_filename: str, original_filename: str, parameters: dict | None = None, other_fields: dict | None = None) UnstructuredDocument[source]

Add metadata to the document if possible, i.e. method can_extract() returned True.

Returns:

document content with added metadata attribute (dict with information about the document)

abstract can_extract(document: UnstructuredDocument, directory: str, filename: str, converted_filename: str, original_filename: str, parameters: dict | None = None, other_fields: dict | None = None) bool[source]

Check if this extractor can handle the given file. Return True if the extractor can handle it and False otherwise. Look to the add_metadata() documentation to get the information about parameters.

class dedoc.metadata_extractors.BaseMetadataExtractor[source]

Bases: AbstractMetadataExtractor

This metadata extractor allows to extract metadata from the documents of any format.

It returns the following information about the given file:
  • file name;

  • file name during parsing (unique);

  • file type (MIME);

  • file size in bytes;

  • time when the file was last accessed;

  • time when the file was created;

  • time when the file was last modified.

add_metadata(document: UnstructuredDocument, directory: str, filename: str, converted_filename: str, original_filename: str, parameters: dict | None = None, other_fields: dict | None = None) UnstructuredDocument[source]

Gets the basic meta-information about the file. Look to the add_metadata() documentation to get the information about parameters.

can_extract(document: UnstructuredDocument, directory: str, filename: str, converted_filename: str, original_filename: str, parameters: dict | None = None, other_fields: dict | None = None) bool[source]

This extractor can handle any file so the method always returns True. Look to the can_extract() documentation to get the information about parameters.

class dedoc.metadata_extractors.DocxMetadataExtractor[source]

Bases: BaseMetadataExtractor

This class is used to extract metadata from docx documents. It expands metadata retrieved by BaseMetadataExtractor.

In addition to them, the following fields can be added to the metadata other fields:
  • document subject;

  • keywords;

  • category;

  • comments;

  • author;

  • author who last modified the file;

  • created, modified and last printed date.

add_metadata(document: UnstructuredDocument, directory: str, filename: str, converted_filename: str, original_filename: str, parameters: dict | None = None, other_fields: dict | None = None) UnstructuredDocument[source]

Add the predefined list of metadata for the docx documents. Look to the add_metadata() documentation to get the information about parameters.

can_extract(document: UnstructuredDocument, directory: str, filename: str, converted_filename: str, original_filename: str, parameters: dict | None = None, other_fields: dict | None = None) bool[source]

Check if the document has .docx extension. Look to the can_extract() documentation to get the information about parameters.

class dedoc.metadata_extractors.ImageMetadataExtractor(*, config: dict)[source]

Bases: BaseMetadataExtractor

This class is used to extract metadata from images. It expands metadata retrieved by BaseMetadataExtractor.

In addition to them, the following fields can be added to the metadata other fields:
  • date time, date time digitized, date time original;

  • digital zoom ratio;

  • exif image height, image width and version;

  • light source;

  • make;

  • model;

  • orientation;

  • resolution unit;

  • software;

  • subject distance range;

  • user comment.

__init__(*, config: dict) None[source]
Parameters:

config – configuration of the extractor, e.g. logger for logging

add_metadata(document: UnstructuredDocument, directory: str, filename: str, converted_filename: str, original_filename: str, parameters: dict | None = None, other_fields: dict | None = None) UnstructuredDocument[source]

Add the predefined list of metadata for images. Look to the add_metadata() documentation to get the information about parameters.

can_extract(document: UnstructuredDocument, directory: str, filename: str, converted_filename: str, original_filename: str, parameters: dict | None = None, other_fields: dict | None = None) bool[source]

Check if the document has image-like extension (“.png”, “.jpg”, “.jpeg”). Look to the can_extract() documentation to get the information about parameters.

class dedoc.metadata_extractors.NoteMetadataExtractor[source]

Bases: BaseMetadataExtractor

This class is used to extract metadata from documents with extension .note.pickle. It expands metadata retrieved by BaseMetadataExtractor.

In addition to them, the author field can be added to the metadata other fields.

__init__() None[source]
add_metadata(document: UnstructuredDocument, directory: str, filename: str, converted_filename: str, original_filename: str, parameters: dict | None = None, other_fields: dict | None = None) UnstructuredDocument[source]

Add the predefined list of metadata for the .note.pickle documents. Look to the add_metadata() documentation to get the information about parameters.

can_extract(document: UnstructuredDocument, directory: str, filename: str, converted_filename: str, original_filename: str, parameters: dict | None = None, other_fields: dict | None = None) bool[source]

Check if the document has .note.pickle extension. Look to the can_extract() documentation to get the information about parameters.

class dedoc.metadata_extractors.PdfMetadataExtractor(*, config: dict)[source]

Bases: BaseMetadataExtractor

This class is used to extract metadata from pdf documents. It expands metadata retrieved by BaseMetadataExtractor.

In addition to them, the following fields can be added to the metadata other fields:
  • producer;

  • creator;

  • author;

  • title;

  • subject;

  • keywords;

  • creation date;

  • modification date.

__init__(*, config: dict) None[source]
Parameters:

config – configuration of the extractor, e.g. logger for logging

add_metadata(document: UnstructuredDocument, directory: str, filename: str, converted_filename: str, original_filename: str, parameters: dict | None = None, other_fields: dict | None = None) UnstructuredDocument[source]

Add the predefined list of metadata for the pdf documents. Look to the add_metadata() documentation to get the information about parameters.

can_extract(document: UnstructuredDocument, directory: str, filename: str, converted_filename: str, original_filename: str, parameters: dict | None = None, other_fields: dict | None = None) bool[source]

Check if the document has .pdf extension. Look to the can_extract() documentation to get the information about parameters.