dedoc.metadata_extractors

class dedoc.metadata_extractors.MetadataExtractorComposition(extractors: List[AbstractMetadataExtractor])[source]

This class allows to extract metadata from any document according to the available list of metadata extractors. The list of metadata extractors is set via the class constructor. The first suitable extractor is used (the one whose method can_extract() returns True), so the order of extractors is important.

__init__(extractors: List[AbstractMetadataExtractor]) → None[source]

Parameters:: extractors – the list of extractors with methods can_extract() and extract() to extract metadata from file

Extract metadata using one of the extractors if suitable extractor was found.

Parameters:

file_path – path to the file to extract metadata. If dedoc manager is used, the file gets a new name during processing - this name should be passed here (for example 23141.doc)
converted_filename – name of the file after renaming and conversion (if dedoc manager is used, for example 23141.docx), by default it’s a name from the file_path. Converted file should be located in the same directory as the file before converting.
original_filename – name of the file before renaming (if dedoc manager is used), by default it’s a name from the file_path
parameters – additional parameters for document parsing, see Parameters description for more details
extension – file extension, for example .doc or .pdf
mime – MIME type of file

Returns:

dict with metadata information about the document

class dedoc.metadata_extractors.AbstractMetadataExtractor(*, config: dict | None = None, recognized_extensions: Set[str] | None = None, recognized_mimes: Set[str] | None = None)[source]

This class is responsible for extracting metadata from the documents of different formats.

__init__(*, config: dict | None = None, recognized_extensions: Set[str] | None = None, recognized_mimes: Set[str] | None = None) → None[source]

Parameters:

config – configuration of the extractor, e.g. logger for logging
recognized_extensions – set of supported files extensions with a dot, for example {.doc, .pdf}
recognized_mimes – set of supported MIME types of files

Check if this extractor can handle the given file.

Parameters:

file_path – path to the file to extract metadata. If dedoc manager is used, the file gets a new name during processing - this name should be passed here (for example 23141.doc)
converted_filename – name of the file after renaming and conversion (if dedoc manager is used, for example 23141.docx), by default it’s a name from the file_path. Converted file should be located in the same directory as the file before converting.
original_filename – name of the file before renaming (if dedoc manager is used), by default it’s a name from the file_path
parameters – additional parameters for document parsing, see Parameters description for more details
mime – MIME type of a file
extension – file extension, for example .doc or .pdf

Returns:

True if the extractor can handle the given file and False otherwise

abstract extract(file_path: str, converted_filename: str | None = None, original_filename: str | None = None, parameters: dict | None = None) → dict[source]

Extract metadata from file if possible, i.e. method can_extract() returned True.

Parameters:

file_path – path to the file to extract metadata. If dedoc manager is used, the file gets a new name during processing - this name should be passed here (for example 23141.doc)
converted_filename – name of the file after renaming and conversion (if dedoc manager is used, for example 23141.docx), by default it’s a name from the file_path. Converted file should be located in the same directory as the file before converting.
original_filename – name of the file before renaming (if dedoc manager is used), by default it’s a name from the file_path
parameters – additional parameters for document parsing, see Parameters description for more details

Returns:

dict with metadata information about the document

class dedoc.metadata_extractors.BaseMetadataExtractor(*, config: dict | None = None, recognized_extensions: Set[str] | None = None, recognized_mimes: Set[str] | None = None)[source]

Bases: AbstractMetadataExtractor

This metadata extractor allows to extract metadata from the documents of any format.

It returns the following information about the given file:

file name;
file name during parsing (unique);
file type (MIME);
file size in bytes;
time when the file was last accessed;
time when the file was created;
time when the file was last modified.

can_extract(file_path: str, converted_filename: str | None = None, original_filename: str | None = None, parameters: dict | None = None, mime: str | None = None, extension: str | None = None) → bool[source]: This extractor can handle any file so the method always returns True. Look to the can_extract() documentation to get the information about parameters.

extract(file_path: str, converted_filename: str | None = None, original_filename: str | None = None, parameters: dict | None = None) → dict[source]: Gets the basic meta-information about the file. Look to the extract() documentation to get the information about parameters.

class dedoc.metadata_extractors.DocxMetadataExtractor(*, config: dict | None = None)[source]

Bases: AbstractMetadataExtractor

This class is used to extract metadata from docx documents. It expands metadata retrieved by BaseMetadataExtractor.

In addition to them, the following fields can be added to the metadata other fields:

document subject;
keywords;
category;
comments;
author;
author who last modified the file;
created, modified and last printed date.

extract(file_path: str, converted_filename: str | None = None, original_filename: str | None = None, parameters: dict | None = None) → dict[source]: Add the predefined list of metadata for the docx documents. Look to the extract() documentation to get the information about parameters.

class dedoc.metadata_extractors.ImageMetadataExtractor(*, config: dict | None = None)[source]

Bases: AbstractMetadataExtractor

This class is used to extract metadata from images. It expands metadata retrieved by BaseMetadataExtractor.

In addition to them, the following fields can be added to the metadata other fields:

date time, date time digitized, date time original;
digital zoom ratio;
exif image height, image width and version;
light source;
make;
model;
orientation;
resolution unit;
software;
subject distance range;
user comment.

extract(file_path: str, converted_filename: str | None = None, original_filename: str | None = None, parameters: dict | None = None) → dict[source]: Add the predefined list of metadata for images. Look to the extract() documentation to get the information about parameters.

class dedoc.metadata_extractors.NoteMetadataExtractor(*, config: dict | None = None)[source]

Bases: AbstractMetadataExtractor

This class is used to extract metadata from documents with extension .note.pickle. It expands metadata retrieved by BaseMetadataExtractor.

In addition to them, the author field can be added to the metadata other fields.

can_extract(file_path: str, converted_filename: str | None = None, original_filename: str | None = None, parameters: dict | None = None, mime: str | None = None, extension: str | None = None) → bool[source]: Check if the document has .note.pickle extension. Look to the can_extract() documentation to get the information about parameters.

extract(file_path: str, converted_filename: str | None = None, original_filename: str | None = None, parameters: dict | None = None) → dict[source]: Add the predefined list of metadata for the .note.pickle documents. Look to the extract() documentation to get the information about parameters.

class dedoc.metadata_extractors.PdfMetadataExtractor(*, config: dict | None = None)[source]

Bases: AbstractMetadataExtractor

This class is used to extract metadata from pdf documents. It expands metadata retrieved by BaseMetadataExtractor.

In addition to them, the following fields can be added to the metadata other fields:

producer;
creator;
author;
title;
subject;
keywords;
creation date;
modification date.

extract(file_path: str, converted_filename: str | None = None, original_filename: str | None = None, parameters: dict | None = None) → dict[source]: Add the predefined list of metadata for the pdf documents. Look to the extract() documentation to get the information about parameters.