dedoc.metadata_extractors
- class dedoc.metadata_extractors.MetadataExtractorComposition(extractors: List[AbstractMetadataExtractor])[source]
This class allows to extract metadata from any document according to the available list of metadata extractors. The list of metadata extractors is set via the class constructor. The first suitable extractor is used (the one whose method
can_extract()
returns True), so the order of extractors is important.- __init__(extractors: List[AbstractMetadataExtractor]) None [source]
- Parameters:
extractors – the list of extractors with methods can_extract() and extract() to extract metadata from file
- extract(file_path: str, converted_filename: str | None = None, original_filename: str | None = None, parameters: dict | None = None, extension: str | None = None, mime: str | None = None) dict [source]
Extract metadata using one of the extractors if suitable extractor was found.
- Parameters:
file_path – path to the file to extract metadata. If dedoc manager is used, the file gets a new name during processing - this name should be passed here (for example 23141.doc)
converted_filename – name of the file after renaming and conversion (if dedoc manager is used, for example 23141.docx), by default it’s a name from the file_path. Converted file should be located in the same directory as the file before converting.
original_filename – name of the file before renaming (if dedoc manager is used), by default it’s a name from the file_path
parameters – additional parameters for document parsing, see Parameters description for more details
extension – file extension, for example .doc or .pdf
mime – MIME type of file
- Returns:
dict with metadata information about the document
- class dedoc.metadata_extractors.AbstractMetadataExtractor(*, config: dict | None = None, recognized_extensions: Set[str] | None = None, recognized_mimes: Set[str] | None = None)[source]
This class is responsible for extracting metadata from the documents of different formats.
- __init__(*, config: dict | None = None, recognized_extensions: Set[str] | None = None, recognized_mimes: Set[str] | None = None) None [source]
- Parameters:
config – configuration of the extractor, e.g. logger for logging
recognized_extensions – set of supported files extensions with a dot, for example {.doc, .pdf}
recognized_mimes – set of supported MIME types of files
- can_extract(file_path: str, converted_filename: str | None = None, original_filename: str | None = None, parameters: dict | None = None, mime: str | None = None, extension: str | None = None) bool [source]
Check if this extractor can handle the given file.
- Parameters:
file_path – path to the file to extract metadata. If dedoc manager is used, the file gets a new name during processing - this name should be passed here (for example 23141.doc)
converted_filename – name of the file after renaming and conversion (if dedoc manager is used, for example 23141.docx), by default it’s a name from the file_path. Converted file should be located in the same directory as the file before converting.
original_filename – name of the file before renaming (if dedoc manager is used), by default it’s a name from the file_path
parameters – additional parameters for document parsing, see Parameters description for more details
mime – MIME type of a file
extension – file extension, for example .doc or .pdf
- Returns:
True if the extractor can handle the given file and False otherwise
- abstract extract(file_path: str, converted_filename: str | None = None, original_filename: str | None = None, parameters: dict | None = None) dict [source]
Extract metadata from file if possible, i.e. method
can_extract()
returned True.- Parameters:
file_path – path to the file to extract metadata. If dedoc manager is used, the file gets a new name during processing - this name should be passed here (for example 23141.doc)
converted_filename – name of the file after renaming and conversion (if dedoc manager is used, for example 23141.docx), by default it’s a name from the file_path. Converted file should be located in the same directory as the file before converting.
original_filename – name of the file before renaming (if dedoc manager is used), by default it’s a name from the file_path
parameters – additional parameters for document parsing, see Parameters description for more details
- Returns:
dict with metadata information about the document
- class dedoc.metadata_extractors.BaseMetadataExtractor(*, config: dict | None = None, recognized_extensions: Set[str] | None = None, recognized_mimes: Set[str] | None = None)[source]
Bases:
AbstractMetadataExtractor
This metadata extractor allows to extract metadata from the documents of any format.
- It returns the following information about the given file:
file name;
file name during parsing (unique);
file type (MIME);
file size in bytes;
time when the file was last accessed;
time when the file was created;
time when the file was last modified.
- can_extract(file_path: str, converted_filename: str | None = None, original_filename: str | None = None, parameters: dict | None = None, mime: str | None = None, extension: str | None = None) bool [source]
This extractor can handle any file so the method always returns True. Look to the
can_extract()
documentation to get the information about parameters.
- class dedoc.metadata_extractors.DocxMetadataExtractor(*, config: dict | None = None)[source]
Bases:
AbstractMetadataExtractor
This class is used to extract metadata from docx documents. It expands metadata retrieved by
BaseMetadataExtractor
.- In addition to them, the following fields can be added to the metadata other fields:
document subject;
keywords;
category;
comments;
author;
author who last modified the file;
created, modified and last printed date.
- class dedoc.metadata_extractors.ImageMetadataExtractor(*, config: dict | None = None)[source]
Bases:
AbstractMetadataExtractor
This class is used to extract metadata from images. It expands metadata retrieved by
BaseMetadataExtractor
.- In addition to them, the following fields can be added to the metadata other fields:
date time, date time digitized, date time original;
digital zoom ratio;
exif image height, image width and version;
light source;
make;
model;
orientation;
resolution unit;
software;
subject distance range;
user comment.
- class dedoc.metadata_extractors.NoteMetadataExtractor(*, config: dict | None = None)[source]
Bases:
AbstractMetadataExtractor
This class is used to extract metadata from documents with extension .note.pickle. It expands metadata retrieved by
BaseMetadataExtractor
.In addition to them, the author field can be added to the metadata other fields.
- can_extract(file_path: str, converted_filename: str | None = None, original_filename: str | None = None, parameters: dict | None = None, mime: str | None = None, extension: str | None = None) bool [source]
Check if the document has .note.pickle extension. Look to the
can_extract()
documentation to get the information about parameters.
- class dedoc.metadata_extractors.PdfMetadataExtractor(*, config: dict | None = None)[source]
Bases:
AbstractMetadataExtractor
This class is used to extract metadata from pdf documents. It expands metadata retrieved by
BaseMetadataExtractor
.- In addition to them, the following fields can be added to the metadata other fields:
producer;
creator;
author;
title;
subject;
keywords;
creation date;
modification date.