dedoc.attachments_extractors
- class dedoc.attachments_extractors.AbstractAttachmentsExtractor(*, config: dict | None = None, recognized_extensions: Set[str] | None = None, recognized_mimes: Set[str] | None = None)[source]
This class is responsible for extracting files attached to the documents of different formats.
- __init__(*, config: dict | None = None, recognized_extensions: Set[str] | None = None, recognized_mimes: Set[str] | None = None) None [source]
- Parameters:
config – configuration of the attachments extractor, e.g. logger for logging
recognized_extensions – set of supported files extensions with a dot, for example {.doc, .pdf}
recognized_mimes – set of supported MIME types of files
- can_extract(file_path: str | None = None, extension: str | None = None, mime: str | None = None, parameters: dict | None = None) bool [source]
Check if this attachments extractor can get attachments of the file. You should provide at least one of the following parameters: file_path, extension, mime.
- Parameters:
file_path – the path of the file to extract attachments from
extension – file extension with a dot, for example .doc or .pdf
mime – MIME type of file
parameters – any additional parameters for the given document
- Returns:
the indicator of possibility to get attachments of this file
- abstract extract(file_path: str, parameters: dict | None = None) List[AttachedFile] [source]
Extract attachments from the given file. This method can only be called on appropriate files, ensure that
can_extract()
is True for the given file.- Parameters:
file_path – path of the file to extract attachments from
parameters – dict with different parameters for extracting, see Attachments handling for more details
- Returns:
list of file’s attachments
- class dedoc.attachments_extractors.AbstractOfficeAttachmentsExtractor(*, config: dict | None = None, recognized_extensions: Set[str] | None = None, recognized_mimes: Set[str] | None = None)[source]
Bases:
AbstractAttachmentsExtractor
,ABC
Extract attachments from files of Microsoft Office format like docx, pptx, xlsx.
- class dedoc.attachments_extractors.DocxAttachmentsExtractor(*, config: dict | None = None)[source]
Bases:
AbstractOfficeAttachmentsExtractor
Extract attachments from docx files.
- extract(file_path: str, parameters: dict | None = None) List[AttachedFile] [source]
Get attachments from the given docx document.
Look to the
AbstractAttachmentsExtractor
documentation to get the information about the methods’ parameters.
- class dedoc.attachments_extractors.ExcelAttachmentsExtractor(*, config: dict | None = None)[source]
Bases:
AbstractOfficeAttachmentsExtractor
Extracts attachments from xlsx files.
- extract(file_path: str, parameters: dict | None = None) List[AttachedFile] [source]
Get attachments from the given xlsx document.
Look to the
AbstractAttachmentsExtractor
documentation to get the information about the methods’ parameters.
- class dedoc.attachments_extractors.JsonAttachmentsExtractor(*, config: dict | None = None)[source]
Bases:
AbstractAttachmentsExtractor
Extract attachments from json files.
- extract(file_path: str, parameters: dict | None = None) List[AttachedFile] [source]
Get attachments from the given json document. Attached files are html files if the option html_fields is given in the parameters. This option should contain list of lists of keys converted to string. The list of keys is the path to the html content inside the json file (end node should be string), that needs to be converted into a file attachment.
For example:
For json like {“a”: {“b”: “Some html string”}, “c”: “Another html string”}
the possible value for html_fields parameter is ‘[[“a”, “b”], [“c”]]’.
Look to the
AbstractAttachmentsExtractor
documentation to get the information about the methods’ parameters.
- class dedoc.attachments_extractors.PptxAttachmentsExtractor(*, config: dict | None = None)[source]
Bases:
AbstractOfficeAttachmentsExtractor
Extract attachments from pptx files.
- extract(file_path: str, parameters: dict | None = None) List[AttachedFile] [source]
Get attachments from the given pptx document.
Look to the
AbstractAttachmentsExtractor
documentation to get the information about the methods’ parameters.
- class dedoc.attachments_extractors.PDFAttachmentsExtractor(*, config: dict | None = None)[source]
Bases:
AbstractAttachmentsExtractor
Extract attachments from pdf files.
- extract(file_path: str, parameters: dict | None = None) List[AttachedFile] [source]
Get attachments from the given pdf document.
Look to the
AbstractAttachmentsExtractor
documentation to get the information about the methods’ parameters.