dedoc.attachments_extractors
- class dedoc.attachments_extractors.AbstractAttachmentsExtractor[source]
This class is responsible for extracting files attached to the documents of different formats.
- abstract can_extract(extension: str, mime: str, parameters: dict | None = None) bool[source]
Check if this attachments extractor can get attachments of the file with the given extension.
- Parameters:
extension – file extension, for example .doc or .pdf
mime – MIME type of file
parameters – any additional parameters for given document
- Returns:
the indicator of possibility to get attachments of this file
- abstract get_attachments(tmpdir: str, filename: str, parameters: dict) List[AttachedFile][source]
Extract attachments from the given file. This method can only be called on appropriate files, ensure that
can_extract()is True for the given file.- Parameters:
tmpdir – directory where file is located and where the attached files will be saved
filename – name of the file to extract attachments (not absolute path)
parameters – dict with different parameters for extracting
- Returns:
list of file’s attachments
- class dedoc.attachments_extractors.AbstractOfficeAttachmentsExtractor[source]
Bases:
AbstractAttachmentsExtractor,ABCExtract attachments from files of Microsoft Office format like docx, pptx, xlsx.
- class dedoc.attachments_extractors.DocxAttachmentsExtractor[source]
Bases:
AbstractOfficeAttachmentsExtractorExtract attachments from docx files.
- can_extract(extension: str, mime: str, parameters: dict | None = None) bool[source]
Checks if this extractor can get attachments from the document (it should have .docx extension)
- get_attachments(tmpdir: str, filename: str, parameters: dict) List[AttachedFile][source]
Get attachments from the given docx document.
Look to the
AbstractAttachmentsExtractordocumentation to get the information about the methods’ parameters.
- class dedoc.attachments_extractors.ExcelAttachmentsExtractor[source]
Bases:
AbstractOfficeAttachmentsExtractorExtracts attachments from xlsx files.
- can_extract(extension: str, mime: str, parameters: dict | None = None) bool[source]
Checks if this extractor can get attachments from the document (it should have .xlsx extension)
- get_attachments(tmpdir: str, filename: str, parameters: dict) List[AttachedFile][source]
Get attachments from the given xlsx document.
Look to the
AbstractAttachmentsExtractordocumentation to get the information about the methods’ parameters.
- class dedoc.attachments_extractors.JsonAttachmentsExtractor[source]
Bases:
AbstractAttachmentsExtractorExtract attachments from json files.
- can_extract(extension: str, mime: str, parameters: dict | None = None) bool[source]
Checks if this extractor can get attachments from the document (it should have .json extension)
- get_attachments(tmpdir: str, filename: str, parameters: dict) List[AttachedFile][source]
Get attachments from the given json document. Attached files are html files if the option html_fields is given in the parameters. This option should contain list of lists of keys converted to string. The list of keys is the path to the html content inside the json file (end node should be string), that needs to be converted into a file attachment.
For example:
For json like {“a”: {“b”: “Some html string”}, “c”: “Another html string”}
the possible value for html_fields parameter is ‘[[“a”, “b”], [“c”]]’.
Look to the
AbstractAttachmentsExtractordocumentation to get the information about the methods’ parameters.
- class dedoc.attachments_extractors.PptxAttachmentsExtractor[source]
Bases:
AbstractOfficeAttachmentsExtractorExtract attachments from pptx files.
- can_extract(extension: str, mime: str, parameters: dict | None = None) bool[source]
Checks if this extractor can get attachments from the document (it should have .pptx extension)
- get_attachments(tmpdir: str, filename: str, parameters: dict) List[AttachedFile][source]
Get attachments from the given pptx document.
Look to the
AbstractAttachmentsExtractordocumentation to get the information about the methods’ parameters.
- class dedoc.attachments_extractors.PDFAttachmentsExtractor(*, config: dict)[source]
Bases:
AbstractAttachmentsExtractorExtract attachments from pdf files.
- __init__(*, config: dict) None[source]
- Parameters:
config – configuration of the extractor, e.g. logger for logging
- can_extract(extension: str, mime: str, parameters: dict | None = None) bool[source]
Checks if this extractor can get attachments from the document (it should have .pdf extension)
- get_attachments(tmpdir: str, filename: str, parameters: dict) List[AttachedFile][source]
Get attachments from the given pdf document.
Look to the
AbstractAttachmentsExtractordocumentation to get the information about the methods’ parameters.