dedoc.attachments_extractors

class dedoc.attachments_extractors.AbstractAttachmentsExtractor[source]

This class is responsible for extracting files attached to the documents of different formats.

abstract can_extract(extension: str, mime: str, parameters: dict | None = None) bool[source]

Check if this attachments extractor can get attachments of the file with the given extension.

Parameters:
  • extension – file extension, for example .doc or .pdf

  • mime – MIME type of file

  • parameters – any additional parameters for given document

Returns:

the indicator of possibility to get attachments of this file

abstract get_attachments(tmpdir: str, filename: str, parameters: dict) List[AttachedFile][source]

Extract attachments from the given file. This method can only be called on appropriate files, ensure that can_extract() is True for the given file.

Parameters:
  • tmpdir – directory where file is located and where the attached files will be saved

  • filename – name of the file to extract attachments (not absolute path)

  • parameters – dict with different parameters for extracting

Returns:

list of file’s attachments

static with_attachments(parameters: dict) bool[source]

Check if the option with_attachments is true in the parameters.

Parameters:

parameters – parameters for the attachment extractor

Returns:

indicator if with_attachments option is true

class dedoc.attachments_extractors.AbstractOfficeAttachmentsExtractor[source]

Bases: AbstractAttachmentsExtractor, ABC

Extract attachments from files of Microsoft Office format like docx, pptx, xlsx.

class dedoc.attachments_extractors.DocxAttachmentsExtractor[source]

Bases: AbstractOfficeAttachmentsExtractor

Extract attachments from docx files.

can_extract(extension: str, mime: str, parameters: dict | None = None) bool[source]

Checks if this extractor can get attachments from the document (it should have .docx extension)

get_attachments(tmpdir: str, filename: str, parameters: dict) List[AttachedFile][source]

Get attachments from the given docx document.

Look to the AbstractAttachmentsExtractor documentation to get the information about the methods’ parameters.

class dedoc.attachments_extractors.ExcelAttachmentsExtractor[source]

Bases: AbstractOfficeAttachmentsExtractor

Extracts attachments from xlsx files.

can_extract(extension: str, mime: str, parameters: dict | None = None) bool[source]

Checks if this extractor can get attachments from the document (it should have .xlsx extension)

get_attachments(tmpdir: str, filename: str, parameters: dict) List[AttachedFile][source]

Get attachments from the given xlsx document.

Look to the AbstractAttachmentsExtractor documentation to get the information about the methods’ parameters.

class dedoc.attachments_extractors.JsonAttachmentsExtractor[source]

Bases: AbstractAttachmentsExtractor

Extract attachments from json files.

can_extract(extension: str, mime: str, parameters: dict | None = None) bool[source]

Checks if this extractor can get attachments from the document (it should have .json extension)

get_attachments(tmpdir: str, filename: str, parameters: dict) List[AttachedFile][source]

Get attachments from the given json document. Attached files are html files if the option html_fields is given in the parameters. This option should contain list of lists of keys converted to string. The list of keys is the path to the html content inside the json file (end node should be string), that needs to be converted into a file attachment.

For example:

For json like {“a”: {“b”: “Some html string”}, “c”: “Another html string”}

the possible value for html_fields parameter is ‘[[“a”, “b”], [“c”]]’.

Look to the AbstractAttachmentsExtractor documentation to get the information about the methods’ parameters.

class dedoc.attachments_extractors.PptxAttachmentsExtractor[source]

Bases: AbstractOfficeAttachmentsExtractor

Extract attachments from pptx files.

can_extract(extension: str, mime: str, parameters: dict | None = None) bool[source]

Checks if this extractor can get attachments from the document (it should have .pptx extension)

get_attachments(tmpdir: str, filename: str, parameters: dict) List[AttachedFile][source]

Get attachments from the given pptx document.

Look to the AbstractAttachmentsExtractor documentation to get the information about the methods’ parameters.

class dedoc.attachments_extractors.PDFAttachmentsExtractor(*, config: dict)[source]

Bases: AbstractAttachmentsExtractor

Extract attachments from pdf files.

__init__(*, config: dict) None[source]
Parameters:

config – configuration of the extractor, e.g. logger for logging

can_extract(extension: str, mime: str, parameters: dict | None = None) bool[source]

Checks if this extractor can get attachments from the document (it should have .pdf extension)

get_attachments(tmpdir: str, filename: str, parameters: dict) List[AttachedFile][source]

Get attachments from the given pdf document.

Look to the AbstractAttachmentsExtractor documentation to get the information about the methods’ parameters.