dedoc.attachments_extractors

class dedoc.attachments_extractors.AbstractAttachmentsExtractor(*, config: dict | None = None, recognized_extensions: Set[str] | None = None, recognized_mimes: Set[str] | None = None)[source]

This class is responsible for extracting files attached to the documents of different formats.

__init__(*, config: dict | None = None, recognized_extensions: Set[str] | None = None, recognized_mimes: Set[str] | None = None) None[source]
Parameters:
  • config – configuration of the attachments extractor, e.g. logger for logging

  • recognized_extensions – set of supported files extensions with a dot, for example {.doc, .pdf}

  • recognized_mimes – set of supported MIME types of files

can_extract(file_path: str | None = None, extension: str | None = None, mime: str | None = None, parameters: dict | None = None) bool[source]

Check if this attachments extractor can get attachments of the file. You should provide at least one of the following parameters: file_path, extension, mime.

Parameters:
  • file_path – the path of the file to extract attachments from

  • extension – file extension with a dot, for example .doc or .pdf

  • mime – MIME type of file

  • parameters – any additional parameters for the given document

Returns:

the indicator of possibility to get attachments of this file

abstract extract(file_path: str, parameters: dict | None = None) List[AttachedFile][source]

Extract attachments from the given file. This method can only be called on appropriate files, ensure that can_extract() is True for the given file.

Parameters:
  • file_path – path of the file to extract attachments from

  • parameters – dict with different parameters for extracting, see Attachments handling for more details

Returns:

list of file’s attachments

static with_attachments(parameters: dict) bool[source]

Check if the option with_attachments is true in the parameters.

Parameters:

parameters – parameters for the attachment extractor

Returns:

indicator if with_attachments option is true

class dedoc.attachments_extractors.AbstractOfficeAttachmentsExtractor(*, config: dict | None = None, recognized_extensions: Set[str] | None = None, recognized_mimes: Set[str] | None = None)[source]

Bases: AbstractAttachmentsExtractor, ABC

Extract attachments from files of Microsoft Office format like docx, pptx, xlsx.

class dedoc.attachments_extractors.DocxAttachmentsExtractor(*, config: dict | None = None)[source]

Bases: AbstractOfficeAttachmentsExtractor

Extract attachments from docx files.

extract(file_path: str, parameters: dict | None = None) List[AttachedFile][source]

Get attachments from the given docx document.

Look to the AbstractAttachmentsExtractor documentation to get the information about the methods’ parameters.

class dedoc.attachments_extractors.ExcelAttachmentsExtractor(*, config: dict | None = None)[source]

Bases: AbstractOfficeAttachmentsExtractor

Extracts attachments from xlsx files.

extract(file_path: str, parameters: dict | None = None) List[AttachedFile][source]

Get attachments from the given xlsx document.

Look to the AbstractAttachmentsExtractor documentation to get the information about the methods’ parameters.

class dedoc.attachments_extractors.JsonAttachmentsExtractor(*, config: dict | None = None)[source]

Bases: AbstractAttachmentsExtractor

Extract attachments from json files.

extract(file_path: str, parameters: dict | None = None) List[AttachedFile][source]

Get attachments from the given json document. Attached files are html files if the option html_fields is given in the parameters. This option should contain list of lists of keys converted to string. The list of keys is the path to the html content inside the json file (end node should be string), that needs to be converted into a file attachment.

For example:

For json like {“a”: {“b”: “Some html string”}, “c”: “Another html string”}

the possible value for html_fields parameter is ‘[[“a”, “b”], [“c”]]’.

Look to the AbstractAttachmentsExtractor documentation to get the information about the methods’ parameters.

class dedoc.attachments_extractors.PptxAttachmentsExtractor(*, config: dict | None = None)[source]

Bases: AbstractOfficeAttachmentsExtractor

Extract attachments from pptx files.

extract(file_path: str, parameters: dict | None = None) List[AttachedFile][source]

Get attachments from the given pptx document.

Look to the AbstractAttachmentsExtractor documentation to get the information about the methods’ parameters.

class dedoc.attachments_extractors.PDFAttachmentsExtractor(*, config: dict | None = None)[source]

Bases: AbstractAttachmentsExtractor

Extract attachments from pdf files.

extract(file_path: str, parameters: dict | None = None) List[AttachedFile][source]

Get attachments from the given pdf document.

Look to the AbstractAttachmentsExtractor documentation to get the information about the methods’ parameters.