Dedoc pipeline

class dedoc.DedocManager(config: dict | None = None, manager_config: dict | None = None)[source]

This class allows to run the whole pipeline of the document processing:

  1. Converting

  2. Reading

  3. Metadata extraction

  4. Structure extraction

  5. Output structure construction

  6. Attachments handling

__init__(config: dict | None = None, manager_config: dict | None = None) None[source]
Parameters:
  • config – config for document processing

  • manager_config – dictionary with different stage document processors.

The following keys should be in the manager_config dictionary:
parse(file_path: str, parameters: Dict[str, str] | None = None) ParsedDocument[source]

Run the whole pipeline of the document processing. If some error occurred, file metadata are stored in the exception’s metadata field.

Parameters:
  • file_path – full path where the file is located

  • parameters – any parameters, specify how to parse file, see Parameters description for more details

Returns:

parsed document

class dedoc.attachments_handler.AttachmentsHandler(*, config: dict | None = None)[source]

This class is used for handling attached files:

  • they may be stored in the custom directory (use attachments_dir key in the parameters to set output directory path);

  • they may be ignored (if the option with_attachments=false in parameters);

  • the metadata of the attachments may be added without files parsing (if with_attachments=true, need_content_analysis=false in parameters)

  • they may be parsed (if with_attachments=true, need_content_analysis=true in parameters), the parsing recursion may be set via recursion_deep_attachments parameter.

__init__(*, config: dict | None = None) None[source]
Parameters:

config – configuration of the handler, e.g. logger for logging

handle_attachments(document_parser: DedocManager, document: UnstructuredDocument, parameters: dict) List[ParsedDocument][source]

Handle attachments of the document in the intermediate representation.

Parameters:
  • document_parser – class with parse method for parsing attachments if needed;

  • document – intermediate representation of the document whose attachments need to be handled;

  • parameters – parameters for attachments handling (with_attachments, need_content_analysis, recursion_deep_attachments, attachments_dir are important, look to the API parameters documentation for more details).

Returns:

list of parsed document attachments