Dedoc pipeline
- class dedoc.DedocManager(config: dict | None = None, manager_config: dict | None = None)[source]
This class allows to run the whole pipeline of the document processing:
Converting
Reading
Metadata extraction
Structure extraction
Output structure construction
Attachments handling
- __init__(config: dict | None = None, manager_config: dict | None = None) None [source]
- Parameters:
config – config for document processing
manager_config – dictionary with different stage document processors.
- The following keys should be in the manager_config dictionary:
converter (optional) (
ConverterComposition
)reader (
ReaderComposition
)structure_extractor (
StructureExtractorComposition
)structure_constructor (
StructureConstructorComposition
)document_metadata_extractor (
MetadataExtractorComposition
)attachments_handler (
AttachmentsHandler
)
- parse(file_path: str, parameters: Dict[str, str] | None = None) ParsedDocument [source]
Run the whole pipeline of the document processing. If some error occurred, file metadata are stored in the exception’s metadata field.
- Parameters:
file_path – full path where the file is located
parameters – any parameters, specify how to parse file, see Parameters description for more details
- Returns:
parsed document
- class dedoc.attachments_handler.AttachmentsHandler(*, config: dict | None = None)[source]
This class is used for handling attached files:
they may be stored in the custom directory (use attachments_dir key in the parameters to set output directory path);
they may be ignored (if the option with_attachments=false in parameters);
the metadata of the attachments may be added without files parsing (if with_attachments=true, need_content_analysis=false in parameters)
they may be parsed (if with_attachments=true, need_content_analysis=true in parameters), the parsing recursion may be set via recursion_deep_attachments parameter.
- __init__(*, config: dict | None = None) None [source]
- Parameters:
config – configuration of the handler, e.g. logger for logging
- handle_attachments(document_parser: DedocManager, document: UnstructuredDocument, parameters: dict) List[ParsedDocument] [source]
Handle attachments of the document in the intermediate representation.
- Parameters:
document_parser – class with parse method for parsing attachments if needed;
document – intermediate representation of the document whose attachments need to be handled;
parameters – parameters for attachments handling (with_attachments, need_content_analysis, recursion_deep_attachments, attachments_dir are important, look to the API parameters documentation for more details).
- Returns:
list of parsed document attachments