Adding support for a new document format to Dedoc

Suppose you need to add support for a new format “newtype”. Several ways of document processing exist:

Converter - you can write a converter from one document format to another;
Reader - you can write special format-specific handler;
AttachmentExtractor - if a document contains attachments, the attachment extractor can allow you to extract them.

General scheme of adding Converter

When there is a parser for a document in a format to which another format is well converted, it’s convenient to make a converter. For example, if we know how to parse documents in docx format, but we need to process documents in doc format, we can write a converter from doc to docx.

1. Implement NewtypeConverter class. This class must inherit the abstract class AbstractConverter from dedoc/converters/concrete_converters/abstract_converter.py. You should call the constructor of the base class in the constructor of the current class.

from dedoc.converters.concrete_converters.abstract_converter import AbstractConverter

class NewtypeConverter(AbstractConverter):
    def __init__(self, config: Optional[dict] = None) -> None:
        super().__init__(config=config)

    def can_convert(self,
                    file_path: Optional[str] = None,
                    extension: Optional[str] = None,
                    mime: Optional[str] = None,
                    parameters: Optional[dict] = None) -> bool:
       pass  # some code here

   def convert(self, file_path: str, parameters: Optional[dict] = None) -> str:
        pass  # some code here

Implement converter methods to convert other formats to this format:

can_convert() method checks if the new converter can process the file, for example, you can return True for the list of some specific file extensions.
convert() method performs the required file conversion.

Add the converter to manager config, see Adding the implemented handlers to the manager config.

General scheme of adding Reader

Implement NewtypeReader class. This class must inherit the abstract class BaseReader.

from dedoc.readers.base_reader import BaseReader

class NewtypeReader(BaseReader):

    def can_read(self, file_path: Optional[str] = None, mime: Optional[str] = None, extension: Optional[str] = None, parameters: Optional[dict] = None) -> bool:
        pass  # some code here

    def read(self, file_path: str, parameters: Optional[dict] = None) -> UnstructuredDocument:
        pass  # some code here

You should implement reader methods according to specific file format processing.

can_read() method checks if the given file can be processed. For processing the following information is required: the path to the file, file extension or mime. It is better to make this method fast because it will be called frequently.
read() method must form UnstructuredDocument (document lines, tables and attachments).

Add the reader to manager config, see Adding the implemented handlers to the manager config.

General scheme of adding AttachmentExtractor

Implement the class NewtypeAttachmentsExtractor. This class must inherit the abstract class AbstractAttachmentsExtractor.

from typing import List
from dedoc.data_structures.attached_file import AttachedFile
from dedoc.attachments_extractors.abstract_attachment_extractor import AbstractAttachmentsExtractor

class NewtypeAttachmentsExtractor(AbstractAttachmentsExtractor):
    def can_extract(self,
                    file_path: Optional[str] = None,
                    extension: Optional[str] = None,
                    mime: Optional[str] = None,
                    parameters: Optional[dict] = None) -> bool:
         pass # some code here

    def extract(self, file_path: str, parameters: Optional[dict] = None) -> List[AttachedFile]:
        pass  # some code here

You should implement methods according to the specifics of extracting attachments for this format.

can_extract() method checks if the new extractor can process the file, for example, you can return True for the list of some specific file extensions.
extract() method should return a list of attachments that were extracted from the document: for each attachment AttachedFile is returned, you can see its code in dedoc/data_structures/attached_file.py.

Add attachments extractor to the reader’s code.

You should add line self.attachment_extractor = NewtypeAttachmentsExtractor() to the constructor of NewtypeReader class and add attachments extraction to read method:

class NewtypeReader(BaseReader):
    def __init__(self, config: Optional[dict] = None) -> None:
        super().__init__(config=config)
        self.attachment_extractor = PdfAttachmentsExtractor(config=self.config)

    def read(self, file_path: str, parameters: Optional[dict] = None) -> UnstructuredDocument:
        # some code
        attachments = self.attachment_extractor.extract(file_path=file_path, parameters=parameters)
        # some code

Example of adding pdf/djvu handlers

Suppose we want to add the ability to handle pdf/djvu documents with a text layer. We don’t want to deal with two formats, because we can convert djvu to pdf. The following steps are proposed:

Implementing the converter from djvu to pdf DjvuConverter.
Implementing of PdfAttachmentsExtractor.
Implementing of PdfReader.
Adding the implemented handlers to the manager config.

Let’s describe each step in more detail.

Implementing the converter from djvu to pdf DjvuConverter

Implement class DjvuConverter.

import os
from typing import Optional

from dedoc.converters.concrete_converters.abstract_converter import AbstractConverter
from dedoc.utils.utils import get_mime_extension, splitext_


class DjvuConverter(AbstractConverter):

    def __init__(self, config: Optional[dict] = None) -> None:
        super().__init__(config=config)

    def can_convert(self,
                    file_path: Optional[str] = None,
                    extension: Optional[str] = None,
                    mime: Optional[str] = None,
                    parameters: Optional[dict] = None) -> bool:
        _, extension = get_mime_extension(file_path=file_path, mime=mime, extension=extension)
        return extension == ".djvu"

    def convert(self, file_path: str, parameters: Optional[dict] = None) -> str:
        file_dir, file_name = os.path.split(file_path)
        name_wo_ext, _ = splitext_(file_name)
        converted_file_path = os.path.join(file_dir, f"{name_wo_ext}.pdf")
        command = ["ddjvu", "--format=pdf", file_path, converted_file_path]
        self._run_subprocess(command=command, filename=file_name, expected_path=converted_file_path)

        return converted_file_path

You should implement the following methods:

can_convert(): return True if file extension is .djvu. You can see the file dedoc/extensions.py for more accurate work with extensions.
convert(): use ddjvu utility and run it using ._run_subprocess method ensures that the converted file was saved.

You can use the converter in your code:

file_path = "test_dir/The_New_Yorker_Case_Study.djvu"

djvu_converter = DjvuConverter()
djvu_converter.can_convert(file_path)  # True
djvu_converter.convert(file_path)  # 'test_dir/The_New_Yorker_Case_Study.pdf'

Implementing of PdfAttachmentsExtractor

Implement PdfAttachmentsExtractor.

import os
from typing import List, Optional

from dedoc.attachments_extractors.abstract_attachment_extractor import AbstractAttachmentsExtractor
from dedoc.data_structures import AttachedFile
from dedoc.extensions import recognized_extensions, recognized_mimes
from dedoc.utils.parameter_utils import get_param_attachments_dir, get_param_need_content_analysis
from dedoc.utils.utils import get_mime_extension


class PdfAttachmentsExtractor(AbstractAttachmentsExtractor):
    def can_extract(self,
                    file_path: Optional[str] = None,
                    extension: Optional[str] = None,
                    mime: Optional[str] = None,
                    parameters: Optional[dict] = None) -> bool:
        mime, extension = get_mime_extension(file_path=file_path, mime=mime, extension=extension)
        return extension in recognized_extensions.pdf_like_format or mime in recognized_mimes.pdf_like_format

    def extract(self, file_path: str, parameters: Optional[dict] = None) -> List[AttachedFile]:
        from pypdf import PdfReader

        parameters = {} if parameters is None else parameters
        with open(os.path.join(file_path), "rb") as f:
            reader = PdfReader(f)
            catalog = reader.trailer["/Root"]

            if "/Names" not in catalog or "/EmbeddedFiles" not in catalog["/Names"]:
                return []

            attachments = []
            filenames = catalog["/Names"]["/EmbeddedFiles"]["/Names"]
            for filename in filenames:
                if isinstance(filename, str):
                    name = filename
                    data_index = filenames.index(filename) + 1
                    f_dict = filenames[data_index].get_object()
                    f_data = f_dict["/EF"]["/F"].get_data()
                    attachments.append((name, f_data))

        attachments_dir = get_param_attachments_dir(parameters, file_path)
        need_content_analysis = get_param_need_content_analysis(parameters)
        attachments = self._content2attach_file(content=attachments, tmpdir=attachments_dir, need_content_analysis=need_content_analysis, parameters=parameters)
        return attachments

You should implement the following methods:

can_extract(): use file extension or mime to check if we could read the given file. You can learn more about extensions and mime using file dedoc/extensions.py
extract() : use information about file path and file name to extract attachments from the given file.

The method returns the list of AttachedFile using _content2attach_file method. This method is inherited from the abstract class, it makes the list of AttachedFile from the list of tuples: the name of the attached file and binary content of the file.

Implementing of PdfReader

Implement PdfReader.

from typing import List, Optional

import tabula
from pdf_attachment_extractor import PdfAttachmentsExtractor

from dedoc.data_structures import CellWithMeta, LineMetadata
from dedoc.data_structures.line_with_meta import LineWithMeta
from dedoc.data_structures.table import Table
from dedoc.data_structures.table_metadata import TableMetadata
from dedoc.data_structures.unstructured_document import UnstructuredDocument
from dedoc.extensions import recognized_extensions, recognized_mimes
from dedoc.readers.base_reader import BaseReader
from dedoc.utils.utils import get_mime_extension


class PdfReader(BaseReader):

    def __init__(self, config: Optional[dict] = None) -> None:
        super().__init__(config=config)
        self.attachment_extractor = PdfAttachmentsExtractor(config=self.config)

    def can_read(self, file_path: Optional[str] = None, mime: Optional[str] = None, extension: Optional[str] = None, parameters: Optional[dict] = None) -> bool:
        mime, extension = get_mime_extension(file_path=file_path, mime=mime, extension=extension)
        return extension in recognized_extensions.pdf_like_format or mime in recognized_mimes.pdf_like_format

    def read(self, file_path: str, parameters: Optional[dict] = None) -> UnstructuredDocument:
        parameters = {} if parameters is None else parameters
        lines = self.__process_lines(file_path)
        tables = self.__process_tables(file_path)
        attachments = self.attachment_extractor.extract(file_path=file_path, parameters=parameters)
        return UnstructuredDocument(lines=lines, tables=tables, attachments=attachments)

    def __process_tables(self, path: str) -> List[Table]:
        dfs = tabula.read_pdf(path, stream=True, pages="all")
        tables = []
        for df in dfs:
            metadata = TableMetadata(page_id=None)
            cells = [[CellWithMeta(lines=[LineWithMeta(line=text_cell)]) for text_cell in row]for row in df.values.tolist()]
            tables.append(Table(cells=cells, metadata=metadata))
        return tables

    def __process_lines(self, path: str) -> List[LineWithMeta]:
        from pypdf import PdfReader as PdfFileReader
        with open(path, "rb") as file:
            lines_with_meta = []
            pdf = PdfFileReader(file)
            for page_id, page in enumerate(pdf.pages):
                text = page.extract_text()
                lines = text.split("\n")
                for line_id, line in enumerate(lines):
                    metadata = LineMetadata(page_id=page_id, line_id=line_id)
                    lines_with_meta.append(LineWithMeta(line=line, metadata=metadata, annotations=[]))
        return lines_with_meta

You should implement the following methods:

can_read(): use file extension or mime to check if we could read the given file. You can learn more about extensions and mime using file dedoc/extensions.py.
read(): Returns document content UnstructuredDocument, consisting of a list of document lines LineWithMeta, tables Table and attachments AttachedFile.

For each line, you need to add its text, metadata, hierarchy level (if exists) and annotations (if exist). For tables, you need to add a list of rows (each row is a list of table cells) and metadata. You can use dedoc.data_structures to learn more about all the described structures. We use pypdf to extract the text and tabula to extract tables. They must be added to requirements.txt of the project. We use class PdfAttachmentsExtractor for attachments extraction (it was mentioned before). It must be added to the reader’s constructor and used in read method.

You can use the reader in your code:

pdf_reader = PdfReader()

file_path = "test_dir/pdf_with_attachment.pdf"
pdf_reader.can_read(file_path)  # True
pdf_reader.read(file_path, parameters={"with_attachments": "true"})  # <dedoc.data_structures.UnstructuredDocument>

document = pdf_reader.read(file_path, parameters={"with_attachments": "true"})
list(vars(document))  # ['tables', 'lines', 'attachments', 'warnings', 'metadata']
len(document.attachments)  # 1
len(document.lines)  # 11

Adding the implemented handlers to the manager config

All implemented document handlers are linked to dedoc in dedoc/manager_config.py

You do not have to edit this file. Create your own manager_config with dedoc handlers you need and your custom handlers directly in your code. Example of a manager config with the new handlers:

import os

from djvu_converter import DjvuConverter
from pdf_reader import PdfReader

from dedoc import DedocManager
from dedoc.attachments_handler import AttachmentsHandler
from dedoc.converters import ConverterComposition
from dedoc.metadata_extractors import BaseMetadataExtractor, DocxMetadataExtractor, MetadataExtractorComposition
from dedoc.readers import ReaderComposition
from dedoc.structure_constructors import LinearConstructor, StructureConstructorComposition, TreeConstructor
from dedoc.structure_extractors import DefaultStructureExtractor, StructureExtractorComposition


manager_config = dict(
    converter=ConverterComposition(converters=[DjvuConverter()]),
    reader=ReaderComposition(readers=[PdfReader()]),
    structure_extractor=StructureExtractorComposition(extractors={DefaultStructureExtractor.document_type: DefaultStructureExtractor()}, default_key="other"),
    structure_constructor=StructureConstructorComposition(
        constructors={"linear": LinearConstructor(), "tree": TreeConstructor()},
        default_constructor=LinearConstructor()
    ),
    document_metadata_extractor=MetadataExtractorComposition(extractors=[DocxMetadataExtractor(), BaseMetadataExtractor()]),
    attachments_handler=AttachmentsHandler(),
)

Then create an object of DedocManager and use parse() method:

file_path = "test_dir/The_New_Yorker_Case_Study.djvu"
manager = DedocManager(manager_config=manager_config)
result = manager.parse(file_path=file_path, parameters={"with_attachments": "true"})

Result is ParsedDocument:

result  # <dedoc.data_structures.ParsedDocument>
result.to_api_schema().model_dump()  # {'content': {'structure': {'node_id': '0', 'text': '', 'annotations': [], 'metadata': {'paragraph_type': 'root', ...

Adding support for a new document type is completed.