Adding support for a new document format to Dedoc
Suppose you need to add support for a new format “newtype”. Several ways of document processing exist:
Converter - you can write a converter from one document format to another;
Reader - you can write special format-specific handler;
AttachmentExtractor - if a document contains attachments, the attachment extractor can allow you to extract them.
General scheme of adding Converter
When there is a parser for a document in a format to which another format is well converted, it’s convenient to make a converter. For example, if we know how to parse documents in docx format, but we need to process documents in doc format, we can write a converter from doc to docx.
1. Implement NewtypeConverter
class. This class must inherit
the abstract class AbstractConverter
from dedoc/converters/concrete_converters/abstract_converter.py
.
You should call the constructor of the base class in the constructor of the current class.
from dedoc.converters.concrete_converters.abstract_converter import AbstractConverter
class NewtypeConverter(AbstractConverter):
def __init__(self, config: Optional[dict] = None) -> None:
super().__init__(config=config)
def can_convert(self,
file_path: Optional[str] = None,
extension: Optional[str] = None,
mime: Optional[str] = None,
parameters: Optional[dict] = None) -> bool:
pass # some code here
def convert(self, file_path: str, parameters: Optional[dict] = None) -> str:
pass # some code here
Implement converter methods to convert other formats to this format:
can_convert()
method checks if the new converter can process the file, for example, you can return True for the list of some specific file extensions.convert()
method performs the required file conversion.
Add the converter to manager config, see Adding the implemented handlers to the manager config.
General scheme of adding Reader
Implement
NewtypeReader
class. This class must inherit the abstract classBaseReader
.
from dedoc.readers.base_reader import BaseReader
class NewtypeReader(BaseReader):
def can_read(self, file_path: Optional[str] = None, mime: Optional[str] = None, extension: Optional[str] = None, parameters: Optional[dict] = None) -> bool:
pass # some code here
def read(self, file_path: str, parameters: Optional[dict] = None) -> UnstructuredDocument:
pass # some code here
You should implement reader methods according to specific file format processing.
can_read()
method checks if the given file can be processed. For processing the following information is required: the path to the file, file extension or mime. It is better to make this method fast because it will be called frequently.read()
method must formUnstructuredDocument
(document lines, tables and attachments).
Add the reader to manager config, see Adding the implemented handlers to the manager config.
General scheme of adding AttachmentExtractor
Implement the class
NewtypeAttachmentsExtractor
. This class must inherit the abstract classAbstractAttachmentsExtractor
.
from typing import List
from dedoc.data_structures.attached_file import AttachedFile
from dedoc.attachments_extractors.abstract_attachment_extractor import AbstractAttachmentsExtractor
class NewtypeAttachmentsExtractor(AbstractAttachmentsExtractor):
def can_extract(self,
file_path: Optional[str] = None,
extension: Optional[str] = None,
mime: Optional[str] = None,
parameters: Optional[dict] = None) -> bool:
pass # some code here
def extract(self, file_path: str, parameters: Optional[dict] = None) -> List[AttachedFile]:
pass # some code here
You should implement methods according to the specifics of extracting attachments for this format.
can_extract()
method checks if the new extractor can process the file, for example, you can return True for the list of some specific file extensions.extract()
method should return a list of attachments that were extracted from the document: for each attachmentAttachedFile
is returned, you can see its code indedoc/data_structures/attached_file.py
.
Add attachments extractor to the reader’s code.
You should add line
self.attachment_extractor = NewtypeAttachmentsExtractor()
to the constructor ofNewtypeReader
class and add attachments extraction toread
method:
class NewtypeReader(BaseReader):
def __init__(self, config: Optional[dict] = None) -> None:
super().__init__(config=config)
self.attachment_extractor = PdfAttachmentsExtractor(config=self.config)
def read(self, file_path: str, parameters: Optional[dict] = None) -> UnstructuredDocument:
# some code
attachments = self.attachment_extractor.extract(file_path=file_path, parameters=parameters)
# some code
Example of adding pdf/djvu handlers
Suppose we want to add the ability to handle pdf/djvu documents with a text layer. We don’t want to deal with two formats, because we can convert djvu to pdf. The following steps are proposed:
Implementing the converter from djvu to pdf
DjvuConverter
.Implementing of
PdfAttachmentsExtractor
.Implementing of
PdfReader
.Adding the implemented handlers to the manager config.
Let’s describe each step in more detail.
Implementing the converter from djvu to pdf DjvuConverter
Implement class DjvuConverter
.
import os
from typing import Optional
from dedoc.converters.concrete_converters.abstract_converter import AbstractConverter
from dedoc.utils.utils import get_mime_extension, splitext_
class DjvuConverter(AbstractConverter):
def __init__(self, config: Optional[dict] = None) -> None:
super().__init__(config=config)
def can_convert(self,
file_path: Optional[str] = None,
extension: Optional[str] = None,
mime: Optional[str] = None,
parameters: Optional[dict] = None) -> bool:
_, extension = get_mime_extension(file_path=file_path, mime=mime, extension=extension)
return extension == ".djvu"
def convert(self, file_path: str, parameters: Optional[dict] = None) -> str:
file_dir, file_name = os.path.split(file_path)
name_wo_ext, _ = splitext_(file_name)
converted_file_path = os.path.join(file_dir, f"{name_wo_ext}.pdf")
command = ["ddjvu", "--format=pdf", file_path, converted_file_path]
self._run_subprocess(command=command, filename=file_name, expected_path=converted_file_path)
return converted_file_path
You should implement the following methods:
can_convert()
: return True if file extension is .djvu. You can see the filededoc/extensions.py
for more accurate work with extensions.convert()
: use ddjvu utility and run it using._run_subprocess
method ensures that the converted file was saved.
You can use the converter in your code:
file_path = "test_dir/The_New_Yorker_Case_Study.djvu"
djvu_converter = DjvuConverter()
djvu_converter.can_convert(file_path) # True
djvu_converter.convert(file_path) # 'test_dir/The_New_Yorker_Case_Study.pdf'
Implementing of PdfAttachmentsExtractor
Implement PdfAttachmentsExtractor
.
import os
from typing import List, Optional
import PyPDF2
from dedoc.attachments_extractors.abstract_attachment_extractor import AbstractAttachmentsExtractor
from dedoc.data_structures import AttachedFile
from dedoc.extensions import recognized_extensions, recognized_mimes
from dedoc.utils.parameter_utils import get_param_attachments_dir, get_param_need_content_analysis
from dedoc.utils.utils import get_mime_extension
class PdfAttachmentsExtractor(AbstractAttachmentsExtractor):
def can_extract(self,
file_path: Optional[str] = None,
extension: Optional[str] = None,
mime: Optional[str] = None,
parameters: Optional[dict] = None) -> bool:
mime, extension = get_mime_extension(file_path=file_path, mime=mime, extension=extension)
return extension in recognized_extensions.pdf_like_format or mime in recognized_mimes.pdf_like_format
def extract(self, file_path: str, parameters: Optional[dict] = None) -> List[AttachedFile]:
parameters = {} if parameters is None else parameters
handler = open(os.path.join(file_path), "rb")
reader = PyPDF2.PdfFileReader(handler)
catalog = reader.trailer["/Root"]
attachments = []
if "/Names" not in catalog or "/EmbeddedFiles" not in catalog["/Names"]:
return attachments
filenames = catalog["/Names"]["/EmbeddedFiles"]["/Names"]
for filename in filenames:
if isinstance(filename, str):
name = filename
data_index = filenames.index(filename) + 1
f_dict = filenames[data_index].getObject()
f_data = f_dict["/EF"]["/F"].getData()
attachments.append((name, f_data))
attachments_dir = get_param_attachments_dir(parameters, file_path)
need_content_analysis = get_param_need_content_analysis(parameters)
attachments = self._content2attach_file(content=attachments, tmpdir=attachments_dir, need_content_analysis=need_content_analysis, parameters=parameters)
return attachments
You should implement the following methods:
can_extract()
: use file extension or mime to check if we could read the given file. You can learn more about extensions and mime using filededoc/extensions.py
extract()
: use information about file path and file name to extract attachments from the given file.
The method returns the list of AttachedFile
using _content2attach_file
method.
This method is inherited from the abstract class, it makes the list of AttachedFile
from the list of tuples:
the name of the attached file and binary content of the file.
Implementing of PdfReader
Implement PdfReader
.
from typing import List, Optional
import tabula
from PyPDF2 import PdfFileReader
from pdf_attachment_extractor import PdfAttachmentsExtractor
from dedoc.data_structures import CellWithMeta, LineMetadata
from dedoc.data_structures.line_with_meta import LineWithMeta
from dedoc.data_structures.table import Table
from dedoc.data_structures.table_metadata import TableMetadata
from dedoc.data_structures.unstructured_document import UnstructuredDocument
from dedoc.extensions import recognized_extensions, recognized_mimes
from dedoc.readers.base_reader import BaseReader
from dedoc.utils.utils import get_mime_extension
class PdfReader(BaseReader):
def __init__(self, config: Optional[dict] = None) -> None:
super().__init__(config=config)
self.attachment_extractor = PdfAttachmentsExtractor(config=self.config)
def can_read(self, file_path: Optional[str] = None, mime: Optional[str] = None, extension: Optional[str] = None, parameters: Optional[dict] = None) -> bool:
mime, extension = get_mime_extension(file_path=file_path, mime=mime, extension=extension)
return extension in recognized_extensions.pdf_like_format or mime in recognized_mimes.pdf_like_format
def read(self, file_path: str, parameters: Optional[dict] = None) -> UnstructuredDocument:
parameters = {} if parameters is None else parameters
lines = self.__process_lines(file_path)
tables = self.__process_tables(file_path)
attachments = self.attachment_extractor.extract(file_path=file_path, parameters=parameters)
return UnstructuredDocument(lines=lines, tables=tables, attachments=attachments)
def __process_tables(self, path: str) -> List[Table]:
dfs = tabula.read_pdf(path, stream=True, pages="all")
tables = []
for df in dfs:
metadata = TableMetadata(page_id=None)
cells = [[CellWithMeta(lines=[LineWithMeta(line=text_cell)]) for text_cell in row]for row in df.values.tolist()]
tables.append(Table(cells=cells, metadata=metadata))
return tables
def __process_lines(self, path: str) -> List[LineWithMeta]:
with open(path, "rb") as file:
lines_with_meta = []
pdf = PdfFileReader(file)
num_pages = pdf.getNumPages()
for page_id in range(num_pages):
page = pdf.getPage(page_id)
text = page.extractText()
lines = text.split("\n")
for line_id, line in enumerate(lines):
metadata = LineMetadata(page_id=page_id, line_id=line_id)
lines_with_meta.append(LineWithMeta(line=line, metadata=metadata, annotations=[]))
return lines_with_meta
You should implement the following methods:
can_read()
: use file extension or mime to check if we could read the given file. You can learn more about extensions and mime using filededoc/extensions.py
.read()
: Returns document contentUnstructuredDocument
, consisting of a list of document linesLineWithMeta
, tablesTable
and attachmentsAttachedFile
.
For each line, you need to add its text, metadata, hierarchy level (if exists) and annotations (if exist).
For tables, you need to add a list of rows (each row is a list of table cells) and metadata.
You can use dedoc.data_structures to learn more about all the described structures.
We use PyPDF2 to extract the text and tabula to extract tables.
They must be added to requirements.txt
of the project.
We use class PdfAttachmentsExtractor
for attachments extraction (it was mentioned before).
It must be added to the reader’s constructor and used in read
method.
You can use the reader in your code:
pdf_reader = PdfReader()
file_path = "test_dir/pdf_with_attachment.pdf"
pdf_reader.can_read(file_path) # True
pdf_reader.read(file_path, parameters={"with_attachments": "true"}) # <dedoc.data_structures.UnstructuredDocument>
document = pdf_reader.read(file_path, parameters={"with_attachments": "true"})
list(vars(document)) # ['tables', 'lines', 'attachments', 'warnings', 'metadata']
len(document.attachments) # 1
len(document.lines) # 11
Adding the implemented handlers to the manager config
All implemented document handlers are linked to dedoc in dedoc/manager_config.py
You do not have to edit this file. Create your own manager_config
with dedoc handlers you need and
your custom handlers directly in your code. Example of a manager config with the new handlers:
import os
from djvu_converter import DjvuConverter
from pdf_reader import PdfReader
from dedoc import DedocManager
from dedoc.attachments_handler import AttachmentsHandler
from dedoc.converters import ConverterComposition
from dedoc.metadata_extractors import BaseMetadataExtractor, DocxMetadataExtractor, MetadataExtractorComposition
from dedoc.readers import ReaderComposition
from dedoc.structure_constructors import LinearConstructor, StructureConstructorComposition, TreeConstructor
from dedoc.structure_extractors import DefaultStructureExtractor, StructureExtractorComposition
manager_config = dict(
converter=ConverterComposition(converters=[DjvuConverter()]),
reader=ReaderComposition(readers=[PdfReader()]),
structure_extractor=StructureExtractorComposition(extractors={DefaultStructureExtractor.document_type: DefaultStructureExtractor()}, default_key="other"),
structure_constructor=StructureConstructorComposition(
constructors={"linear": LinearConstructor(), "tree": TreeConstructor()},
default_constructor=LinearConstructor()
),
document_metadata_extractor=MetadataExtractorComposition(extractors=[DocxMetadataExtractor(), BaseMetadataExtractor()]),
attachments_handler=AttachmentsHandler(),
)
Then create an object of DedocManager
and use parse()
method:
file_path = "test_dir/The_New_Yorker_Case_Study.djvu"
manager = DedocManager(manager_config=manager_config)
result = manager.parse(file_path=file_path, parameters={"with_attachments": "true"})
Result is ParsedDocument
:
result # <dedoc.data_structures.ParsedDocument>
result.to_api_schema().model_dump() # {'content': {'structure': {'node_id': '0', 'text': '', 'annotations': [], 'metadata': {'paragraph_type': 'root', ...
Adding support for a new document type is completed.