Dedoc: the system for document structure extraction

https://github.com/ispras/dedoc/raw/master/dedoc_logo.png

Dedoc is an open universal system for converting textual documents of different formats to a unified output representation.

Dedoc allows to extract the following data from the documents:
  • Content - textual lines of the document in the reading order;

  • Annotations of the lines - formatting of the text for its visual representation;

  • Structure - type of each document line and its level (importance) in the document hierarchy;

  • Tables - tables that are found in the document;

  • Attachments - files, attached to the document;

  • Metadata - some additional information about the file, e.g. creation date or the author.

Dedoc can be integrated in some system for document contents and structure analysis as a separate module. Dedoc can be used as a python library, API service or a docker container.

Workflow

_images/workflow.png
The main workflow consists of the following stages:

1. Converting document to one of the supported formats. There are some documents that can be easily converted to another well-known format, e.g. odt to docx. In this case we use converters to convert these documents to one common format in order to facilitate the subsequent reading. The list of supported document formats and which of them should be converted is shown in the table Supported documents formats and the reader’s output.

2. Reading the converted document to get intermediate representation of the document. This representation include document lines with annotations, tables, attachments and metadata. The table Supported documents formats and the reader’s output shows which information can be extracted according to the document’s format.

3. Structure extraction from the document. This stage includes line types and hierarchy levels identification. In the section Structure extraction using dedoc supported types of structure are enlisted.

4. Structure construction of the output. The result document structure representation may vary and structure constructors may use the information about lines types and levels differently. For example, the tree of document lines may be built.

Reading documents using dedoc

Dedoc allows to get the common intermediate representation for the documents of various formats. The resulting output of any reader is a class UnstructuredDocument. See readers’ annotations and readers’ line types to get more details about information that can be extracted by each available reader.

See also

Dedoc supports handling of a fixed list of document formats, but the list can be expanded by new handlers. A tutorial how to add a new document format to Dedoc is here.

Supported documents formats and the reader’s output

Document format

Reader

Lines

Tables

Attachments

zip, tar, tar.gz, rar, 7z

ArchiveReader

-

-

+

csv, tsv

CSVReader

-

+

-

docx

DocxReader

+

+

+

doc, odt, rtf

convert to docx using DocxConverter

+

+

+

xlsx

ExcelReader

-

+

+

xls, ods

convert to xlsx using ExcelConverter

-

+

+

pptx

PptxReader

+

+

+

ppt, odp

convert to pptx using PptxConverter

+

+

+

eml

EmailReader

+

+

+

html, shtml

HtmlReader

+

+

-

mhtml, mhtml.gz, mht, mht.gz

MhtmlReader

+

+

+

json

JsonReader

+

-

+

txt, txt.gz

RawTextReader

+

-

-

xml

convert to txt using TxtConverter

+

-

-

pdf (without textual layer), png

PdfImageReader

+

+

-

pdf (with textual layer)

PdfTabbyReader, PdfTxtlayerReader

+

+

+

pdf

PdfAutoReader

+

+

+

bmp, dib, eps, gif, hdr, j2k, jfif, jp2, jpe, jpeg, jpg, pbm, pcx, pgm, pic, png, pnm, ppm, ras, sgi, sr, tiff, webp

convert to png using PNGConverter

+

+

-

djvu

convert to pdf using PDFConverter

+

+

+

note.pickle

NoteReader

+

-

-

Structure extraction using dedoc

Dedoc allows to extract structure from the documents of some specific domains. For this purpose classifiers are used to predict the type of each document line/paragraph. Then some rules (mostly based on regular expressions) are used to find a hierarchy level of each line for the document tree representation.

See also

It’s possible to define a new structure extractor in order to handle documents of new domains. A tutorial how to add a new structure type to Dedoc is here.

Currently the following domains can be handled:

For a document of unknown or unsupported domain there is an option to use default structure extractor (document_type=other at Api parameters description), the default document structure described here.