Dedoc: the system for document structure extraction

Dedoc is an open universal system for converting textual documents of different formats to a unified output representation.

Dedoc allows to extract the following data from the documents:

Content - textual lines of the document in the reading order;
Annotations of the lines - formatting of the text for its visual representation;
Structure - type of each document line and its level (importance) in the document hierarchy;
Tables - tables that are found in the document;
Attachments - files, attached to the document;
Metadata - some additional information about the file, e.g. creation date or the author.

Dedoc can be integrated in some system for document contents and structure analysis as a separate module. Dedoc can be used as a python library, API service or a docker container.

Workflow

The main workflow consists of the following stages:

1. Converting document to one of the supported formats. There are some documents that can be easily converted to another well-known format, e.g. odt to docx. In this case we use converters to convert these documents to one common format in order to facilitate the subsequent reading. The list of supported document formats and which of them should be converted is shown in the table Supported documents formats and the reader’s output.

2. Reading the converted document to get intermediate representation of the document. This representation include document lines with annotations, tables, attachments and metadata. The table Supported documents formats and the reader’s output shows which information can be extracted according to the document’s format.

3. Structure extraction from the document. This stage includes line types and hierarchy levels identification. In the section Structure extraction using dedoc supported types of structure are enlisted.

4. Structure construction of the output. The result document structure representation may vary and structure constructors may use the information about lines types and levels differently. For example, the tree of document lines may be built.

Reading documents using dedoc

Dedoc allows to get the common intermediate representation for the documents of various formats. The resulting output of any reader is a class UnstructuredDocument. See readers’ annotations and readers’ line types to get more details about information that can be extracted by each available reader.

See also

Dedoc supports handling of a fixed list of document formats, but the list can be expanded by new handlers. A tutorial how to add a new document format to Dedoc is here.

Supported documents formats and the reader’s output
Document format	Reader	Lines	Tables	Attachments
zip, tar, tar.gz, rar, 7z	`ArchiveReader`	-	-	+
csv, tsv	`CSVReader`	-	+	-
docx	`DocxReader`	+	+	+
doc, odt, rtf	convert to docx using `DocxConverter`	+	+	+
xlsx	`ExcelReader`	-	+	+
xls, ods	convert to xlsx using `ExcelConverter`	-	+	+
pptx	`PptxReader`	+	+	+
ppt, odp	convert to pptx using `PptxConverter`	+	+	+
eml	`EmailReader`	+	+	+
html, shtml	`HtmlReader`	+	+	-
mhtml, mhtml.gz, mht, mht.gz	`MhtmlReader`	+	+	+
json	`JsonReader`	+	-	+
txt, txt.gz	`RawTextReader`	+	-	-
xml	convert to txt using `TxtConverter`	+	-	-
pdf (without textual layer), png	`PdfImageReader`	+	+	-
pdf (with textual layer)	`PdfTabbyReader`, `PdfTxtlayerReader`	+	+	+
pdf	`PdfAutoReader`	+	+	+
bmp, dib, eps, gif, hdr, j2k, jfif, jp2, jpe, jpeg, jpg, pbm, pcx, pgm, pic, png, pnm, ppm, ras, sgi, sr, tiff, webp	convert to png using `PNGConverter`	+	+	-
djvu	convert to pdf using `PDFConverter`	+	+	+
note.pickle	`NoteReader`	+	-	-

Structure extraction using dedoc

Dedoc allows to extract structure from the documents of some specific domains. For this purpose classifiers are used to predict the type of each document line/paragraph. Then some rules (mostly based on regular expressions) are used to find a hierarchy level of each line for the document tree representation.