Using dedoc via API

Dedoc can be used as a web application that runs on the localhost:1231. It’s possible to change the port via config.py file (if you clone the repository and run dedoc as a docker container).

There are two ways to install and run dedoc as a web application:

Install and run dedoc using docker.
Install dedoc using pip. After installing library using this method you can run the application via the following command:
dedoc -m main

Application usage

Once you run the dedoc application, you can go to localhost:1231 and look to the main page with the information about dedoc. From this page you can go to the upload page and manually choose settings for the file parsing. Then you can get the result after pressing the upload button.

If you want to use the application in your program, you can send requests e.g. using requests python library. Post-requests should be sent to http://localhost:1231/upload.

import requests

data = {
    "pdf_with_text_layer": "auto_tabby",
    "document_type": "diploma",
    "language": "rus",
    "need_pdf_table_analysis": "true",
    "need_header_footer_analysis": "false",
    "is_one_column_document": "true",
    "return_format": 'html'
}
with open(filename, "rb") as file:
    files = {"file": (filename, file)}
    r = requests.post("http://localhost:1231/upload", files=files, data=data)
    result = r.content.decode("utf-8")

The data dictionary in the example contains some parameters to parse the given file. They are described in the section Api parameters description.

Api parameters description

Api parameters for files parsing via dedoc
Parameter	Values	Default value	Description
Type of document structure parsing
document_type	other, law, tz, diploma, article, fintoc	other	Type of the document structure according to specific domain. The following parameters are available: other – structure for document of any domain (Default document structure type); law – Russian laws (Law structure type); tz – Russian technical specifications (Technical specification structure type); diploma – Russian thesis (Diploma structure type); article – scientific article (Article structure type (GROBID)); fintoc – English, French and Spanish financial prospects (FinTOC structure type). This type is used for choosing a specific structure extractor (and, in some cases, a specific reader).
patterns	list of patterns dictionaries converted to string	None	This parameter is used only when `document_type="other"`. Configuration of default document structure, please see Configure structure extraction using patterns for more details.
structure_type	tree, linear	tree	The type of output document representation: tree – the document is represented as a hierarchical structure where nodes are document lines/paragraphs and child nodes have greater hierarchy level then parents according to the level found by structure extractor; linear – the document is represented as a tree where the root is empty node, and all document lines are children of the root. This type is used for choosing a specific structure constructor after document structure extraction.
return_format	json, pretty_json, html, plain_text, tree	json	The output format of the result data. The document structure from a structure constructor (see `ParsedDocument`) is transformed to one of the following formats: json – simple json structure got via recursive transformation of `ParsedDocument` into a dictionary, see JSON output format for examples; pretty_json – prettified by adding indentation to the aforesaid json structure; html – `ParsedDocument` is transformed into html file with styles and headers according to the extracted annotations and structure; plain_text – simple textual lines of the document; tree – simple document tree representation in html format (useful for structure visualization).
Attachments handling
with_attachments	true, false	false	The option to enable attached files extraction. Some documents can have attached files (attachments), e.g. images or videos. Dedoc allows to find attachments of the given file, get their metadata and save them in the directory where the given file is located. If the option is `false`, all attached files will be ignored.
need_content_analysis	true, false	false	The option to enable file’s attachments parsing along with the given file. The content of the parsed attachments will be represented as `ParsedDocument` and saved in the specified return format in the `attachments` field (see JSON output format for examples). Use `true` value to enable this behaviour.
recursion_deep_attachments	integer value >= 0	10	If the attached files of the given file contain some attachments, they can also be extracted. The level of this recursion can be set via this parameter.
return_base64	true, false	false	Attached files can be encoded in base64 and their contents will be saved instead of saving attached file on disk. The encoded contents will be saved in the attachment’s metadata in the `base64_encode` field. Use `true` value to enable this behaviour.
PDF handling
need_pdf_table_analysis	true, false	true	This option is used for PDF documents which are images with text (PDF without a textual layer). It is also used for PDF documents when `pdf_with_text_layer` is `true`, `false`, `auto` or `auto_tabby`. Since costly table recognition methods are used to get tables, you may need to use `need_pdf_table_analysis=false` to increase parsing speed and get text without tables. If the document has a textual layer, it is recommended to use `pdf_with_text_layer=tabby`, in this case tables will be parsed much easier and faster.
pdf_with_text_layer	true, false, tabby, auto, auto_tabby, bad_encoding	auto_tabby	This option is used for choosing a specific reader of PDF documents. The following options are available: true – use this option if you are sure that the PDF file has a textual layer (its text is copiable). In this case tables will be parsed using table recognition method for documents without a textual layer (if you set `need_pdf_table_analysis=false` parsing will be faster but tables will be ignored). It is recommended to use `pdf_with_text_layer=tabby` instead of `pdf_with_text_layer=true`, but you can try this option as well. false – this value forces to use PDF reader for scanned documents (images, PDF without a textual layer) even if the document has a textual layer (is copyable). It is highly recommended to use this option value if you are sure that documents for parsing are images or PDF without a textual layer, because this method is more costly in time and resources. tabby – use this option if you are sure that the PDF file has a textual layer (its text is copiable). This option value forces to use PDF reader for documents with a textual layer only, it also allows to extract tables easily and quickly. The method enabled by this option is much faster than the method enabled by `pdf_with_text_layer=true`. auto – automatic detection of textual layer presence in the PDF document. If the document has a textual layer (is copyable), PDF document parsing works like with `need_pdf_table_analysis=true`. If the document doesn’t have a textual layer (it is an image, scanned document), PDF document parsing works like with `need_pdf_table_analysis=false`. It is recommended to use `pdf_with_text_layer=auto_tabby` instead of `pdf_with_text_layer=auto`, but you can try this option as well. auto_tabby – automatic detection of textual layer presence in the PDF document. If the document has a textual layer (is copyable), PDF document parsing works like with `need_pdf_table_analysis=tabby`. If the document doesn’t have a textual layer (it is an image, scanned document), PDF document parsing works like with `need_pdf_table_analysis=false`. It is highly recommended to use this option value for any PDF document parsing. bad_encoding – automatic correction of PDF with textual layer but broken encoding (for Russian and English languages). Use this option if you are sure that PDF file has a textual layer with broken encoding, i.e. its text is copyable, but incorrect (gibberish). Otherwise use `pdf_with_text_layer=auto_tabby`, because it’s faster and it will provide better results on usual PDF files.
textual_layer_classifier	ml, simple, letter	ml	Type of the classifier for textual layer detection. Works only when auto or auto_tabby is selected at pdf_with_text_layer. ml – use ML textual layer classifier trained on heuristic features, works only for Russian and English languages. simple – if any text is detected in a PDF file, Dedoc assumes that textual layer is detected and it is correct. Much faster but less accurate. letter – textual layer is considered as correct if percent of letters in the text > 50%.
each_page_textual_layer_detection	true, false	false	Enable textual layer detection separately on each document page. Works only when auto or auto_tabby is selected at pdf_with_text_layer. true – classify text of each page if it’s correct or not. Classification algorythm depends on the fast_textual_layer_detection parameter. This option is recommended for PDF documents with scanned (or corrupted) pages in the middle. false – classify text of only first 8 pages if it’s correct or not. First page is handled separately. Much faster but less accurate.
need_gost_frame_analysis	true, false	false	This option is used to enable GOST (Russian government standard) frame recognition for PDF documents or images. The GOST frame recognizer is used recognize and ignore GOST frame on images and PDF documents. See GOST frame handling for more details.
language	rus, eng, rus+eng, fra, spa	rus+eng	Language of the parsed PDF document without a textual layer. The following values are available: rus – Russian; eng – English; rus+eng – both Russian and English; fra – French (for fintoc structure type); spa – Spanish (for fintoc structure type).
pages	:, start:, :end, start:end	:	If you need to read a part of the PDF document, you can use page slice to define the reading range. If the range is set like `start_page:end_page`, document will be processed from `start_page` to `end_page` (`start_page` to `end_page` are included to the range). using : means reading all document pages; using empty `end` – start: (e.g. 5:) means reading the document from `start` up to the end of the document; using empty `start` – :end (e.g. :5) means reading the document from the beginning up to the `end` page; using start:end means reading document pages from `start` to `end` inclusively. If `start` > `end` or `start` > the number of pages in the document, the empty document will be returned. If `end` > the number of pages in the document, the document will be read up to its end. For example, if `1:3` is given, 1, 2 and 3 document pages will be processed.
is_one_column_document	true, false, auto	auto	This option is used to set the number of columns if the PDF document is without a textual layer in case it’s known beforehand. The following values are available: true – the document is single column; false – the document is multi-column (two columns parsing is supported); auto – automatic detection of the number of columns in the document. If you are not sure about the number of columns in the documents you need to parse, it is recommended to use `auto`.
document_orientation	auto, no_change	auto	This option is used to control document orientation analysis for PDF documents without a textual layer. The following values are available: auto – automatic detection of rotated document pages (rotation angle 0, 90, 180, 270 degrees) and rotation of document pages; no_change – parse document pages as they are without rotated pages detection. If you are sure that the documents you need to parse consist of vertical (not rotated) pages, you can use `no_change`.
need_header_footer_analysis	true, false	false	This option is used to remove headers and footers of PDF documents from the output result. If `need_header_footer_analysis=false`, header and footer lines will present in the output as well as all other document lines. The algorithm is implemented and described in the class `HeaderFooterDetector`.
table_type	“”, wo_external_bounds, one_cell_table, split_last_column and their combinaton	“”	Setting up the table recognition method. This option is used for PDF documents which are images with text (PDF without a textual layer). It is also used for PDF documents when `pdf_with_text_layer` is `true`, `false`, `auto` or `auto_tabby`. The value of the parameter specifies the type of tables for recognition, supported table types are described in `TableTypeAdditionalOptions`. You can use combination of values (for example, `wo_external_bounds+one_cell_table`).
need_binarization	true, false	false	This option is used to clean background (binarize) for pages of PDF documents without a textual layer. If the document’s background is heterogeneous, this option may help to improve the result of document text recognition. By default `need_binarization=false` because its usage may decrease the quality of the document page (and the recognised text on it).
extract_notes	true, false	false	This option is used to enable notes and comments extraction from PDF documents. The notes/comments text is stored in `LinkedTextAnnotation`. By default `extract_notes=false` because its usage may slow down the processing, especially for big PDF files.
Other formats handling
delimiter	any string	None	A column separator for files in CSV and TSV format. By default “,” (comma) is used for CSV and “\t” (tabulation) for TSV.
encoding	any string	None	The encoding of documents of textual formats like TXT, CSV, TSV. Look here to get the list of possible values for the `encoding` parameter. By default the encoding of the document is detected automatically.
handle_invisible_table	true, false	false	Handle tables without visible borders as tables for HTML documents. By default tables without visible borders are parsed as usual textual lines.