Using dedoc via API
Dedoc can be used as a web application that runs on the localhost:1231. It’s possible to change the port via config.py file (if you clone the repository and run dedoc as a docker container).
There are two ways to install and run dedoc as a web application:
Install dedoc using pip. After installing library using this method you can run the application via the following command:
dedoc -m main
Application usage
Once you run the dedoc application, you can go to localhost:1231 and look to the main page with the information about dedoc. From this page you can go to the upload page and manually choose settings for the file parsing. Then you can get the result after pressing the upload button.
If you want to use the application in your program, you can send requests e.g. using requests python library. Post-requests should be sent to http://localhost:1231/upload.
import requests data = { "pdf_with_text_layer": "auto_tabby", "document_type": "diploma", "language": "rus", "need_pdf_table_analysis": "true", "need_header_footer_analysis": "false", "is_one_column_document": "true", "return_format": 'html' } with open(filename, 'rb') as file: files = {'file': (filename, file)} r = requests.post("http://localhost:1231/upload", files=files, data=data) result = r.content.decode('utf-8')
The data dictionary in the example contains some parameters to parse the given file. They are described in the section Api parameters description.
Api parameters description
Parameter |
Values |
Default value |
Description |
|---|---|---|---|
Type of document structure parsing |
|||
document_type |
other, law, tz, diploma |
other |
Type of the document structure according to specific domain. The following parameters are available:
This type is used for choosing a specific structure extractor after document reading. |
structure_type |
tree, linear |
tree |
The type of output document representation:
This type is used for choosing a specific structure constructor after document structure extraction. |
return_format |
json, pretty_json, html, tree |
json |
The output format of the result data.
The document structure from a structure constructor (see
|
Attachments handling |
|||
with_attachments |
true, false |
false |
The option to enable attached files extraction. Some documents can have attached files (attachments), e.g. images or videos. Dedoc allows to find attachments of the given file, get their metadata and save them in the directory where the given file is located. If the option is false, all attached files will be ignored. |
need_content_analysis |
true, false |
false |
The option to enable file’s attachments parsing along with the given file.
The content of the parsed attachments will be represented as |
recursion_deep_attachments |
integer value >= 0 |
10 |
If the attached files of the given file contain some attachments, they can also be extracted. The level of this recursion can be set via this parameter. |
return_base64 |
true, false |
false |
Attached files can be encoded in base64 and their contents will be saved instead of saving attached file on disk. The encoded contents will be saved in the attachment’s metadata in the base64_encode field. Use true value to enable this behaviour. |
attachments_dir |
optional string with a valid path |
None |
The path to the directory where document’s attached files can be saved instead of a temporary directory. |
Tables handling |
|||
insert_table |
true, false |
false |
Parameter for inserting tables into the result content. By default tables are returned separately from the main document tree. If parameter insert_table is set to true, tables will be inserted to the document tree. See JSON output format for examples of result structure in both cases. |
need_pdf_table_analysis |
true, false |
true |
This option is used for PDF documents which are images with text (PDF without a textual layer). It is also used for PDF documents when pdf_with_text_layer is true, false, auto or auto_tabby. Since costly table recognition methods are used to get tables, you may need to use need_pdf_table_analysis=false to increase parsing speed and get text without tables. If the document has a textual layer, it is recommended to use pdf_with_text_layer=tabby, in this case tables will be parsed much easier and faster. |
orient_analysis_cells |
true, false |
false |
This option is used for a table recognition in case of PDF documents without a textual layer (images, scanned documents or when pdf_with_text_layer is true, false or auto). When set to true, it enables analysis of rotated cells in table headers. Use this option if you are sure that the cells of the table header are rotated. |
orient_cell_angle |
90, 270 |
90 |
This option is used for a table recognition in case of PDF documents without a textual layer (images, scanned documents or when pdf_with_text_layer is true, false or auto). It is ignored when orient_analysis_cells=false. The option is used to set orientation of cells in table headers:
|
PDF handling |
|||
pdf_with_text_layer |
true, false, tabby, auto, auto_tabby |
auto_tabby |
This option is used for choosing a specific reader of PDF documents. The following options are available:
|
language |
rus, eng, rus+eng |
rus+eng |
Language of the parsed PDF document without a textual layer. The following values are available:
|
pages |
:, start:, :end, start:end |
: |
If you need to read a part of the PDF document, you can use page slice to define the reading range. If the range is set like start_page:end_page, document will be processed from start_page to end_page (start_page to end_page are included to the range).
If start > end or start > the number of pages in the document, the empty document will be returned. If end > the number of pages in the document, the document will be read up to its end. For example, if 1:3 is given, 1, 2 and 3 document pages will be processed. |
is_one_column_document |
true, false, auto |
auto |
This option is used to set the number of columns if the PDF document is without a textual layer in case it’s known beforehand. The following values are available:
If you are not sure about the number of columns in the documents you need to parse, it is recommended to use auto. |
document_orientation |
auto, no_change |
auto |
This option is used to control document orientation analysis for PDF documents without a textual layer. The following values are available:
If you are sure that the documents you need to parse consist of vertical (not rotated) pages, you can use no_change. |
need_header_footer_analysis |
true, false |
false |
This option is used to remove headers and footers of PDF documents from the output result. If need_header_footer_analysis=false, header and footer lines will present in the output as well as all other document lines. |
need_binarization |
true, false |
false |
This option is used to clean background (binarize) for pages of PDF documents without a textual layer. If the document’s background is heterogeneous, this option may help to improve the result of document text recognition. By default need_binarization=false because its usage may decrease the quality of the document page (and the recognised text on it). |
Other formats handling |
|||
delimiter |
any string |
None |
A column separator for files in CSV and TSV format. By default “,” (comma) is used for CSV and “\t” (tabulation) for TSV. |
encoding |
any string |
None |
The encoding of documents of textual formats like TXT, CSV, TSV. Look here to get the list of possible values for the encoding parameter. By default the encoding of the document is detected automatically. |
handle_invisible_table |
true, false |
false |
Handle tables without visible borders as tables for HTML documents. By default tables without visible borders are parsed as usual textual lines. |