Using dedoc via API
Dedoc can be used as a web application that runs on the localhost:1231
.
It’s possible to change the port via config.py
file (if you clone the repository and run dedoc as a docker container).
There are two ways to install and run dedoc as a web application:
Install dedoc using pip. After installing library using this method you can run the application via the following command:
dedoc -m main
Application usage
Once you run the dedoc application, you can go to localhost:1231
and
look to the main page with the information about dedoc.
From this page you can go to the upload page and manually choose settings for the file parsing.
Then you can get the result after pressing the upload
button.
If you want to use the application in your program,
you can send requests e.g. using requests python library.
Post-requests should be sent to http://localhost:1231/upload
.
import requests data = { "pdf_with_text_layer": "auto_tabby", "document_type": "diploma", "language": "rus", "need_pdf_table_analysis": "true", "need_header_footer_analysis": "false", "is_one_column_document": "true", "return_format": 'html' } with open(filename, "rb") as file: files = {"file": (filename, file)} r = requests.post("http://localhost:1231/upload", files=files, data=data) result = r.content.decode("utf-8")
The data
dictionary in the example contains some parameters to parse the given file.
They are described in the section Api parameters description.
Api parameters description
Parameter |
Values |
Default value |
Description |
---|---|---|---|
Type of document structure parsing |
|||
document_type |
other, law, tz, diploma, article, fintoc |
other |
Type of the document structure according to specific domain. The following parameters are available:
This type is used for choosing a specific structure extractor (and, in some cases, a specific reader). |
patterns |
list of patterns dictionaries converted to string |
None |
This parameter is used only when |
structure_type |
tree, linear |
tree |
The type of output document representation:
This type is used for choosing a specific structure constructor after document structure extraction. |
return_format |
json, pretty_json, html, plain_text, tree |
json |
The output format of the result data.
The document structure from a structure constructor (see
|
Attachments handling |
|||
with_attachments |
true, false |
false |
The option to enable attached files extraction.
Some documents can have attached files (attachments), e.g. images or videos.
Dedoc allows to find attachments of the given file, get their metadata and save them in the directory where the given file is located.
If the option is |
need_content_analysis |
true, false |
false |
The option to enable file’s attachments parsing along with the given file.
The content of the parsed attachments will be represented as |
recursion_deep_attachments |
integer value >= 0 |
10 |
If the attached files of the given file contain some attachments, they can also be extracted. The level of this recursion can be set via this parameter. |
return_base64 |
true, false |
false |
Attached files can be encoded in base64 and their contents will be saved instead of saving attached file on disk.
The encoded contents will be saved in the attachment’s metadata in the |
PDF handling |
|||
need_pdf_table_analysis |
true, false |
true |
This option is used for PDF documents which are images with text (PDF without a textual layer).
It is also used for PDF documents when |
pdf_with_text_layer |
true, false, tabby, auto, auto_tabby |
auto_tabby |
This option is used for choosing a specific reader of PDF documents. The following options are available:
|
fast_textual_layer_detection |
true, false |
false |
Enable fast textual layer detection. Works only when auto or auto_tabby is selected at pdf_with_text_layer.
|
need_gost_frame_analysis |
true, false |
false |
This option is used to enable GOST (Russian government standard) frame recognition for PDF documents or images. The GOST frame recognizer is used recognize and ignore GOST frame on images and PDF documents. |
language |
rus, eng, rus+eng, fra, spa |
rus+eng |
Language of the parsed PDF document without a textual layer. The following values are available:
|
pages |
:, start:, :end, start:end |
: |
If you need to read a part of the PDF document, you can use page slice to define the reading range.
If the range is set like
If |
is_one_column_document |
true, false, auto |
auto |
This option is used to set the number of columns if the PDF document is without a textual layer in case it’s known beforehand. The following values are available:
If you are not sure about the number of columns in the documents you need to parse, it is recommended to use |
document_orientation |
auto, no_change |
auto |
This option is used to control document orientation analysis for PDF documents without a textual layer. The following values are available:
If you are sure that the documents you need to parse consist of vertical (not rotated) pages, you can use |
need_header_footer_analysis |
true, false |
false |
This option is used to remove headers and footers of PDF documents from the output result.
If |
need_binarization |
true, false |
false |
This option is used to clean background (binarize) for pages of PDF documents without a textual layer.
If the document’s background is heterogeneous, this option may help to improve the result of document text recognition.
By default |
Other formats handling |
|||
delimiter |
any string |
None |
A column separator for files in CSV and TSV format. By default “,” (comma) is used for CSV and “\t” (tabulation) for TSV. |
encoding |
any string |
None |
The encoding of documents of textual formats like TXT, CSV, TSV.
Look here to get the list of possible values for the |
handle_invisible_table |
true, false |
false |
Handle tables without visible borders as tables for HTML documents. By default tables without visible borders are parsed as usual textual lines. |