PDF and images handling

Parameters for PDF and images handling

Parameter

Possible values

Default value

Where can be used

Description

pdf_with_text_layer

true, false, tabby, auto, auto_tabby, bad_encoding

auto_tabby

This option is used for choosing a specific reader of PDF documents for dedoc.DedocManager or dedoc.readers.ReaderComposition. For readers, the option is used to check if the reader is able to parse the file. The following options are available:

  • true – parsing PDF files with a textual layer (text is copiable). This option is used to choose dedoc.readers.PdfTxtlayerReader for parsing.

  • false – parsing scanned documents (images, PDF without a textual layer) even if the document has a textual layer (is copyable). This option is used to choose dedoc.readers.PdfImageReader for parsing. Note: dedoc.readers.PdfImageReader doesn’t check the option because it can handle both scanned and copyable documents.

  • tabby – parsing PDF files with a textual layer (text is copiable). This option is used to choose dedoc.readers.PdfTabbyReader for parsing.

  • auto – automatic detection of textual layer presence in the PDF document. This option is used to choose dedoc.readers.PdfAutoReader for parsing. If the document has a textual layer (is copyable), dedoc.readers.PdfTxtlayerReader will be used for parsing. If the document doesn’t have a textual layer (it is an image, scanned document), dedoc.readers.PdfImageReader will be used.

  • auto_tabby – automatic detection of textual layer presence in the PDF document. This option is used to choose dedoc.readers.PdfAutoReader for parsing. If the document has a textual layer (is copyable), dedoc.readers.PdfTabbyReader will be used for parsing. If the document doesn’t have a textual layer (it is an image, scanned document), dedoc.readers.PdfImageReader will be used. It is highly recommended to use this option value for any PDF document parsing.

  • bad_encoding – automatic correction of PDF with textual layer but broken encoding (for Russian and English languages). Use this option if you are sure that PDF file has a textual layer with broken encoding, i.e. its text is copyable, but incorrect (gibberish). Otherwise use dedoc.readers.PdfAutoReader or dedoc.readers.PdfTabbyReader, because it’s faster and it will provide better results on usual PDF files.

textual_layer_classifier

ml, simple, letter

ml

Type of the classifier for textual layer detection. Works only when auto or auto_tabby is selected at pdf_with_text_layer.

  • ml – use ML textual layer classifier trained on heuristic features, works only for Russian and English languages.

  • simple – if any text is detected in a PDF file, Dedoc assumes that textual layer is detected and it is correct. Much faster but less accurate.

  • letter – textual layer is considered as correct if percent of letters in the text > 50%.

each_page_textual_layer_detection

true, false

false

Enable textual layer detection separately on each document page. Works only when auto or auto_tabby is selected at pdf_with_text_layer.

  • true – classify text of each page if it’s correct or not. Classification algorythm depends on the fast_textual_layer_detection parameter. This option is recommended for PDF documents with scanned (or corrupted) pages in the middle.

  • false – classify text of only first 8 pages if it’s correct or not. First page is handled separately. Much faster but less accurate.

language

rus, eng, rus+eng, fra, spa

rus+eng

Language of the document without a textual layer. The following values are available:

pages

:, start:, :end, start:end

:

If you need to read a part of the PDF document, you can use page slice to define the reading range. If the range is set like start_page:end_page, document will be processed from start_page to end_page (start_page to end_page are included to the range).

  • using : means reading all document pages;

  • using empty endstart: (e.g. 5:) means reading the document from start up to the end of the document;

  • using empty start:end (e.g. :5) means reading the document from the beginning up to the end page;

  • using start:end means reading document pages from start to end inclusively.

If start > end or start > the number of pages in the document, the empty document will be returned. If end > the number of pages in the document, the document will be read up to its end. For example, if 1:3 is given, 1, 2 and 3 document pages will be processed.

is_one_column_document

true, false, auto

auto

This option is used to set the number of columns if the PDF document is without a textual layer in case it’s known beforehand. The following values are available:

  • true – the document is single column;

  • false – the document is multi-column (two columns parsing is supported);

  • auto – automatic detection of the number of columns in the document.

If you are not sure about the number of columns in the documents you need to parse, it is recommended to use auto.

document_orientation

auto, no_change

auto

This option is used to control document orientation analysis for PDF documents without a textual layer. The following values are available:

  • auto – automatic detection of rotated document pages (rotation angle 0, 90, 180, 270 degrees) and rotation of document pages;

  • no_change – parse document pages as they are without rotated pages detection.

If you are sure that the documents you need to parse consist of vertical (not rotated) pages, you can use no_change.

need_header_footer_analysis

True, False

False

This option is used to remove headers and footers of PDF documents from the output result. If need_header_footer_analysis=False, header and footer lines will present in the output as well as all other document lines. The algorithm is implemented and described in the class HeaderFooterDetector.

need_binarization

True, False

False

This option is used to clean background (binarize) for pages of PDF documents without a textual layer. If the document’s background is heterogeneous, this option may help to improve the result of document text recognition. By default need_binarization=False because its usage may decrease the quality of the document page (and the recognised text on it).

need_pdf_table_analysis

True, False

True

This option is used to enable table recognition for PDF documents or images. The table recognition method is used in dedoc.readers.PdfImageReader and dedoc.readers.PdfTxtlayerReader. If the document has a textual layer, it is recommended to use dedoc.readers.PdfTabbyReader, in this case tables will be parsed much easier and faster.

table_type

“”, wo_external_bounds, one_cell_table, split_last_column and their combinaton

“”

Setting up the table recognition method. The table recognition method is used in dedoc.readers.PdfImageReader and dedoc.readers.PdfTxtlayerReader. The value of the parameter specifies the type of tables recognized when processed by class TableRecognizer. More details about each parameter value are disclosed in the class TableTypeAdditionalOptions description. You can use combination of values (for example, wo_external_bounds+one_cell_table).

need_gost_frame_analysis

True, False

False

This option is used to enable GOST (Russian government standard) frame recognition for PDF documents or images. It allows dedoc.readers.PdfImageReader, dedoc.readers.PdfTxtlayerReader and dedoc.readers.PdfTabbyReader to properly process the content of the document containing GOST frame, see GOST frame handling for more details.