Auxiliary data structures for PDF and images parsing
- class dedoc.readers.pdf_reader.pdf_image_reader.table_recognizer.table_recognizer.TableRecognizer(*, config: dict | None = None)[source]
The class recognizes tables from document images. This class is internal to the system. It is called from readers such as
dedoc.readers.PdfTxtlayerReaderordedoc.readers.PdfImageReader.The class recognizes tables with borders from the document image using
recognize_tables_from_image();The class also analyzes recognized single-page tables and combines them into multi-page ones using
convert_to_multipages_tables()
- convert_to_multipages_tables(all_single_tables: List[ScanTable], lines_with_meta: List[LineWithMeta]) List[ScanTable][source]
The function analyzes recognized tables from the entire document (all pages) to see if they are multi-page. If single-page tables are part of one multi-page, they are combined into one multi-page table.
- recognize_tables_from_image(image: ndarray, page_number: int, language: str, table_type: str = '') Tuple[ndarray, List[ScanTable]][source]
The function recognizes tables with borders from scanned document image. Here, the contour analysis method is used to determine the boundaries of table cells. Then, a set of heuristics is used to detect tables, and finally, the detected table cells are converted to a matrix form (merged cells are detected and separated).
- class dedoc.readers.pdf_reader.data_classes.tables.table_type.TableTypeAdditionalOptions[source]
Enum for table types of tables for the table recognizer. The value of the parameter specifies the type of tables recognized when processed by class
TableRecognizer.Parameter table_type=wo_external_bounds - recognize tables without external bounds.
Example of a table of type wo_external_bounds:
text | text | text --------+------+------ text | text | text --------+------+------ text | text | text --------+------+------ text | text | text
Parameter table_type=one_cell_table - if a document contains a bounding box with text, it will be considered a table.
Example of a page with a table of type one_cell_table:
_________________________ Header of document text text text +------+ text | text | <--- it is a table +------+ ________________________
Parameter table_type=split_last_column - specified parameter for the merged last column of the table.
Example of a table of type split_last_column:
+--------+------+-------+ | text | text | text1 | +--------+------+ | | text0 | text | text2 | | | -----| | | | text | text3 | +--------+------+ | | text | text | text4 | +--------+------+-------+ | Recognition | V +--------+------+-------+ | text | text | text1 | +--------+------+-------| | text0 | text | text2 | |--------+ -----+------ | | text0 | text | text3 | +--------+------+------ | | text | text | text4 | +--------+------+-------+
Class detects header and footer textual lines. The algorithm was implemented according to the article:
Lin X. Header and footer extraction by page association //Document Recognition and Retrieval X. – SPIE, 2003. – Т. 5010. – С. 164-171.
Algorithm’s notes:
For documents of 6 pages or more, lines on even and odd pages of the document are compared to detect alternating footers-headers. For documents of less than 6 pages, lines between adjacent pages (between even or odd pages) are compared. Therefore, alternating footers-headers will not be detected on documents of less than 6 pages.
The algorithm analyzes the first 4 and last 4 lines on each page of the document and, by comparing lines across pages, identifies common footer-header patterns using Levenshtein similarity.
For algorithm work, the document must have at least two pages of text. It is not an ML algorithm so it cannot work with just one page.
The more pages, the better. Remember that the parameter pages limits the number of pages in a document.