Auxiliary data structures for PDF and images parsing

class dedoc.readers.pdf_reader.pdf_image_reader.table_recognizer.table_recognizer.TableRecognizer(*, config: dict | None = None)[source]

The class recognizes tables from document images. This class is internal to the system. It is called from readers such as dedoc.readers.PdfTxtlayerReader or dedoc.readers.PdfImageReader.

convert_to_multipages_tables(all_single_tables: List[ScanTable], lines_with_meta: List[LineWithMeta]) List[ScanTable][source]

The function analyzes recognized tables from the entire document (all pages) to see if they are multi-page. If single-page tables are part of one multi-page, they are combined into one multi-page table.

recognize_tables_from_image(image: ndarray, page_number: int, language: str, table_type: str = '') Tuple[ndarray, List[ScanTable]][source]

The function recognizes tables with borders from scanned document image. Here, the contour analysis method is used to determine the boundaries of table cells. Then, a set of heuristics is used to detect tables, and finally, the detected table cells are converted to a matrix form (merged cells are detected and separated).

class dedoc.readers.pdf_reader.data_classes.tables.table_type.TableTypeAdditionalOptions[source]

Enum for table types of tables for the table recognizer. The value of the parameter specifies the type of tables recognized when processed by class TableRecognizer.

  • Parameter table_type=wo_external_bounds - recognize tables without external bounds.

Example of a table of type wo_external_bounds:

 text   | text | text
--------+------+------
 text   | text | text
--------+------+------
 text   | text | text
--------+------+------
 text   | text | text
  • Parameter table_type=one_cell_table - if a document contains a bounding box with text, it will be considered a table.

Example of a page with a table of type one_cell_table:

_________________________
Header of document
text text text +------+
text           | text |  <--- it is a table
               +------+
________________________
  • Parameter table_type=split_last_column - specified parameter for the merged last column of the table.

Example of a table of type split_last_column:

+--------+------+-------+
| text   | text | text1 |
+--------+------+       |
| text0  | text | text2 |
|        | -----|       |
|        | text | text3 |
+--------+------+       |
| text   | text | text4 |
+--------+------+-------+
            |
        Recognition
           |
           V
+--------+------+-------+
| text   | text | text1 |
+--------+------+-------|
| text0  | text | text2 |
|--------+ -----+------ |
| text0  | text | text3 |
+--------+------+------ |
| text   | text | text4 |
+--------+------+-------+
class dedoc.readers.pdf_reader.utils.header_footers_analysis.HeaderFooterDetector[source]

Class detects header and footer textual lines. The algorithm was implemented according to the article:

Lin X. Header and footer extraction by page association //Document Recognition and Retrieval X. – SPIE, 2003. – Т. 5010. – С. 164-171.

Algorithm’s notes:

  1. For documents of 6 pages or more, lines on even and odd pages of the document are compared to detect alternating footers-headers. For documents of less than 6 pages, lines between adjacent pages (between even or odd pages) are compared. Therefore, alternating footers-headers will not be detected on documents of less than 6 pages.

  2. The algorithm analyzes the first 4 and last 4 lines on each page of the document and, by comparing lines across pages, identifies common footer-header patterns using Levenshtein similarity.

  3. For algorithm work, the document must have at least two pages of text. It is not an ML algorithm so it cannot work with just one page.

  4. The more pages, the better. Remember that the parameter pages limits the number of pages in a document.