dedoc.data_structures

Main classes defining a document

class dedoc.data_structures.UnstructuredDocument(tables: List[Table], lines: List[LineWithMeta], attachments: List[AttachedFile], warnings: List[str] | None = None, metadata: dict | None = None)[source]

This class holds information about raw document content: its text, tables and attachments, that have been procured using one of the readers. Text is represented as a flat list of lines, hierarchy level of each line isn’t defined (only tag hierarchy level may exist).

Variables:

lines (List[LineWithMeta]) – list of textual lines with metadata returned by a reader
tables (List[Table]) – list of document tables returned by a reader
attachments (List[AttachedFile]) – list of document attached files
metadata (dict) – information about the document (like in DocumentMetadata)
warnings (List[str]) – list of warnings, obtained in the process of the document parsing

class dedoc.data_structures.ParsedDocument(metadata: DocumentMetadata, content: DocumentContent, warnings: List[str] | None = None, attachments: List[ParsedDocument] | None = None)[source]

Bases: Serializable

This class holds information about the document content, metadata and attachments.

Variables:

content (DocumentContent) – document text (hierarchy of nodes) and tables
attachments (List[ParsedDocument]) – result of analysis of attached files (empty if with_attachments=False)
metadata (DocumentMetadata) – document metadata such as size, creation date and so on.
warnings (List[str]) – list of warnings and possible errors, arising in the process of document parsing

to_api_schema() → ParsedDocument[source]

Convert class data into the corresponding API schema class.

Returns:: API schema class

class dedoc.data_structures.DocumentContent(tables: List[Table], structure: TreeNode, warnings: List[str] | None = None)[source]

Bases: Serializable

This class holds the document content - structured text and tables.

Variables:

tables (List[Table]) – list of document tables
structure (TreeNode) – tree structure of the document nodes with text and additional metadata
warnings (List[str]) – list of warnings, obtained in the process of the document parsing

to_api_schema() → DocumentContent[source]

Convert class data into the corresponding API schema class.

Returns:: API schema class

class dedoc.data_structures.DocumentMetadata(file_name: str, temporary_file_name: str, size: int, modified_time: int, created_time: int, access_time: int, file_type: str, uid: str | None = None, **kwargs: Dict[str, str | int | float])[source]

Bases: Serializable

This class holds information about document metadata.

Variables:

file_name (str) – original document name (before rename and conversion, so it can contain non-ascii symbols, spaces and so on)
temporary_file_name (str) – file name during parsing (unique name after rename and conversion)
size (int) – size of the original file in bytes
modified_time (int) – time of the last modification in unix time format (seconds since the epoch)
created_time (int) – time of the creation in unixtime
access_time (int) – time of the last access to the file in unixtime
file_type (str) – mime type of the file
uid (str) – document unique identifier (useful for attached files)

Additional variables may be added with other file metadata.

to_api_schema() → DocumentMetadata[source]

Convert class data into the corresponding API schema class.

Returns:: API schema class

class dedoc.data_structures.TreeNode(node_id: str, text: str, annotations: List[Annotation], metadata: LineMetadata, subparagraphs: List[TreeNode], parent: TreeNode | None)[source]

Bases: Serializable

TreeNode helps to represent document as recursive tree structure. It has parent node (None for root ot the tree) and list of children nodes (empty list for list node).

Variables:

node_id (str) – unique node identifier
text (str) – text of the node (may contain several lines)
annotations (List[Annotation]) – some metadata related to the part of the text (as font size)
metadata (LineMetadata) – metadata refers to entire node (as node type)
subparagraphs (List[TreeNode]) – list of child of this node
parent (TreeNode) – parent node (None for root, not none for other nodes)

add_child(line: LineWithMeta) → TreeNode[source]

Create a new tree node - children of the given node from given line. Return newly created node.

Parameters:: line – Line with meta, new node will be built from this line
Returns:: return created node (child of the self)

add_text(line: LineWithMeta) → None[source]

Add the text and annotations from given line, text is separated with aa len line symbol.

Parameters:: line – line with text to add

static create(lines: List[LineWithMeta] | None = None) → TreeNode[source]

Creates a root node with given text.

Parameters:: lines – this lines should be the title of the document (or should be empty for documents without title)
Returns:: root of the document tree

get_root() → TreeNode[source]

Returns:: root of the tree

to_api_schema() → TreeNode[source]

Convert class data into the corresponding API schema class.

Returns:: API schema class

class dedoc.data_structures.LineWithMeta(line: str, metadata: LineMetadata | None = None, annotations: List[Annotation] | None = None, uid: str | None = None)[source]

Bases: Sized, Serializable

Structural unit of document - line (or paragraph) of text and its metadata. One LineWithMeta should not contain text from different logical parts of the document (for example, document title and raw text of the document should not be in the same line). Still the logical part of the document may be represented by more than one line (for example, document title may consist of many lines).

__len__() → int[source]

__getitem__(index: slice | int) → LineWithMeta[source]

__add__(other: LineWithMeta | str) → LineWithMeta[source]

__init__(line: str, metadata: LineMetadata | None = None, annotations: List[Annotation] | None = None, uid: str | None = None) → None[source]

Parameters:

line – raw text of the document line
metadata – metadata (related to the entire line, as line or page number, its hierarchy level)
annotations – metadata that refers to some part of the text, for example, font size, font type, etc.
uid – unique identifier of the line

__lt__(other: LineWithMeta) → bool[source]: Return self<value.

property annotations: List[Annotation]: Metadata that refers to some part of the text, for example, font size, font type, etc.

static join(lines: List[LineWithMeta], delimiter: str = '\n') → LineWithMeta[source]

Join list of lines with the given delimiter, keep annotations consistent. This method is similar to the python built-it join method for strings.

Parameters:

lines – list of lines to join
delimiter – delimiter to insert between lines

Returns:

merged line

property line: str: Raw text of the document line

property metadata: LineMetadata: Line metadata related to the entire line, as line or page number, hierarchy level

set_line(line: str) → None[source]

set_metadata(metadata: LineMetadata) → None[source]

shift(shift_x: int, shift_y: int, image_width: int, image_height: int) → None[source]

split(sep: str) → List[LineWithMeta][source]

Split this line into a list of lines, keep annotations consistent. This method does not remove any text from the line.

Parameters:: sep – separator for splitting
Returns:: list of split lines

to_api_schema() → LineWithMeta[source]

Convert class data into the corresponding API schema class.

Returns:: API schema class

property uid: str: Unique identifier of the line

Bases: Serializable

This class holds information about document node (and document line) metadata, such as page number or line level in a document hierarchy.

Variables:

tag_hierarchy_level (HierarchyLevel) – the hierarchy level of the line with its type directly extracted by some of the readers (usually information got from tags e.g. in docx or html readers)
hierarchy_level (Optional[HierarchyLevel]) – the hierarchy level of the line extracted by some of the structure extractors - the result type and level of the line. The lower the level of the hierarchy, the closer it is to the root, it’s used to construct document tree.
page_id (int) – page number where paragraph starts, the numeration starts from page 0
line_id (Optional[int]) – line number inside the entire document, the numeration starts from line 0

Additional variables may be added with other line metadata.

to_api_schema() → LineMetadata[source]

Convert class data into the corresponding API schema class.

Returns:: API schema class

class dedoc.data_structures.HierarchyLevel(level_1: int | None, level_2: int | None, can_be_multiline: bool, line_type: str)[source]

This class defines the level of the document line. The lower is its value, the more important the line is.

The level of the line consists of two parts:

level_1 defines primary importance (e.g. root - level_1=0, header - level_1=1, etc.);
level_2 defines the level inside lines of equal type (e.g. for list items - “1.” - level_2=1, “1.1.” - level_2=2, etc.).

For the least important lines (line_type=raw_text) both levels are None.

Look to the hierarchy level description to get more details.

Variables:

level_1 (Optional[int]) – value of a line’s primary importance
level_2 (Optional[int]) – level of the line inside specific class
can_be_multiline (bool) – is used to unify lines inside tree node, if line can be multiline, it can be joined with another line
line_type (str) – type of the line, e.g. raw text, list item, header, etc.

__eq__(other: HierarchyLevel) → bool[source]

Defines the equality of two hierarchy levels:

two lines with equal level_1, level_2 are equal.
if some of the levels is None, its value is considered as +inf (infinities have equal value)

Parameters:: other – other hierarchy level
Returns:: whether current hierarchy level == other hierarchy level

__lt__(other: HierarchyLevel) → bool[source]

Defines the comparison of hierarchy levels:

current level < other level if (level_1, level_2) < other (level_1, level_2);
if some of the levels is None, its value is considered as +inf (infinities have equal value)

Parameters:: other – other hierarchy level
Returns:: whether current hierarchy level < other hierarchy level

static create_raw_text() → HierarchyLevel[source]: Create hierarchy level for a raw textual line.

static create_root() → HierarchyLevel[source]: Create hierarchy level for the document root.

static create_unknown() → HierarchyLevel[source]: Create hierarchy level for a line with unknown type.

is_list_item() → bool[source]: Check if the line is a list item.

is_raw_text() → bool[source]: Check if the line is raw text.

is_unknown() → bool[source]: Check if the type of the line is unknown (only for levels from readers).

class dedoc.data_structures.Table(cells: List[List[CellWithMeta]], metadata: TableMetadata)[source]

Bases: Serializable

This class holds information about tables in the document. We assume that a table has rectangle form (has the same number of columns in each row). If some cells are merged, they are duplicated and information about merge is stored in rowspan and colspan. Table representation is row-based i.e. external list contains list of rows.

Variables:

metadata (TableMetadata) – a list of lists of table cells (cell has text lines, colspan and rowspan attributes)
cells (List[List[CellWithMeta]]) – table metadata as location, title and so on

to_api_schema() → Table[source]

Convert class data into the corresponding API schema class.

Returns:: API schema class

class dedoc.data_structures.TableMetadata(page_id: int | None, uid: str | None = None, rotated_angle: float = 0.0, title: str = '')[source]

Bases: Serializable

This class holds the information about table unique identifier, rotation angle (if table has been rotated - for images) and so on.

Variables:

page_id (Optional[int]) – number of the page where table starts
uid (str) – unique identifier of the table (used for linking table to text)
rotated_angle (float) – value of the rotation angle by which the table was rotated during recognition
title (str) – table’s title

to_api_schema() → TableMetadata[source]

Convert class data into the corresponding API schema class.

Returns:: API schema class

class dedoc.data_structures.CellWithMeta(lines: List[LineWithMeta] | None, colspan: int = 1, rowspan: int = 1, invisible: bool = False)[source]

Bases: Serializable

This class holds the information about the cell: list of lines and cell properties (rowspan, colspan, invisible).

Variables:

lines (List[LineWithMeta]) – list of textual lines of the cell
colspan (int) – number of columns to span (for cells merged horizontally)
rowspan (int) – number of rows to span (for cells merged vertically)
invisible (bool) – indicator for displaying or hiding cell text - cells that are merged with others are hidden (for HTML display)

get_annotations() → List[Annotation][source]: Get merged annotations of all cell lines (start/end of annotations moved according to the merged text)

get_text() → str[source]: Get merged text of all cell lines

to_api_schema() → CellWithMeta[source]

Convert class data into the corresponding API schema class.

Returns:: API schema class

Helper classes

class dedoc.data_structures.Serializable[source]

Base class for the API schema objects which we later need convert to dict.

abstract to_api_schema() → BaseModel[source]

Convert class data into the corresponding API schema class.

Returns:: API schema class

class dedocutils.data_structures.BBox(x_top_left: int, y_top_left: int, width: int, height: int)[source]

Bounding box around some page object, the coordinate system starts from top left corner.

x_top_left

y_top_left

x_bottom_right

y_bottom_right

width

height

__init__(x_top_left: int, y_top_left: int, width: int, height: int) → None[source]

The following parameters should have values of pixels number.

Parameters:

x_top_left – x coordinate of the bbox top left corner
y_top_left – y coordinate of the bbox top left corner
width – bounding box width
height – bounding box height

static from_two_points(top_left: Tuple[int, int], bottom_right: Tuple[int, int]) → BBox[source]

Make the bounding box from two points.

Parameters:

top_left – (x, y) point of the bbox top left corner
bottom_right – (x, y) point of the bbox bottom right corner

have_intersection_with_box(box: BBox, threshold: float = 0.3) → bool[source]

Check if the current bounding box has the intersection with another one.

Parameters:

box – another bounding box to check intersection with
threshold – the lowest value of the intersection over union used get boolean result

property square: int: Square of the bbox.

class dedoc.data_structures.AttachedFile(original_name: str, tmp_file_path: str, need_content_analysis: bool, uid: str)[source]

Holds information about files, attached to the parsed document.

Variables:

original_name (str) – original name of the attached file if it was possible to extract it
tmp_file_path (str) – path to the attached file on disk - its name is different from original_name
need_content_analysis (bool) – does the attached file need parsing (enable recursive parsing in DedocManager)
uid (str) – unique identifier of the attached file

class dedoc.readers.pdf_reader.data_classes.tables.scantable.ScanTable(page_number: int, cells: List[List[CellWithMeta]], bbox: BBox, order: int = -1, page_width: int | None = None, page_height: int | None = None)[source]

Bases: Table

Utility class for storing recognized tables from document images. The class TableRecognizer works with this class.

Annotations of the text lines

class dedoc.data_structures.Annotation(start: int, end: int, name: str, value: str, is_mergeable: bool = True)[source]

Bases: Serializable

Base class for text annotations of all kinds. Annotation is the piece of information about the text line: it’s appearance or links to another document object. Look to the concrete kind of annotations to get mode examples.

Variables:

start (int) – start of the annotated text
end (int) – end of the annotated text (end isn’t included)
name (str) – annotation’s name, specific for each type of annotation
value (str) – information about annotated text, depends on the type of annotation, e.g. “True”/”False”, “10.0”, etc.
is_mergeable (bool) – is it possible to merge annotations with the same value

Concrete annotations

class dedoc.data_structures.AttachAnnotation(attach_uid: str, start: int, end: int)[source]