dedoc.data_structures

Main classes defining a document

class dedoc.data_structures.UnstructuredDocument(tables: List[Table], lines: List[LineWithMeta], attachments: List[AttachedFile], warnings: List[str] | None = None, metadata: dict | None = None)[source]

This class holds information about raw document content: its text, tables and attachments, that have been procured using one of the readers. Text is represented as a flat list of lines, hierarchy level of each line isn’t defined (only tag hierarchy level may exist).

__init__(tables: List[Table], lines: List[LineWithMeta], attachments: List[AttachedFile], warnings: List[str] | None = None, metadata: dict | None = None) → None[source]

Parameters:

tables – list of document tables
lines – list of raw document lines
attachments – list of documents attachments
warnings – list of warnings, obtained in the process of the document parsing
metadata – additional data

class dedoc.data_structures.ParsedDocument(metadata: DocumentMetadata, content: DocumentContent | None, warnings: List[str] | None = None, attachments: List[ParsedDocument] | None = None)[source]

Bases: Serializable

This class holds information about the document content, metadata and attachments.

__init__(metadata: DocumentMetadata, content: DocumentContent | None, warnings: List[str] | None = None, attachments: List[ParsedDocument] | None = None) → None[source]

Parameters:

metadata – document metadata such as size, creation date and so on.
content – text and tables
attachments – result of analysis of attached files
warnings – list of warnings and possible errors, arising in the process of document parsing

to_api_schema() → ParsedDocument[source]

Convert class data into the corresponding API schema class.

Returns:: API schema class

class dedoc.data_structures.DocumentContent(tables: List[Table], structure: TreeNode, warnings: List[str] | None = None)[source]

Bases: Serializable

This class holds the document content - structured text and tables.

__init__(tables: List[Table], structure: TreeNode, warnings: List[str] | None = None) → None[source]

Parameters:

tables – list of document tables
structure – tree structure in which content of the document is organized
warnings – list of warnings, obtained in the process of the document structure constructing

to_api_schema() → DocumentContent[source]

Convert class data into the corresponding API schema class.

Returns:: API schema class

class dedoc.data_structures.DocumentMetadata(file_name: str, temporary_file_name: str, size: int, modified_time: int, created_time: int, access_time: int, file_type: str, other_fields: dict | None = None, uid: str | None = None)[source]

Bases: Serializable

This class holds information about document metadata.

__init__(file_name: str, temporary_file_name: str, size: int, modified_time: int, created_time: int, access_time: int, file_type: str, other_fields: dict | None = None, uid: str | None = None) → None[source]

Parameters:

uid – document unique identifier (useful for attached files)
file_name – original document name (before rename and conversion, so it can contain non-ascii symbols, spaces and so on)
temporary_file_name – file name during parsing (unique name after rename and conversion);
size – size of the original file in bytes
modified_time – time of the last modification in unix time format (seconds since the epoch)
created_time – time of the creation in unixtime
access_time – time of the last access to the file in unixtime
file_type – mime type of the file
other_fields – additional fields of user metadata

extend_other_fields(new_fields: dict) → None[source]

Add new attributes to the class and to the other_fields dictionary.

Parameters:: new_fields – fields to add

to_api_schema() → DocumentMetadata[source]

Convert class data into the corresponding API schema class.

Returns:: API schema class

class dedoc.data_structures.TreeNode(node_id: str, text: str, annotations: List[Annotation], metadata: LineMetadata, subparagraphs: List[TreeNode], parent: TreeNode | None)[source]

Bases: Serializable

TreeNode helps to represent document as recursive tree structure. It has parent node (None for root ot the tree) and list of children nodes (empty list for list node).

__init__(node_id: str, text: str, annotations: List[Annotation], metadata: LineMetadata, subparagraphs: List[TreeNode], parent: TreeNode | None) → None[source]

Parameters:

node_id – node id is unique in one document
text – text of the node
annotations – some metadata related to the part of the text (as font size)
metadata – metadata refers to entire node (as node type)
subparagraphs – list of child of this node
parent – parent node (None for root, not none for other nodes)

add_child(line: LineWithMeta) → TreeNode[source]

Create a new tree node - children of the given node from given line. Return newly created node.

Parameters:: line – Line with meta, new node will be built from this line
Returns:: return created node (child of the self)

add_text(line: LineWithMeta) → None[source]

Add the text and annotations from given line, text is separated with aa len line symbol.

Parameters:: line – line with text to add

static create(lines: List[LineWithMeta] | None = None) → TreeNode[source]

Creates a root node with given text.

Parameters:: lines – this lines should be the title of the document (or should be empty for documents without title)
Returns:: root of the document tree

get_root() → TreeNode[source]

Returns:: root of the tree

to_api_schema() → TreeNode[source]

Convert class data into the corresponding API schema class.

Returns:: API schema class

class dedoc.data_structures.LineWithMeta(line: str, metadata: LineMetadata | None = None, annotations: List[Annotation] | None = None, uid: str | None = None)[source]

Bases: Sized, Serializable

Structural unit of document - line (or paragraph) of text and its metadata. One LineWithMeta should not contain text from different logical parts of the document (for example, document title and raw text of the document should not be in the same line). Still the logical part of the document may be represented by more than one line (for example, document title may consist of many lines).

__len__() → int[source]

__getitem__(index: slice | int) → LineWithMeta[source]

__add__(other: LineWithMeta | str) → LineWithMeta[source]

__init__(line: str, metadata: LineMetadata | None = None, annotations: List[Annotation] | None = None, uid: str | None = None) → None[source]

Parameters:

line – raw text of the document line
metadata – metadata (related to the entire line, as line or page number, its hierarchy level)
annotations – metadata that refers to some part of the text, for example, font size, font type, etc.
uid – unique identifier of the line

__lt__(other: LineWithMeta) → bool[source]: Return self<value.

property annotations: List[Annotation]

static join(lines: List[LineWithMeta], delimiter: str = '\n') → LineWithMeta[source]

Join list of lines with the given delimiter, keep annotations consistent. This method is similar to the python built-it join method for strings.

Parameters:

lines – list of lines to join
delimiter – delimiter to insert between lines

Returns:

merged line

property line: str

property metadata: LineMetadata

set_line(line: str) → None[source]

split(sep: str) → List[LineWithMeta][source]

Split this line into a list of lines, keep annotations consistent. This method does not remove any text from the line.

Parameters:: sep – separator for splitting
Returns:: list of split lines

to_api_schema() → LineWithMeta[source]

Convert class data into the corresponding API schema class.

Returns:: API schema class

property uid: str

class dedoc.data_structures.LineMetadata(page_id: int, line_id: int | None, tag_hierarchy_level: HierarchyLevel | None = None, hierarchy_level: HierarchyLevel | None = None, other_fields: dict | None = None)[source]

Bases: Serializable

This class holds information about document node (and document line) metadata, such as page number or line level in a document hierarchy.

__init__(page_id: int, line_id: int | None, tag_hierarchy_level: HierarchyLevel | None = None, hierarchy_level: HierarchyLevel | None = None, other_fields: dict | None = None) → None[source]

Parameters:

page_id – page number where paragraph starts, the numeration starts from page 0
line_id – line number inside the entire document, the numeration starts from line 0
tag_hierarchy_level – the hierarchy level of the line with its type directly extracted by some of the readers (usually information got from tags e.g. in docx or html readers)
hierarchy_level – the hierarchy level of the line extracted by some of the structure extractors - the result type and level of the line. The lower the level of the hierarchy, the closer it is to the root, it’s used to construct document tree.
other_fields – additional fields of user metadata

extend_other_fields(new_fields: dict) → None[source]

Add new attributes to the class and to the other_fields dictionary.

Parameters:: new_fields – fields to add

to_api_schema() → LineMetadata[source]

Convert class data into the corresponding API schema class.

Returns:: API schema class

class dedoc.data_structures.HierarchyLevel(level_1: int | None, level_2: int | None, can_be_multiline: bool, line_type: str)[source]

This class defines the level of the document line. The lower is its value, the more important the line is.

The level of the line consists of two parts:

level_1 defines primary importance (e.g. root - level_1=0, header - level_1=1, etc.);
level_2 defines the level inside lines of equal type (e.g. for list items - “1.” - level_2=1, “1.1.” - level_2=2, etc.).

For the least important lines like raw_text both levels are None.

__eq__(other: HierarchyLevel) → bool[source]

Defines the equality of two hierarchy levels:

two raw text lines or lines with unknown type are equal;
two lines with equal level_1, level_2 are equal.

__init__(level_1: int | None, level_2: int | None, can_be_multiline: bool, line_type: str) → None[source]

Parameters:

level_1 – value of a line’s primary importance
level_2 – level of the line inside specific class
can_be_multiline – is used to unify lines inside tree node, if line can be multiline, it can be joined with another line
line_type – type of the line, e.g. raw text, list item, header, etc.

__lt__(other: HierarchyLevel) → bool[source]

Defines the comparison of hierarchy levels:

line1 < line2 if (level_1, level_2) of line1 <= (level_1, level_2) of line2;
line1 < line2 if line2 is raw text or unknown, and line1 has another type.

Else line1 >= line2.

Parameters:: other – hierarchy level of the line2

static create_raw_text() → HierarchyLevel[source]: Create hierarchy level for a raw textual line.

static create_root() → HierarchyLevel[source]: Create hierarchy level for the document root.

static create_unknown() → HierarchyLevel[source]: Create hierarchy level for a line with unknown type.

is_list_item() → bool[source]: Check if the line is a list item.

is_raw_text() → bool[source]: Check if the line is raw text.

is_unknown() → bool[source]: Check if the type of the line is unknown (only for levels from readers).

class dedoc.data_structures.Table(cells: List[List[CellWithMeta]], metadata: TableMetadata)[source]

Bases: Serializable

This class holds information about tables in the document. We assume that a table has rectangle form (has the same number of columns in each row). Table representation is row-based i.e. external list contains list of rows.

__init__(cells: List[List[CellWithMeta]], metadata: TableMetadata) → None[source]

Parameters:

cells – a list of lists of cells (cell has text, colspan and rowspan attributes)
metadata – some table metadata as location, size and so on

to_api_schema() → Table[source]

Convert class data into the corresponding API schema class.

Returns:: API schema class

class dedoc.data_structures.TableMetadata(page_id: int | None, uid: str | None = None, rotated_angle: float = 0.0)[source]

Bases: Serializable

This class holds the information about table unique identifier, rotation angle (if table has been rotated - for images) and so on.

__init__(page_id: int | None, uid: str | None = None, rotated_angle: float = 0.0) → None[source]

Parameters:

page_id – number of the page where table starts
uid – unique identifier of the table
rotated_angle – value of the rotation angle by which the table was rotated during recognition

to_api_schema() → TableMetadata[source]

Convert class data into the corresponding API schema class.

Returns:: API schema class

class dedoc.data_structures.CellWithMeta(lines: List[LineWithMeta], colspan: int = 1, rowspan: int = 1, invisible: bool = False)[source]

Bases: Serializable

This class holds the information about the cell: list of lines and cell properties (rowspan, colspan, invisible).

__init__(lines: List[LineWithMeta], colspan: int = 1, rowspan: int = 1, invisible: bool = False) → None[source]

Parameters:

lines – textual lines of the cell
colspan – number of columns to span like in HTML format
rowspan – number of rows to span like in HTML format
invisible – indicator for displaying or hiding cell text

get_annotations() → List[Annotation][source]: Get merged annotations of all cell lines (start/end of annotations moved according to the merged text)

get_text() → str[source]: Get merged text of all cell lines

to_api_schema() → CellWithMeta[source]

Convert class data into the corresponding API schema class.

Returns:: API schema class

Helper classes

class dedoc.data_structures.Serializable[source]

Base class for the API schema objects which we later need convert to dict.

abstract to_api_schema() → BaseModel[source]

Convert class data into the corresponding API schema class.

Returns:: API schema class

class dedocutils.data_structures.BBox(x_top_left: int, y_top_left: int, width: int, height: int)[source]

Bounding box around some page object, the coordinate system starts from top left corner.

x_top_left

y_top_left

x_bottom_right

y_bottom_right

width

height

__init__(x_top_left: int, y_top_left: int, width: int, height: int) → None[source]

The following parameters should have values of pixels number.

Parameters:

x_top_left – x coordinate of the bbox top left corner
y_top_left – y coordinate of the bbox top left corner
width – bounding box width
height – bounding box height

static from_two_points(top_left: Tuple[int, int], bottom_right: Tuple[int, int]) → BBox[source]

Make the bounding box from two points.

Parameters:

top_left – (x, y) point of the bbox top left corner
bottom_right – (x, y) point of the bbox bottom right corner

have_intersection_with_box(box: BBox, threshold: float = 0.3) → bool[source]

Check if the current bounding box has the intersection with another one.

Parameters:

box – another bounding box to check intersection with
threshold – the lowest value of the intersection over union used get boolean result

property square: int: Square of the bbox.

class dedoc.data_structures.AttachedFile(original_name: str, tmp_file_path: str, need_content_analysis: bool, uid: str)[source]

Holds information about files, attached to the parsed document.

__init__(original_name: str, tmp_file_path: str, need_content_analysis: bool, uid: str) → None[source]

Parameters:

original_name – Name of the file from which the attachments are extracted
tmp_file_path – path to the attachment file.
need_content_analysis – indicator should we parse the attachment’s content or simply save it without parsing
uid – unique identifier of the attachment

Annotations of the text lines

class dedoc.data_structures.Annotation(start: int, end: int, name: str, value: str, is_mergeable: bool = True)[source]

Bases: Serializable

Base class for text annotations of all kinds. Annotation is the piece of information about the text line: it’s appearance or links to another document object. Look to the concrete kind of annotations to get mode examples.

__init__(start: int, end: int, name: str, value: str, is_mergeable: bool = True) → None[source]

Some kind of text information about symbols between start and end. For example Annotation(1, 13, “italic”, “True”) says that text between 1st and 13th symbol was writen in italic.

Parameters:

start – start of the annotated text
end – end of the annotated text (end isn’t included)
name – annotation’s name
value – information about annotated text
is_mergeable – is it possible to merge annotations with the same value

Concrete annotations

class dedoc.data_structures.AttachAnnotation(attach_uid: str, start: int, end: int)[source]

Bases: Annotation

This annotation indicate the place of the attachment in the original document (for example, the place where image was placed in the docx document). The line containing this annotation is placed directly before the referred attachment.

name = 'attachment'

__init__(attach_uid: str, start: int, end: int) → None[source]

Parameters:

attach_uid – unique identifier of the attachment which is referenced inside this annotation
start – start of the annotated text (usually zero)
end – end of the annotated text (usually end of the line)

class dedoc.data_structures.TableAnnotation(name: str, start: int, end: int)[source]

Bases: Annotation

This annotation indicate the place of the table in the original document. The line containing this annotation is placed directly before the referred table.

name = 'table'

__init__(name: str, start: int, end: int) → None[source]

Parameters:

name – unique identifier of the table which is referenced inside this annotation
start – start of the annotated text (usually zero)
end – end of the annotated text (usually end of the line)

class dedoc.data_structures.LinkedTextAnnotation(start: int, end: int, value: str)[source]

Bases: Annotation

This annotation is used when some text is linked to the line or its part. For example, line can contain a number that refers the footnote - the text of this footnote will be the value of this annotation.

name = 'linked_text'

__init__(start: int, end: int, value: str) → None[source]

Parameters:

start – start of the annotated text
end – end of the annotated text (not included)
value – text, linked to given one, for example text of the footnote

class dedoc.data_structures.BBoxAnnotation(start: int, end: int, value: BBox, page_width: int, page_height: int)[source]

Bases: Annotation

Coordinates of the line’s bounding box (in relative coordinates) - for pdf documents.

name = 'bounding box'

__init__(start: int, end: int, value: BBox, page_width: int, page_height: int) → None[source]

Parameters:

start – start of the annotated text (usually zero)
end – end of the annotated text (usually end of the line)
value – bounding box where line is located
page_width – width of original image with this bbox
page_height – height of original image with this bbox

class dedoc.data_structures.AlignmentAnnotation(start: int, end: int, value: str)[source]

Bases: Annotation

This annotation defines the alignment of the entire line in the document: left, right, to the both sides of the page or in the center.

name = 'alignment'

__init__(start: int, end: int, value: str) → None[source]

Parameters:

start – start of the annotated text (usually zero)
end – end of the annotated text (usually end of the line)
value – kind of the line alignment: left, right, both of center

class dedoc.data_structures.IndentationAnnotation(start: int, end: int, value: str)[source]

Bases: Annotation

This annotation contains the indentation of the entire line in twentieths of a point (1/1440 of an inch). These units of measurement are taken from the standard Office Open XML File Formats.

name = 'indentation'

__init__(start: int, end: int, value: str) → None[source]

Parameters:

start – start of the annotated text (usually zero)
end – end of the annotated text (usually end of the line)
value – text indentation in twentieths of a point (1/1440 of an inch) how it’s defined in DOCX format

class dedoc.data_structures.SpacingAnnotation(start: int, end: int, value: str)[source]

Bases: Annotation

This annotation contains spacing between the current line and the previous one. It’s measured in twentieths of a point or one hundredths of a line according to the standard Office Open XML File Formats.

name = 'spacing'

__init__(start: int, end: int, value: str) → None[source]

Parameters:

start – start of the annotated text (usually zero)
end – end of the annotated text (usually end of the line)
value – spacing between the current line and the previous one how it’s defined in DOCX format (integer value)

class dedoc.data_structures.BoldAnnotation(start: int, end: int, value: str)[source]

Bases: Annotation

Boldness of some text inside the line.

name = 'bold'

__init__(start: int, end: int, value: str) → None[source]

Parameters:

start – start of the bold text
end – end of the bold text (not included)
value – True if bold else False (False usually isn’t used because you may not use this annotation at all)

class dedoc.data_structures.ItalicAnnotation(start: int, end: int, value: str)[source]

Bases: Annotation

Text written in italic inside the line.

name = 'italic'

__init__(start: int, end: int, value: str) → None[source]

Parameters:

start – start of the italic text
end – end of the italic text (not included)
value – True if italic else False (False usually isn’t used because you may not use this annotation at all)

class dedoc.data_structures.UnderlinedAnnotation(start: int, end: int, value: str)[source]

Bases: Annotation

Underlined text inside the line.

name = 'underlined'

__init__(start: int, end: int, value: str) → None[source]

Parameters:

start – start of the underlined text
end – end of the underlined text (not included)
value – True if underlined else False (False usually isn’t used because you may not use this annotation at all)

class dedoc.data_structures.StrikeAnnotation(start: int, end: int, value: str)[source]

Bases: Annotation

Strikethrough of some text inside the line.

name = 'strike'

__init__(start: int, end: int, value: str) → None[source]

Parameters:

start – start of the strikethrough text
end – end of the strikethrough text (not included)
value – True if strikethrough else False (False usually isn’t used because you may not use this annotation at all)

class dedoc.data_structures.SubscriptAnnotation(start: int, end: int, value: str)[source]

Bases: Annotation

Subscript text inside the line.

name = 'subscript'

__init__(start: int, end: int, value: str) → None[source]

Parameters:

start – start of the subscript text
end – end of the subscript text (not included)
value – True if subscript else False (False usually isn’t used because you may not use this annotation at all)

class dedoc.data_structures.SuperscriptAnnotation(start: int, end: int, value: str)[source]

Bases: Annotation

Superscript text inside the line.

name = 'superscript'

__init__(start: int, end: int, value: str) → None[source]

Parameters:

start – start of the superscript text
end – end of the superscript text (not included)
value – True if superscript else False (False usually isn’t used because you may not use this annotation at all)

class dedoc.data_structures.ColorAnnotation(start: int, end: int, red: float, green: float, blue: float)[source]

Bases: Annotation

Color of some text inside the line in the RGB format.

name = 'color_annotation'

__init__(start: int, end: int, red: float, green: float, blue: float) → None[source]

Parameters:

start – start of the colored text
end – end of the colored text (not included)
red – mean value of the red color component in the pixels that are not white in the given bounding box
green – mean value of the green color component in the pixels that are not white in the given bounding box
blue – mean value of the blue color component in the pixels that are not white in the given bounding box

class dedoc.data_structures.SizeAnnotation(start: int, end: int, value: str)[source]

Bases: Annotation

This annotation contains the font size of some part of the line in points (1/72 of an inch). These units of measurement are taken from the standard Office Open XML File Formats.

name = 'size'

__init__(start: int, end: int, value: str) → None[source]

Parameters:

start – start of the annotated text
end – end of the annotated text (not included)
value – font size in points (1/72 of an inch) how it’s defined in DOCX format

class dedoc.data_structures.StyleAnnotation(start: int, end: int, value: str)[source]

Bases: Annotation

This annotation contains the information about style of the line in the document. For example, in docx documents lines can be highlighted using Heading styles.

name = 'style'

__init__(start: int, end: int, value: str) → None[source]

Parameters:

start – start of the annotated text
end – end of the annotated text (not included)
value – style name of the text procured from the document formatting if exist (e.g. Heading 1)

class dedoc.data_structures.ConfidenceAnnotation(start: int, end: int, value: str)[source]

Bases: Annotation

Confidence level of some recognized with OCR text inside the line.

name = 'confidence'

__init__(start: int, end: int, value: str) → None[source]

Parameters:

start – start of the text
end – end of the text (not included)
value – confidence level in “percents” (float number from 0 to 1)