dedoc.data_structures

Main classes defining a document

class dedoc.data_structures.UnstructuredDocument(tables: List[Table], lines: List[LineWithMeta], attachments: List[AttachedFile], warnings: List[str] | None = None, metadata: dict | None = None)[source]

This class holds information about raw document content: its text, tables and attachments, that have been procured using one of the readers. Text is represented as a flat list of lines, hierarchy level of each line isn’t defined (only tag hierarchy level may exist).

Variables:
  • lines (List[LineWithMeta]) – list of textual lines with metadata returned by a reader

  • tables (List[Table]) – list of document tables returned by a reader

  • attachments (List[AttachedFile]) – list of document attached files

  • metadata (dict) – information about the document (like in DocumentMetadata)

  • warnings (List[str]) – list of warnings, obtained in the process of the document parsing

class dedoc.data_structures.ParsedDocument(metadata: DocumentMetadata, content: DocumentContent, warnings: List[str] | None = None, attachments: List[ParsedDocument] | None = None)[source]

Bases: Serializable

This class holds information about the document content, metadata and attachments.

Variables:
  • content (DocumentContent) – document text (hierarchy of nodes) and tables

  • attachments (List[ParsedDocument]) – result of analysis of attached files (empty if with_attachments=False)

  • metadata (DocumentMetadata) – document metadata such as size, creation date and so on.

  • warnings (List[str]) – list of warnings and possible errors, arising in the process of document parsing

to_api_schema() ParsedDocument[source]

Convert class data into the corresponding API schema class.

Returns:

API schema class

class dedoc.data_structures.DocumentContent(tables: List[Table], structure: TreeNode, warnings: List[str] | None = None)[source]

Bases: Serializable

This class holds the document content - structured text and tables.

Variables:
  • tables (List[Table]) – list of document tables

  • structure (TreeNode) – tree structure of the document nodes with text and additional metadata

  • warnings (List[str]) – list of warnings, obtained in the process of the document parsing

to_api_schema() DocumentContent[source]

Convert class data into the corresponding API schema class.

Returns:

API schema class

class dedoc.data_structures.DocumentMetadata(file_name: str, temporary_file_name: str, size: int, modified_time: int, created_time: int, access_time: int, file_type: str, uid: str | None = None, **kwargs: Dict[str, str | int | float])[source]

Bases: Serializable

This class holds information about document metadata.

Variables:
  • file_name (str) – original document name (before rename and conversion, so it can contain non-ascii symbols, spaces and so on)

  • temporary_file_name (str) – file name during parsing (unique name after rename and conversion)

  • size (int) – size of the original file in bytes

  • modified_time (int) – time of the last modification in unix time format (seconds since the epoch)

  • created_time (int) – time of the creation in unixtime

  • access_time (int) – time of the last access to the file in unixtime

  • file_type (str) – mime type of the file

  • uid (str) – document unique identifier (useful for attached files)

Additional variables may be added with other file metadata.

to_api_schema() DocumentMetadata[source]

Convert class data into the corresponding API schema class.

Returns:

API schema class

class dedoc.data_structures.TreeNode(node_id: str, text: str, annotations: List[Annotation], metadata: LineMetadata, subparagraphs: List[TreeNode], parent: TreeNode | None)[source]

Bases: Serializable

TreeNode helps to represent document as recursive tree structure. It has parent node (None for root ot the tree) and list of children nodes (empty list for list node).

Variables:
  • node_id (str) – unique node identifier

  • text (str) – text of the node (may contain several lines)

  • annotations (List[Annotation]) – some metadata related to the part of the text (as font size)

  • metadata (LineMetadata) – metadata refers to entire node (as node type)

  • subparagraphs (List[TreeNode]) – list of child of this node

  • parent (TreeNode) – parent node (None for root, not none for other nodes)

add_child(line: LineWithMeta) TreeNode[source]

Create a new tree node - children of the given node from given line. Return newly created node.

Parameters:

line – Line with meta, new node will be built from this line

Returns:

return created node (child of the self)

add_text(line: LineWithMeta) None[source]

Add the text and annotations from given line, text is separated with aa len line symbol.

Parameters:

line – line with text to add

static create(lines: List[LineWithMeta] | None = None) TreeNode[source]

Creates a root node with given text.

Parameters:

lines – this lines should be the title of the document (or should be empty for documents without title)

Returns:

root of the document tree

get_root() TreeNode[source]
Returns:

root of the tree

to_api_schema() TreeNode[source]

Convert class data into the corresponding API schema class.

Returns:

API schema class

class dedoc.data_structures.LineWithMeta(line: str, metadata: LineMetadata | None = None, annotations: List[Annotation] | None = None, uid: str | None = None)[source]

Bases: Sized, Serializable

Structural unit of document - line (or paragraph) of text and its metadata. One LineWithMeta should not contain text from different logical parts of the document (for example, document title and raw text of the document should not be in the same line). Still the logical part of the document may be represented by more than one line (for example, document title may consist of many lines).

__len__() int[source]
__getitem__(index: slice | int) LineWithMeta[source]
__add__(other: LineWithMeta | str) LineWithMeta[source]
__init__(line: str, metadata: LineMetadata | None = None, annotations: List[Annotation] | None = None, uid: str | None = None) None[source]
Parameters:
  • line – raw text of the document line

  • metadata – metadata (related to the entire line, as line or page number, its hierarchy level)

  • annotations – metadata that refers to some part of the text, for example, font size, font type, etc.

  • uid – unique identifier of the line

__lt__(other: LineWithMeta) bool[source]

Return self<value.

property annotations: List[Annotation]

Metadata that refers to some part of the text, for example, font size, font type, etc.

static join(lines: List[LineWithMeta], delimiter: str = '\n') LineWithMeta[source]

Join list of lines with the given delimiter, keep annotations consistent. This method is similar to the python built-it join method for strings.

Parameters:
  • lines – list of lines to join

  • delimiter – delimiter to insert between lines

Returns:

merged line

property line: str

Raw text of the document line

property metadata: LineMetadata

Line metadata related to the entire line, as line or page number, hierarchy level

set_line(line: str) None[source]
set_metadata(metadata: LineMetadata) None[source]
shift(shift_x: int, shift_y: int, image_width: int, image_height: int) None[source]
split(sep: str) List[LineWithMeta][source]

Split this line into a list of lines, keep annotations consistent. This method does not remove any text from the line.

Parameters:

sep – separator for splitting

Returns:

list of split lines

to_api_schema() LineWithMeta[source]

Convert class data into the corresponding API schema class.

Returns:

API schema class

property uid: str

Unique identifier of the line

class dedoc.data_structures.LineMetadata(page_id: int, line_id: int | None, tag_hierarchy_level: HierarchyLevel | None = None, hierarchy_level: HierarchyLevel | None = None, **kwargs: Dict[str, str | int | float])[source]

Bases: Serializable

This class holds information about document node (and document line) metadata, such as page number or line level in a document hierarchy.

Variables:
  • tag_hierarchy_level (HierarchyLevel) – the hierarchy level of the line with its type directly extracted by some of the readers (usually information got from tags e.g. in docx or html readers)

  • hierarchy_level (Optional[HierarchyLevel]) – the hierarchy level of the line extracted by some of the structure extractors - the result type and level of the line. The lower the level of the hierarchy, the closer it is to the root, it’s used to construct document tree.

  • page_id (int) – page number where paragraph starts, the numeration starts from page 0

  • line_id (Optional[int]) – line number inside the entire document, the numeration starts from line 0

Additional variables may be added with other line metadata.

to_api_schema() LineMetadata[source]

Convert class data into the corresponding API schema class.

Returns:

API schema class

class dedoc.data_structures.HierarchyLevel(level_1: int | None, level_2: int | None, can_be_multiline: bool, line_type: str)[source]

This class defines the level of the document line. The lower is its value, the more important the line is.

The level of the line consists of two parts:
  • level_1 defines primary importance (e.g. root - level_1=0, header - level_1=1, etc.);

  • level_2 defines the level inside lines of equal type (e.g. for list items - “1.” - level_2=1, “1.1.” - level_2=2, etc.).

For the least important lines (line_type=raw_text) both levels are None.

Look to the hierarchy level description to get more details.

Variables:
  • level_1 (Optional[int]) – value of a line’s primary importance

  • level_2 (Optional[int]) – level of the line inside specific class

  • can_be_multiline (bool) – is used to unify lines inside tree node, if line can be multiline, it can be joined with another line

  • line_type (str) – type of the line, e.g. raw text, list item, header, etc.

__eq__(other: HierarchyLevel) bool[source]
Defines the equality of two hierarchy levels:
  • two lines with equal level_1, level_2 are equal.

  • if some of the levels is None, its value is considered as +inf (infinities have equal value)

Parameters:

other – other hierarchy level

Returns:

whether current hierarchy level == other hierarchy level

__lt__(other: HierarchyLevel) bool[source]
Defines the comparison of hierarchy levels:
  • current level < other level if (level_1, level_2) < other (level_1, level_2);

  • if some of the levels is None, its value is considered as +inf (infinities have equal value)

Parameters:

other – other hierarchy level

Returns:

whether current hierarchy level < other hierarchy level

static create_raw_text() HierarchyLevel[source]

Create hierarchy level for a raw textual line.

static create_root() HierarchyLevel[source]

Create hierarchy level for the document root.

static create_unknown() HierarchyLevel[source]

Create hierarchy level for a line with unknown type.

is_list_item() bool[source]

Check if the line is a list item.

is_raw_text() bool[source]

Check if the line is raw text.

is_unknown() bool[source]

Check if the type of the line is unknown (only for levels from readers).

class dedoc.data_structures.Table(cells: List[List[CellWithMeta]], metadata: TableMetadata)[source]

Bases: Serializable

This class holds information about tables in the document. We assume that a table has rectangle form (has the same number of columns in each row). If some cells are merged, they are duplicated and information about merge is stored in rowspan and colspan. Table representation is row-based i.e. external list contains list of rows.

Variables:
  • metadata (TableMetadata) – a list of lists of table cells (cell has text lines, colspan and rowspan attributes)

  • cells (List[List[CellWithMeta]]) – table metadata as location, title and so on

to_api_schema() Table[source]

Convert class data into the corresponding API schema class.

Returns:

API schema class

class dedoc.data_structures.TableMetadata(page_id: int | None, uid: str | None = None, rotated_angle: float = 0.0, title: str = '')[source]

Bases: Serializable

This class holds the information about table unique identifier, rotation angle (if table has been rotated - for images) and so on.

Variables:
  • page_id (Optional[int]) – number of the page where table starts

  • uid (str) – unique identifier of the table (used for linking table to text)

  • rotated_angle (float) – value of the rotation angle by which the table was rotated during recognition

  • title (str) – table’s title

to_api_schema() TableMetadata[source]

Convert class data into the corresponding API schema class.

Returns:

API schema class

class dedoc.data_structures.CellWithMeta(lines: List[LineWithMeta] | None, colspan: int = 1, rowspan: int = 1, invisible: bool = False)[source]

Bases: Serializable

This class holds the information about the cell: list of lines and cell properties (rowspan, colspan, invisible).

Variables:
  • lines (List[LineWithMeta]) – list of textual lines of the cell

  • colspan (int) – number of columns to span (for cells merged horizontally)

  • rowspan (int) – number of rows to span (for cells merged vertically)

  • invisible (bool) – indicator for displaying or hiding cell text - cells that are merged with others are hidden (for HTML display)

get_annotations() List[Annotation][source]

Get merged annotations of all cell lines (start/end of annotations moved according to the merged text)

get_text() str[source]

Get merged text of all cell lines

to_api_schema() CellWithMeta[source]

Convert class data into the corresponding API schema class.

Returns:

API schema class

Helper classes

class dedoc.data_structures.Serializable[source]

Base class for the API schema objects which we later need convert to dict.

abstract to_api_schema() BaseModel[source]

Convert class data into the corresponding API schema class.

Returns:

API schema class

class dedocutils.data_structures.BBox(x_top_left: int, y_top_left: int, width: int, height: int)[source]

Bounding box around some page object, the coordinate system starts from top left corner.

x_top_left
y_top_left
x_bottom_right
y_bottom_right
width
height
__init__(x_top_left: int, y_top_left: int, width: int, height: int) None[source]

The following parameters should have values of pixels number.

Parameters:
  • x_top_left – x coordinate of the bbox top left corner

  • y_top_left – y coordinate of the bbox top left corner

  • width – bounding box width

  • height – bounding box height

static from_two_points(top_left: Tuple[int, int], bottom_right: Tuple[int, int]) BBox[source]

Make the bounding box from two points.

Parameters:
  • top_left – (x, y) point of the bbox top left corner

  • bottom_right – (x, y) point of the bbox bottom right corner

have_intersection_with_box(box: BBox, threshold: float = 0.3) bool[source]

Check if the current bounding box has the intersection with another one.

Parameters:
  • box – another bounding box to check intersection with

  • threshold – the lowest value of the intersection over union used get boolean result

property square: int

Square of the bbox.

class dedoc.data_structures.AttachedFile(original_name: str, tmp_file_path: str, need_content_analysis: bool, uid: str)[source]

Holds information about files, attached to the parsed document.

Variables:
  • original_name (str) – original name of the attached file if it was possible to extract it

  • tmp_file_path (str) – path to the attached file on disk - its name is different from original_name

  • need_content_analysis (bool) – does the attached file need parsing (enable recursive parsing in DedocManager)

  • uid (str) – unique identifier of the attached file

Annotations of the text lines

class dedoc.data_structures.Annotation(start: int, end: int, name: str, value: str, is_mergeable: bool = True)[source]

Bases: Serializable

Base class for text annotations of all kinds. Annotation is the piece of information about the text line: it’s appearance or links to another document object. Look to the concrete kind of annotations to get mode examples.

Variables:
  • start (int) – start of the annotated text

  • end (int) – end of the annotated text (end isn’t included)

  • name (str) – annotation’s name, specific for each type of annotation

  • value (str) – information about annotated text, depends on the type of annotation, e.g. “True”/”False”, “10.0”, etc.

  • is_mergeable (bool) – is it possible to merge annotations with the same value

Concrete annotations

class dedoc.data_structures.AttachAnnotation(attach_uid: str, start: int, end: int)[source]

Bases: Annotation

This annotation indicate the place of the attachment in the original document (for example, the place where image was placed in the docx document). The line containing this annotation is placed directly before the referred attachment.

name: str = 'attachment'
__init__(attach_uid: str, start: int, end: int) None[source]
Parameters:
  • attach_uid – unique identifier of the attachment which is referenced inside this annotation

  • start – start of the annotated text (usually zero)

  • end – end of the annotated text (usually end of the line)

class dedoc.data_structures.TableAnnotation(value: str, start: int, end: int)[source]

Bases: Annotation

This annotation indicate the place of the table in the original document. The line containing this annotation is placed directly before the referred table.

name: str = 'table'
__init__(value: str, start: int, end: int) None[source]
Parameters:
  • value – unique identifier of the table which is referenced inside this annotation

  • start – start of the annotated text (usually zero)

  • end – end of the annotated text (usually end of the line)

class dedoc.data_structures.LinkedTextAnnotation(start: int, end: int, value: str)[source]

Bases: Annotation

This annotation is used when some text is linked to the line or its part. For example, line can contain a number that refers the footnote - the text of this footnote will be the value of this annotation.

name: str = 'linked_text'
__init__(start: int, end: int, value: str) None[source]
Parameters:
  • start – start of the annotated text

  • end – end of the annotated text (not included)

  • value – text, linked to given one, for example text of the footnote

class dedoc.data_structures.ReferenceAnnotation(value: str, start: int, end: int)[source]

Bases: Annotation

This annotation points to a place in the document text that is a link to another line in the document (for example, another textual line).

Example of usage for document_type=”article” with the example of link on the bibliography_item LineWithMeta.

LineWithMeta:

LineWithMeta(   # the line with the reference annotation
    line="As for the PRF, we use the tree-based construction from Goldreich, Goldwasser and Micali [18]",
    metadata=LineMetadata(page_id=0, line_id=32),
    annotations=[ReferenceAnnotation(start=90, end=92, value="97cfac39-f0e3-11ee-b81c-b88584b4e4a1"), ...]
)

other LineWithMeta:

LineWithMeta(   # The line referenced by the previous one
    line="some your text (can be empty)",
    metadata=LineMetadata(
        page_id=10,
        line_id=189,
        tag_hierarchy_level=HierarchyLevel(level1=2, level2=0, paragraph_type="bibliography_item")),
        uid="97cfac39-f0e3-11ee-b81c-b88584b4e4a1"
    ),
    annotations=[]
)
name: str = 'reference'
__init__(value: str, start: int, end: int) None[source]
Parameters:
  • value – unique identifier of the line to which this annotation refers

  • start – start of the annotated text with a link

  • end – end of the annotated text with a link

class dedoc.data_structures.BBoxAnnotation(start: int, end: int, value: BBox, page_width: int, page_height: int)[source]

Bases: Annotation

Coordinates of the line’s bounding box (in relative coordinates) - for pdf documents.

name: str = 'bounding box'
__init__(start: int, end: int, value: BBox, page_width: int, page_height: int) None[source]
Parameters:
  • start – start of the annotated text (usually zero)

  • end – end of the annotated text (usually end of the line)

  • value – bounding box where line is located

  • page_width – width of original image with this bbox

  • page_height – height of original image with this bbox

class dedoc.data_structures.AlignmentAnnotation(start: int, end: int, value: str)[source]

Bases: Annotation

This annotation defines the alignment of the entire line in the document: left, right, to the both sides of the page or in the center.

name: str = 'alignment'
__init__(start: int, end: int, value: str) None[source]
Parameters:
  • start – start of the annotated text (usually zero)

  • end – end of the annotated text (usually end of the line)

  • value – kind of the line alignment: left, right, both of center

class dedoc.data_structures.IndentationAnnotation(start: int, end: int, value: str)[source]

Bases: Annotation

This annotation contains the indentation of the entire line in twentieths of a point (1/1440 of an inch). These units of measurement are taken from the standard Office Open XML File Formats.

name: str = 'indentation'
__init__(start: int, end: int, value: str) None[source]
Parameters:
  • start – start of the annotated text (usually zero)

  • end – end of the annotated text (usually end of the line)

  • value – text indentation in twentieths of a point (1/1440 of an inch) how it’s defined in DOCX format

class dedoc.data_structures.SpacingAnnotation(start: int, end: int, value: str)[source]

Bases: Annotation

This annotation contains spacing between the current line and the previous one. It’s measured in twentieths of a point or one hundredths of a line according to the standard Office Open XML File Formats.

name: str = 'spacing'
__init__(start: int, end: int, value: str) None[source]
Parameters:
  • start – start of the annotated text (usually zero)

  • end – end of the annotated text (usually end of the line)

  • value – spacing between the current line and the previous one how it’s defined in DOCX format (integer value)

class dedoc.data_structures.BoldAnnotation(start: int, end: int, value: str)[source]

Bases: Annotation

Boldness of some text inside the line.

name: str = 'bold'
__init__(start: int, end: int, value: str) None[source]
Parameters:
  • start – start of the bold text

  • end – end of the bold text (not included)

  • value – True if bold else False (False usually isn’t used because you may not use this annotation at all)

class dedoc.data_structures.ItalicAnnotation(start: int, end: int, value: str)[source]

Bases: Annotation

Text written in italic inside the line.

name: str = 'italic'
__init__(start: int, end: int, value: str) None[source]
Parameters:
  • start – start of the italic text

  • end – end of the italic text (not included)

  • value – True if italic else False (False usually isn’t used because you may not use this annotation at all)

class dedoc.data_structures.UnderlinedAnnotation(start: int, end: int, value: str)[source]

Bases: Annotation

Underlined text inside the line.

name: str = 'underlined'
__init__(start: int, end: int, value: str) None[source]
Parameters:
  • start – start of the underlined text

  • end – end of the underlined text (not included)

  • value – True if underlined else False (False usually isn’t used because you may not use this annotation at all)

class dedoc.data_structures.StrikeAnnotation(start: int, end: int, value: str)[source]

Bases: Annotation

Strikethrough of some text inside the line.

name: str = 'strike'
__init__(start: int, end: int, value: str) None[source]
Parameters:
  • start – start of the strikethrough text

  • end – end of the strikethrough text (not included)

  • value – True if strikethrough else False (False usually isn’t used because you may not use this annotation at all)

class dedoc.data_structures.SubscriptAnnotation(start: int, end: int, value: str)[source]

Bases: Annotation

Subscript text inside the line.

name: str = 'subscript'
__init__(start: int, end: int, value: str) None[source]
Parameters:
  • start – start of the subscript text

  • end – end of the subscript text (not included)

  • value – True if subscript else False (False usually isn’t used because you may not use this annotation at all)

class dedoc.data_structures.SuperscriptAnnotation(start: int, end: int, value: str)[source]

Bases: Annotation

Superscript text inside the line.

name: str = 'superscript'
__init__(start: int, end: int, value: str) None[source]
Parameters:
  • start – start of the superscript text

  • end – end of the superscript text (not included)

  • value – True if superscript else False (False usually isn’t used because you may not use this annotation at all)

class dedoc.data_structures.ColorAnnotation(start: int, end: int, red: float, green: float, blue: float)[source]

Bases: Annotation

Color of some text inside the line in the RGB format.

name: str = 'color_annotation'
__init__(start: int, end: int, red: float, green: float, blue: float) None[source]
Parameters:
  • start – start of the colored text

  • end – end of the colored text (not included)

  • red – mean value of the red color component in the pixels that are not white in the given bounding box

  • green – mean value of the green color component in the pixels that are not white in the given bounding box

  • blue – mean value of the blue color component in the pixels that are not white in the given bounding box

class dedoc.data_structures.SizeAnnotation(start: int, end: int, value: str)[source]

Bases: Annotation

This annotation contains the font size of some part of the line in points (1/72 of an inch). These units of measurement are taken from the standard Office Open XML File Formats.

name: str = 'size'
__init__(start: int, end: int, value: str) None[source]
Parameters:
  • start – start of the annotated text

  • end – end of the annotated text (not included)

  • value – font size in points (1/72 of an inch) how it’s defined in DOCX format

class dedoc.data_structures.StyleAnnotation(start: int, end: int, value: str)[source]

Bases: Annotation

This annotation contains the information about style of the line in the document. For example, in docx documents lines can be highlighted using Heading styles.

name: str = 'style'
__init__(start: int, end: int, value: str) None[source]
Parameters:
  • start – start of the annotated text

  • end – end of the annotated text (not included)

  • value – style name of the text procured from the document formatting if exist (e.g. Heading 1)

class dedoc.data_structures.ConfidenceAnnotation(start: int, end: int, value: str)[source]

Bases: Annotation

Confidence level of some recognized with OCR text inside the line.

name: str = 'confidence'
__init__(start: int, end: int, value: str) None[source]
Parameters:
  • start – start of the text

  • end – end of the text (not included)

  • value – confidence level in “percents” (float number from 0 to 1)