dedoc.data_structures
Main classes defining a document
- class dedoc.data_structures.UnstructuredDocument(tables: List[Table], lines: List[LineWithMeta], attachments: List[AttachedFile], warnings: List[str] | None = None, metadata: dict | None = None)[source]
This class holds information about raw document content: its text, tables and attachments, that have been procured using one of the readers. Text is represented as a flat list of lines, hierarchy level of each line isn’t defined (only tag hierarchy level may exist).
- __init__(tables: List[Table], lines: List[LineWithMeta], attachments: List[AttachedFile], warnings: List[str] | None = None, metadata: dict | None = None) None[source]
- Parameters:
tables – list of document tables
lines – list of raw document lines
attachments – list of documents attachments
warnings – list of warnings, obtained in the process of the document parsing
metadata – additional data
- class dedoc.data_structures.ParsedDocument(metadata: DocumentMetadata, content: DocumentContent | None, warnings: List[str] | None = None, attachments: List[ParsedDocument] | None = None)[source]
Bases:
SerializableThis class holds information about the document content, metadata and attachments.
- __init__(metadata: DocumentMetadata, content: DocumentContent | None, warnings: List[str] | None = None, attachments: List[ParsedDocument] | None = None) None[source]
- Parameters:
metadata – document metadata such as size, creation date and so on.
content – text and tables
attachments – result of analysis of attached files
warnings – list of warnings and possible errors, arising in the process of document parsing
- to_api_schema() ParsedDocument[source]
Convert class data into the corresponding API schema class.
- Returns:
API schema class
- class dedoc.data_structures.DocumentContent(tables: List[Table], structure: TreeNode, warnings: List[str] | None = None)[source]
Bases:
SerializableThis class holds the document content - structured text and tables.
- __init__(tables: List[Table], structure: TreeNode, warnings: List[str] | None = None) None[source]
- Parameters:
tables – list of document tables
structure – tree structure in which content of the document is organized
warnings – list of warnings, obtained in the process of the document structure constructing
- to_api_schema() DocumentContent[source]
Convert class data into the corresponding API schema class.
- Returns:
API schema class
- class dedoc.data_structures.DocumentMetadata(file_name: str, temporary_file_name: str, size: int, modified_time: int, created_time: int, access_time: int, file_type: str, other_fields: dict | None = None, uid: str | None = None)[source]
Bases:
SerializableThis class holds information about document metadata.
- __init__(file_name: str, temporary_file_name: str, size: int, modified_time: int, created_time: int, access_time: int, file_type: str, other_fields: dict | None = None, uid: str | None = None) None[source]
- Parameters:
uid – document unique identifier (useful for attached files)
file_name – original document name (before rename and conversion, so it can contain non-ascii symbols, spaces and so on)
temporary_file_name – file name during parsing (unique name after rename and conversion);
size – size of the original file in bytes
modified_time – time of the last modification in unix time format (seconds since the epoch)
created_time – time of the creation in unixtime
access_time – time of the last access to the file in unixtime
file_type – mime type of the file
other_fields – additional fields of user metadata
- extend_other_fields(new_fields: dict) None[source]
Add new attributes to the class and to the other_fields dictionary.
- Parameters:
new_fields – fields to add
- to_api_schema() DocumentMetadata[source]
Convert class data into the corresponding API schema class.
- Returns:
API schema class
- class dedoc.data_structures.TreeNode(node_id: str, text: str, annotations: List[Annotation], metadata: LineMetadata, subparagraphs: List[TreeNode], parent: TreeNode | None)[source]
Bases:
SerializableTreeNode helps to represent document as recursive tree structure. It has parent node (None for root ot the tree) and list of children nodes (empty list for list node).
- __init__(node_id: str, text: str, annotations: List[Annotation], metadata: LineMetadata, subparagraphs: List[TreeNode], parent: TreeNode | None) None[source]
- Parameters:
node_id – node id is unique in one document
text – text of the node
annotations – some metadata related to the part of the text (as font size)
metadata – metadata refers to entire node (as node type)
subparagraphs – list of child of this node
parent – parent node (None for root, not none for other nodes)
- add_child(line: LineWithMeta) TreeNode[source]
Create a new tree node - children of the given node from given line. Return newly created node.
- Parameters:
line – Line with meta, new node will be built from this line
- Returns:
return created node (child of the self)
- add_text(line: LineWithMeta) None[source]
Add the text and annotations from given line, text is separated with aa len line symbol.
- Parameters:
line – line with text to add
- static create(lines: List[LineWithMeta] | None = None) TreeNode[source]
Creates a root node with given text.
- Parameters:
lines – this lines should be the title of the document (or should be empty for documents without title)
- Returns:
root of the document tree
- class dedoc.data_structures.LineWithMeta(line: str, metadata: LineMetadata | None = None, annotations: List[Annotation] | None = None, uid: str | None = None)[source]
Bases:
Sized,SerializableStructural unit of document - line (or paragraph) of text and its metadata. One LineWithMeta should not contain text from different logical parts of the document (for example, document title and raw text of the document should not be in the same line). Still the logical part of the document may be represented by more than one line (for example, document title may consist of many lines).
- __getitem__(index: slice | int) LineWithMeta[source]
- __add__(other: LineWithMeta | str) LineWithMeta[source]
- __init__(line: str, metadata: LineMetadata | None = None, annotations: List[Annotation] | None = None, uid: str | None = None) None[source]
- Parameters:
line – raw text of the document line
metadata – metadata (related to the entire line, as line or page number, its hierarchy level)
annotations – metadata that refers to some part of the text, for example, font size, font type, etc.
uid – unique identifier of the line
- __lt__(other: LineWithMeta) bool[source]
Return self<value.
- property annotations: List[Annotation]
- static join(lines: List[LineWithMeta], delimiter: str = '\n') LineWithMeta[source]
Join list of lines with the given delimiter, keep annotations consistent. This method is similar to the python built-it join method for strings.
- Parameters:
lines – list of lines to join
delimiter – delimiter to insert between lines
- Returns:
merged line
- property line: str
- property metadata: LineMetadata
- split(sep: str) List[LineWithMeta][source]
Split this line into a list of lines, keep annotations consistent. This method does not remove any text from the line.
- Parameters:
sep – separator for splitting
- Returns:
list of split lines
- to_api_schema() LineWithMeta[source]
Convert class data into the corresponding API schema class.
- Returns:
API schema class
- property uid: str
- class dedoc.data_structures.LineMetadata(page_id: int, line_id: int | None, tag_hierarchy_level: HierarchyLevel | None = None, hierarchy_level: HierarchyLevel | None = None, other_fields: dict | None = None)[source]
Bases:
SerializableThis class holds information about document node (and document line) metadata, such as page number or line level in a document hierarchy.
- __init__(page_id: int, line_id: int | None, tag_hierarchy_level: HierarchyLevel | None = None, hierarchy_level: HierarchyLevel | None = None, other_fields: dict | None = None) None[source]
- Parameters:
page_id – page number where paragraph starts, the numeration starts from page 0
line_id – line number inside the entire document, the numeration starts from line 0
tag_hierarchy_level – the hierarchy level of the line with its type directly extracted by some of the readers (usually information got from tags e.g. in docx or html readers)
hierarchy_level – the hierarchy level of the line extracted by some of the structure extractors - the result type and level of the line. The lower the level of the hierarchy, the closer it is to the root, it’s used to construct document tree.
other_fields – additional fields of user metadata
- extend_other_fields(new_fields: dict) None[source]
Add new attributes to the class and to the other_fields dictionary.
- Parameters:
new_fields – fields to add
- to_api_schema() LineMetadata[source]
Convert class data into the corresponding API schema class.
- Returns:
API schema class
- class dedoc.data_structures.HierarchyLevel(level_1: int | None, level_2: int | None, can_be_multiline: bool, line_type: str)[source]
This class defines the level of the document line. The lower is its value, the more important the line is.
- The level of the line consists of two parts:
level_1 defines primary importance (e.g. root - level_1=0, header - level_1=1, etc.);
level_2 defines the level inside lines of equal type (e.g. for list items - “1.” - level_2=1, “1.1.” - level_2=2, etc.).
For the least important lines like raw_text both levels are None.
- __eq__(other: HierarchyLevel) bool[source]
- Defines the equality of two hierarchy levels:
two raw text lines or lines with unknown type are equal;
two lines with equal level_1, level_2 are equal.
- __init__(level_1: int | None, level_2: int | None, can_be_multiline: bool, line_type: str) None[source]
- Parameters:
level_1 – value of a line’s primary importance
level_2 – level of the line inside specific class
can_be_multiline – is used to unify lines inside tree node, if line can be multiline, it can be joined with another line
line_type – type of the line, e.g. raw text, list item, header, etc.
- __lt__(other: HierarchyLevel) bool[source]
- Defines the comparison of hierarchy levels:
line1 < line2 if (level_1, level_2) of line1 <= (level_1, level_2) of line2;
line1 < line2 if line2 is raw text or unknown, and line1 has another type.
Else line1 >= line2.
- Parameters:
other – hierarchy level of the line2
- static create_raw_text() HierarchyLevel[source]
Create hierarchy level for a raw textual line.
- static create_root() HierarchyLevel[source]
Create hierarchy level for the document root.
- static create_unknown() HierarchyLevel[source]
Create hierarchy level for a line with unknown type.
- class dedoc.data_structures.Table(cells: List[List[CellWithMeta]], metadata: TableMetadata)[source]
Bases:
SerializableThis class holds information about tables in the document. We assume that a table has rectangle form (has the same number of columns in each row). Table representation is row-based i.e. external list contains list of rows.
- __init__(cells: List[List[CellWithMeta]], metadata: TableMetadata) None[source]
- Parameters:
cells – a list of lists of cells (cell has text, colspan and rowspan attributes)
metadata – some table metadata as location, size and so on
- class dedoc.data_structures.TableMetadata(page_id: int | None, uid: str | None = None, rotated_angle: float = 0.0)[source]
Bases:
SerializableThis class holds the information about table unique identifier, rotation angle (if table has been rotated - for images) and so on.
- __init__(page_id: int | None, uid: str | None = None, rotated_angle: float = 0.0) None[source]
- Parameters:
page_id – number of the page where table starts
uid – unique identifier of the table
rotated_angle – value of the rotation angle by which the table was rotated during recognition
- to_api_schema() TableMetadata[source]
Convert class data into the corresponding API schema class.
- Returns:
API schema class
- class dedoc.data_structures.CellWithMeta(lines: List[LineWithMeta], colspan: int = 1, rowspan: int = 1, invisible: bool = False)[source]
Bases:
SerializableThis class holds the information about the cell: list of lines and cell properties (rowspan, colspan, invisible).
- __init__(lines: List[LineWithMeta], colspan: int = 1, rowspan: int = 1, invisible: bool = False) None[source]
- Parameters:
lines – textual lines of the cell
colspan – number of columns to span like in HTML format
rowspan – number of rows to span like in HTML format
invisible – indicator for displaying or hiding cell text
- get_annotations() List[Annotation][source]
Get merged annotations of all cell lines (start/end of annotations moved according to the merged text)
- to_api_schema() CellWithMeta[source]
Convert class data into the corresponding API schema class.
- Returns:
API schema class
Helper classes
- class dedoc.data_structures.Serializable[source]
Base class for the API schema objects which we later need convert to dict.
- class dedocutils.data_structures.BBox(x_top_left: int, y_top_left: int, width: int, height: int)[source]
Bounding box around some page object, the coordinate system starts from top left corner.
- x_top_left
- y_top_left
- x_bottom_right
- y_bottom_right
- width
- height
- __init__(x_top_left: int, y_top_left: int, width: int, height: int) None[source]
The following parameters should have values of pixels number.
- Parameters:
x_top_left – x coordinate of the bbox top left corner
y_top_left – y coordinate of the bbox top left corner
width – bounding box width
height – bounding box height
- static from_two_points(top_left: Tuple[int, int], bottom_right: Tuple[int, int]) BBox[source]
Make the bounding box from two points.
- Parameters:
top_left – (x, y) point of the bbox top left corner
bottom_right – (x, y) point of the bbox bottom right corner
- have_intersection_with_box(box: BBox, threshold: float = 0.3) bool[source]
Check if the current bounding box has the intersection with another one.
- Parameters:
box – another bounding box to check intersection with
threshold – the lowest value of the intersection over union used get boolean result
- property square: int
Square of the bbox.
- class dedoc.data_structures.AttachedFile(original_name: str, tmp_file_path: str, need_content_analysis: bool, uid: str)[source]
Holds information about files, attached to the parsed document.
- __init__(original_name: str, tmp_file_path: str, need_content_analysis: bool, uid: str) None[source]
- Parameters:
original_name – Name of the file from which the attachments are extracted
tmp_file_path – path to the attachment file.
need_content_analysis – indicator should we parse the attachment’s content or simply save it without parsing
uid – unique identifier of the attachment
Annotations of the text lines
- class dedoc.data_structures.Annotation(start: int, end: int, name: str, value: str, is_mergeable: bool = True)[source]
Bases:
SerializableBase class for text annotations of all kinds. Annotation is the piece of information about the text line: it’s appearance or links to another document object. Look to the concrete kind of annotations to get mode examples.
- __init__(start: int, end: int, name: str, value: str, is_mergeable: bool = True) None[source]
Some kind of text information about symbols between start and end. For example Annotation(1, 13, “italic”, “True”) says that text between 1st and 13th symbol was writen in italic.
- Parameters:
start – start of the annotated text
end – end of the annotated text (end isn’t included)
name – annotation’s name
value – information about annotated text
is_mergeable – is it possible to merge annotations with the same value
Concrete annotations
- class dedoc.data_structures.AttachAnnotation(attach_uid: str, start: int, end: int)[source]
Bases:
AnnotationThis annotation indicate the place of the attachment in the original document (for example, the place where image was placed in the docx document). The line containing this annotation is placed directly before the referred attachment.
- name = 'attachment'
- class dedoc.data_structures.TableAnnotation(name: str, start: int, end: int)[source]
Bases:
AnnotationThis annotation indicate the place of the table in the original document. The line containing this annotation is placed directly before the referred table.
- name = 'table'
- class dedoc.data_structures.LinkedTextAnnotation(start: int, end: int, value: str)[source]
Bases:
AnnotationThis annotation is used when some text is linked to the line or its part. For example, line can contain a number that refers the footnote - the text of this footnote will be the value of this annotation.
- name = 'linked_text'
- class dedoc.data_structures.BBoxAnnotation(start: int, end: int, value: BBox, page_width: int, page_height: int)[source]
Bases:
AnnotationCoordinates of the line’s bounding box (in relative coordinates) - for pdf documents.
- name = 'bounding box'
- __init__(start: int, end: int, value: BBox, page_width: int, page_height: int) None[source]
- Parameters:
start – start of the annotated text (usually zero)
end – end of the annotated text (usually end of the line)
value – bounding box where line is located
page_width – width of original image with this bbox
page_height – height of original image with this bbox
- class dedoc.data_structures.AlignmentAnnotation(start: int, end: int, value: str)[source]
Bases:
AnnotationThis annotation defines the alignment of the entire line in the document: left, right, to the both sides of the page or in the center.
- name = 'alignment'
- class dedoc.data_structures.IndentationAnnotation(start: int, end: int, value: str)[source]
Bases:
AnnotationThis annotation contains the indentation of the entire line in twentieths of a point (1/1440 of an inch). These units of measurement are taken from the standard Office Open XML File Formats.
- name = 'indentation'
- class dedoc.data_structures.SpacingAnnotation(start: int, end: int, value: str)[source]
Bases:
AnnotationThis annotation contains spacing between the current line and the previous one. It’s measured in twentieths of a point or one hundredths of a line according to the standard Office Open XML File Formats.
- name = 'spacing'
- class dedoc.data_structures.BoldAnnotation(start: int, end: int, value: str)[source]
Bases:
AnnotationBoldness of some text inside the line.
- name = 'bold'
- class dedoc.data_structures.ItalicAnnotation(start: int, end: int, value: str)[source]
Bases:
AnnotationText written in italic inside the line.
- name = 'italic'
- class dedoc.data_structures.UnderlinedAnnotation(start: int, end: int, value: str)[source]
Bases:
AnnotationUnderlined text inside the line.
- name = 'underlined'
- class dedoc.data_structures.StrikeAnnotation(start: int, end: int, value: str)[source]
Bases:
AnnotationStrikethrough of some text inside the line.
- name = 'strike'
- class dedoc.data_structures.SubscriptAnnotation(start: int, end: int, value: str)[source]
Bases:
AnnotationSubscript text inside the line.
- name = 'subscript'
- class dedoc.data_structures.SuperscriptAnnotation(start: int, end: int, value: str)[source]
Bases:
AnnotationSuperscript text inside the line.
- name = 'superscript'
- class dedoc.data_structures.ColorAnnotation(start: int, end: int, red: float, green: float, blue: float)[source]
Bases:
AnnotationColor of some text inside the line in the RGB format.
- name = 'color_annotation'
- __init__(start: int, end: int, red: float, green: float, blue: float) None[source]
- Parameters:
start – start of the colored text
end – end of the colored text (not included)
red – mean value of the red color component in the pixels that are not white in the given bounding box
green – mean value of the green color component in the pixels that are not white in the given bounding box
blue – mean value of the blue color component in the pixels that are not white in the given bounding box
- class dedoc.data_structures.SizeAnnotation(start: int, end: int, value: str)[source]
Bases:
AnnotationThis annotation contains the font size of some part of the line in points (1/72 of an inch). These units of measurement are taken from the standard Office Open XML File Formats.
- name = 'size'
- class dedoc.data_structures.StyleAnnotation(start: int, end: int, value: str)[source]
Bases:
AnnotationThis annotation contains the information about style of the line in the document. For example, in docx documents lines can be highlighted using Heading styles.
- name = 'style'
- class dedoc.data_structures.ConfidenceAnnotation(start: int, end: int, value: str)[source]
Bases:
AnnotationConfidence level of some recognized with OCR text inside the line.
- name = 'confidence'