dedoc.data_structures
Main classes defining a document
- class dedoc.data_structures.UnstructuredDocument(tables: List[Table], lines: List[LineWithMeta], attachments: List[AttachedFile], warnings: List[str] | None = None, metadata: dict | None = None)[source]
This class holds information about raw document content: its text, tables and attachments, that have been procured using one of the readers. Text is represented as a flat list of lines, hierarchy level of each line isn’t defined (only tag hierarchy level may exist).
- Variables:
lines (List[LineWithMeta]) – list of textual lines with metadata returned by a reader
tables (List[Table]) – list of document tables returned by a reader
attachments (List[AttachedFile]) – list of document attached files
metadata (dict) – information about the document (like in
DocumentMetadata
)warnings (List[str]) – list of warnings, obtained in the process of the document parsing
- class dedoc.data_structures.ParsedDocument(metadata: DocumentMetadata, content: DocumentContent, warnings: List[str] | None = None, attachments: List[ParsedDocument] | None = None)[source]
Bases:
Serializable
This class holds information about the document content, metadata and attachments.
- Variables:
content (DocumentContent) – document text (hierarchy of nodes) and tables
attachments (List[ParsedDocument]) – result of analysis of attached files (empty if with_attachments=False)
metadata (DocumentMetadata) – document metadata such as size, creation date and so on.
warnings (List[str]) – list of warnings and possible errors, arising in the process of document parsing
- to_api_schema() ParsedDocument [source]
Convert class data into the corresponding API schema class.
- Returns:
API schema class
- class dedoc.data_structures.DocumentContent(tables: List[Table], structure: TreeNode, warnings: List[str] | None = None)[source]
Bases:
Serializable
This class holds the document content - structured text and tables.
- Variables:
- to_api_schema() DocumentContent [source]
Convert class data into the corresponding API schema class.
- Returns:
API schema class
- class dedoc.data_structures.DocumentMetadata(file_name: str, temporary_file_name: str, size: int, modified_time: int, created_time: int, access_time: int, file_type: str, uid: str | None = None, **kwargs: Dict[str, str | int | float])[source]
Bases:
Serializable
This class holds information about document metadata.
- Variables:
file_name (str) – original document name (before rename and conversion, so it can contain non-ascii symbols, spaces and so on)
temporary_file_name (str) – file name during parsing (unique name after rename and conversion)
size (int) – size of the original file in bytes
modified_time (int) – time of the last modification in unix time format (seconds since the epoch)
created_time (int) – time of the creation in unixtime
access_time (int) – time of the last access to the file in unixtime
file_type (str) – mime type of the file
uid (str) – document unique identifier (useful for attached files)
Additional variables may be added with other file metadata.
- to_api_schema() DocumentMetadata [source]
Convert class data into the corresponding API schema class.
- Returns:
API schema class
- class dedoc.data_structures.TreeNode(node_id: str, text: str, annotations: List[Annotation], metadata: LineMetadata, subparagraphs: List[TreeNode], parent: TreeNode | None)[source]
Bases:
Serializable
TreeNode helps to represent document as recursive tree structure. It has parent node (None for root ot the tree) and list of children nodes (empty list for list node).
- Variables:
node_id (str) – unique node identifier
text (str) – text of the node (may contain several lines)
annotations (List[Annotation]) – some metadata related to the part of the text (as font size)
metadata (LineMetadata) – metadata refers to entire node (as node type)
subparagraphs (List[TreeNode]) – list of child of this node
parent (TreeNode) – parent node (None for root, not none for other nodes)
- add_child(line: LineWithMeta) TreeNode [source]
Create a new tree node - children of the given node from given line. Return newly created node.
- Parameters:
line – Line with meta, new node will be built from this line
- Returns:
return created node (child of the self)
- add_text(line: LineWithMeta) None [source]
Add the text and annotations from given line, text is separated with aa len line symbol.
- Parameters:
line – line with text to add
- static create(lines: List[LineWithMeta] | None = None) TreeNode [source]
Creates a root node with given text.
- Parameters:
lines – this lines should be the title of the document (or should be empty for documents without title)
- Returns:
root of the document tree
- class dedoc.data_structures.LineWithMeta(line: str, metadata: LineMetadata | None = None, annotations: List[Annotation] | None = None, uid: str | None = None)[source]
Bases:
Sized
,Serializable
Structural unit of document - line (or paragraph) of text and its metadata. One LineWithMeta should not contain text from different logical parts of the document (for example, document title and raw text of the document should not be in the same line). Still the logical part of the document may be represented by more than one line (for example, document title may consist of many lines).
- __getitem__(index: slice | int) LineWithMeta [source]
- __add__(other: LineWithMeta | str) LineWithMeta [source]
- __init__(line: str, metadata: LineMetadata | None = None, annotations: List[Annotation] | None = None, uid: str | None = None) None [source]
- Parameters:
line – raw text of the document line
metadata – metadata (related to the entire line, as line or page number, its hierarchy level)
annotations – metadata that refers to some part of the text, for example, font size, font type, etc.
uid – unique identifier of the line
- __lt__(other: LineWithMeta) bool [source]
Return self<value.
- property annotations: List[Annotation]
Metadata that refers to some part of the text, for example, font size, font type, etc.
- static join(lines: List[LineWithMeta], delimiter: str = '\n') LineWithMeta [source]
Join list of lines with the given delimiter, keep annotations consistent. This method is similar to the python built-it join method for strings.
- Parameters:
lines – list of lines to join
delimiter – delimiter to insert between lines
- Returns:
merged line
- property line: str
Raw text of the document line
- property metadata: LineMetadata
Line metadata related to the entire line, as line or page number, hierarchy level
- set_metadata(metadata: LineMetadata) None [source]
- split(sep: str) List[LineWithMeta] [source]
Split this line into a list of lines, keep annotations consistent. This method does not remove any text from the line.
- Parameters:
sep – separator for splitting
- Returns:
list of split lines
- to_api_schema() LineWithMeta [source]
Convert class data into the corresponding API schema class.
- Returns:
API schema class
- property uid: str
Unique identifier of the line
- class dedoc.data_structures.LineMetadata(page_id: int, line_id: int | None, tag_hierarchy_level: HierarchyLevel | None = None, hierarchy_level: HierarchyLevel | None = None, **kwargs: Dict[str, str | int | float])[source]
Bases:
Serializable
This class holds information about document node (and document line) metadata, such as page number or line level in a document hierarchy.
- Variables:
tag_hierarchy_level (HierarchyLevel) – the hierarchy level of the line with its type directly extracted by some of the readers (usually information got from tags e.g. in docx or html readers)
hierarchy_level (Optional[HierarchyLevel]) – the hierarchy level of the line extracted by some of the structure extractors - the result type and level of the line. The lower the level of the hierarchy, the closer it is to the root, it’s used to construct document tree.
page_id (int) – page number where paragraph starts, the numeration starts from page 0
line_id (Optional[int]) – line number inside the entire document, the numeration starts from line 0
Additional variables may be added with other line metadata.
- to_api_schema() LineMetadata [source]
Convert class data into the corresponding API schema class.
- Returns:
API schema class
- class dedoc.data_structures.HierarchyLevel(level_1: int | None, level_2: int | None, can_be_multiline: bool, line_type: str)[source]
This class defines the level of the document line. The lower is its value, the more important the line is.
- The level of the line consists of two parts:
level_1 defines primary importance (e.g. root - level_1=0, header - level_1=1, etc.);
level_2 defines the level inside lines of equal type (e.g. for list items - “1.” - level_2=1, “1.1.” - level_2=2, etc.).
For the least important lines (line_type=raw_text) both levels are None.
Look to the hierarchy level description to get more details.
- Variables:
level_1 (Optional[int]) – value of a line’s primary importance
level_2 (Optional[int]) – level of the line inside specific class
can_be_multiline (bool) – is used to unify lines inside tree node, if line can be multiline, it can be joined with another line
line_type (str) – type of the line, e.g. raw text, list item, header, etc.
- __eq__(other: HierarchyLevel) bool [source]
- Defines the equality of two hierarchy levels:
two lines with equal level_1, level_2 are equal.
if some of the levels is None, its value is considered as +inf (infinities have equal value)
- Parameters:
other – other hierarchy level
- Returns:
whether current hierarchy level == other hierarchy level
- __lt__(other: HierarchyLevel) bool [source]
- Defines the comparison of hierarchy levels:
current level < other level if (level_1, level_2) < other (level_1, level_2);
if some of the levels is None, its value is considered as +inf (infinities have equal value)
- Parameters:
other – other hierarchy level
- Returns:
whether current hierarchy level < other hierarchy level
- static create_raw_text() HierarchyLevel [source]
Create hierarchy level for a raw textual line.
- static create_root() HierarchyLevel [source]
Create hierarchy level for the document root.
- static create_unknown() HierarchyLevel [source]
Create hierarchy level for a line with unknown type.
- class dedoc.data_structures.Table(cells: List[List[CellWithMeta]], metadata: TableMetadata)[source]
Bases:
Serializable
This class holds information about tables in the document. We assume that a table has rectangle form (has the same number of columns in each row). If some cells are merged, they are duplicated and information about merge is stored in rowspan and colspan. Table representation is row-based i.e. external list contains list of rows.
- Variables:
metadata (TableMetadata) – a list of lists of table cells (cell has text lines, colspan and rowspan attributes)
cells (List[List[CellWithMeta]]) – table metadata as location, title and so on
- class dedoc.data_structures.TableMetadata(page_id: int | None, uid: str | None = None, rotated_angle: float = 0.0, title: str = '')[source]
Bases:
Serializable
This class holds the information about table unique identifier, rotation angle (if table has been rotated - for images) and so on.
- Variables:
page_id (Optional[int]) – number of the page where table starts
uid (str) – unique identifier of the table (used for linking table to text)
rotated_angle (float) – value of the rotation angle by which the table was rotated during recognition
title (str) – table’s title
- to_api_schema() TableMetadata [source]
Convert class data into the corresponding API schema class.
- Returns:
API schema class
- class dedoc.data_structures.CellWithMeta(lines: List[LineWithMeta] | None, colspan: int = 1, rowspan: int = 1, invisible: bool = False)[source]
Bases:
Serializable
This class holds the information about the cell: list of lines and cell properties (rowspan, colspan, invisible).
- Variables:
lines (List[LineWithMeta]) – list of textual lines of the cell
colspan (int) – number of columns to span (for cells merged horizontally)
rowspan (int) – number of rows to span (for cells merged vertically)
invisible (bool) – indicator for displaying or hiding cell text - cells that are merged with others are hidden (for HTML display)
- get_annotations() List[Annotation] [source]
Get merged annotations of all cell lines (start/end of annotations moved according to the merged text)
- to_api_schema() CellWithMeta [source]
Convert class data into the corresponding API schema class.
- Returns:
API schema class
Helper classes
- class dedoc.data_structures.Serializable[source]
Base class for the API schema objects which we later need convert to dict.
- class dedocutils.data_structures.BBox(x_top_left: int, y_top_left: int, width: int, height: int)[source]
Bounding box around some page object, the coordinate system starts from top left corner.
- x_top_left
- y_top_left
- x_bottom_right
- y_bottom_right
- width
- height
- __init__(x_top_left: int, y_top_left: int, width: int, height: int) None [source]
The following parameters should have values of pixels number.
- Parameters:
x_top_left – x coordinate of the bbox top left corner
y_top_left – y coordinate of the bbox top left corner
width – bounding box width
height – bounding box height
- static from_two_points(top_left: Tuple[int, int], bottom_right: Tuple[int, int]) BBox [source]
Make the bounding box from two points.
- Parameters:
top_left – (x, y) point of the bbox top left corner
bottom_right – (x, y) point of the bbox bottom right corner
- have_intersection_with_box(box: BBox, threshold: float = 0.3) bool [source]
Check if the current bounding box has the intersection with another one.
- Parameters:
box – another bounding box to check intersection with
threshold – the lowest value of the intersection over union used get boolean result
- property square: int
Square of the bbox.
- class dedoc.data_structures.AttachedFile(original_name: str, tmp_file_path: str, need_content_analysis: bool, uid: str)[source]
Holds information about files, attached to the parsed document.
- Variables:
original_name (str) – original name of the attached file if it was possible to extract it
tmp_file_path (str) – path to the attached file on disk - its name is different from original_name
need_content_analysis (bool) – does the attached file need parsing (enable recursive parsing in
DedocManager
)uid (str) – unique identifier of the attached file
Annotations of the text lines
- class dedoc.data_structures.Annotation(start: int, end: int, name: str, value: str, is_mergeable: bool = True)[source]
Bases:
Serializable
Base class for text annotations of all kinds. Annotation is the piece of information about the text line: it’s appearance or links to another document object. Look to the concrete kind of annotations to get mode examples.
- Variables:
start (int) – start of the annotated text
end (int) – end of the annotated text (end isn’t included)
name (str) – annotation’s name, specific for each type of annotation
value (str) – information about annotated text, depends on the type of annotation, e.g. “True”/”False”, “10.0”, etc.
is_mergeable (bool) – is it possible to merge annotations with the same value
Concrete annotations
- class dedoc.data_structures.AttachAnnotation(attach_uid: str, start: int, end: int)[source]
Bases:
Annotation
This annotation indicate the place of the attachment in the original document (for example, the place where image was placed in the docx document). The line containing this annotation is placed directly before the referred attachment.
- name: str = 'attachment'
- class dedoc.data_structures.TableAnnotation(value: str, start: int, end: int)[source]
Bases:
Annotation
This annotation indicate the place of the table in the original document. The line containing this annotation is placed directly before the referred table.
- name: str = 'table'
- class dedoc.data_structures.LinkedTextAnnotation(start: int, end: int, value: str)[source]
Bases:
Annotation
This annotation is used when some text is linked to the line or its part. For example, line can contain a number that refers the footnote - the text of this footnote will be the value of this annotation.
- name: str = 'linked_text'
- class dedoc.data_structures.ReferenceAnnotation(value: str, start: int, end: int)[source]
Bases:
Annotation
This annotation points to a place in the document text that is a link to another line in the document (for example, another textual line).
Example of usage for document_type=”article” with the example of link on the bibliography_item
LineWithMeta
.LineWithMeta:
LineWithMeta( # the line with the reference annotation line="As for the PRF, we use the tree-based construction from Goldreich, Goldwasser and Micali [18]", metadata=LineMetadata(page_id=0, line_id=32), annotations=[ReferenceAnnotation(start=90, end=92, value="97cfac39-f0e3-11ee-b81c-b88584b4e4a1"), ...] )
other LineWithMeta:
LineWithMeta( # The line referenced by the previous one line="some your text (can be empty)", metadata=LineMetadata( page_id=10, line_id=189, tag_hierarchy_level=HierarchyLevel(level1=2, level2=0, paragraph_type="bibliography_item")), uid="97cfac39-f0e3-11ee-b81c-b88584b4e4a1" ), annotations=[] )
- name: str = 'reference'
- class dedoc.data_structures.BBoxAnnotation(start: int, end: int, value: BBox, page_width: int, page_height: int)[source]
Bases:
Annotation
Coordinates of the line’s bounding box (in relative coordinates) - for pdf documents.
- name: str = 'bounding box'
- __init__(start: int, end: int, value: BBox, page_width: int, page_height: int) None [source]
- Parameters:
start – start of the annotated text (usually zero)
end – end of the annotated text (usually end of the line)
value – bounding box where line is located
page_width – width of original image with this bbox
page_height – height of original image with this bbox
- class dedoc.data_structures.AlignmentAnnotation(start: int, end: int, value: str)[source]
Bases:
Annotation
This annotation defines the alignment of the entire line in the document: left, right, to the both sides of the page or in the center.
- name: str = 'alignment'
- class dedoc.data_structures.IndentationAnnotation(start: int, end: int, value: str)[source]
Bases:
Annotation
This annotation contains the indentation of the entire line in twentieths of a point (1/1440 of an inch). These units of measurement are taken from the standard Office Open XML File Formats.
- name: str = 'indentation'
- class dedoc.data_structures.SpacingAnnotation(start: int, end: int, value: str)[source]
Bases:
Annotation
This annotation contains spacing between the current line and the previous one. It’s measured in twentieths of a point or one hundredths of a line according to the standard Office Open XML File Formats.
- name: str = 'spacing'
- class dedoc.data_structures.BoldAnnotation(start: int, end: int, value: str)[source]
Bases:
Annotation
Boldness of some text inside the line.
- name: str = 'bold'
- class dedoc.data_structures.ItalicAnnotation(start: int, end: int, value: str)[source]
Bases:
Annotation
Text written in italic inside the line.
- name: str = 'italic'
- class dedoc.data_structures.UnderlinedAnnotation(start: int, end: int, value: str)[source]
Bases:
Annotation
Underlined text inside the line.
- name: str = 'underlined'
- class dedoc.data_structures.StrikeAnnotation(start: int, end: int, value: str)[source]
Bases:
Annotation
Strikethrough of some text inside the line.
- name: str = 'strike'
- class dedoc.data_structures.SubscriptAnnotation(start: int, end: int, value: str)[source]
Bases:
Annotation
Subscript text inside the line.
- name: str = 'subscript'
- class dedoc.data_structures.SuperscriptAnnotation(start: int, end: int, value: str)[source]
Bases:
Annotation
Superscript text inside the line.
- name: str = 'superscript'
- class dedoc.data_structures.ColorAnnotation(start: int, end: int, red: float, green: float, blue: float)[source]
Bases:
Annotation
Color of some text inside the line in the RGB format.
- name: str = 'color_annotation'
- __init__(start: int, end: int, red: float, green: float, blue: float) None [source]
- Parameters:
start – start of the colored text
end – end of the colored text (not included)
red – mean value of the red color component in the pixels that are not white in the given bounding box
green – mean value of the green color component in the pixels that are not white in the given bounding box
blue – mean value of the blue color component in the pixels that are not white in the given bounding box
- class dedoc.data_structures.SizeAnnotation(start: int, end: int, value: str)[source]
Bases:
Annotation
This annotation contains the font size of some part of the line in points (1/72 of an inch). These units of measurement are taken from the standard Office Open XML File Formats.
- name: str = 'size'
- class dedoc.data_structures.StyleAnnotation(start: int, end: int, value: str)[source]
Bases:
Annotation
This annotation contains the information about style of the line in the document. For example, in docx documents lines can be highlighted using Heading styles.
- name: str = 'style'
- class dedoc.data_structures.ConfidenceAnnotation(start: int, end: int, value: str)[source]
Bases:
Annotation
Confidence level of some recognized with OCR text inside the line.
- name: str = 'confidence'