API schema
The output json format has a strict schema: serialized ParsedDocument
is returned.
Json schema of the output is also available during dedoc application running on http://localhost:1231/docs
.
- class dedoc.api.schema.ParsedDocument(*, content: DocumentContent, metadata: DocumentMetadata, version: str, warnings: List[str], attachments: List[ParsedDocument])[source]
Holds information about the document content, metadata and attachments.
- Variables:
content (DocumentContent) – document text (hierarchy of nodes) and tables
attachments (List[ParsedDocument]) – result of analysis of attached files (empty if with_attachments=False)
metadata (DocumentMetadata) – document metadata such as size, creation date and so on.
warnings (List[str]) – list of warnings and possible errors, arising in the process of document parsing
version (str) – version of the program that parsed this document
- class dedoc.api.schema.DocumentContent(*, structure: TreeNode, tables: List[Table])[source]
Content of the document - structured text and tables.
- class dedoc.api.schema.DocumentMetadata(*, uid: str, file_name: str, temporary_file_name: str, size: int, modified_time: int, created_time: int, access_time: int, file_type: str, **extra_data: Any)[source]
Document metadata like its name, size, author, etc.
- Variables:
file_name (str) – original document name (before rename and conversion, so it can contain non-ascii symbols, spaces and so on)
temporary_file_name (str) – file name during parsing (unique name after rename and conversion)
size (int) – size of the original file in bytes
modified_time (int) – time of the last modification in unix time format (seconds since the epoch)
created_time (int) – time of the creation in unixtime
access_time (int) – time of the last access to the file in unixtime
file_type (str) – mime type of the file
uid (str) – document unique identifier (useful for attached files)
Additional variables may be added with other file metadata.
- class dedoc.api.schema.TreeNode(*, node_id: str, text: str, annotations: List[Annotation], metadata: LineMetadata, subparagraphs: List[TreeNode])[source]
Helps to represent document as recursive tree structure. It has list of children TreeNode nodes (empty list for a leaf node).
- Variables:
node_id (str) – unique node identifier
text (str) – text of the node (may contain several lines)
annotations (List[Annotation]) – some metadata related to the part of the text (as font size)
metadata (LineMetadata) – metadata refers to entire node (as node type)
subparagraphs (List[TreeNode]) – list of child of this node
- class dedoc.api.schema.LineWithMeta(*, text: str, annotations: List[Annotation])[source]
Textual line with text annotations.
- Variables:
text (str) – text of the line
annotations (List[Annotation]) – text annotations (font, size, bold, italic, etc.)
- class dedoc.api.schema.LineMetadata(*, paragraph_type: str, page_id: int, line_id: int | None, **extra_data: Any)[source]
Holds information about document node/line metadata, such as page number or line type.
- Variables:
paragraph_type (str) – type of the document line/paragraph (header, list_item, list, etc.)
page_id (int) – page number where paragraph starts, the numeration starts from page 0
line_id (Optional[int]) – line number inside the entire document, the numeration starts from line 0
Additional variables may be added with other line metadata.
- class dedoc.api.schema.Table(*, cells: List[List[CellWithMeta]], metadata: TableMetadata)[source]
Holds information about tables in the document. We assume that a table has rectangle form (has the same number of columns in each row). Table representation is row-based i.e. external list contains list of rows.
- Variables:
metadata (TableMetadata) – a list of lists of table cells (cell has text lines, colspan and rowspan attributes)
cells (List[List[CellWithMeta]]) – table metadata as location, title and so on
- class dedoc.api.schema.TableMetadata(*, page_id: int | None, uid: str, rotated_angle: float, title: str)[source]
Holds the information about table unique identifier, rotation angle (if table has been rotated - for images) and so on.
- Variables:
page_id (Optional[int]) – number of the page where table starts
uid (str) – unique identifier of the table (used for linking table to text)
rotated_angle (float) – value of the rotation angle by which the table was rotated during recognition
title (str) – table’s title
- class dedoc.api.schema.CellWithMeta(*, lines: List[LineWithMeta], rowspan: int, colspan: int, invisible: bool)[source]
Holds the information about the cell: list of lines and cell properties (rowspan, colspan, invisible).
- Variables:
lines (List[LineWithMeta]) – list of textual lines of the cell
colspan (int) – number of columns to span (for cells merged horizontally)
rowspan (int) – number of rows to span (for cells merged vertically)
invisible (bool) – indicator for displaying or hiding cell text - cells that are merged with others are hidden (for HTML display)
- class dedoc.api.schema.Annotation(*, start: int, end: int, name: str, value: str)[source]
The piece of information about the text line: it’s appearance or links to another document object. For example Annotation(1, 13, “italic”, “True”) says that text between 1st and 13th symbol was written in italic.
- Variables:
start (int) – start of the annotated text
end (int) – end of the annotated text (end isn’t included)
name (str) – annotation’s name, specific for each type of annotation
value (str) – information about annotated text, depends on the type of annotation, e.g. “True”/”False”, “10.0”, etc.