API schema

The output json format has a strict schema: serialized ParsedDocument is returned. Json schema of the output is also available during dedoc application running on http://localhost:1231/docs.

class dedoc.api.schema.ParsedDocument(*, content: DocumentContent, metadata: DocumentMetadata, version: str, warnings: List[str], attachments: List[ParsedDocument])[source]

Holds information about the document content, metadata and attachments.

content: DocumentContent
metadata: DocumentMetadata
version: str
warnings: List[str]
attachments: List[ParsedDocument]
class dedoc.api.schema.DocumentContent(*, structure: TreeNode, tables: List[Table])[source]

Content of the document - structured text and tables.

structure: TreeNode
tables: List[Table]
class dedoc.api.schema.DocumentMetadata(*, uid: str, file_name: str, temporary_file_name: str, size: int, modified_time: int, created_time: int, access_time: int, file_type: str, other_fields: dict | None, **extra_data: Any)[source]

Document metadata like its name, size, author, etc.

uid: str
file_name: str
temporary_file_name: str
size: int
modified_time: int
created_time: int
access_time: int
file_type: str
other_fields: dict | None
class dedoc.api.schema.TreeNode(*, node_id: str, text: str, annotations: List[Annotation], metadata: LineMetadata, subparagraphs: List[TreeNode])[source]

Helps to represent document as recursive tree structure. It has list of children TreeNode nodes (empty list for a leaf node).

node_id: str
text: str
annotations: List[Annotation]
metadata: LineMetadata
subparagraphs: List[TreeNode]
class dedoc.api.schema.LineWithMeta(*, text: str, annotations: List[Annotation])[source]

Textual line with text annotations.

text: str
annotations: List[Annotation]
class dedoc.api.schema.LineMetadata(*, paragraph_type: str, page_id: int, line_id: int | None, other_fields: dict | None, **extra_data: Any)[source]

Holds information about document node/line metadata, such as page number or line type.

paragraph_type: str
page_id: int
line_id: int | None
other_fields: dict | None
class dedoc.api.schema.Table(*, cells: List[List[CellWithMeta]], metadata: TableMetadata)[source]

Holds information about tables in the document. We assume that a table has rectangle form (has the same number of columns in each row). Table representation is row-based i.e. external list contains list of rows.

cells: List[List[CellWithMeta]]
metadata: TableMetadata
class dedoc.api.schema.TableMetadata(*, page_id: int | None, uid: str, rotated_angle: float)[source]

Holds the information about table unique identifier, rotation angle (if table has been rotated - for images) and so on.

page_id: int | None
uid: str
rotated_angle: float
class dedoc.api.schema.CellWithMeta(*, lines: List[LineWithMeta], rowspan: int, colspan: int, invisible: bool)[source]

Holds the information about the cell: list of lines and cell properties (rowspan, colspan, invisible).

lines: List[LineWithMeta]
rowspan: int
colspan: int
invisible: bool
class dedoc.api.schema.Annotation(*, start: int, end: int, name: str, value: str)[source]

The piece of information about the text line: it’s appearance or links to another document object. For example Annotation(1, 13, “italic”, “True”) says that text between 1st and 13th symbol was written in italic.

start: int
end: int
name: str
value: str