API schema
The output json format has a strict schema: serialized ParsedDocument is returned.
Json schema of the output is also available during dedoc application running on http://localhost:1231/docs.
- class dedoc.api.schema.ParsedDocument(*, content: DocumentContent, metadata: DocumentMetadata, version: str, warnings: List[str], attachments: List[ParsedDocument])[source]
Holds information about the document content, metadata and attachments.
- content: DocumentContent
- metadata: DocumentMetadata
- version: str
- warnings: List[str]
- attachments: List[ParsedDocument]
- class dedoc.api.schema.DocumentContent(*, structure: TreeNode, tables: List[Table])[source]
Content of the document - structured text and tables.
- class dedoc.api.schema.DocumentMetadata(*, uid: str, file_name: str, temporary_file_name: str, size: int, modified_time: int, created_time: int, access_time: int, file_type: str, other_fields: dict | None, **extra_data: Any)[source]
Document metadata like its name, size, author, etc.
- uid: str
- file_name: str
- temporary_file_name: str
- size: int
- modified_time: int
- created_time: int
- access_time: int
- file_type: str
- other_fields: dict | None
- class dedoc.api.schema.TreeNode(*, node_id: str, text: str, annotations: List[Annotation], metadata: LineMetadata, subparagraphs: List[TreeNode])[source]
Helps to represent document as recursive tree structure. It has list of children TreeNode nodes (empty list for a leaf node).
- node_id: str
- text: str
- annotations: List[Annotation]
- metadata: LineMetadata
- class dedoc.api.schema.LineWithMeta(*, text: str, annotations: List[Annotation])[source]
Textual line with text annotations.
- text: str
- annotations: List[Annotation]
- class dedoc.api.schema.LineMetadata(*, paragraph_type: str, page_id: int, line_id: int | None, other_fields: dict | None, **extra_data: Any)[source]
Holds information about document node/line metadata, such as page number or line type.
- paragraph_type: str
- page_id: int
- line_id: int | None
- other_fields: dict | None
- class dedoc.api.schema.Table(*, cells: List[List[CellWithMeta]], metadata: TableMetadata)[source]
Holds information about tables in the document. We assume that a table has rectangle form (has the same number of columns in each row). Table representation is row-based i.e. external list contains list of rows.
- cells: List[List[CellWithMeta]]
- metadata: TableMetadata
- class dedoc.api.schema.TableMetadata(*, page_id: int | None, uid: str, rotated_angle: float)[source]
Holds the information about table unique identifier, rotation angle (if table has been rotated - for images) and so on.
- page_id: int | None
- uid: str
- rotated_angle: float
- class dedoc.api.schema.CellWithMeta(*, lines: List[LineWithMeta], rowspan: int, colspan: int, invisible: bool)[source]
Holds the information about the cell: list of lines and cell properties (rowspan, colspan, invisible).
- lines: List[LineWithMeta]
- rowspan: int
- colspan: int
- invisible: bool
- class dedoc.api.schema.Annotation(*, start: int, end: int, name: str, value: str)[source]
The piece of information about the text line: it’s appearance or links to another document object. For example Annotation(1, 13, “italic”, “True”) says that text between 1st and 13th symbol was written in italic.
- start: int
- end: int
- name: str
- value: str