API schema

The output json format has a strict schema: serialized ParsedDocument is returned. Json schema of the output is also available during dedoc application running on http://localhost:1231/docs.

class dedoc.api.schema.ParsedDocument(*, content: DocumentContent, metadata: DocumentMetadata, version: str, warnings: List[str], attachments: List[ParsedDocument])[source]

Holds information about the document content, metadata and attachments.

Variables:
  • content (DocumentContent) – document text (hierarchy of nodes) and tables

  • attachments (List[ParsedDocument]) – result of analysis of attached files (empty if with_attachments=False)

  • metadata (DocumentMetadata) – document metadata such as size, creation date and so on.

  • warnings (List[str]) – list of warnings and possible errors, arising in the process of document parsing

  • version (str) – version of the program that parsed this document

class dedoc.api.schema.DocumentContent(*, structure: TreeNode, tables: List[Table])[source]

Content of the document - structured text and tables.

Variables:
  • tables (List[Table]) – list of document tables

  • structure (TreeNode) – tree structure of the document nodes with text and additional metadata

class dedoc.api.schema.DocumentMetadata(*, uid: str, file_name: str, temporary_file_name: str, size: int, modified_time: int, created_time: int, access_time: int, file_type: str, **extra_data: Any)[source]

Document metadata like its name, size, author, etc.

Variables:
  • file_name (str) – original document name (before rename and conversion, so it can contain non-ascii symbols, spaces and so on)

  • temporary_file_name (str) – file name during parsing (unique name after rename and conversion)

  • size (int) – size of the original file in bytes

  • modified_time (int) – time of the last modification in unix time format (seconds since the epoch)

  • created_time (int) – time of the creation in unixtime

  • access_time (int) – time of the last access to the file in unixtime

  • file_type (str) – mime type of the file

  • uid (str) – document unique identifier (useful for attached files)

Additional variables may be added with other file metadata.

class dedoc.api.schema.TreeNode(*, node_id: str, text: str, annotations: List[Annotation], metadata: LineMetadata, subparagraphs: List[TreeNode])[source]

Helps to represent document as recursive tree structure. It has list of children TreeNode nodes (empty list for a leaf node).

Variables:
  • node_id (str) – unique node identifier

  • text (str) – text of the node (may contain several lines)

  • annotations (List[Annotation]) – some metadata related to the part of the text (as font size)

  • metadata (LineMetadata) – metadata refers to entire node (as node type)

  • subparagraphs (List[TreeNode]) – list of child of this node

class dedoc.api.schema.LineWithMeta(*, text: str, annotations: List[Annotation])[source]

Textual line with text annotations.

Variables:
  • text (str) – text of the line

  • annotations (List[Annotation]) – text annotations (font, size, bold, italic, etc.)

class dedoc.api.schema.LineMetadata(*, paragraph_type: str, page_id: int, line_id: int | None, **extra_data: Any)[source]

Holds information about document node/line metadata, such as page number or line type.

Variables:
  • paragraph_type (str) – type of the document line/paragraph (header, list_item, list, etc.)

  • page_id (int) – page number where paragraph starts, the numeration starts from page 0

  • line_id (Optional[int]) – line number inside the entire document, the numeration starts from line 0

Additional variables may be added with other line metadata.

class dedoc.api.schema.Table(*, cells: List[List[CellWithMeta]], metadata: TableMetadata)[source]

Holds information about tables in the document. We assume that a table has rectangle form (has the same number of columns in each row). Table representation is row-based i.e. external list contains list of rows.

Variables:
  • metadata (TableMetadata) – a list of lists of table cells (cell has text lines, colspan and rowspan attributes)

  • cells (List[List[CellWithMeta]]) – table metadata as location, title and so on

class dedoc.api.schema.TableMetadata(*, page_id: int | None, uid: str, rotated_angle: float, title: str)[source]

Holds the information about table unique identifier, rotation angle (if table has been rotated - for images) and so on.

Variables:
  • page_id (Optional[int]) – number of the page where table starts

  • uid (str) – unique identifier of the table (used for linking table to text)

  • rotated_angle (float) – value of the rotation angle by which the table was rotated during recognition

  • title (str) – table’s title

class dedoc.api.schema.CellWithMeta(*, lines: List[LineWithMeta], rowspan: int, colspan: int, invisible: bool)[source]

Holds the information about the cell: list of lines and cell properties (rowspan, colspan, invisible).

Variables:
  • lines (List[LineWithMeta]) – list of textual lines of the cell

  • colspan (int) – number of columns to span (for cells merged horizontally)

  • rowspan (int) – number of rows to span (for cells merged vertically)

  • invisible (bool) – indicator for displaying or hiding cell text - cells that are merged with others are hidden (for HTML display)

class dedoc.api.schema.Annotation(*, start: int, end: int, name: str, value: str)[source]

The piece of information about the text line: it’s appearance or links to another document object. For example Annotation(1, 13, “italic”, “True”) says that text between 1st and 13th symbol was written in italic.

Variables:
  • start (int) – start of the annotated text

  • end (int) – end of the annotated text (end isn’t included)

  • name (str) – annotation’s name, specific for each type of annotation

  • value (str) – information about annotated text, depends on the type of annotation, e.g. “True”/”False”, “10.0”, etc.