Description of the API output format
Let’s consider the example of a file example.docx
.

document example
There are lines of the different types (heading, list item, raw text), a table and an attached image with text in the document. Let’s parse it using dedoc API and look at the output depending on the API parameters. The instruction about dedoc API running may be useful.
JSON output format
Dedoc allows to get json representation of the class ParsedDocument
.
This format is used as a return format by default or if you use return_format="json"
in the dictionary with API parameters.
The output structure may vary depending on the other API parameters (see Api parameters for files parsing via dedoc for more details).
Basic example
Let’s parse the example file using default parameters:
with open(filename, "rb") as file:
files = {"file": (filename, file)}
r = requests.post("http://localhost:1231/upload", files=files, data=dict())
result = r.content.decode("utf-8")
The full output json file
contains
serialized class ParsedDocument
with its content, tables, metadata and attachments.
The beginning of the document’s content:
"content": {
"structure": {
"node_id": "0",
"text": "",
"annotations": [],
"metadata": {
"paragraph_type": "root",
"page_id": 0,
"line_id": 0
},
"subparagraphs": [
{
"node_id": "0.0",
"text": "Document example",
"annotations": [
{
"start": 0,
"end": 16,
"name": "indentation",
"value": "0"
},
The key “node_id” means the level of the line in a document tree. The amount of numbers separated by dot shows the depth of the line inside the document tree, while the numbers itself show line number inside lines list with the same depth.
According to the document’s styles, the line with text “Heading” is less important, so it’s is a subparagraph of the line with text “Document example”:
"subparagraphs": [
{
"node_id": "0.0.0",
"text": "Heading",
The beginning of the document’s tables:
"tables": [
{
"cells": [
[
{
"lines": [
{
"text": "Table header",
"annotations": [
{
"start": 0,
"end": 12,
"name": "indentation",
"value": "0"
},
{
"start": 0,
"end": 12,
"name": "alignment",
"value": "center"
},
{
"start": 0,
"end": 12,
"name": "spacing",
"value": "0"
},
The beginning of the document’s metadata:
"metadata": {
"uid": "doc_uid_auto_ff95f898-0871-11ef-b95c-0242ac120002",
"file_name": "example_return_format.docx",
"temporary_file_name": "1714647118_806.docx",
"size": 21270,
"modified_time": 1714647118,
"created_time": 1714647118,
"access_time": 1714647118,
"file_type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
The document’s attachments:
"attachments": []
As we see, the attachments
field is empty because the option
with_attachments
is set to "false"
by default (see Api parameters for files parsing via dedoc).
Example of linear structure type
Let’s parse the example file using linear output structure parameter:
with open(filename, "rb") as file:
files = {"file": (filename, file)}
r = requests.post("http://localhost:1231/upload", files=files, data=dict(structure_type="linear"))
result = r.content.decode("utf-8")
The full output json file
is almost the same but it has some differences from the basic output.
The beginning of the document’s content is the same as in the previous example with default parameters:
"content": {
"structure": {
"node_id": "0",
"text": "",
"annotations": [],
"metadata": {
"paragraph_type": "root",
"page_id": 0,
"line_id": 0
},
"subparagraphs": [
{
"node_id": "0.0",
"text": "Document example",
"annotations": [
{
"start": 0,
"end": 16,
"name": "indentation",
"value": "0"
},
But the next document line isn’t a subparagraph of the document’s title (line with text “Document example”), it has the same level in the document’s tree hierarchy.
"node_id": "0.1",
"text": "Heading",
All remaining document lines have the same level as well.
Example with attachments
Let’s parse the example file using with_attachments
parameter:
with open(filename, "rb") as file:
files = {"file": (filename, file)}
r = requests.post("http://localhost:1231/upload", files=files, data=dict(with_attachments="true"))
result = r.content.decode("utf-8")
The full output json file
has the same document content, tables and metadata.
Unlike the previous examples, in this case we have attachments
field filled:
"attachments": [
{
"content": {
"structure": {
"node_id": "0",
"text": "",
"annotations": [],
"metadata": {
"paragraph_type": "root",
"page_id": 0,
"line_id": 0
},
"subparagraphs": []
},
"tables": []
},
"metadata": {
"uid": "attach_7098fafc-e566-46d5-9125-adb6e9b047d8",
"file_name": "image1.png",
"temporary_file_name": "1714647118_301.png",
"size": 14874,
"modified_time": 1714647118,
"created_time": 1714647118,
"access_time": 1714647118,
"file_type": "image/png"
},
"version": "2.2",
"warnings": [],
"attachments": []
}
]
Example with base64 attachments
Let’s parse the example file with attachments in base64 format:
with open(filename, "rb") as file:
files = {"file": (filename, file)}
r = requests.post("http://localhost:1231/upload", files=files, data=dict(with_attachments="true", return_base64="true"))
result = r.content.decode("utf-8")
The full output json file
has the same document content, tables, metadata and filled attachments as the previous example output.
The only difference is in the attachment’s metadata: attachment’s content is encoded and stored in the "base64_encode"
field:
"attachments": [
{
"content": {
"structure": {
"node_id": "0",
"text": "",
"annotations": [],
"metadata": {
"paragraph_type": "root",
"page_id": 0,
"line_id": 0
},
"subparagraphs": []
},
"tables": []
},
"metadata": {
"uid": "attach_e2f42908-09a9-40fc-9b75-2e1413abc275",
"file_name": "image1.png",
"temporary_file_name": "1714647118_610.png",
"size": 14874,
"modified_time": 1714647118,
"created_time": 1714647118,
"access_time": 1714647118,
"file_type": "image/png",
"base64_encode": ""
},
"version": "2.2",
"warnings": [],
"attachments": []
}
]
Example with parsed attachments
Let’s parse the example file with attachments and their content:
with open(filename, "rb") as file:
files = {"file": (filename, file)}
r = requests.post("http://localhost:1231/upload", files=files, data=dict(with_attachments="true", need_content_analysis="true"))
result = r.content.decode("utf-8")
The full output json file
has the same document content, tables and metadata.
The attachments
field is filled and attachments are also parsed.
In the document example the attached image has some text on it, this text has been also parsed and saved in the attachment’s content.
The beginning of the document’s attachments:
"attachments": [
{
"content": {
"structure": {
"node_id": "0",
"text": "",
"annotations": [],
"metadata": {
"paragraph_type": "root",
"page_id": 0,
"line_id": 0
},
"subparagraphs": [
{
"node_id": "0.0",
"text": "THE GREAT ENGLISH DOCUMENT\n",
"annotations": [
{
"start": 0,
"end": 3,
"name": "confidence",
"value": "0.96"
},