dedoc.structure_constructors
- class dedoc.structure_constructors.AbstractStructureConstructor[source]
This class construct structured representation of the document from unstructured document (list of lines).
The result class
ParsedDocument
containsDocumentContent
consisting of tables and the structure itself. This structure is formed based on the list ofLineWithMeta
with their types and hierarchy levels, that are retrieved with the help of some structure extractor.The order of the document lines and their hierarchy can be represented in different ways, e.g. standard tree of lines hierarchy. Also, some other custom structure can be defined by the specific constructor.
- abstract construct(document: UnstructuredDocument, parameters: dict | None = None) ParsedDocument [source]
Process unstructured document and build parsed document representation on this basis.
- Parameters:
document – intermediate representation of the document received from some structure extractor (there should be filled hierarchy levels for all lines)
parameters – additional parameters for document parsing, see Structure type configuring for more details
- Returns:
the structured representation of the given document
- class dedoc.structure_constructors.StructureConstructorComposition(constructors: Dict[str, AbstractStructureConstructor], default_constructor: AbstractStructureConstructor)[source]
Bases:
AbstractStructureConstructor
This class allows to construct structure from any document according to the available list of structure constructors. The list of structure constructors and names of structure types for them is set via the class constructor. Each structure type defines some specific document representation, which is retrieved via the corresponding structure constructor.
- __init__(constructors: Dict[str, AbstractStructureConstructor], default_constructor: AbstractStructureConstructor) None [source]
- Parameters:
constructors – mapping structure_type -> structure constructor, defined for certain structure representations
default_constructor – the structure constructor, that will be used by default if the empty structure type is given
- construct(document: UnstructuredDocument, parameters: dict | None = None) ParsedDocument [source]
Construct the result document structure according to the structure_type parameter. If structure_type is empty string or None the default constructor will be used. To get the information about the parameters look at the documentation of
AbstractStructureConstructor
.
- class dedoc.structure_constructors.LinearConstructor[source]
Bases:
AbstractStructureConstructor
This class is used to form a simple basic document structure representation as a list of document lines. The result contains the empty root node with the consecutive list of all document lines as its children.
- construct(document: UnstructuredDocument, parameters: dict | None = None) ParsedDocument [source]
Build the linear structure representation for the given document intermediate representation. To get the information about the parameters look at the documentation of
AbstractStructureConstructor
.
- class dedoc.structure_constructors.TreeConstructor[source]
Bases:
AbstractStructureConstructor
This class is used to form a basic hierarchical document structure representation as a tree.
- The structure is built according to the lines’ hierarchy levels and their types:
lines with hierarchy level (0, 0) are merged and become a root of the document;
lines with a type list_item become children of a new empty auxiliary node list;
- each line is added as a separate tree node in the document hierarchy according to its hierarchy level:
if the level of the current line is less then the previous line level, the current line becomes its child;
else the line becomes a child of the first line which have less hierarchy level that the current line has.
Hierarchy levels of the lines are compared lexicographically.
- Example:
- root line (0, 0)
- first child line (1, 0)
- line (2, 0)
line (2, 1)
line (2, 0)
second child line (1, 0)
- construct(document: UnstructuredDocument, parameters: dict | None = None) ParsedDocument [source]
Build the tree structure representation for the given document intermediate representation. To get the information about the parameters look at the documentation of
AbstractStructureConstructor
.