dedoc.structure_extractors

class dedoc.structure_extractors.AbstractStructureExtractor[source]

This class adds additional information to the given unstructured document (list of lines) received from some of the readers. Types of lines (paragraph_type) and their levels (hierarchy_level) in the document are added.

The hierarchy level of the line shows the importance of the line in the document: the more important the line is, the less level value it has. Look at the class dedoc.data_structures.HierarchyLevel for more information.

The paragraph type of the line should be one of the predefined types for some certain document domain, e.g. header, list_item, raw_text, etc. Each concrete structure extractor defines the rules of structuring: the levels and possible types of the lines.

abstract extract_structure(document: UnstructuredDocument, parameters: dict) → UnstructuredDocument[source]

This method extracts structure for the document content received from some reader: it finds lines types and their hierarchy levels and adds them to the lines’ metadata.

Parameters:

document – document content that has been received from some of the readers
parameters – additional parameters for document parsing

Returns:

document content with added additional information about lines types and hierarchy levels

class dedoc.structure_extractors.StructureExtractorComposition(extractors: Dict[str, AbstractStructureExtractor], default_key: str)[source]

Bases: AbstractStructureExtractor

This class allows to extract structure from any document according to the available list of structure extractors. The list of structure extractors and names of document types for them is set via the class constructor. Each document type defines some specific document domain, those structure is extracted via the corresponding structure extractor.

__init__(extractors: Dict[str, AbstractStructureExtractor], default_key: str) → None[source]

Parameters:

extractors – mapping document_type -> structure extractor, defined for certain document domains
default_key – the document_type of the structure extractor, that will be used by default if the wrong parameters are given. default_key should exist as a key in the extractors’ dictionary.

extract_structure(document: UnstructuredDocument, parameters: dict) → UnstructuredDocument[source]: Adds information about the document structure according to the document type received from parameters (the key document_type). If there isn’t document_type key in parameters or this document_type isn’t found in the supported types, the default extractor will be used. To get the information about the method’s parameters look at the documentation of the class AbstractStructureExtractor.

class dedoc.structure_extractors.DefaultStructureExtractor[source]

Bases: AbstractStructureExtractor

This class corresponds the basic structure extraction from the documents.

You can find the description of this type of structure in the section Default document structure type.

document_type = 'other'

extract_structure(document: UnstructuredDocument, parameters: dict) → UnstructuredDocument[source]: Extract basic structure from the given document and add additional information to the lines’ metadata. To get the information about the method’s parameters look at the documentation of the class AbstractStructureExtractor.

class dedoc.structure_extractors.AbstractLawStructureExtractor(*, config: dict)[source]

Bases: AbstractStructureExtractor, ABC

This class is used for extraction structure from laws.

You can find the description of this type of structure in the section Law structure type.

__init__(*, config: dict) → None[source]

Parameters:: config – some configuration for document parsing

extract_structure(document: UnstructuredDocument, parameters: dict) → UnstructuredDocument[source]: Extract law structure from the given document and add additional information to the lines’ metadata. To get the information about the method’s parameters look at the documentation of the class AbstractStructureExtractor.

class dedoc.structure_extractors.ClassifyingLawStructureExtractor(extractors: Dict[str, AbstractStructureExtractor], *, config: dict)[source]

Bases: AbstractStructureExtractor, ABC

This class is used to dynamically classify laws into two types: laws and foiv. The specific extractors are called according to the classifying results.

document_type = 'law'

__init__(extractors: Dict[str, AbstractStructureExtractor], *, config: dict) → None[source]

Parameters:

extractors – mapping law_type -> structure extractor, defined for certain law types
config – configuration of the extractor, e.g. logger for logging

extract_structure(document: UnstructuredDocument, parameters: dict) → UnstructuredDocument[source]: Classify law kind and extract structure according to the specific law format. To get the information about the method’s parameters look at the documentation of the class AbstractStructureExtractor.

class dedoc.structure_extractors.LawStructureExtractor(*, config: dict)[source]

Bases: AbstractLawStructureExtractor

This class is used for extraction structure from common laws.

You can find the description of this type of structure in the section Simple law structure type.

document_type = 'law'

__init__(*, config: dict) → None[source]

Parameters:: config – some configuration for document parsing

class dedoc.structure_extractors.FoivLawStructureExtractor(*, config: dict)[source]

Bases: AbstractLawStructureExtractor

This class is used for extraction structure from foiv type of law.

You can find the description of this type of structure in the section Foiv law structure type.

document_type = 'foiv_law'

__init__(*, config: dict) → None[source]

Parameters:: config – some configuration for document parsing

class dedoc.structure_extractors.DiplomaStructureExtractor(*, config: dict)[source]

Bases: AbstractStructureExtractor

This class is used for extraction structure from russian diplomas, master dissertations, thesis, etc.

You can find the description of this type of structure in the section Diploma structure type.

document_type = 'diploma'

__init__(*, config: dict) → None[source]

Parameters:: config – some configuration for document parsing

extract_structure(document: UnstructuredDocument, parameters: dict) → UnstructuredDocument[source]: Extract diploma structure from the given document and add additional information to the lines’ metadata. To get the information about the method’s parameters look at the documentation of the class AbstractStructureExtractor.

class dedoc.structure_extractors.TzStructureExtractor(*, config: dict)[source]

Bases: AbstractStructureExtractor

This class is used for extraction structure from technical tasks.

You can find the description of this type of structure in the section Technical specification structure type.

document_type = 'tz'

__init__(*, config: dict) → None[source]

Parameters:: config – some configuration for document parsing

extract_structure(document: UnstructuredDocument, parameters: dict) → UnstructuredDocument[source]: Extract technical task structure from the given document and add additional information to the lines’ metadata. To get the information about the method’s parameters look at the documentation of the class AbstractStructureExtractor.