dedoc.structure_extractors

class dedoc.structure_extractors.AbstractStructureExtractor(*, config: dict | None = None)[source]

This class adds additional information to the given unstructured document (list of lines) received from some of the readers. Types of lines (paragraph_type) and their levels (hierarchy_level) in the document are added.

The hierarchy level of the line shows the importance of the line in the document: the more important the line is, the less level value it has. Look at the class dedoc.data_structures.HierarchyLevel for more information.

The paragraph type of the line should be one of the predefined types for some certain document domain, e.g. header, list_item, raw_text, etc. Each concrete structure extractor defines the rules of structuring: the levels and possible types of the lines.

__init__(*, config: dict | None = None) → None[source]

Parameters:: config – configuration of the extractor, e.g. logger for logging

abstract extract(document: UnstructuredDocument, parameters: dict | None = None) → UnstructuredDocument[source]

This method extracts structure for the document content received from some reader: it finds lines types and their hierarchy levels and adds them to the lines’ metadata.

Parameters:

document – document content that has been received from some of the readers
parameters – additional parameters for document parsing, see Structure type configuring for more details

Returns:

document content with added additional information about lines types and hierarchy levels

class dedoc.structure_extractors.StructureExtractorComposition(extractors: Dict[str, AbstractStructureExtractor], default_key: str, *, config: dict | None = None)[source]

Bases: AbstractStructureExtractor

This class allows to extract structure from any document according to the available list of structure extractors. The list of structure extractors and names of document types for them is set via the class constructor. Each document type defines some specific document domain, those structure is extracted via the corresponding structure extractor.

__init__(extractors: Dict[str, AbstractStructureExtractor], default_key: str, *, config: dict | None = None) → None[source]

Parameters:

extractors – mapping document_type -> structure extractor, defined for certain document domains
default_key – the document_type of the structure extractor, that will be used by default if the wrong parameters are given. default_key should exist as a key in the extractors’ dictionary.

extract(document: UnstructuredDocument, parameters: dict | None = None) → UnstructuredDocument[source]: Adds information about the document structure according to the document type received from parameters (the key document_type). If there isn’t document_type key in parameters or this document_type isn’t found in the supported types, the default extractor will be used. To get the information about the method’s parameters look at the documentation of the class AbstractStructureExtractor.

class dedoc.structure_extractors.DefaultStructureExtractor(*, config: dict | None = None)[source]

Bases: AbstractStructureExtractor

This class corresponds the basic structure extraction from the documents.

You can find the description of this type of structure in the section Default document structure type.

document_type = 'other'

extract(document: UnstructuredDocument, parameters: dict | None = None) → UnstructuredDocument[source]

Extract basic structure from the given document and add additional information to the lines’ metadata. To get the information about the method’s parameters look at the documentation of the class AbstractStructureExtractor.

parameters parameter can contain patterns for configuring lines types and their levels in the output document tree (“patterns” key). Please see Patterns for DefaultStructureExtractor and Configure structure extraction using patterns to get information how to use patterns for making your custom structure.

class dedoc.structure_extractors.AbstractLawStructureExtractor(*, config: dict | None = None)[source]

Bases: AbstractStructureExtractor, ABC

This class is used for extraction structure from laws.

You can find the description of this type of structure in the section Law structure type.

extract(document: UnstructuredDocument, parameters: dict | None = None) → UnstructuredDocument[source]: Extract law structure from the given document and add additional information to the lines’ metadata. To get the information about the method’s parameters look at the documentation of the class AbstractStructureExtractor.

class dedoc.structure_extractors.ClassifyingLawStructureExtractor(extractors: Dict[str, AbstractStructureExtractor], *, config: dict | None = None)[source]

Bases: AbstractStructureExtractor, ABC

This class is used to dynamically classify laws into two types: laws and foiv. The specific extractors are called according to the classifying results.

document_type = 'law'

__init__(extractors: Dict[str, AbstractStructureExtractor], *, config: dict | None = None) → None[source]

Parameters:

extractors – mapping law_type -> structure extractor, defined for certain law types
config – configuration of the extractor, e.g. logger for logging

extract(document: UnstructuredDocument, parameters: dict | None = None) → UnstructuredDocument[source]: Classify law kind and extract structure according to the specific law format. To get the information about the method’s parameters look at the documentation of the class AbstractStructureExtractor.

class dedoc.structure_extractors.LawStructureExtractor(*, config: dict | None = None)[source]

Bases: AbstractLawStructureExtractor

This class is used for extraction structure from common laws.

You can find the description of this type of structure in the section Simple law structure type.

document_type = 'law'

class dedoc.structure_extractors.FoivLawStructureExtractor(*, config: dict | None = None)[source]

Bases: AbstractLawStructureExtractor

This class is used for extraction structure from foiv type of law.

You can find the description of this type of structure in the section Foiv law structure type.

document_type = 'foiv_law'

class dedoc.structure_extractors.DiplomaStructureExtractor(*, config: dict | None = None)[source]

Bases: AbstractStructureExtractor

This class is used for extraction structure from russian diplomas, master dissertations, thesis, etc.

You can find the description of this type of structure in the section Diploma structure type.

document_type = 'diploma'

extract(document: UnstructuredDocument, parameters: dict | None = None) → UnstructuredDocument[source]: Extract diploma structure from the given document and add additional information to the lines’ metadata. To get the information about the method’s parameters look at the documentation of the class AbstractStructureExtractor.

class dedoc.structure_extractors.TzStructureExtractor(*, config: dict | None = None)[source]

Bases: AbstractStructureExtractor

This class is used for extraction structure from technical tasks.

You can find the description of this type of structure in the section Technical specification structure type.

document_type = 'tz'

extract(document: UnstructuredDocument, parameters: dict | None = None) → UnstructuredDocument[source]: Extract technical task structure from the given document and add additional information to the lines’ metadata. To get the information about the method’s parameters look at the documentation of the class AbstractStructureExtractor.

class dedoc.structure_extractors.ArticleStructureExtractor(*, config: dict | None = None)[source]

Bases: AbstractStructureExtractor

This class corresponds to the GROBID article structure extraction.

This class saves all tag_hierarchy_levels received from the ArticleReader without using the postprocessing step (without using regular expressions).

You can find the description of this type of structure in the section Article structure type (GROBID).

document_type = 'article'

extract(document: UnstructuredDocument, parameters: dict | None = None) → UnstructuredDocument[source]: Extract article structure from the given document and add additional information to the lines’ metadata. To get the information about the method’s parameters look at the documentation of the class AbstractStructureExtractor.

class dedoc.structure_extractors.FintocStructureExtractor(*, config: dict | None = None)[source]

Bases: AbstractStructureExtractor

This class is an implementation of the TOC extractor for the FinTOC 2022 Shared task. The code is a modification of the winner’s solution (ISP RAS team).

This structure extractor is used for English, French and Spanish financial prospects in PDF format (with a textual layer). It is recommended to use PdfTxtlayerReader to obtain document lines. You can find the more detailed description of this type of structure in the section FinTOC structure type.

document_type = 'fintoc'

extract(document: UnstructuredDocument, parameters: dict | None = None, file_path: str | None = None) → UnstructuredDocument[source]

According to the FinTOC 2022 title detection task, lines are classified as titles and non-titles. The information about titles is saved in line.metadata.hierarchy_level (HierarchyLevel class):

Title lines have HierarchyLevel.header type, and their depth (HierarchyLevel.level_2) is similar to the depth of TOC item from the FinTOC 2022 TOC generation task.

Non-title lines have HierarchyLevel.raw_text type, and their depth isn’t obtained.

Parameters:

document – document content that has been received from some of the readers (PdfTxtlayerReader is recommended).
parameters –
for this structure extractor, “language” parameter is used for setting document’s language, e.g. parameters={"language": "en"}. The following options are supported:
- ”en”, “eng” - English (default);
- ”fr”, “fra” - French;
- ”sp”, “spa” - Spanish.
file_path – path to the file on disk.

Returns:

document content with added additional information about title/non-title lines and hierarchy levels of titles.

Patterns for `DefaultStructureExtractor`

Structure patterns are used for a more flexible configuring of lines types and levels during structure extraction step. They are useful only for DefaultStructureExtractor (in API when “document_type”=”other”). Please see Configure structure extraction using patterns to get examples of patterns usage.

Base class for all patterns to configure structure extraction by DefaultStructureExtractor.

_name = ''

Initialize pattern with default values of HierarchyLevel attributes. They can be used in get_hierarchy_level() according to specific pattern logic.

Parameters:

line_type – type of the line, e.g. “header”, “bullet_list_item”, “chapter”, etc.
level_1 – value of a line primary importance
level_2 – level of the line inside specific class
can_be_multiline – is used to unify lines inside tree node by TreeConstructor, if line can be multiline, it can be joined with another line. If None is given, can_be_multiline is set to True.

abstract get_hierarchy_level(line: LineWithMeta) → HierarchyLevel[source]

This method should be applied only when match() returned True for the given line.

Get HierarchyLevel for initialising line.metadata.hierarchy_level attribute. Please see Hierarchy level for document lines to get more information about HierarchyLevel.

abstract match(line: LineWithMeta) → bool[source]: Check if the given line satisfies to the pattern requirements. Line text, annotations or metadata (metadata.tag_hierarchy_level) can be used to decide, if the line matches the pattern or not.

classmethod name() → str[source]: Returns _name attribute, is used in parameters configuration to choose a specific pattern. Each pattern has a unique non-empty name.

class dedoc.structure_extractors.patterns.pattern_composition.PatternComposition(patterns: List[AbstractPattern])[source]

Class for applying patterns to get line’s hierarchy level.

Example of usage:

from dedoc.data_structures.line_with_meta import LineWithMeta
from dedoc.structure_extractors.patterns import TagListPattern, TagPattern
from dedoc.structure_extractors.patterns.pattern_composition import PatternComposition


pattern_composition = PatternComposition(
    patterns=[
        TagListPattern(line_type="list_item", default_level_1=2, can_be_multiline=False),
        TagPattern(default_line_type="raw_text")
    ]
)
line = LineWithMeta(line="Some text")
line.metadata.hierarchy_level = pattern_composition.get_hierarchy_level(line=line)

__init__(patterns: List[AbstractPattern]) → None[source]

Set the list of patterns to apply to lines.

Note: the order of the patterns is important. More specific patterns should go first. Otherwise, they may be ignored because of the patterns which also are applicable to the given line.

Parameters:: patterns – list of patterns to apply to lines.

get_hierarchy_level(line: LineWithMeta) → HierarchyLevel[source]

Choose the suitable pattern from the list of patterns for applying to the given line. The first applicable pattern will be chosen. If no applicable pattern was found, the default raw_text HierarchyLevel is used as result.

Parameters:: line – line to get hierarchy level for.

class dedoc.structure_extractors.patterns.RegexpPattern(regexp: str, line_type: str, level_1: int | None = None, level_2: int | None = None, can_be_multiline: bool | str | None = None)[source]

Bases: AbstractPattern

Pattern for matching line text by a regular expression.

Note

The pattern is case-insensitive (lower and upper letters are not differed). Before regular expression matching, the line text is stripped (space symbols are deleted from both sides).

See also

Syntax for writing regular expressions is described in the Python documentation.

Example of library usage:

import re
from dedoc.structure_extractors import DefaultStructureExtractor
from dedoc.structure_extractors.patterns import RegexpPattern

reader = ...
structure_extractor = DefaultStructureExtractor()
patterns = [
    RegexpPattern(regexp="^chapter\s\d+\.", line_type="chapter", level_1=1, can_be_multiline=False),
    RegexpPattern(regexp=re.compile(r"^part\s\d+\.\d+\."), line_type="part", level_1=2, can_be_multiline=False)
]
document = reader.read(file_path=file_path)
document = structure_extractor.extract(document=document, parameters={"patterns": patterns})

Example of API usage:

import requests

patterns = [{"name": "regexp", "regexp": "^chapter\s\d+\.", "line_type": "chapter", "level_1": 1, "can_be_multiline": "false"}]
parameters = {"patterns": str(patterns)}
with open(file_path, "rb") as file:
    files = {"file": (file_name, file)}
    r = requests.post("http://localhost:1231/upload", files=files, data=parameters)

_name = 'regexp'

__init__(regexp: str, line_type: str, level_1: int | None = None, level_2: int | None = None, can_be_multiline: bool | str | None = None) → None[source]

Initialize pattern with default values of HierarchyLevel attributes.

Parameters:

regexp – regular expression for checking, if the line text matches the pattern. Note that regular expression is used on the lowercase and stripped line.
line_type – type of the line, e.g. “header”, “bullet_list_item”, “chapter”, etc.
level_1 – value of a line primary importance
level_2 – level of the line inside specific class
can_be_multiline – is used to unify lines inside tree node by TreeConstructor, if line can be multiline, it can be joined with another line. If None is given, can_be_multiline is set to True.

get_hierarchy_level(line: LineWithMeta) → HierarchyLevel[source]

This method should be applied only when match() returned True for the given line.

Return HierarchyLevel for initialising line.metadata.hierarchy_level. The attributes line_type, level_1, level_2, can_be_multiline are equal to values given during class initialisation.

match(line: LineWithMeta) → bool[source]: Check if the pattern is suitable for the given line. Line text is checked by applying pattern’s regular expression, text is stripped and made lowercase beforehand.

class dedoc.structure_extractors.patterns.StartWordPattern(start_word: str, line_type: str, level_1: int | None = None, level_2: int | None = None, can_be_multiline: bool | str | None = None)[source]

Bases: AbstractPattern

Pattern for lines that begin with some specific text (e.g. Introduction, Chapter, etc.).

Note

The pattern is case-insensitive (lower and upper letters are not differed). Before matching, the line text is stripped (space symbols are deleted from both sides). Start word for marching is also stripped and made lowercase.

Example of library usage:

import re
from dedoc.structure_extractors import DefaultStructureExtractor
from dedoc.structure_extractors.patterns import StartWordPattern

reader = ...
structure_extractor = DefaultStructureExtractor()
patterns = [StartWordPattern(start_word="chapter", line_type="chapter", level_1=1, can_be_multiline=False)]
document = reader.read(file_path=file_path)
document = structure_extractor.extract(document=document, parameters={"patterns": patterns})

Example of API usage:

import requests

patterns = [{"name": "start_word", "start_word": "chapter", "line_type": "chapter", "level_1": 1, "can_be_multiline": "false"}]
parameters = {"patterns": str(patterns)}
with open(file_path, "rb") as file:
    files = {"file": (file_name, file)}
    r = requests.post("http://localhost:1231/upload", files=files, data=parameters)

_name = 'start_word'

__init__(start_word: str, line_type: str, level_1: int | None = None, level_2: int | None = None, can_be_multiline: bool | str | None = None) → None[source]

Initialize pattern with default values of HierarchyLevel attributes.

Parameters:

start_word – string for checking of line text beginning. Note that start_word will be stripped and made lowercase, and will be used on the lowercase and stripped line.
line_type – type of the line, e.g. “header”, “bullet_list_item”, “chapter”, etc.
level_1 – value of a line primary importance
level_2 – level of the line inside specific class
can_be_multiline – is used to unify lines inside tree node by TreeConstructor, if line can be multiline, it can be joined with another line. If None is given, can_be_multiline is set to True.

get_hierarchy_level(line: LineWithMeta) → HierarchyLevel[source]

This method should be applied only when match() returned True for the given line.

Return HierarchyLevel for initialising line.metadata.hierarchy_level. The attributes line_type, level_1, level_2, can_be_multiline are equal to values given during class initialisation.

match(line: LineWithMeta) → bool[source]: Check if the pattern is suitable for the given line. Line text is checked if it starts with the given start_word, text is stripped and made lowercase beforehand.

Bases: AbstractPattern

Pattern for using information from readers saved in line.metadata.tag_hierarchy_level. Can be useful for paragraph extraction in PDF documents and images, because PDF and image readers save information about paragraphs in line.metadata.tag_hierarchy_level.can_be_multiline.

See also

Please see Types of textual lines if you need information, which line types can be extracted by each reader.

Example of library usage:

import re
from dedoc.structure_extractors import DefaultStructureExtractor
from dedoc.structure_extractors.patterns import TagPattern

reader = ...
structure_extractor = DefaultStructureExtractor()
patterns = [TagPattern(default_line_type="raw_text")]
document = reader.read(file_path=file_path)
document = structure_extractor.extract(document=document, parameters={"patterns": patterns})

Example of API usage:

import requests

patterns = [{"name": "tag", "default_line_type": "raw_text"}]
parameters = {"patterns": str(patterns)}
with open(file_path, "rb") as file:
    files = {"file": (file_name, file)}
    r = requests.post("http://localhost:1231/upload", files=files, data=parameters)

_name = 'tag'

Initialize pattern for configuring values of HierarchyLevel attributes. It is recommended to configure default_* values in case line.metadata.tag_hierarchy_level miss some values. If you want to use values from line.metadata.tag_hierarchy_level, it is recommended to leave line_type, level_1, level_2, can_be_multiline empty.

can_be_multiline is filled in PDF and images readers during paragraph detection, so if you want to extract paragraphs, you shouldn’t set can_be_multiline during pattern initialization.

Parameters:

line_type – type of the line, replaces line_type from tag_hierarchy_level if non-empty.
level_1 – value of a line primary importance, replaces level_1 from tag_hierarchy_level if non-empty.
level_2 – level of the line inside specific class, replaces level_2 from tag_hierarchy_level if non-empty.
can_be_multiline – is used to unify lines inside tree node by TreeConstructor, if line can be multiline, it can be joined with another line. If not None, replaces can_be_multiline from tag_hierarchy_level.
default_line_type – type of the line, is used when tag_hierarchy_level.line_type == “unknown”.
default_level_1 – value of a line primary importance, is used when tag_hierarchy_level.level_1 is None.
default_level_2 – level of the line inside specific class, is used when tag_hierarchy_level.level_2 is None.

get_hierarchy_level(line: LineWithMeta) → HierarchyLevel[source]

This method should be applied only when match() returned True for the given line.

Return HierarchyLevel for initialising line.metadata.hierarchy_level. The attribute line_type is initialized according to the following rules:

if non-empty line_type is given during pattern initialisation, then its value is used in the result;
if line_type is not given (or None is given) and line.metadata.tag_hierarchy_level is not unknown, the line_type value from line.metadata.tag_hierarchy_level is used in the result;
otherwise (line_type is empty and line.metadata.tag_hierarchy_level is unknown) default_line_type value is used in the result.

Similar rules work for level_1 and level_2 with comparing with None instead of unknown.

The can_be_multiline attribute is initialized according to the following rules:

if non-empty can_be_multiline is given during pattern initialisation, then its value is used in the result;
otherwise can_be_multiline value from line.metadata.tag_hierarchy_level is used in the result.

match(line: LineWithMeta) → bool[source]: Check if the pattern is suitable for the given line: line.metadata.tag_hierarchy_level should not be empty. line.metadata.tag_hierarchy_level is filled during reading step, some readers can skip tag_hierarchy_level initialisation.

class dedoc.structure_extractors.patterns.BracketListPattern(line_type: str, level_1: int, level_2: int, can_be_multiline: bool | str | None = None)[source]

Bases: RegexpPattern

Pattern for matching numbered lists with brackets, e.g.

1) first element
2) second element

Example of library usage:

from dedoc.structure_extractors import DefaultStructureExtractor
from dedoc.structure_extractors.patterns import BracketListPattern

reader = ...
structure_extractor = DefaultStructureExtractor()
patterns = [BracketListPattern(line_type="list_item", level_1=1, level_2=1, can_be_multiline=False)]
document = reader.read(file_path=file_path)
document = structure_extractor.extract(document=document, parameters={"patterns": patterns})

Example of API usage:

import requests

patterns = [{"name": "bracket_list", "line_type": "list_item", "level_1": 1, "level_2": 1, "can_be_multiline": "false"}]
parameters = {"patterns": str(patterns)}
with open(file_path, "rb") as file:
    files = {"file": (file_name, file)}
    r = requests.post("http://localhost:1231/upload", files=files, data=parameters)

_name = 'bracket_list'

__init__(line_type: str, level_1: int, level_2: int, can_be_multiline: bool | str | None = None) → None[source]

Initialize pattern with default values of HierarchyLevel attributes.

Parameters:

regexp – regular expression for checking, if the line text matches the pattern. Note that regular expression is used on the lowercase and stripped line.
line_type – type of the line, e.g. “header”, “bullet_list_item”, “chapter”, etc.
level_1 – value of a line primary importance
level_2 – level of the line inside specific class
can_be_multiline – is used to unify lines inside tree node by TreeConstructor, if line can be multiline, it can be joined with another line. If None is given, can_be_multiline is set to True.

class dedoc.structure_extractors.patterns.BracketRomanListPattern(line_type: str, level_1: int, level_2: int, can_be_multiline: bool | str | None = None)[source]

Bases: RegexpPattern

Pattern for matching roman lists with brackets, e.g.

i) first item
ii) second item
iii) third item
iv) forth item

Note

The pattern is case-insensitive (lower and upper letters are not differed).

Example of library usage:

from dedoc.structure_extractors import DefaultStructureExtractor
from dedoc.structure_extractors.patterns import BracketRomanListPattern

reader = ...
structure_extractor = DefaultStructureExtractor()
patterns = [BracketRomanListPattern(line_type="list_item", level_1=1, level_2=1, can_be_multiline=False)]
document = reader.read(file_path=file_path)
document = structure_extractor.extract(document=document, parameters={"patterns": patterns})

Example of API usage:

import requests

patterns = [{"name": "bracket_roman_list", "line_type": "list_item", "level_1": 1, "level_2": 1, "can_be_multiline": "false"}]
parameters = {"patterns": str(patterns)}
with open(file_path, "rb") as file:
    files = {"file": (file_name, file)}
    r = requests.post("http://localhost:1231/upload", files=files, data=parameters)

_name = 'bracket_roman_list'

__init__(line_type: str, level_1: int, level_2: int, can_be_multiline: bool | str | None = None) → None[source]

Initialize pattern with default values of HierarchyLevel attributes.

Parameters:

regexp – regular expression for checking, if the line text matches the pattern. Note that regular expression is used on the lowercase and stripped line.
line_type – type of the line, e.g. “header”, “bullet_list_item”, “chapter”, etc.
level_1 – value of a line primary importance
level_2 – level of the line inside specific class
can_be_multiline – is used to unify lines inside tree node by TreeConstructor, if line can be multiline, it can be joined with another line. If None is given, can_be_multiline is set to True.

class dedoc.structure_extractors.patterns.BulletListPattern(line_type: str, level_1: int, level_2: int, can_be_multiline: bool | str | None = None)[source]

Bases: RegexpPattern

Pattern for matching bulleted lists, e.g.

- first item
- second item

or with other bullet markers -, —, −, –, ®, ., •, ,, ‚, ©, ⎯, °, *, >, ●, ♣, ①, ▪, *, +.

Example of library usage:

from dedoc.structure_extractors import DefaultStructureExtractor
from dedoc.structure_extractors.patterns import BulletListPattern

reader = ...
structure_extractor = DefaultStructureExtractor()
patterns = [BulletListPattern(line_type="list_item", level_1=1, level_2=1, can_be_multiline=False)]
document = reader.read(file_path=file_path)
document = structure_extractor.extract(document=document, parameters={"patterns": patterns})

Example of API usage:

import requests

patterns = [{"name": "bullet_list", "line_type": "list_item", "level_1": 1, "level_2": 1, "can_be_multiline": "false"}]
parameters = {"patterns": str(patterns)}
with open(file_path, "rb") as file:
    files = {"file": (file_name, file)}
    r = requests.post("http://localhost:1231/upload", files=files, data=parameters)

_name = 'bullet_list'

__init__(line_type: str, level_1: int, level_2: int, can_be_multiline: bool | str | None = None) → None[source]

Initialize pattern with default values of HierarchyLevel attributes.

Parameters:

regexp – regular expression for checking, if the line text matches the pattern. Note that regular expression is used on the lowercase and stripped line.
line_type – type of the line, e.g. “header”, “bullet_list_item”, “chapter”, etc.
level_1 – value of a line primary importance
level_2 – level of the line inside specific class
can_be_multiline – is used to unify lines inside tree node by TreeConstructor, if line can be multiline, it can be joined with another line. If None is given, can_be_multiline is set to True.

class dedoc.structure_extractors.patterns.DottedListPattern(line_type: str, level_1: int, can_be_multiline: bool | str | None = None)[source]

Bases: RegexpPattern

Pattern for matching numbered lists with dots, e.g.

first element
1. first sub-element
2. second sub-element
second element

The number of dots is unlimited. There is no level_2 parameter in this pattern, level_2 is calculated as the number of numbers between dots, e.g.

1. → level_2=1
1.1 or 1.1. → level_2=2
1.2.3.4 or 1.2.3.4. → level_2=4

Example of library usage:

from dedoc.structure_extractors import DefaultStructureExtractor
from dedoc.structure_extractors.patterns import DottedListPattern

reader = ...
structure_extractor = DefaultStructureExtractor()
patterns = [DottedListPattern(line_type="list_item", level_1=1, can_be_multiline=False)]
document = reader.read(file_path=file_path)
document = structure_extractor.extract(document=document, parameters={"patterns": patterns})

Example of API usage:

import requests

patterns = [{"name": "dotted_list", "line_type": "list_item", "level_1": 1, "can_be_multiline": "false"}]
parameters = {"patterns": str(patterns)}
with open(file_path, "rb") as file:
    files = {"file": (file_name, file)}
    r = requests.post("http://localhost:1231/upload", files=files, data=parameters)

_name = 'dotted_list'

__init__(line_type: str, level_1: int, can_be_multiline: bool | str | None = None) → None[source]

Initialize pattern with default values of HierarchyLevel attributes.

Parameters:

regexp – regular expression for checking, if the line text matches the pattern. Note that regular expression is used on the lowercase and stripped line.
line_type – type of the line, e.g. “header”, “bullet_list_item”, “chapter”, etc.
level_1 – value of a line primary importance
level_2 – level of the line inside specific class
can_be_multiline – is used to unify lines inside tree node by TreeConstructor, if line can be multiline, it can be joined with another line. If None is given, can_be_multiline is set to True.

get_hierarchy_level(line: LineWithMeta) → HierarchyLevel[source]

This method should be applied only when match() returned True for the given line.

Return HierarchyLevel for initialising line.metadata.hierarchy_level. The attributes line_type, level_1, level_2, can_be_multiline are equal to values given during class initialisation.

class dedoc.structure_extractors.patterns.LetterListPattern(line_type: str, level_1: int, level_2: int, can_be_multiline: bool | str | None = None)[source]

Bases: RegexpPattern

Pattern for matching lists with letters and brackets, e.g.

a) first element
b) second element

or (example for Armenian language)

ա) տեղաբաշխել
բ) Հայաստանի Հանրապետության
գ) սահմանապահ վերակարգերի

Note

The pattern is case-insensitive (lower and upper letters are not differed).

Example of library usage:

from dedoc.structure_extractors import DefaultStructureExtractor
from dedoc.structure_extractors.patterns import LetterListPattern

reader = ...
structure_extractor = DefaultStructureExtractor()
patterns = [LetterListPattern(line_type="list_item", level_1=1, level_2=1, can_be_multiline=False)]
document = reader.read(file_path=file_path)
document = structure_extractor.extract(document=document, parameters={"patterns": patterns})

Example of API usage:

import requests

patterns = [{"name": "letter_list", "line_type": "list_item", "level_1": 1, "level_2": 1, "can_be_multiline": "false"}]
parameters = {"patterns": str(patterns)}
with open(file_path, "rb") as file:
    files = {"file": (file_name, file)}
    r = requests.post("http://localhost:1231/upload", files=files, data=parameters)

_name = 'letter_list'

__init__(line_type: str, level_1: int, level_2: int, can_be_multiline: bool | str | None = None) → None[source]

Initialize pattern with default values of HierarchyLevel attributes.

Parameters:

regexp – regular expression for checking, if the line text matches the pattern. Note that regular expression is used on the lowercase and stripped line.
line_type – type of the line, e.g. “header”, “bullet_list_item”, “chapter”, etc.
level_1 – value of a line primary importance
level_2 – level of the line inside specific class
can_be_multiline – is used to unify lines inside tree node by TreeConstructor, if line can be multiline, it can be joined with another line. If None is given, can_be_multiline is set to True.

class dedoc.structure_extractors.patterns.RomanListPattern(line_type: str, level_1: int, level_2: int, can_be_multiline: bool | str | None = None)[source]

Bases: RegexpPattern

Pattern for matching roman lists with dots, e.g.

I. first item
II. second item
III. third item
IV. forth item

Note

The pattern is case-insensitive (lower and upper letters are not differed).

Example of library usage:

from dedoc.structure_extractors import DefaultStructureExtractor
from dedoc.structure_extractors.patterns import RomanListPattern

reader = ...
structure_extractor = DefaultStructureExtractor()
patterns = [RomanListPattern(line_type="list_item", level_1=1, level_2=1, can_be_multiline=False)]
document = reader.read(file_path=file_path)
document = structure_extractor.extract(document=document, parameters={"patterns": patterns})

Example of API usage:

import requests

patterns = [{"name": "roman_list", "line_type": "list_item", "level_1": 1, "level_2": 1, "can_be_multiline": "false"}]
parameters = {"patterns": str(patterns)}
with open(file_path, "rb") as file:
    files = {"file": (file_name, file)}
    r = requests.post("http://localhost:1231/upload", files=files, data=parameters)

_name = 'roman_list'

__init__(line_type: str, level_1: int, level_2: int, can_be_multiline: bool | str | None = None) → None[source]

Initialize pattern with default values of HierarchyLevel attributes.

Parameters:

regexp – regular expression for checking, if the line text matches the pattern. Note that regular expression is used on the lowercase and stripped line.
line_type – type of the line, e.g. “header”, “bullet_list_item”, “chapter”, etc.
level_1 – value of a line primary importance
level_2 – level of the line inside specific class
can_be_multiline – is used to unify lines inside tree node by TreeConstructor, if line can be multiline, it can be joined with another line. If None is given, can_be_multiline is set to True.

class dedoc.structure_extractors.patterns.TagHeaderPattern(line_type: str | None = None, level_1: int | None = None, level_2: int | None = None, can_be_multiline: bool | str | None = None, default_line_type: str = 'header', default_level_1: int = 1, default_level_2: int | None = None)[source]

Bases: TagPattern

Pattern for using information about heading lines (header) from readers saved in line.metadata.tag_hierarchy_level. Also allows to calculate level_2 based on dotted list depth (same as in DottedListPattern) if level_2, tag_hierarchy_level.level_2, default_level_2 are empty.

See also

Please see Types of textual lines to find out which readers can extract lines with type “header”.

Example of library usage:

import re
from dedoc.structure_extractors import DefaultStructureExtractor
from dedoc.structure_extractors.patterns import TagHeaderPattern

reader = ...
structure_extractor = DefaultStructureExtractor()
patterns = [TagHeaderPattern(line_type="header", level_1=1, can_be_multiline=False)]
document = reader.read(file_path=file_path)
document = structure_extractor.extract(document=document, parameters={"patterns": patterns})

Example of API usage:

import requests

patterns = [{"name": "tag_header", "line_type": "header", "level_1": 1, "can_be_multiline": "False"}]
parameters = {"patterns": str(patterns)}
with open(file_path, "rb") as file:
    files = {"file": (file_name, file)}
    r = requests.post("http://localhost:1231/upload", files=files, data=parameters)

_name = 'tag_header'

Initialize pattern for configuring values of HierarchyLevel attributes. It is recommended to configure default_* values in case line.metadata.tag_hierarchy_level miss some values. If you want to use values from line.metadata.tag_hierarchy_level, it is recommended to leave line_type, level_1, level_2, can_be_multiline empty.

can_be_multiline is filled in PDF and images readers during paragraph detection, so if you want to extract paragraphs, you shouldn’t set can_be_multiline during pattern initialization.

Parameters:

line_type – type of the line, replaces line_type from tag_hierarchy_level if non-empty.
level_1 – value of a line primary importance, replaces level_1 from tag_hierarchy_level if non-empty.
level_2 – level of the line inside specific class, replaces level_2 from tag_hierarchy_level if non-empty.
can_be_multiline – is used to unify lines inside tree node by TreeConstructor, if line can be multiline, it can be joined with another line. If not None, replaces can_be_multiline from tag_hierarchy_level.
default_line_type – type of the line, is used when tag_hierarchy_level.line_type == “unknown”.
default_level_1 – value of a line primary importance, is used when tag_hierarchy_level.level_1 is None.
default_level_2 – level of the line inside specific class, is used when tag_hierarchy_level.level_2 is None.

get_hierarchy_level(line: LineWithMeta) → HierarchyLevel[source]

This method should be applied only when match() returned True for the given line.

Return HierarchyLevel for initialising line.metadata.hierarchy_level. The attribute line_type is initialized according to the following rules:

if non-empty line_type is given during pattern initialisation, then its value is used in the result;
if line_type is not given (or None is given) and line.metadata.tag_hierarchy_level is not unknown, the line_type value from line.metadata.tag_hierarchy_level is used in the result;
otherwise (line_type is empty and line.metadata.tag_hierarchy_level is unknown) default_line_type value is used in the result.

Similar rules work for level_1 and level_2 with comparing with None instead of unknown.

The can_be_multiline attribute is initialized according to the following rules:

if non-empty can_be_multiline is given during pattern initialisation, then its value is used in the result;
otherwise can_be_multiline value from line.metadata.tag_hierarchy_level is used in the result.

match(line: LineWithMeta) → bool[source]

Check if the pattern is suitable for the given line:

line.metadata.tag_hierarchy_level should not be empty;
line.metadata.tag_hierarchy_level.line_type == "header"

line.metadata.tag_hierarchy_level is filled during reading step, please see Types of textual lines to find out which readers can extract lines with type “header”.

class dedoc.structure_extractors.patterns.TagListPattern(line_type: str | None = None, level_1: int | None = None, level_2: int | None = None, can_be_multiline: bool | str | None = None, default_line_type: str = 'list_item', default_level_1: int = 2, default_level_2: int | None = None)[source]

Bases: TagPattern

Pattern for using information about list item lines (list_item) from readers saved in line.metadata.tag_hierarchy_level. Also allows to calculate level_2 based on dotted list depth (same as in DottedListPattern) if level_2, tag_hierarchy_level.level_2, default_level_2 are empty.

See also

Please see Types of textual lines to find out which readers can extract lines with type “list_item”.

Example of library usage:

import re
from dedoc.structure_extractors import DefaultStructureExtractor
from dedoc.structure_extractors.patterns import TagListPattern

reader = ...
structure_extractor = DefaultStructureExtractor()
patterns = [TagListPattern(line_type="list_item", default_level_1=2, can_be_multiline=False)]
document = reader.read(file_path=file_path)
document = structure_extractor.extract(document=document, parameters={"patterns": patterns})

Example of API usage:

import requests

patterns = [{"name": "tag_list", "line_type": "list_item", "default_level_1": 2, "can_be_multiline": "False"}]
parameters = {"patterns": str(patterns)}
with open(file_path, "rb") as file:
    files = {"file": (file_name, file)}
    r = requests.post("http://localhost:1231/upload", files=files, data=parameters)

_name = 'tag_list'

Initialize pattern for configuring values of HierarchyLevel attributes. It is recommended to configure default_* values in case line.metadata.tag_hierarchy_level miss some values. If you want to use values from line.metadata.tag_hierarchy_level, it is recommended to leave line_type, level_1, level_2, can_be_multiline empty.

can_be_multiline is filled in PDF and images readers during paragraph detection, so if you want to extract paragraphs, you shouldn’t set can_be_multiline during pattern initialization.

Parameters:

line_type – type of the line, replaces line_type from tag_hierarchy_level if non-empty.
level_1 – value of a line primary importance, replaces level_1 from tag_hierarchy_level if non-empty.
level_2 – level of the line inside specific class, replaces level_2 from tag_hierarchy_level if non-empty.
can_be_multiline – is used to unify lines inside tree node by TreeConstructor, if line can be multiline, it can be joined with another line. If not None, replaces can_be_multiline from tag_hierarchy_level.
default_line_type – type of the line, is used when tag_hierarchy_level.line_type == “unknown”.
default_level_1 – value of a line primary importance, is used when tag_hierarchy_level.level_1 is None.
default_level_2 – level of the line inside specific class, is used when tag_hierarchy_level.level_2 is None.

get_hierarchy_level(line: LineWithMeta) → HierarchyLevel[source]

This method should be applied only when match() returned True for the given line.

Return HierarchyLevel for initialising line.metadata.hierarchy_level. The attribute line_type is initialized according to the following rules:

if non-empty line_type is given during pattern initialisation, then its value is used in the result;
if line_type is not given (or None is given) and line.metadata.tag_hierarchy_level is not unknown, the line_type value from line.metadata.tag_hierarchy_level is used in the result;
otherwise (line_type is empty and line.metadata.tag_hierarchy_level is unknown) default_line_type value is used in the result.

Similar rules work for level_1 and level_2 with comparing with None instead of unknown.

The can_be_multiline attribute is initialized according to the following rules:

if non-empty can_be_multiline is given during pattern initialisation, then its value is used in the result;
otherwise can_be_multiline value from line.metadata.tag_hierarchy_level is used in the result.

match(line: LineWithMeta) → bool[source]

Check if the pattern is suitable for the given line:

line.metadata.tag_hierarchy_level should not be empty;
line.metadata.tag_hierarchy_level.line_type == "list_item"

line.metadata.tag_hierarchy_level is filled during reading step, please see Types of textual lines to find out which readers can extract lines with type “list_item”.

dedoc.structure_extractors

Patterns for DefaultStructureExtractor

Patterns for `DefaultStructureExtractor`