dedoc.structure_extractors
- class dedoc.structure_extractors.AbstractStructureExtractor(*, config: dict | None = None)[source]
This class adds additional information to the given unstructured document (list of lines) received from some of the readers. Types of lines (paragraph_type) and their levels (hierarchy_level) in the document are added.
The hierarchy level of the line shows the importance of the line in the document: the more important the line is, the less level value it has. Look at the class
dedoc.data_structures.HierarchyLevelfor more information.The paragraph type of the line should be one of the predefined types for some certain document domain, e.g. header, list_item, raw_text, etc. Each concrete structure extractor defines the rules of structuring: the levels and possible types of the lines.
- __init__(*, config: dict | None = None) None[source]
- Parameters:
config – configuration of the extractor, e.g. logger for logging
- abstract extract(document: UnstructuredDocument, parameters: dict | None = None) UnstructuredDocument[source]
This method extracts structure for the document content received from some reader: it finds lines types and their hierarchy levels and adds them to the lines’ metadata.
- Parameters:
document – document content that has been received from some of the readers
parameters – additional parameters for document parsing, see Structure type configuring for more details
- Returns:
document content with added additional information about lines types and hierarchy levels
- class dedoc.structure_extractors.StructureExtractorComposition(extractors: Dict[str, AbstractStructureExtractor], default_key: str, *, config: dict | None = None)[source]
Bases:
AbstractStructureExtractorThis class allows to extract structure from any document according to the available list of structure extractors. The list of structure extractors and names of document types for them is set via the class constructor. Each document type defines some specific document domain, those structure is extracted via the corresponding structure extractor.
- __init__(extractors: Dict[str, AbstractStructureExtractor], default_key: str, *, config: dict | None = None) None[source]
- Parameters:
extractors – mapping document_type -> structure extractor, defined for certain document domains
default_key – the document_type of the structure extractor, that will be used by default if the wrong parameters are given. default_key should exist as a key in the extractors’ dictionary.
- extract(document: UnstructuredDocument, parameters: dict | None = None) UnstructuredDocument[source]
Adds information about the document structure according to the document type received from parameters (the key document_type). If there isn’t document_type key in parameters or this document_type isn’t found in the supported types, the default extractor will be used. To get the information about the method’s parameters look at the documentation of the class
AbstractStructureExtractor.
- class dedoc.structure_extractors.DefaultStructureExtractor(*, config: dict | None = None)[source]
Bases:
AbstractStructureExtractorThis class corresponds the basic structure extraction from the documents.
You can find the description of this type of structure in the section Default document structure type.
- document_type = 'other'
- extract(document: UnstructuredDocument, parameters: dict | None = None) UnstructuredDocument[source]
Extract basic structure from the given document and add additional information to the lines’ metadata. To get the information about the method’s parameters look at the documentation of the class
AbstractStructureExtractor.parametersparameter can contain patterns for configuring lines types and their levels in the output document tree (“patterns” key). Please see Patterns for DefaultStructureExtractor and Configure structure extraction using patterns to get information how to use patterns for making your custom structure.
- class dedoc.structure_extractors.AbstractLawStructureExtractor(*, config: dict | None = None)[source]
Bases:
AbstractStructureExtractor,ABCThis class is used for extraction structure from laws.
You can find the description of this type of structure in the section Law structure type.
- extract(document: UnstructuredDocument, parameters: dict | None = None) UnstructuredDocument[source]
Extract law structure from the given document and add additional information to the lines’ metadata. To get the information about the method’s parameters look at the documentation of the class
AbstractStructureExtractor.
- class dedoc.structure_extractors.ClassifyingLawStructureExtractor(extractors: Dict[str, AbstractStructureExtractor], *, config: dict | None = None)[source]
Bases:
AbstractStructureExtractor,ABCThis class is used to dynamically classify laws into two types: laws and foiv. The specific extractors are called according to the classifying results.
- document_type = 'law'
- __init__(extractors: Dict[str, AbstractStructureExtractor], *, config: dict | None = None) None[source]
- Parameters:
extractors – mapping law_type -> structure extractor, defined for certain law types
config – configuration of the extractor, e.g. logger for logging
- extract(document: UnstructuredDocument, parameters: dict | None = None) UnstructuredDocument[source]
Classify law kind and extract structure according to the specific law format. To get the information about the method’s parameters look at the documentation of the class
AbstractStructureExtractor.
- class dedoc.structure_extractors.LawStructureExtractor(*, config: dict | None = None)[source]
Bases:
AbstractLawStructureExtractorThis class is used for extraction structure from common laws.
You can find the description of this type of structure in the section Simple law structure type.
- document_type = 'law'
- class dedoc.structure_extractors.FoivLawStructureExtractor(*, config: dict | None = None)[source]
Bases:
AbstractLawStructureExtractorThis class is used for extraction structure from foiv type of law.
You can find the description of this type of structure in the section Foiv law structure type.
- document_type = 'foiv_law'
- class dedoc.structure_extractors.DiplomaStructureExtractor(*, config: dict | None = None)[source]
Bases:
AbstractStructureExtractorThis class is used for extraction structure from russian diplomas, master dissertations, thesis, etc.
You can find the description of this type of structure in the section Diploma structure type.
- document_type = 'diploma'
- extract(document: UnstructuredDocument, parameters: dict | None = None) UnstructuredDocument[source]
Extract diploma structure from the given document and add additional information to the lines’ metadata. To get the information about the method’s parameters look at the documentation of the class
AbstractStructureExtractor.
- class dedoc.structure_extractors.TzStructureExtractor(*, config: dict | None = None)[source]
Bases:
AbstractStructureExtractorThis class is used for extraction structure from technical tasks.
You can find the description of this type of structure in the section Technical specification structure type.
- document_type = 'tz'
- extract(document: UnstructuredDocument, parameters: dict | None = None) UnstructuredDocument[source]
Extract technical task structure from the given document and add additional information to the lines’ metadata. To get the information about the method’s parameters look at the documentation of the class
AbstractStructureExtractor.
- class dedoc.structure_extractors.ArticleStructureExtractor(*, config: dict | None = None)[source]
Bases:
AbstractStructureExtractorThis class corresponds to the GROBID article structure extraction.
This class saves all tag_hierarchy_levels received from the
ArticleReaderwithout using the postprocessing step (without using regular expressions).You can find the description of this type of structure in the section Article structure type (GROBID).
- document_type = 'article'
- extract(document: UnstructuredDocument, parameters: dict | None = None) UnstructuredDocument[source]
Extract article structure from the given document and add additional information to the lines’ metadata. To get the information about the method’s parameters look at the documentation of the class
AbstractStructureExtractor.
- class dedoc.structure_extractors.FintocStructureExtractor(*, config: dict | None = None)[source]
Bases:
AbstractStructureExtractorThis class is an implementation of the TOC extractor for the FinTOC 2022 Shared task. The code is a modification of the winner’s solution (ISP RAS team).
This structure extractor is used for English, French and Spanish financial prospects in PDF format (with a textual layer). It is recommended to use
PdfTxtlayerReaderto obtain document lines. You can find the more detailed description of this type of structure in the section FinTOC structure type.- document_type = 'fintoc'
- extract(document: UnstructuredDocument, parameters: dict | None = None, file_path: str | None = None) UnstructuredDocument[source]
According to the FinTOC 2022 title detection task, lines are classified as titles and non-titles. The information about titles is saved in
line.metadata.hierarchy_level(HierarchyLevelclass):Title lines have
HierarchyLevel.headertype, and their depth (HierarchyLevel.level_2) is similar to the depth of TOC item from the FinTOC 2022 TOC generation task.Non-title lines have
HierarchyLevel.raw_texttype, and their depth isn’t obtained.
- Parameters:
document – document content that has been received from some of the readers (
PdfTxtlayerReaderis recommended).parameters –
for this structure extractor, “language” parameter is used for setting document’s language, e.g.
parameters={"language": "en"}. The following options are supported:”en”, “eng” - English (default);
”fr”, “fra” - French;
”sp”, “spa” - Spanish.
file_path – path to the file on disk.
- Returns:
document content with added additional information about title/non-title lines and hierarchy levels of titles.
Patterns for DefaultStructureExtractor
Structure patterns are used for a more flexible configuring of lines types and levels during structure extraction step.
They are useful only for DefaultStructureExtractor (in API when “document_type”=”other”).
Please see Configure structure extraction using patterns to get examples of patterns usage.
- class dedoc.structure_extractors.patterns.abstract_pattern.AbstractPattern(line_type: str | None, level_1: int | None, level_2: int | None, can_be_multiline: bool | str | None)[source]
Base class for all patterns to configure structure extraction by
DefaultStructureExtractor.- _name = ''
- __init__(line_type: str | None, level_1: int | None, level_2: int | None, can_be_multiline: bool | str | None) None[source]
Initialize pattern with default values of
HierarchyLevelattributes. They can be used inget_hierarchy_level()according to specific pattern logic.- Parameters:
line_type – type of the line, e.g. “header”, “bullet_list_item”, “chapter”, etc.
level_1 – value of a line primary importance
level_2 – level of the line inside specific class
can_be_multiline – is used to unify lines inside tree node by
TreeConstructor, if line can be multiline, it can be joined with another line. IfNoneis given, can_be_multiline is set toTrue.
- abstract get_hierarchy_level(line: LineWithMeta) HierarchyLevel[source]
This method should be applied only when
match()returnedTruefor the given line.Get
HierarchyLevelfor initialisingline.metadata.hierarchy_levelattribute. Please see Hierarchy level for document lines to get more information aboutHierarchyLevel.
- abstract match(line: LineWithMeta) bool[source]
Check if the given line satisfies to the pattern requirements. Line text, annotations or metadata (
metadata.tag_hierarchy_level) can be used to decide, if the line matches the pattern or not.
- class dedoc.structure_extractors.patterns.pattern_composition.PatternComposition(patterns: List[AbstractPattern])[source]
Class for applying patterns to get line’s hierarchy level.
Example of usage:
from dedoc.data_structures.line_with_meta import LineWithMeta from dedoc.structure_extractors.patterns import TagListPattern, TagPattern from dedoc.structure_extractors.patterns.pattern_composition import PatternComposition pattern_composition = PatternComposition( patterns=[ TagListPattern(line_type="list_item", default_level_1=2, can_be_multiline=False), TagPattern(default_line_type="raw_text") ] ) line = LineWithMeta(line="Some text") line.metadata.hierarchy_level = pattern_composition.get_hierarchy_level(line=line)
- __init__(patterns: List[AbstractPattern]) None[source]
Set the list of patterns to apply to lines.
Note: the order of the patterns is important. More specific patterns should go first. Otherwise, they may be ignored because of the patterns which also are applicable to the given line.
- Parameters:
patterns – list of patterns to apply to lines.
- get_hierarchy_level(line: LineWithMeta) HierarchyLevel[source]
Choose the suitable pattern from the list of patterns for applying to the given line. The first applicable pattern will be chosen. If no applicable pattern was found, the default
raw_textHierarchyLevelis used as result.- Parameters:
line – line to get hierarchy level for.
- class dedoc.structure_extractors.patterns.RegexpPattern(regexp: str, line_type: str, level_1: int | None = None, level_2: int | None = None, can_be_multiline: bool | str | None = None)[source]
Bases:
AbstractPatternPattern for matching line text by a regular expression.
Note
The pattern is case-insensitive (lower and upper letters are not differed). Before regular expression matching, the line text is stripped (space symbols are deleted from both sides).
See also
Syntax for writing regular expressions is described in the Python documentation.
Example of library usage:
import re from dedoc.structure_extractors import DefaultStructureExtractor from dedoc.structure_extractors.patterns import RegexpPattern reader = ... structure_extractor = DefaultStructureExtractor() patterns = [ RegexpPattern(regexp="^chapter\s\d+\.", line_type="chapter", level_1=1, can_be_multiline=False), RegexpPattern(regexp=re.compile(r"^part\s\d+\.\d+\."), line_type="part", level_1=2, can_be_multiline=False) ] document = reader.read(file_path=file_path) document = structure_extractor.extract(document=document, parameters={"patterns": patterns})
Example of API usage:
import requests patterns = [{"name": "regexp", "regexp": "^chapter\s\d+\.", "line_type": "chapter", "level_1": 1, "can_be_multiline": "false"}] parameters = {"patterns": str(patterns)} with open(file_path, "rb") as file: files = {"file": (file_name, file)} r = requests.post("http://localhost:1231/upload", files=files, data=parameters)
- _name = 'regexp'
- __init__(regexp: str, line_type: str, level_1: int | None = None, level_2: int | None = None, can_be_multiline: bool | str | None = None) None[source]
Initialize pattern with default values of
HierarchyLevelattributes.- Parameters:
regexp – regular expression for checking, if the line text matches the pattern. Note that regular expression is used on the lowercase and stripped line.
line_type – type of the line, e.g. “header”, “bullet_list_item”, “chapter”, etc.
level_1 – value of a line primary importance
level_2 – level of the line inside specific class
can_be_multiline – is used to unify lines inside tree node by
TreeConstructor, if line can be multiline, it can be joined with another line. IfNoneis given, can_be_multiline is set toTrue.
- get_hierarchy_level(line: LineWithMeta) HierarchyLevel[source]
This method should be applied only when
match()returnedTruefor the given line.Return
HierarchyLevelfor initialisingline.metadata.hierarchy_level. The attributesline_type,level_1,level_2,can_be_multilineare equal to values given during class initialisation.
- match(line: LineWithMeta) bool[source]
Check if the pattern is suitable for the given line. Line text is checked by applying pattern’s regular expression, text is stripped and made lowercase beforehand.
- class dedoc.structure_extractors.patterns.StartWordPattern(start_word: str, line_type: str, level_1: int | None = None, level_2: int | None = None, can_be_multiline: bool | str | None = None)[source]
Bases:
AbstractPatternPattern for lines that begin with some specific text (e.g. Introduction, Chapter, etc.).
Note
The pattern is case-insensitive (lower and upper letters are not differed). Before matching, the line text is stripped (space symbols are deleted from both sides). Start word for marching is also stripped and made lowercase.
Example of library usage:
import re from dedoc.structure_extractors import DefaultStructureExtractor from dedoc.structure_extractors.patterns import StartWordPattern reader = ... structure_extractor = DefaultStructureExtractor() patterns = [StartWordPattern(start_word="chapter", line_type="chapter", level_1=1, can_be_multiline=False)] document = reader.read(file_path=file_path) document = structure_extractor.extract(document=document, parameters={"patterns": patterns})
Example of API usage:
import requests patterns = [{"name": "start_word", "start_word": "chapter", "line_type": "chapter", "level_1": 1, "can_be_multiline": "false"}] parameters = {"patterns": str(patterns)} with open(file_path, "rb") as file: files = {"file": (file_name, file)} r = requests.post("http://localhost:1231/upload", files=files, data=parameters)
- _name = 'start_word'
- __init__(start_word: str, line_type: str, level_1: int | None = None, level_2: int | None = None, can_be_multiline: bool | str | None = None) None[source]
Initialize pattern with default values of
HierarchyLevelattributes.- Parameters:
start_word – string for checking of line text beginning. Note that start_word will be stripped and made lowercase, and will be used on the lowercase and stripped line.
line_type – type of the line, e.g. “header”, “bullet_list_item”, “chapter”, etc.
level_1 – value of a line primary importance
level_2 – level of the line inside specific class
can_be_multiline – is used to unify lines inside tree node by
TreeConstructor, if line can be multiline, it can be joined with another line. IfNoneis given, can_be_multiline is set toTrue.
- get_hierarchy_level(line: LineWithMeta) HierarchyLevel[source]
This method should be applied only when
match()returnedTruefor the given line.Return
HierarchyLevelfor initialisingline.metadata.hierarchy_level. The attributesline_type,level_1,level_2,can_be_multilineare equal to values given during class initialisation.
- match(line: LineWithMeta) bool[source]
Check if the pattern is suitable for the given line. Line text is checked if it starts with the given
start_word, text is stripped and made lowercase beforehand.
- class dedoc.structure_extractors.patterns.TagPattern(line_type: str | None = None, level_1: int | None = None, level_2: int | None = None, can_be_multiline: bool | str | None = None, default_line_type: str = 'raw_text', default_level_1: int | None = None, default_level_2: int | None = None)[source]
Bases:
AbstractPatternPattern for using information from readers saved in
line.metadata.tag_hierarchy_level. Can be useful for paragraph extraction in PDF documents and images, because PDF and image readers save information about paragraphs inline.metadata.tag_hierarchy_level.can_be_multiline.See also
Please see Types of textual lines if you need information, which line types can be extracted by each reader.
Example of library usage:
import re from dedoc.structure_extractors import DefaultStructureExtractor from dedoc.structure_extractors.patterns import TagPattern reader = ... structure_extractor = DefaultStructureExtractor() patterns = [TagPattern(default_line_type="raw_text")] document = reader.read(file_path=file_path) document = structure_extractor.extract(document=document, parameters={"patterns": patterns})
Example of API usage:
import requests patterns = [{"name": "tag", "default_line_type": "raw_text"}] parameters = {"patterns": str(patterns)} with open(file_path, "rb") as file: files = {"file": (file_name, file)} r = requests.post("http://localhost:1231/upload", files=files, data=parameters)
- _name = 'tag'
- __init__(line_type: str | None = None, level_1: int | None = None, level_2: int | None = None, can_be_multiline: bool | str | None = None, default_line_type: str = 'raw_text', default_level_1: int | None = None, default_level_2: int | None = None) None[source]
Initialize pattern for configuring values of
HierarchyLevelattributes. It is recommended to configuredefault_*values in caseline.metadata.tag_hierarchy_levelmiss some values. If you want to use values fromline.metadata.tag_hierarchy_level, it is recommended to leaveline_type,level_1,level_2,can_be_multilineempty.can_be_multilineis filled in PDF and images readers during paragraph detection, so if you want to extract paragraphs, you shouldn’t setcan_be_multilineduring pattern initialization.- Parameters:
line_type – type of the line, replaces line_type from tag_hierarchy_level if non-empty.
level_1 – value of a line primary importance, replaces level_1 from tag_hierarchy_level if non-empty.
level_2 – level of the line inside specific class, replaces level_2 from tag_hierarchy_level if non-empty.
can_be_multiline – is used to unify lines inside tree node by
TreeConstructor, if line can be multiline, it can be joined with another line. If not None, replaces can_be_multiline from tag_hierarchy_level.default_line_type – type of the line, is used when tag_hierarchy_level.line_type == “unknown”.
default_level_1 – value of a line primary importance, is used when tag_hierarchy_level.level_1 is None.
default_level_2 – level of the line inside specific class, is used when tag_hierarchy_level.level_2 is None.
- get_hierarchy_level(line: LineWithMeta) HierarchyLevel[source]
This method should be applied only when
match()returnedTruefor the given line.Return
HierarchyLevelfor initialisingline.metadata.hierarchy_level. The attributeline_typeis initialized according to the following rules:if non-empty
line_typeis given during pattern initialisation, then its value is used in the result;if
line_typeis not given (orNoneis given) andline.metadata.tag_hierarchy_levelis notunknown, theline_typevalue fromline.metadata.tag_hierarchy_levelis used in the result;otherwise (
line_typeis empty andline.metadata.tag_hierarchy_levelisunknown)default_line_typevalue is used in the result.
Similar rules work for
level_1andlevel_2with comparing withNoneinstead ofunknown.The
can_be_multilineattribute is initialized according to the following rules:if non-empty
can_be_multilineis given during pattern initialisation, then its value is used in the result;otherwise
can_be_multilinevalue fromline.metadata.tag_hierarchy_levelis used in the result.
- match(line: LineWithMeta) bool[source]
Check if the pattern is suitable for the given line:
line.metadata.tag_hierarchy_levelshould not be empty.line.metadata.tag_hierarchy_levelis filled during reading step, some readers can skiptag_hierarchy_levelinitialisation.
- class dedoc.structure_extractors.patterns.BracketListPattern(line_type: str, level_1: int, level_2: int, can_be_multiline: bool | str | None = None)[source]
Bases:
RegexpPatternPattern for matching numbered lists with brackets, e.g.
1) first element 2) second element
Example of library usage:
from dedoc.structure_extractors import DefaultStructureExtractor from dedoc.structure_extractors.patterns import BracketListPattern reader = ... structure_extractor = DefaultStructureExtractor() patterns = [BracketListPattern(line_type="list_item", level_1=1, level_2=1, can_be_multiline=False)] document = reader.read(file_path=file_path) document = structure_extractor.extract(document=document, parameters={"patterns": patterns})
Example of API usage:
import requests patterns = [{"name": "bracket_list", "line_type": "list_item", "level_1": 1, "level_2": 1, "can_be_multiline": "false"}] parameters = {"patterns": str(patterns)} with open(file_path, "rb") as file: files = {"file": (file_name, file)} r = requests.post("http://localhost:1231/upload", files=files, data=parameters)
- _name = 'bracket_list'
- __init__(line_type: str, level_1: int, level_2: int, can_be_multiline: bool | str | None = None) None[source]
Initialize pattern with default values of
HierarchyLevelattributes.- Parameters:
regexp – regular expression for checking, if the line text matches the pattern. Note that regular expression is used on the lowercase and stripped line.
line_type – type of the line, e.g. “header”, “bullet_list_item”, “chapter”, etc.
level_1 – value of a line primary importance
level_2 – level of the line inside specific class
can_be_multiline – is used to unify lines inside tree node by
TreeConstructor, if line can be multiline, it can be joined with another line. IfNoneis given, can_be_multiline is set toTrue.
- class dedoc.structure_extractors.patterns.BracketRomanListPattern(line_type: str, level_1: int, level_2: int, can_be_multiline: bool | str | None = None)[source]
Bases:
RegexpPatternPattern for matching roman lists with brackets, e.g.
i) first item ii) second item iii) third item iv) forth item
Note
The pattern is case-insensitive (lower and upper letters are not differed).
Example of library usage:
from dedoc.structure_extractors import DefaultStructureExtractor from dedoc.structure_extractors.patterns import BracketRomanListPattern reader = ... structure_extractor = DefaultStructureExtractor() patterns = [BracketRomanListPattern(line_type="list_item", level_1=1, level_2=1, can_be_multiline=False)] document = reader.read(file_path=file_path) document = structure_extractor.extract(document=document, parameters={"patterns": patterns})
Example of API usage:
import requests patterns = [{"name": "bracket_roman_list", "line_type": "list_item", "level_1": 1, "level_2": 1, "can_be_multiline": "false"}] parameters = {"patterns": str(patterns)} with open(file_path, "rb") as file: files = {"file": (file_name, file)} r = requests.post("http://localhost:1231/upload", files=files, data=parameters)
- _name = 'bracket_roman_list'
- __init__(line_type: str, level_1: int, level_2: int, can_be_multiline: bool | str | None = None) None[source]
Initialize pattern with default values of
HierarchyLevelattributes.- Parameters:
regexp – regular expression for checking, if the line text matches the pattern. Note that regular expression is used on the lowercase and stripped line.
line_type – type of the line, e.g. “header”, “bullet_list_item”, “chapter”, etc.
level_1 – value of a line primary importance
level_2 – level of the line inside specific class
can_be_multiline – is used to unify lines inside tree node by
TreeConstructor, if line can be multiline, it can be joined with another line. IfNoneis given, can_be_multiline is set toTrue.
- class dedoc.structure_extractors.patterns.BulletListPattern(line_type: str, level_1: int, level_2: int, can_be_multiline: bool | str | None = None)[source]
Bases:
RegexpPatternPattern for matching bulleted lists, e.g.
- first item - second item
or with other bullet markers
-, —, −, –, ®, ., •, ,, ‚, ©, ⎯, °, *, >, ●, ♣, ①, ▪, *, +.Example of library usage:
from dedoc.structure_extractors import DefaultStructureExtractor from dedoc.structure_extractors.patterns import BulletListPattern reader = ... structure_extractor = DefaultStructureExtractor() patterns = [BulletListPattern(line_type="list_item", level_1=1, level_2=1, can_be_multiline=False)] document = reader.read(file_path=file_path) document = structure_extractor.extract(document=document, parameters={"patterns": patterns})
Example of API usage:
import requests patterns = [{"name": "bullet_list", "line_type": "list_item", "level_1": 1, "level_2": 1, "can_be_multiline": "false"}] parameters = {"patterns": str(patterns)} with open(file_path, "rb") as file: files = {"file": (file_name, file)} r = requests.post("http://localhost:1231/upload", files=files, data=parameters)
- _name = 'bullet_list'
- __init__(line_type: str, level_1: int, level_2: int, can_be_multiline: bool | str | None = None) None[source]
Initialize pattern with default values of
HierarchyLevelattributes.- Parameters:
regexp – regular expression for checking, if the line text matches the pattern. Note that regular expression is used on the lowercase and stripped line.
line_type – type of the line, e.g. “header”, “bullet_list_item”, “chapter”, etc.
level_1 – value of a line primary importance
level_2 – level of the line inside specific class
can_be_multiline – is used to unify lines inside tree node by
TreeConstructor, if line can be multiline, it can be joined with another line. IfNoneis given, can_be_multiline is set toTrue.
- class dedoc.structure_extractors.patterns.DottedListPattern(line_type: str, level_1: int, can_be_multiline: bool | str | None = None)[source]
Bases:
RegexpPatternPattern for matching numbered lists with dots, e.g.
1. first element 1.1. first sub-element 1.2. second sub-element 2. second element
The number of dots is unlimited. There is no
level_2parameter in this pattern,level_2is calculated as the number of numbers between dots, e.g.1.→level_2=11.1or1.1.→level_2=21.2.3.4or1.2.3.4.→level_2=4
Example of library usage:
from dedoc.structure_extractors import DefaultStructureExtractor from dedoc.structure_extractors.patterns import DottedListPattern reader = ... structure_extractor = DefaultStructureExtractor() patterns = [DottedListPattern(line_type="list_item", level_1=1, can_be_multiline=False)] document = reader.read(file_path=file_path) document = structure_extractor.extract(document=document, parameters={"patterns": patterns})
Example of API usage:
import requests patterns = [{"name": "dotted_list", "line_type": "list_item", "level_1": 1, "can_be_multiline": "false"}] parameters = {"patterns": str(patterns)} with open(file_path, "rb") as file: files = {"file": (file_name, file)} r = requests.post("http://localhost:1231/upload", files=files, data=parameters)
- _name = 'dotted_list'
- __init__(line_type: str, level_1: int, can_be_multiline: bool | str | None = None) None[source]
Initialize pattern with default values of
HierarchyLevelattributes.- Parameters:
regexp – regular expression for checking, if the line text matches the pattern. Note that regular expression is used on the lowercase and stripped line.
line_type – type of the line, e.g. “header”, “bullet_list_item”, “chapter”, etc.
level_1 – value of a line primary importance
level_2 – level of the line inside specific class
can_be_multiline – is used to unify lines inside tree node by
TreeConstructor, if line can be multiline, it can be joined with another line. IfNoneis given, can_be_multiline is set toTrue.
- get_hierarchy_level(line: LineWithMeta) HierarchyLevel[source]
This method should be applied only when
match()returnedTruefor the given line.Return
HierarchyLevelfor initialisingline.metadata.hierarchy_level. The attributesline_type,level_1,level_2,can_be_multilineare equal to values given during class initialisation.
- class dedoc.structure_extractors.patterns.LetterListPattern(line_type: str, level_1: int, level_2: int, can_be_multiline: bool | str | None = None)[source]
Bases:
RegexpPatternPattern for matching lists with letters and brackets, e.g.
a) first element b) second element
or (example for Armenian language)
ա) տեղաբաշխել բ) Հայաստանի Հանրապետության գ) սահմանապահ վերակարգերի
Note
The pattern is case-insensitive (lower and upper letters are not differed).
Example of library usage:
from dedoc.structure_extractors import DefaultStructureExtractor from dedoc.structure_extractors.patterns import LetterListPattern reader = ... structure_extractor = DefaultStructureExtractor() patterns = [LetterListPattern(line_type="list_item", level_1=1, level_2=1, can_be_multiline=False)] document = reader.read(file_path=file_path) document = structure_extractor.extract(document=document, parameters={"patterns": patterns})
Example of API usage:
import requests patterns = [{"name": "letter_list", "line_type": "list_item", "level_1": 1, "level_2": 1, "can_be_multiline": "false"}] parameters = {"patterns": str(patterns)} with open(file_path, "rb") as file: files = {"file": (file_name, file)} r = requests.post("http://localhost:1231/upload", files=files, data=parameters)
- _name = 'letter_list'
- __init__(line_type: str, level_1: int, level_2: int, can_be_multiline: bool | str | None = None) None[source]
Initialize pattern with default values of
HierarchyLevelattributes.- Parameters:
regexp – regular expression for checking, if the line text matches the pattern. Note that regular expression is used on the lowercase and stripped line.
line_type – type of the line, e.g. “header”, “bullet_list_item”, “chapter”, etc.
level_1 – value of a line primary importance
level_2 – level of the line inside specific class
can_be_multiline – is used to unify lines inside tree node by
TreeConstructor, if line can be multiline, it can be joined with another line. IfNoneis given, can_be_multiline is set toTrue.
- class dedoc.structure_extractors.patterns.RomanListPattern(line_type: str, level_1: int, level_2: int, can_be_multiline: bool | str | None = None)[source]
Bases:
RegexpPatternPattern for matching roman lists with dots, e.g.
I. first item II. second item III. third item IV. forth item
Note
The pattern is case-insensitive (lower and upper letters are not differed).
Example of library usage:
from dedoc.structure_extractors import DefaultStructureExtractor from dedoc.structure_extractors.patterns import RomanListPattern reader = ... structure_extractor = DefaultStructureExtractor() patterns = [RomanListPattern(line_type="list_item", level_1=1, level_2=1, can_be_multiline=False)] document = reader.read(file_path=file_path) document = structure_extractor.extract(document=document, parameters={"patterns": patterns})
Example of API usage:
import requests patterns = [{"name": "roman_list", "line_type": "list_item", "level_1": 1, "level_2": 1, "can_be_multiline": "false"}] parameters = {"patterns": str(patterns)} with open(file_path, "rb") as file: files = {"file": (file_name, file)} r = requests.post("http://localhost:1231/upload", files=files, data=parameters)
- _name = 'roman_list'
- __init__(line_type: str, level_1: int, level_2: int, can_be_multiline: bool | str | None = None) None[source]
Initialize pattern with default values of
HierarchyLevelattributes.- Parameters:
regexp – regular expression for checking, if the line text matches the pattern. Note that regular expression is used on the lowercase and stripped line.
line_type – type of the line, e.g. “header”, “bullet_list_item”, “chapter”, etc.
level_1 – value of a line primary importance
level_2 – level of the line inside specific class
can_be_multiline – is used to unify lines inside tree node by
TreeConstructor, if line can be multiline, it can be joined with another line. IfNoneis given, can_be_multiline is set toTrue.
- class dedoc.structure_extractors.patterns.TagHeaderPattern(line_type: str | None = None, level_1: int | None = None, level_2: int | None = None, can_be_multiline: bool | str | None = None, default_line_type: str = 'header', default_level_1: int = 1, default_level_2: int | None = None)[source]
Bases:
TagPatternPattern for using information about heading lines (header) from readers saved in
line.metadata.tag_hierarchy_level. Also allows to calculatelevel_2based on dotted list depth (same as inDottedListPattern) if level_2, tag_hierarchy_level.level_2, default_level_2 are empty.See also
Please see Types of textual lines to find out which readers can extract lines with type “header”.
Example of library usage:
import re from dedoc.structure_extractors import DefaultStructureExtractor from dedoc.structure_extractors.patterns import TagHeaderPattern reader = ... structure_extractor = DefaultStructureExtractor() patterns = [TagHeaderPattern(line_type="header", level_1=1, can_be_multiline=False)] document = reader.read(file_path=file_path) document = structure_extractor.extract(document=document, parameters={"patterns": patterns})
Example of API usage:
import requests patterns = [{"name": "tag_header", "line_type": "header", "level_1": 1, "can_be_multiline": "False"}] parameters = {"patterns": str(patterns)} with open(file_path, "rb") as file: files = {"file": (file_name, file)} r = requests.post("http://localhost:1231/upload", files=files, data=parameters)
- _name = 'tag_header'
- __init__(line_type: str | None = None, level_1: int | None = None, level_2: int | None = None, can_be_multiline: bool | str | None = None, default_line_type: str = 'header', default_level_1: int = 1, default_level_2: int | None = None) None[source]
Initialize pattern for configuring values of
HierarchyLevelattributes. It is recommended to configuredefault_*values in caseline.metadata.tag_hierarchy_levelmiss some values. If you want to use values fromline.metadata.tag_hierarchy_level, it is recommended to leaveline_type,level_1,level_2,can_be_multilineempty.can_be_multilineis filled in PDF and images readers during paragraph detection, so if you want to extract paragraphs, you shouldn’t setcan_be_multilineduring pattern initialization.- Parameters:
line_type – type of the line, replaces line_type from tag_hierarchy_level if non-empty.
level_1 – value of a line primary importance, replaces level_1 from tag_hierarchy_level if non-empty.
level_2 – level of the line inside specific class, replaces level_2 from tag_hierarchy_level if non-empty.
can_be_multiline – is used to unify lines inside tree node by
TreeConstructor, if line can be multiline, it can be joined with another line. If not None, replaces can_be_multiline from tag_hierarchy_level.default_line_type – type of the line, is used when tag_hierarchy_level.line_type == “unknown”.
default_level_1 – value of a line primary importance, is used when tag_hierarchy_level.level_1 is None.
default_level_2 – level of the line inside specific class, is used when tag_hierarchy_level.level_2 is None.
- get_hierarchy_level(line: LineWithMeta) HierarchyLevel[source]
This method should be applied only when
match()returnedTruefor the given line.Return
HierarchyLevelfor initialisingline.metadata.hierarchy_level. The attributeline_typeis initialized according to the following rules:if non-empty
line_typeis given during pattern initialisation, then its value is used in the result;if
line_typeis not given (orNoneis given) andline.metadata.tag_hierarchy_levelis notunknown, theline_typevalue fromline.metadata.tag_hierarchy_levelis used in the result;otherwise (
line_typeis empty andline.metadata.tag_hierarchy_levelisunknown)default_line_typevalue is used in the result.
Similar rules work for
level_1andlevel_2with comparing withNoneinstead ofunknown.The
can_be_multilineattribute is initialized according to the following rules:if non-empty
can_be_multilineis given during pattern initialisation, then its value is used in the result;otherwise
can_be_multilinevalue fromline.metadata.tag_hierarchy_levelis used in the result.
- match(line: LineWithMeta) bool[source]
Check if the pattern is suitable for the given line:
line.metadata.tag_hierarchy_levelshould not be empty;line.metadata.tag_hierarchy_level.line_type == "header"
line.metadata.tag_hierarchy_levelis filled during reading step, please see Types of textual lines to find out which readers can extract lines with type “header”.
- class dedoc.structure_extractors.patterns.TagListPattern(line_type: str | None = None, level_1: int | None = None, level_2: int | None = None, can_be_multiline: bool | str | None = None, default_line_type: str = 'list_item', default_level_1: int = 2, default_level_2: int | None = None)[source]
Bases:
TagPatternPattern for using information about list item lines (list_item) from readers saved in
line.metadata.tag_hierarchy_level. Also allows to calculatelevel_2based on dotted list depth (same as inDottedListPattern) if level_2, tag_hierarchy_level.level_2, default_level_2 are empty.See also
Please see Types of textual lines to find out which readers can extract lines with type “list_item”.
Example of library usage:
import re from dedoc.structure_extractors import DefaultStructureExtractor from dedoc.structure_extractors.patterns import TagListPattern reader = ... structure_extractor = DefaultStructureExtractor() patterns = [TagListPattern(line_type="list_item", default_level_1=2, can_be_multiline=False)] document = reader.read(file_path=file_path) document = structure_extractor.extract(document=document, parameters={"patterns": patterns})
Example of API usage:
import requests patterns = [{"name": "tag_list", "line_type": "list_item", "default_level_1": 2, "can_be_multiline": "False"}] parameters = {"patterns": str(patterns)} with open(file_path, "rb") as file: files = {"file": (file_name, file)} r = requests.post("http://localhost:1231/upload", files=files, data=parameters)
- _name = 'tag_list'
- __init__(line_type: str | None = None, level_1: int | None = None, level_2: int | None = None, can_be_multiline: bool | str | None = None, default_line_type: str = 'list_item', default_level_1: int = 2, default_level_2: int | None = None) None[source]
Initialize pattern for configuring values of
HierarchyLevelattributes. It is recommended to configuredefault_*values in caseline.metadata.tag_hierarchy_levelmiss some values. If you want to use values fromline.metadata.tag_hierarchy_level, it is recommended to leaveline_type,level_1,level_2,can_be_multilineempty.can_be_multilineis filled in PDF and images readers during paragraph detection, so if you want to extract paragraphs, you shouldn’t setcan_be_multilineduring pattern initialization.- Parameters:
line_type – type of the line, replaces line_type from tag_hierarchy_level if non-empty.
level_1 – value of a line primary importance, replaces level_1 from tag_hierarchy_level if non-empty.
level_2 – level of the line inside specific class, replaces level_2 from tag_hierarchy_level if non-empty.
can_be_multiline – is used to unify lines inside tree node by
TreeConstructor, if line can be multiline, it can be joined with another line. If not None, replaces can_be_multiline from tag_hierarchy_level.default_line_type – type of the line, is used when tag_hierarchy_level.line_type == “unknown”.
default_level_1 – value of a line primary importance, is used when tag_hierarchy_level.level_1 is None.
default_level_2 – level of the line inside specific class, is used when tag_hierarchy_level.level_2 is None.
- get_hierarchy_level(line: LineWithMeta) HierarchyLevel[source]
This method should be applied only when
match()returnedTruefor the given line.Return
HierarchyLevelfor initialisingline.metadata.hierarchy_level. The attributeline_typeis initialized according to the following rules:if non-empty
line_typeis given during pattern initialisation, then its value is used in the result;if
line_typeis not given (orNoneis given) andline.metadata.tag_hierarchy_levelis notunknown, theline_typevalue fromline.metadata.tag_hierarchy_levelis used in the result;otherwise (
line_typeis empty andline.metadata.tag_hierarchy_levelisunknown)default_line_typevalue is used in the result.
Similar rules work for
level_1andlevel_2with comparing withNoneinstead ofunknown.The
can_be_multilineattribute is initialized according to the following rules:if non-empty
can_be_multilineis given during pattern initialisation, then its value is used in the result;otherwise
can_be_multilinevalue fromline.metadata.tag_hierarchy_levelis used in the result.
- match(line: LineWithMeta) bool[source]
Check if the pattern is suitable for the given line:
line.metadata.tag_hierarchy_levelshould not be empty;line.metadata.tag_hierarchy_level.line_type == "list_item"
line.metadata.tag_hierarchy_levelis filled during reading step, please see Types of textual lines to find out which readers can extract lines with type “list_item”.