dedoc.structure_extractors
- class dedoc.structure_extractors.AbstractStructureExtractor(*, config: dict | None = None)[source]
This class adds additional information to the given unstructured document (list of lines) received from some of the readers. Types of lines (paragraph_type) and their levels (hierarchy_level) in the document are added.
The hierarchy level of the line shows the importance of the line in the document: the more important the line is, the less level value it has. Look at the class
dedoc.data_structures.HierarchyLevel
for more information.The paragraph type of the line should be one of the predefined types for some certain document domain, e.g. header, list_item, raw_text, etc. Each concrete structure extractor defines the rules of structuring: the levels and possible types of the lines.
- __init__(*, config: dict | None = None) None [source]
- Parameters:
config – configuration of the extractor, e.g. logger for logging
- abstract extract(document: UnstructuredDocument, parameters: dict | None = None) UnstructuredDocument [source]
This method extracts structure for the document content received from some reader: it finds lines types and their hierarchy levels and adds them to the lines’ metadata.
- Parameters:
document – document content that has been received from some of the readers
parameters – additional parameters for document parsing, see Structure type configuring for more details
- Returns:
document content with added additional information about lines types and hierarchy levels
- class dedoc.structure_extractors.StructureExtractorComposition(extractors: Dict[str, AbstractStructureExtractor], default_key: str, *, config: dict | None = None)[source]
Bases:
AbstractStructureExtractor
This class allows to extract structure from any document according to the available list of structure extractors. The list of structure extractors and names of document types for them is set via the class constructor. Each document type defines some specific document domain, those structure is extracted via the corresponding structure extractor.
- __init__(extractors: Dict[str, AbstractStructureExtractor], default_key: str, *, config: dict | None = None) None [source]
- Parameters:
extractors – mapping document_type -> structure extractor, defined for certain document domains
default_key – the document_type of the structure extractor, that will be used by default if the wrong parameters are given. default_key should exist as a key in the extractors’ dictionary.
- extract(document: UnstructuredDocument, parameters: dict | None = None) UnstructuredDocument [source]
Adds information about the document structure according to the document type received from parameters (the key document_type). If there isn’t document_type key in parameters or this document_type isn’t found in the supported types, the default extractor will be used. To get the information about the method’s parameters look at the documentation of the class
AbstractStructureExtractor
.
- class dedoc.structure_extractors.DefaultStructureExtractor(*, config: dict | None = None)[source]
Bases:
AbstractStructureExtractor
This class corresponds the basic structure extraction from the documents.
You can find the description of this type of structure in the section Default document structure type.
- document_type = 'other'
- extract(document: UnstructuredDocument, parameters: dict | None = None) UnstructuredDocument [source]
Extract basic structure from the given document and add additional information to the lines’ metadata. To get the information about the method’s parameters look at the documentation of the class
AbstractStructureExtractor
.parameters
parameter can contain patterns for configuring lines types and their levels in the output document tree (“patterns” key). Please see Patterns for DefaultStructureExtractor and Configure structure extraction using patterns to get information how to use patterns for making your custom structure.
- class dedoc.structure_extractors.AbstractLawStructureExtractor(*, config: dict | None = None)[source]
Bases:
AbstractStructureExtractor
,ABC
This class is used for extraction structure from laws.
You can find the description of this type of structure in the section Law structure type.
- extract(document: UnstructuredDocument, parameters: dict | None = None) UnstructuredDocument [source]
Extract law structure from the given document and add additional information to the lines’ metadata. To get the information about the method’s parameters look at the documentation of the class
AbstractStructureExtractor
.
- class dedoc.structure_extractors.ClassifyingLawStructureExtractor(extractors: Dict[str, AbstractStructureExtractor], *, config: dict | None = None)[source]
Bases:
AbstractStructureExtractor
,ABC
This class is used to dynamically classify laws into two types: laws and foiv. The specific extractors are called according to the classifying results.
- document_type = 'law'
- __init__(extractors: Dict[str, AbstractStructureExtractor], *, config: dict | None = None) None [source]
- Parameters:
extractors – mapping law_type -> structure extractor, defined for certain law types
config – configuration of the extractor, e.g. logger for logging
- extract(document: UnstructuredDocument, parameters: dict | None = None) UnstructuredDocument [source]
Classify law kind and extract structure according to the specific law format. To get the information about the method’s parameters look at the documentation of the class
AbstractStructureExtractor
.
- class dedoc.structure_extractors.LawStructureExtractor(*, config: dict | None = None)[source]
Bases:
AbstractLawStructureExtractor
This class is used for extraction structure from common laws.
You can find the description of this type of structure in the section Simple law structure type.
- document_type = 'law'
- class dedoc.structure_extractors.FoivLawStructureExtractor(*, config: dict | None = None)[source]
Bases:
AbstractLawStructureExtractor
This class is used for extraction structure from foiv type of law.
You can find the description of this type of structure in the section Foiv law structure type.
- document_type = 'foiv_law'
- class dedoc.structure_extractors.DiplomaStructureExtractor(*, config: dict | None = None)[source]
Bases:
AbstractStructureExtractor
This class is used for extraction structure from russian diplomas, master dissertations, thesis, etc.
You can find the description of this type of structure in the section Diploma structure type.
- document_type = 'diploma'
- extract(document: UnstructuredDocument, parameters: dict | None = None) UnstructuredDocument [source]
Extract diploma structure from the given document and add additional information to the lines’ metadata. To get the information about the method’s parameters look at the documentation of the class
AbstractStructureExtractor
.
- class dedoc.structure_extractors.TzStructureExtractor(*, config: dict | None = None)[source]
Bases:
AbstractStructureExtractor
This class is used for extraction structure from technical tasks.
You can find the description of this type of structure in the section Technical specification structure type.
- document_type = 'tz'
- extract(document: UnstructuredDocument, parameters: dict | None = None) UnstructuredDocument [source]
Extract technical task structure from the given document and add additional information to the lines’ metadata. To get the information about the method’s parameters look at the documentation of the class
AbstractStructureExtractor
.
- class dedoc.structure_extractors.ArticleStructureExtractor(*, config: dict | None = None)[source]
Bases:
AbstractStructureExtractor
This class corresponds to the GROBID article structure extraction.
This class saves all tag_hierarchy_levels received from the
ArticleReader
without using the postprocessing step (without using regular expressions).You can find the description of this type of structure in the section Article structure type (GROBID).
- document_type = 'article'
- extract(document: UnstructuredDocument, parameters: dict | None = None) UnstructuredDocument [source]
Extract article structure from the given document and add additional information to the lines’ metadata. To get the information about the method’s parameters look at the documentation of the class
AbstractStructureExtractor
.
- class dedoc.structure_extractors.FintocStructureExtractor(*, config: dict | None = None)[source]
Bases:
AbstractStructureExtractor
This class is an implementation of the TOC extractor for the FinTOC 2022 Shared task. The code is a modification of the winner’s solution (ISP RAS team).
This structure extractor is used for English, French and Spanish financial prospects in PDF format (with a textual layer). It is recommended to use
PdfTxtlayerReader
to obtain document lines. You can find the more detailed description of this type of structure in the section FinTOC structure type.- document_type = 'fintoc'
- extract(document: UnstructuredDocument, parameters: dict | None = None, file_path: str | None = None) UnstructuredDocument [source]
According to the FinTOC 2022 title detection task, lines are classified as titles and non-titles. The information about titles is saved in
line.metadata.hierarchy_level
(HierarchyLevel
class):Title lines have
HierarchyLevel.header
type, and their depth (HierarchyLevel.level_2
) is similar to the depth of TOC item from the FinTOC 2022 TOC generation task.Non-title lines have
HierarchyLevel.raw_text
type, and their depth isn’t obtained.
- Parameters:
document – document content that has been received from some of the readers (
PdfTxtlayerReader
is recommended).parameters –
for this structure extractor, “language” parameter is used for setting document’s language, e.g.
parameters={"language": "en"}
. The following options are supported:”en”, “eng” - English (default);
”fr”, “fra” - French;
”sp”, “spa” - Spanish.
file_path – path to the file on disk.
- Returns:
document content with added additional information about title/non-title lines and hierarchy levels of titles.
Patterns for DefaultStructureExtractor
Structure patterns are used for a more flexible configuring of lines types and levels during structure extraction step.
They are useful only for DefaultStructureExtractor
(in API when “document_type”=”other”).
Please see Configure structure extraction using patterns to get examples of patterns usage.
- class dedoc.structure_extractors.patterns.abstract_pattern.AbstractPattern(line_type: str | None, level_1: int | None, level_2: int | None, can_be_multiline: bool | str | None)[source]
Base class for all patterns to configure structure extraction by
DefaultStructureExtractor
.- _name = ''
- __init__(line_type: str | None, level_1: int | None, level_2: int | None, can_be_multiline: bool | str | None) None [source]
Initialize pattern with default values of
HierarchyLevel
attributes. They can be used inget_hierarchy_level()
according to specific pattern logic.- Parameters:
line_type – type of the line, e.g. “header”, “bullet_list_item”, “chapter”, etc.
level_1 – value of a line primary importance
level_2 – level of the line inside specific class
can_be_multiline – is used to unify lines inside tree node by
TreeConstructor
, if line can be multiline, it can be joined with another line. IfNone
is given, can_be_multiline is set toTrue
.
- abstract get_hierarchy_level(line: LineWithMeta) HierarchyLevel [source]
This method should be applied only when
match()
returnedTrue
for the given line.Get
HierarchyLevel
for initialisingline.metadata.hierarchy_level
attribute. Please see Hierarchy level for document lines to get more information aboutHierarchyLevel
.
- abstract match(line: LineWithMeta) bool [source]
Check if the given line satisfies to the pattern requirements. Line text, annotations or metadata (
metadata.tag_hierarchy_level
) can be used to decide, if the line matches the pattern or not.
- class dedoc.structure_extractors.patterns.pattern_composition.PatternComposition(patterns: List[AbstractPattern])[source]
Class for applying patterns to get line’s hierarchy level.
Example of usage:
from dedoc.data_structures.line_with_meta import LineWithMeta from dedoc.structure_extractors.patterns import TagListPattern, TagPattern from dedoc.structure_extractors.patterns.pattern_composition import PatternComposition pattern_composition = PatternComposition( patterns=[ TagListPattern(line_type="list_item", default_level_1=2, can_be_multiline=False), TagPattern(default_line_type="raw_text") ] ) line = LineWithMeta(line="Some text") line.metadata.hierarchy_level = pattern_composition.get_hierarchy_level(line=line)
- __init__(patterns: List[AbstractPattern]) None [source]
Set the list of patterns to apply to lines.
Note: the order of the patterns is important. More specific patterns should go first. Otherwise, they may be ignored because of the patterns which also are applicable to the given line.
- Parameters:
patterns – list of patterns to apply to lines.
- get_hierarchy_level(line: LineWithMeta) HierarchyLevel [source]
Choose the suitable pattern from the list of patterns for applying to the given line. The first applicable pattern will be chosen. If no applicable pattern was found, the default
raw_text
HierarchyLevel
is used as result.- Parameters:
line – line to get hierarchy level for.
- class dedoc.structure_extractors.patterns.RegexpPattern(regexp: str, line_type: str, level_1: int | None = None, level_2: int | None = None, can_be_multiline: bool | str | None = None)[source]
Bases:
AbstractPattern
Pattern for matching line text by a regular expression.
Note
The pattern is case-insensitive (lower and upper letters are not differed). Before regular expression matching, the line text is stripped (space symbols are deleted from both sides).
See also
Syntax for writing regular expressions is described in the Python documentation.
Example of library usage:
import re from dedoc.structure_extractors import DefaultStructureExtractor from dedoc.structure_extractors.patterns import RegexpPattern reader = ... structure_extractor = DefaultStructureExtractor() patterns = [ RegexpPattern(regexp="^chapter\s\d+\.", line_type="chapter", level_1=1, can_be_multiline=False), RegexpPattern(regexp=re.compile(r"^part\s\d+\.\d+\."), line_type="part", level_1=2, can_be_multiline=False) ] document = reader.read(file_path=file_path) document = structure_extractor.extract(document=document, parameters={"patterns": patterns})
Example of API usage:
import requests patterns = [{"name": "regexp", "regexp": "^chapter\s\d+\.", "line_type": "chapter", "level_1": 1, "can_be_multiline": "false"}] parameters = {"patterns": str(patterns)} with open(file_path, "rb") as file: files = {"file": (file_name, file)} r = requests.post("http://localhost:1231/upload", files=files, data=parameters)
- _name = 'regexp'
- __init__(regexp: str, line_type: str, level_1: int | None = None, level_2: int | None = None, can_be_multiline: bool | str | None = None) None [source]
Initialize pattern with default values of
HierarchyLevel
attributes.- Parameters:
regexp – regular expression for checking, if the line text matches the pattern. Note that regular expression is used on the lowercase and stripped line.
line_type – type of the line, e.g. “header”, “bullet_list_item”, “chapter”, etc.
level_1 – value of a line primary importance
level_2 – level of the line inside specific class
can_be_multiline – is used to unify lines inside tree node by
TreeConstructor
, if line can be multiline, it can be joined with another line. IfNone
is given, can_be_multiline is set toTrue
.
- get_hierarchy_level(line: LineWithMeta) HierarchyLevel [source]
This method should be applied only when
match()
returnedTrue
for the given line.Return
HierarchyLevel
for initialisingline.metadata.hierarchy_level
. The attributesline_type
,level_1
,level_2
,can_be_multiline
are equal to values given during class initialisation.
- match(line: LineWithMeta) bool [source]
Check if the pattern is suitable for the given line. Line text is checked by applying pattern’s regular expression, text is stripped and made lowercase beforehand.
- class dedoc.structure_extractors.patterns.StartWordPattern(start_word: str, line_type: str, level_1: int | None = None, level_2: int | None = None, can_be_multiline: bool | str | None = None)[source]
Bases:
AbstractPattern
Pattern for lines that begin with some specific text (e.g. Introduction, Chapter, etc.).
Note
The pattern is case-insensitive (lower and upper letters are not differed). Before matching, the line text is stripped (space symbols are deleted from both sides). Start word for marching is also stripped and made lowercase.
Example of library usage:
import re from dedoc.structure_extractors import DefaultStructureExtractor from dedoc.structure_extractors.patterns import StartWordPattern reader = ... structure_extractor = DefaultStructureExtractor() patterns = [StartWordPattern(start_word="chapter", line_type="chapter", level_1=1, can_be_multiline=False)] document = reader.read(file_path=file_path) document = structure_extractor.extract(document=document, parameters={"patterns": patterns})
Example of API usage:
import requests patterns = [{"name": "start_word", "start_word": "chapter", "line_type": "chapter", "level_1": 1, "can_be_multiline": "false"}] parameters = {"patterns": str(patterns)} with open(file_path, "rb") as file: files = {"file": (file_name, file)} r = requests.post("http://localhost:1231/upload", files=files, data=parameters)
- _name = 'start_word'
- __init__(start_word: str, line_type: str, level_1: int | None = None, level_2: int | None = None, can_be_multiline: bool | str | None = None) None [source]
Initialize pattern with default values of
HierarchyLevel
attributes.- Parameters:
start_word – string for checking of line text beginning. Note that start_word will be stripped and made lowercase, and will be used on the lowercase and stripped line.
line_type – type of the line, e.g. “header”, “bullet_list_item”, “chapter”, etc.
level_1 – value of a line primary importance
level_2 – level of the line inside specific class
can_be_multiline – is used to unify lines inside tree node by
TreeConstructor
, if line can be multiline, it can be joined with another line. IfNone
is given, can_be_multiline is set toTrue
.
- get_hierarchy_level(line: LineWithMeta) HierarchyLevel [source]
This method should be applied only when
match()
returnedTrue
for the given line.Return
HierarchyLevel
for initialisingline.metadata.hierarchy_level
. The attributesline_type
,level_1
,level_2
,can_be_multiline
are equal to values given during class initialisation.
- match(line: LineWithMeta) bool [source]
Check if the pattern is suitable for the given line. Line text is checked if it starts with the given
start_word
, text is stripped and made lowercase beforehand.
- class dedoc.structure_extractors.patterns.TagPattern(line_type: str | None = None, level_1: int | None = None, level_2: int | None = None, can_be_multiline: bool | str | None = None, default_line_type: str = 'raw_text', default_level_1: int | None = None, default_level_2: int | None = None)[source]
Bases:
AbstractPattern
Pattern for using information from readers saved in
line.metadata.tag_hierarchy_level
. Can be useful for paragraph extraction in PDF documents and images, because PDF and image readers save information about paragraphs inline.metadata.tag_hierarchy_level.can_be_multiline
.See also
Please see Types of textual lines if you need information, which line types can be extracted by each reader.
Example of library usage:
import re from dedoc.structure_extractors import DefaultStructureExtractor from dedoc.structure_extractors.patterns import TagPattern reader = ... structure_extractor = DefaultStructureExtractor() patterns = [TagPattern(default_line_type="raw_text")] document = reader.read(file_path=file_path) document = structure_extractor.extract(document=document, parameters={"patterns": patterns})
Example of API usage:
import requests patterns = [{"name": "tag", "default_line_type": "raw_text"}] parameters = {"patterns": str(patterns)} with open(file_path, "rb") as file: files = {"file": (file_name, file)} r = requests.post("http://localhost:1231/upload", files=files, data=parameters)
- _name = 'tag'
- __init__(line_type: str | None = None, level_1: int | None = None, level_2: int | None = None, can_be_multiline: bool | str | None = None, default_line_type: str = 'raw_text', default_level_1: int | None = None, default_level_2: int | None = None) None [source]
Initialize pattern for configuring values of
HierarchyLevel
attributes. It is recommended to configuredefault_*
values in caseline.metadata.tag_hierarchy_level
miss some values. If you want to use values fromline.metadata.tag_hierarchy_level
, it is recommended to leaveline_type
,level_1
,level_2
,can_be_multiline
empty.can_be_multiline
is filled in PDF and images readers during paragraph detection, so if you want to extract paragraphs, you shouldn’t setcan_be_multiline
during pattern initialization.- Parameters:
line_type – type of the line, replaces line_type from tag_hierarchy_level if non-empty.
level_1 – value of a line primary importance, replaces level_1 from tag_hierarchy_level if non-empty.
level_2 – level of the line inside specific class, replaces level_2 from tag_hierarchy_level if non-empty.
can_be_multiline – is used to unify lines inside tree node by
TreeConstructor
, if line can be multiline, it can be joined with another line. If not None, replaces can_be_multiline from tag_hierarchy_level.default_line_type – type of the line, is used when tag_hierarchy_level.line_type == “unknown”.
default_level_1 – value of a line primary importance, is used when tag_hierarchy_level.level_1 is None.
default_level_2 – level of the line inside specific class, is used when tag_hierarchy_level.level_2 is None.
- get_hierarchy_level(line: LineWithMeta) HierarchyLevel [source]
This method should be applied only when
match()
returnedTrue
for the given line.Return
HierarchyLevel
for initialisingline.metadata.hierarchy_level
. The attributeline_type
is initialized according to the following rules:if non-empty
line_type
is given during pattern initialisation, then its value is used in the result;if
line_type
is not given (orNone
is given) andline.metadata.tag_hierarchy_level
is notunknown
, theline_type
value fromline.metadata.tag_hierarchy_level
is used in the result;otherwise (
line_type
is empty andline.metadata.tag_hierarchy_level
isunknown
)default_line_type
value is used in the result.
Similar rules work for
level_1
andlevel_2
with comparing withNone
instead ofunknown
.The
can_be_multiline
attribute is initialized according to the following rules:if non-empty
can_be_multiline
is given during pattern initialisation, then its value is used in the result;otherwise
can_be_multiline
value fromline.metadata.tag_hierarchy_level
is used in the result.
- match(line: LineWithMeta) bool [source]
Check if the pattern is suitable for the given line:
line.metadata.tag_hierarchy_level
should not be empty.line.metadata.tag_hierarchy_level
is filled during reading step, some readers can skiptag_hierarchy_level
initialisation.
- class dedoc.structure_extractors.patterns.BracketListPattern(line_type: str, level_1: int, level_2: int, can_be_multiline: bool | str | None = None)[source]
Bases:
RegexpPattern
Pattern for matching numbered lists with brackets, e.g.
1) first element 2) second element
Example of library usage:
from dedoc.structure_extractors import DefaultStructureExtractor from dedoc.structure_extractors.patterns import BracketListPattern reader = ... structure_extractor = DefaultStructureExtractor() patterns = [BracketListPattern(line_type="list_item", level_1=1, level_2=1, can_be_multiline=False)] document = reader.read(file_path=file_path) document = structure_extractor.extract(document=document, parameters={"patterns": patterns})
Example of API usage:
import requests patterns = [{"name": "bracket_list", "line_type": "list_item", "level_1": 1, "level_2": 1, "can_be_multiline": "false"}] parameters = {"patterns": str(patterns)} with open(file_path, "rb") as file: files = {"file": (file_name, file)} r = requests.post("http://localhost:1231/upload", files=files, data=parameters)
- _name = 'bracket_list'
- __init__(line_type: str, level_1: int, level_2: int, can_be_multiline: bool | str | None = None) None [source]
Initialize pattern with default values of
HierarchyLevel
attributes.- Parameters:
regexp – regular expression for checking, if the line text matches the pattern. Note that regular expression is used on the lowercase and stripped line.
line_type – type of the line, e.g. “header”, “bullet_list_item”, “chapter”, etc.
level_1 – value of a line primary importance
level_2 – level of the line inside specific class
can_be_multiline – is used to unify lines inside tree node by
TreeConstructor
, if line can be multiline, it can be joined with another line. IfNone
is given, can_be_multiline is set toTrue
.
- class dedoc.structure_extractors.patterns.BracketRomanListPattern(line_type: str, level_1: int, level_2: int, can_be_multiline: bool | str | None = None)[source]
Bases:
RegexpPattern
Pattern for matching roman lists with brackets, e.g.
i) first item ii) second item iii) third item iv) forth item
Note
The pattern is case-insensitive (lower and upper letters are not differed).
Example of library usage:
from dedoc.structure_extractors import DefaultStructureExtractor from dedoc.structure_extractors.patterns import BracketRomanListPattern reader = ... structure_extractor = DefaultStructureExtractor() patterns = [BracketRomanListPattern(line_type="list_item", level_1=1, level_2=1, can_be_multiline=False)] document = reader.read(file_path=file_path) document = structure_extractor.extract(document=document, parameters={"patterns": patterns})
Example of API usage:
import requests patterns = [{"name": "bracket_roman_list", "line_type": "list_item", "level_1": 1, "level_2": 1, "can_be_multiline": "false"}] parameters = {"patterns": str(patterns)} with open(file_path, "rb") as file: files = {"file": (file_name, file)} r = requests.post("http://localhost:1231/upload", files=files, data=parameters)
- _name = 'bracket_roman_list'
- __init__(line_type: str, level_1: int, level_2: int, can_be_multiline: bool | str | None = None) None [source]
Initialize pattern with default values of
HierarchyLevel
attributes.- Parameters:
regexp – regular expression for checking, if the line text matches the pattern. Note that regular expression is used on the lowercase and stripped line.
line_type – type of the line, e.g. “header”, “bullet_list_item”, “chapter”, etc.
level_1 – value of a line primary importance
level_2 – level of the line inside specific class
can_be_multiline – is used to unify lines inside tree node by
TreeConstructor
, if line can be multiline, it can be joined with another line. IfNone
is given, can_be_multiline is set toTrue
.
- class dedoc.structure_extractors.patterns.BulletListPattern(line_type: str, level_1: int, level_2: int, can_be_multiline: bool | str | None = None)[source]
Bases:
RegexpPattern
Pattern for matching bulleted lists, e.g.
- first item - second item
or with other bullet markers
-, —, −, –, ®, ., •, ,, ‚, ©, ⎯, °, *, >, ●, ♣, ①, ▪, *, +
.Example of library usage:
from dedoc.structure_extractors import DefaultStructureExtractor from dedoc.structure_extractors.patterns import BulletListPattern reader = ... structure_extractor = DefaultStructureExtractor() patterns = [BulletListPattern(line_type="list_item", level_1=1, level_2=1, can_be_multiline=False)] document = reader.read(file_path=file_path) document = structure_extractor.extract(document=document, parameters={"patterns": patterns})
Example of API usage:
import requests patterns = [{"name": "bullet_list", "line_type": "list_item", "level_1": 1, "level_2": 1, "can_be_multiline": "false"}] parameters = {"patterns": str(patterns)} with open(file_path, "rb") as file: files = {"file": (file_name, file)} r = requests.post("http://localhost:1231/upload", files=files, data=parameters)
- _name = 'bullet_list'
- __init__(line_type: str, level_1: int, level_2: int, can_be_multiline: bool | str | None = None) None [source]
Initialize pattern with default values of
HierarchyLevel
attributes.- Parameters:
regexp – regular expression for checking, if the line text matches the pattern. Note that regular expression is used on the lowercase and stripped line.
line_type – type of the line, e.g. “header”, “bullet_list_item”, “chapter”, etc.
level_1 – value of a line primary importance
level_2 – level of the line inside specific class
can_be_multiline – is used to unify lines inside tree node by
TreeConstructor
, if line can be multiline, it can be joined with another line. IfNone
is given, can_be_multiline is set toTrue
.
- class dedoc.structure_extractors.patterns.DottedListPattern(line_type: str, level_1: int, can_be_multiline: bool | str | None = None)[source]
Bases:
RegexpPattern
Pattern for matching numbered lists with dots, e.g.
1. first element 1.1. first sub-element 1.2. second sub-element 2. second element
The number of dots is unlimited. There is no
level_2
parameter in this pattern,level_2
is calculated as the number of numbers between dots, e.g.1.
→level_2=1
1.1
or1.1.
→level_2=2
1.2.3.4
or1.2.3.4.
→level_2=4
Example of library usage:
from dedoc.structure_extractors import DefaultStructureExtractor from dedoc.structure_extractors.patterns import DottedListPattern reader = ... structure_extractor = DefaultStructureExtractor() patterns = [DottedListPattern(line_type="list_item", level_1=1, can_be_multiline=False)] document = reader.read(file_path=file_path) document = structure_extractor.extract(document=document, parameters={"patterns": patterns})
Example of API usage:
import requests patterns = [{"name": "dotted_list", "line_type": "list_item", "level_1": 1, "can_be_multiline": "false"}] parameters = {"patterns": str(patterns)} with open(file_path, "rb") as file: files = {"file": (file_name, file)} r = requests.post("http://localhost:1231/upload", files=files, data=parameters)
- _name = 'dotted_list'
- __init__(line_type: str, level_1: int, can_be_multiline: bool | str | None = None) None [source]
Initialize pattern with default values of
HierarchyLevel
attributes.- Parameters:
regexp – regular expression for checking, if the line text matches the pattern. Note that regular expression is used on the lowercase and stripped line.
line_type – type of the line, e.g. “header”, “bullet_list_item”, “chapter”, etc.
level_1 – value of a line primary importance
level_2 – level of the line inside specific class
can_be_multiline – is used to unify lines inside tree node by
TreeConstructor
, if line can be multiline, it can be joined with another line. IfNone
is given, can_be_multiline is set toTrue
.
- get_hierarchy_level(line: LineWithMeta) HierarchyLevel [source]
This method should be applied only when
match()
returnedTrue
for the given line.Return
HierarchyLevel
for initialisingline.metadata.hierarchy_level
. The attributesline_type
,level_1
,level_2
,can_be_multiline
are equal to values given during class initialisation.
- class dedoc.structure_extractors.patterns.LetterListPattern(line_type: str, level_1: int, level_2: int, can_be_multiline: bool | str | None = None)[source]
Bases:
RegexpPattern
Pattern for matching lists with letters and brackets, e.g.
a) first element b) second element
or (example for Armenian language)
ա) տեղաբաշխել բ) Հայաստանի Հանրապետության գ) սահմանապահ վերակարգերի
Note
The pattern is case-insensitive (lower and upper letters are not differed).
Example of library usage:
from dedoc.structure_extractors import DefaultStructureExtractor from dedoc.structure_extractors.patterns import LetterListPattern reader = ... structure_extractor = DefaultStructureExtractor() patterns = [LetterListPattern(line_type="list_item", level_1=1, level_2=1, can_be_multiline=False)] document = reader.read(file_path=file_path) document = structure_extractor.extract(document=document, parameters={"patterns": patterns})
Example of API usage:
import requests patterns = [{"name": "letter_list", "line_type": "list_item", "level_1": 1, "level_2": 1, "can_be_multiline": "false"}] parameters = {"patterns": str(patterns)} with open(file_path, "rb") as file: files = {"file": (file_name, file)} r = requests.post("http://localhost:1231/upload", files=files, data=parameters)
- _name = 'letter_list'
- __init__(line_type: str, level_1: int, level_2: int, can_be_multiline: bool | str | None = None) None [source]
Initialize pattern with default values of
HierarchyLevel
attributes.- Parameters:
regexp – regular expression for checking, if the line text matches the pattern. Note that regular expression is used on the lowercase and stripped line.
line_type – type of the line, e.g. “header”, “bullet_list_item”, “chapter”, etc.
level_1 – value of a line primary importance
level_2 – level of the line inside specific class
can_be_multiline – is used to unify lines inside tree node by
TreeConstructor
, if line can be multiline, it can be joined with another line. IfNone
is given, can_be_multiline is set toTrue
.
- class dedoc.structure_extractors.patterns.RomanListPattern(line_type: str, level_1: int, level_2: int, can_be_multiline: bool | str | None = None)[source]
Bases:
RegexpPattern
Pattern for matching roman lists with dots, e.g.
I. first item II. second item III. third item IV. forth item
Note
The pattern is case-insensitive (lower and upper letters are not differed).
Example of library usage:
from dedoc.structure_extractors import DefaultStructureExtractor from dedoc.structure_extractors.patterns import RomanListPattern reader = ... structure_extractor = DefaultStructureExtractor() patterns = [RomanListPattern(line_type="list_item", level_1=1, level_2=1, can_be_multiline=False)] document = reader.read(file_path=file_path) document = structure_extractor.extract(document=document, parameters={"patterns": patterns})
Example of API usage:
import requests patterns = [{"name": "roman_list", "line_type": "list_item", "level_1": 1, "level_2": 1, "can_be_multiline": "false"}] parameters = {"patterns": str(patterns)} with open(file_path, "rb") as file: files = {"file": (file_name, file)} r = requests.post("http://localhost:1231/upload", files=files, data=parameters)
- _name = 'roman_list'
- __init__(line_type: str, level_1: int, level_2: int, can_be_multiline: bool | str | None = None) None [source]
Initialize pattern with default values of
HierarchyLevel
attributes.- Parameters:
regexp – regular expression for checking, if the line text matches the pattern. Note that regular expression is used on the lowercase and stripped line.
line_type – type of the line, e.g. “header”, “bullet_list_item”, “chapter”, etc.
level_1 – value of a line primary importance
level_2 – level of the line inside specific class
can_be_multiline – is used to unify lines inside tree node by
TreeConstructor
, if line can be multiline, it can be joined with another line. IfNone
is given, can_be_multiline is set toTrue
.
- class dedoc.structure_extractors.patterns.TagHeaderPattern(line_type: str | None = None, level_1: int | None = None, level_2: int | None = None, can_be_multiline: bool | str | None = None, default_line_type: str = 'header', default_level_1: int = 1, default_level_2: int | None = None)[source]
Bases:
TagPattern
Pattern for using information about heading lines (header) from readers saved in
line.metadata.tag_hierarchy_level
. Also allows to calculatelevel_2
based on dotted list depth (same as inDottedListPattern
) if level_2, tag_hierarchy_level.level_2, default_level_2 are empty.See also
Please see Types of textual lines to find out which readers can extract lines with type “header”.
Example of library usage:
import re from dedoc.structure_extractors import DefaultStructureExtractor from dedoc.structure_extractors.patterns import TagHeaderPattern reader = ... structure_extractor = DefaultStructureExtractor() patterns = [TagHeaderPattern(line_type="header", level_1=1, can_be_multiline=False)] document = reader.read(file_path=file_path) document = structure_extractor.extract(document=document, parameters={"patterns": patterns})
Example of API usage:
import requests patterns = [{"name": "tag_header", "line_type": "header", "level_1": 1, "can_be_multiline": "False"}] parameters = {"patterns": str(patterns)} with open(file_path, "rb") as file: files = {"file": (file_name, file)} r = requests.post("http://localhost:1231/upload", files=files, data=parameters)
- _name = 'tag_header'
- __init__(line_type: str | None = None, level_1: int | None = None, level_2: int | None = None, can_be_multiline: bool | str | None = None, default_line_type: str = 'header', default_level_1: int = 1, default_level_2: int | None = None) None [source]
Initialize pattern for configuring values of
HierarchyLevel
attributes. It is recommended to configuredefault_*
values in caseline.metadata.tag_hierarchy_level
miss some values. If you want to use values fromline.metadata.tag_hierarchy_level
, it is recommended to leaveline_type
,level_1
,level_2
,can_be_multiline
empty.can_be_multiline
is filled in PDF and images readers during paragraph detection, so if you want to extract paragraphs, you shouldn’t setcan_be_multiline
during pattern initialization.- Parameters:
line_type – type of the line, replaces line_type from tag_hierarchy_level if non-empty.
level_1 – value of a line primary importance, replaces level_1 from tag_hierarchy_level if non-empty.
level_2 – level of the line inside specific class, replaces level_2 from tag_hierarchy_level if non-empty.
can_be_multiline – is used to unify lines inside tree node by
TreeConstructor
, if line can be multiline, it can be joined with another line. If not None, replaces can_be_multiline from tag_hierarchy_level.default_line_type – type of the line, is used when tag_hierarchy_level.line_type == “unknown”.
default_level_1 – value of a line primary importance, is used when tag_hierarchy_level.level_1 is None.
default_level_2 – level of the line inside specific class, is used when tag_hierarchy_level.level_2 is None.
- get_hierarchy_level(line: LineWithMeta) HierarchyLevel [source]
This method should be applied only when
match()
returnedTrue
for the given line.Return
HierarchyLevel
for initialisingline.metadata.hierarchy_level
. The attributeline_type
is initialized according to the following rules:if non-empty
line_type
is given during pattern initialisation, then its value is used in the result;if
line_type
is not given (orNone
is given) andline.metadata.tag_hierarchy_level
is notunknown
, theline_type
value fromline.metadata.tag_hierarchy_level
is used in the result;otherwise (
line_type
is empty andline.metadata.tag_hierarchy_level
isunknown
)default_line_type
value is used in the result.
Similar rules work for
level_1
andlevel_2
with comparing withNone
instead ofunknown
.The
can_be_multiline
attribute is initialized according to the following rules:if non-empty
can_be_multiline
is given during pattern initialisation, then its value is used in the result;otherwise
can_be_multiline
value fromline.metadata.tag_hierarchy_level
is used in the result.
- match(line: LineWithMeta) bool [source]
Check if the pattern is suitable for the given line:
line.metadata.tag_hierarchy_level
should not be empty;line.metadata.tag_hierarchy_level.line_type == "header"
line.metadata.tag_hierarchy_level
is filled during reading step, please see Types of textual lines to find out which readers can extract lines with type “header”.
- class dedoc.structure_extractors.patterns.TagListPattern(line_type: str | None = None, level_1: int | None = None, level_2: int | None = None, can_be_multiline: bool | str | None = None, default_line_type: str = 'list_item', default_level_1: int = 2, default_level_2: int | None = None)[source]
Bases:
TagPattern
Pattern for using information about list item lines (list_item) from readers saved in
line.metadata.tag_hierarchy_level
. Also allows to calculatelevel_2
based on dotted list depth (same as inDottedListPattern
) if level_2, tag_hierarchy_level.level_2, default_level_2 are empty.See also
Please see Types of textual lines to find out which readers can extract lines with type “list_item”.
Example of library usage:
import re from dedoc.structure_extractors import DefaultStructureExtractor from dedoc.structure_extractors.patterns import TagListPattern reader = ... structure_extractor = DefaultStructureExtractor() patterns = [TagListPattern(line_type="list_item", default_level_1=2, can_be_multiline=False)] document = reader.read(file_path=file_path) document = structure_extractor.extract(document=document, parameters={"patterns": patterns})
Example of API usage:
import requests patterns = [{"name": "tag_list", "line_type": "list_item", "default_level_1": 2, "can_be_multiline": "False"}] parameters = {"patterns": str(patterns)} with open(file_path, "rb") as file: files = {"file": (file_name, file)} r = requests.post("http://localhost:1231/upload", files=files, data=parameters)
- _name = 'tag_list'
- __init__(line_type: str | None = None, level_1: int | None = None, level_2: int | None = None, can_be_multiline: bool | str | None = None, default_line_type: str = 'list_item', default_level_1: int = 2, default_level_2: int | None = None) None [source]
Initialize pattern for configuring values of
HierarchyLevel
attributes. It is recommended to configuredefault_*
values in caseline.metadata.tag_hierarchy_level
miss some values. If you want to use values fromline.metadata.tag_hierarchy_level
, it is recommended to leaveline_type
,level_1
,level_2
,can_be_multiline
empty.can_be_multiline
is filled in PDF and images readers during paragraph detection, so if you want to extract paragraphs, you shouldn’t setcan_be_multiline
during pattern initialization.- Parameters:
line_type – type of the line, replaces line_type from tag_hierarchy_level if non-empty.
level_1 – value of a line primary importance, replaces level_1 from tag_hierarchy_level if non-empty.
level_2 – level of the line inside specific class, replaces level_2 from tag_hierarchy_level if non-empty.
can_be_multiline – is used to unify lines inside tree node by
TreeConstructor
, if line can be multiline, it can be joined with another line. If not None, replaces can_be_multiline from tag_hierarchy_level.default_line_type – type of the line, is used when tag_hierarchy_level.line_type == “unknown”.
default_level_1 – value of a line primary importance, is used when tag_hierarchy_level.level_1 is None.
default_level_2 – level of the line inside specific class, is used when tag_hierarchy_level.level_2 is None.
- get_hierarchy_level(line: LineWithMeta) HierarchyLevel [source]
This method should be applied only when
match()
returnedTrue
for the given line.Return
HierarchyLevel
for initialisingline.metadata.hierarchy_level
. The attributeline_type
is initialized according to the following rules:if non-empty
line_type
is given during pattern initialisation, then its value is used in the result;if
line_type
is not given (orNone
is given) andline.metadata.tag_hierarchy_level
is notunknown
, theline_type
value fromline.metadata.tag_hierarchy_level
is used in the result;otherwise (
line_type
is empty andline.metadata.tag_hierarchy_level
isunknown
)default_line_type
value is used in the result.
Similar rules work for
level_1
andlevel_2
with comparing withNone
instead ofunknown
.The
can_be_multiline
attribute is initialized according to the following rules:if non-empty
can_be_multiline
is given during pattern initialisation, then its value is used in the result;otherwise
can_be_multiline
value fromline.metadata.tag_hierarchy_level
is used in the result.
- match(line: LineWithMeta) bool [source]
Check if the pattern is suitable for the given line:
line.metadata.tag_hierarchy_level
should not be empty;line.metadata.tag_hierarchy_level.line_type == "list_item"
line.metadata.tag_hierarchy_level
is filled during reading step, please see Types of textual lines to find out which readers can extract lines with type “list_item”.