dedoc.readers

class dedoc.readers.BaseReader(*, config: dict | None = None, recognized_extensions: Set[str] | None = None, recognized_mimes: Set[str] | None = None)[source]

This class is a base class for reading documents of any formats. It allows to check if the specific reader can read the document of some format and to get document’s text with metadata, tables and attachments.

The metadata (or annotations) of the text are various and may include text boldness and color, footnotes or links to tables. Some of the readers can also extract information about line type and hierarchy level (for example, list item) - this information is stored in the tag_hierarchy_level attribute of the class LineMetadata.

can_read(file_path: str | None = None, mime: str | None = None, extension: str | None = None, parameters: dict | None = None) → bool[source]

Check if this reader can handle the given file. You should provide at least one of the following parameters: file_path, extension, mime.

Parameters:

file_path – path to the file in the file system
mime – MIME type of a file
extension – file extension, for example .doc or .pdf
parameters – dict with additional parameters for document reader, see Parameters description for more details

Returns:

True if this reader can handle the file, False otherwise

abstract read(file_path: str, parameters: dict | None = None) → UnstructuredDocument[source]

Read file from disk and extract text with annotations, tables and attachments from the document. The given file should have appropriate extension and mime type, so it should be checked by the method can_read(), which should return True beforehand.

Parameters:

file_path – path to the file in the file system
parameters – dict with additional parameters for document reader, see Parameters description for more details

Returns:

intermediate representation of the document with lines, tables and attachments

class dedoc.readers.ReaderComposition(readers: List[BaseReader])[source]

This class allows to read any document of the predefined list of formats according to the available list of readers. The list of readers is set via the class constructor. The first suitable reader is used for parsing (the one whose method can_read() returns True), so the order of the given readers is important.

__init__(readers: List[BaseReader]) → None[source]

Parameters:: readers – the list of readers for documents of different formats that will be used for parsing

read(file_path: str, parameters: dict | None = None, extension: str | None = None, mime: str | None = None) → UnstructuredDocument[source]

Get intermediate representation for the document of any format which one of the available readers can parse. If there is no suitable reader for the given document, the BadFileFormatException will be raised.

Parameters:

file_path – path of the file to be parsed
parameters – dict with additional parameters for document readers, see Parameters description for more details
extension – file extension, for example .doc or .pdf
mime – MIME type of file

Returns:

intermediate representation of the document with lines, tables and attachments

class dedoc.readers.ArchiveReader(*, config: dict | None = None)[source]

Bases: BaseReader

This reader allows to get archived files as attachments of the UnstructuredDocument. Documents with the following extensions can be parsed: .zip, .tar, .tar.gz, .rar, .7z.

read(file_path: str, parameters: dict | None = None) → UnstructuredDocument[source]: The method return empty content of archive, all content will be placed inside attachments. Look to the documentation of read() to get information about the method’s parameters.

class dedoc.readers.CSVReader(*, config: dict | None = None)[source]

Bases: BaseReader

This class allows to parse files with the following extensions: .csv, .tsv.

read(file_path: str, parameters: dict | None = None) → UnstructuredDocument[source]: The method will place all extracted content inside tables of the UnstructuredDocument. The lines and attachments remain empty. Look to the documentation of read() to get information about the method’s parameters.

class dedoc.readers.DocxReader(*, config: dict | None = None)[source]

Bases: BaseReader

This class is used for parsing documents with .docx extension. Please use DocxConverter for getting docx file from similar formats.

read(file_path: str, parameters: dict | None = None) → UnstructuredDocument[source]: The method return document content with all document’s lines, tables and attachments. This reader is able to add some additional information to the tag_hierarchy_level of LineMetadata. Look to the documentation of read() to get information about the method’s parameters.

class dedoc.readers.EmailReader(*, config: dict | None = None)[source]

Bases: BaseReader

This class is used for parsing documents with .eml extension (e-mail messages saved into files).

can_read(file_path: str | None = None, mime: str | None = None, extension: str | None = None, parameters: dict | None = None) → bool[source]: Check if the document extension or mime is suitable for this reader. Look to the documentation of can_read() to get information about the method’s parameters.

read(file_path: str, parameters: dict | None = None) → UnstructuredDocument[source]

The method return document content with all document’s lines, tables and attachments. This reader is able to add some additional information to the tag_hierarchy_level of LineMetadata. It also saves some data from the message’s header (fields “subject”, “from”, “to”, “cc”, “bcc”, “date”, “reply-to”) to the attached json file with prefix message_header_.

Look to the documentation of read() to get information about the method’s parameters.

class dedoc.readers.ExcelReader(*, config: dict | None = None)[source]

Bases: BaseReader

This class is used for parsing documents with .xlsx extension. Please use ExcelConverter for getting xlsx file from similar formats.

read(file_path: str, parameters: dict | None = None) → UnstructuredDocument[source]: This method extracts tables and attachments from the document, lines attribute remains empty. Look to the documentation of read() to get information about the method’s parameters.

class dedoc.readers.HtmlReader(*, config: dict | None = None)[source]

Bases: BaseReader

This reader allows to handle documents with the following extensions: .htm, .html, .shtml

read(file_path: str, parameters: dict | None = None) → UnstructuredDocument[source]: The method return document content with all document’s lines and tables, attachments remain empty. This reader is able to add some additional information to the tag_hierarchy_level of LineMetadata. Look to the documentation of read() to get information about the method’s parameters.

class dedoc.readers.JsonReader(*, config: dict | None = None)[source]

Bases: BaseReader

This reader allows handle .json files.

read(file_path: str, parameters: dict | None = None) → UnstructuredDocument[source]: The method return document content with all document’s lines and attachments, tables remain empty. This reader considers json lists as list items and adds this information to the tag_hierarchy_level of LineMetadata. The dictionaries are processed by creating key line with type key and value line as a child. Look to the documentation of read() to get information about the method’s parameters.

class dedoc.readers.MhtmlReader(*, config: dict | None = None)[source]

Bases: BaseReader

This reader can process files with the following extensions: .mhtml, .mht, .mhtml.gz, .mht.gz

can_read(file_path: str | None = None, mime: str | None = None, extension: str | None = None, parameters: dict | None = None) → bool[source]: Check if the document extension is suitable for this reader. Look to the documentation of can_read() to get information about the method’s parameters.

read(file_path: str, parameters: dict | None = None) → UnstructuredDocument[source]: The method return document content with all document’s lines, tables and attachments. This reader is able to add some additional information to the tag_hierarchy_level of LineMetadata. Look to the documentation of read() to get information about the method’s parameters.

class dedoc.readers.NoteReader(*, config: dict | None = None)[source]

Bases: BaseReader

This class is used for parsing documents with .note.pickle extension.

read(file_path: str, parameters: dict | None = None) → UnstructuredDocument[source]: The method return document content with all document’s lines. Look to the documentation of read() to get information about the method’s parameters.

class dedoc.readers.PptxReader(*, config: dict | None = None)[source]

Bases: BaseReader

This class is used for parsing documents with .pptx extension. Please use PptxConverter for getting pptx file from similar formats.

read(file_path: str, parameters: dict | None = None) → UnstructuredDocument[source]: The method return document content with all document’s lines, tables and attachments. Look to the documentation of read() to get information about the method’s parameters.

class dedoc.readers.PdfBaseReader(*, config: dict | None = None, recognized_extensions: Set[str] | None = None, recognized_mimes: Set[str] | None = None)[source]

Bases: BaseReader

Base class for pdf documents parsing.

read(file_path: str, parameters: dict | None = None) → UnstructuredDocument[source]

The method return document content with all document’s lines, tables and attachments. This reader is able to add some additional information to the tag_hierarchy_level of LineMetadata (can_be_multiline attribute is important for paragraph extraction). Look to the documentation of read() to get information about the method’s parameters.

You can also see PDF and images handling to get more information about parameters dictionary possible arguments.

class dedoc.readers.PdfImageReader(*, config: dict | None = None)[source]

Bases: PdfBaseReader

This class allows to extract content from the .pdf documents without a textual layer (not copyable documents), as well as from images (scanned documents).

The following features are implemented to enhance the recognition results:

optical character recognition using Tesseract OCR;
table detection and recognition;
document binarization (configure via need_binarization parameter);
document orientation correction (automatically rotate on 90, 180, 270 degrees if it’s needed);
one and two column documents classification;
detection of bold text.

It isn’t recommended to use this reader for extracting content from PDF documents with a correct textual layer, use other PDF readers instead.

read(file_path: str, parameters: dict | None = None) → UnstructuredDocument[source]

The method return document content with all document’s lines, tables and attachments. This reader is able to add some additional information to the tag_hierarchy_level of LineMetadata (can_be_multiline attribute is important for paragraph extraction). Look to the documentation of read() to get information about the method’s parameters.

You can also see PDF and images handling to get more information about parameters dictionary possible arguments.

class dedoc.readers.PdfTabbyReader(*, config: dict | None = None)[source]

Bases: PdfBaseReader

This class allows to extract content (textual and table) from the .pdf documents with a textual layer (copyable documents). It uses java code to get the result.

It is recommended to use this class as a handler for PDF documents with a correct textual layer if you don’t need to check textual layer correctness. For more information, look to pdf_with_text_layer option description in PDF and images handling.

can_read(file_path: str | None = None, mime: str | None = None, extension: str | None = None, parameters: dict | None = None) → bool[source]

Check if the document extension is suitable for this reader (PDF format is supported only). This method returns True only when the key pdf_with_text_layer with value tabby is set in the dictionary parameters.

You can look to PDF and images handling to get more information about parameters dictionary possible arguments.

Look to the documentation of can_read() to get information about the method’s parameters.

read(file_path: str, parameters: dict | None = None) → UnstructuredDocument[source]

The method return document content with all document’s lines, tables and attachments. This reader is able to add some additional information to the tag_hierarchy_level of LineMetadata. Look to the documentation of read() to get information about the method’s parameters.

You can also see PDF and images handling to get more information about parameters dictionary possible arguments.

class dedoc.readers.PdfTxtlayerReader(*, config: dict | None = None)[source]

Bases: PdfBaseReader

This class allows to extract content (text, tables, attachments) from the .pdf documents with a textual layer (copyable documents). It uses a pdfminer library for content extraction.

For more information, look to pdf_with_text_layer option description in PDF and images handling.

can_read(file_path: str | None = None, mime: str | None = None, extension: str | None = None, parameters: dict | None = None) → bool[source]

Check if the document extension is suitable for this reader (PDF format is supported only). This method returns True only when the key pdf_with_text_layer with value true is set in the dictionary parameters.

You can look to PDF and images handling to get more information about parameters dictionary possible arguments.

Look to the documentation of can_read() to get information about the method’s parameters.

read(file_path: str, parameters: dict | None = None) → UnstructuredDocument[source]

The method return document content with all document’s lines, tables and attachments. This reader is able to add some additional information to the tag_hierarchy_level of LineMetadata (can_be_multiline attribute is important for paragraph extraction). Look to the documentation of read() to get information about the method’s parameters.

You can also see PDF and images handling to get more information about parameters dictionary possible arguments.

class dedoc.readers.PdfAutoReader(*, config: dict | None = None)[source]

Bases: BaseReader

This class allows to extract content from the .pdf documents of any kind. PDF documents can have a textual layer (copyable documents) or be without it (images, scanned documents).

PdfAutoReader is used for automatic detection of a correct textual layer in the given PDF file:

if PDF document has a correct textual layer then PdfTxtlayerReader or PdfTabbyReader is used for document content extraction;
if PDF document doesn’t have a correct textual layer then PdfImageReader is used for document content extraction.

For more information, look to pdf_with_text_layer option description in PDF and images handling.

can_read(file_path: str | None = None, mime: str | None = None, extension: str | None = None, parameters: dict | None = None) → bool[source]

Check if the document extension is suitable for this reader (PDF format is supported only). This method returns True only when the key pdf_with_text_layer with value auto or auto_tabby is set in the dictionary parameters.

It is recommended to use pdf_with_text_layer=auto_tabby because it’s faster and allows to get better results. You can look to PDF and images handling to get more information about parameters dictionary possible arguments.

read(file_path: str, parameters: dict | None = None) → UnstructuredDocument[source]: The method return document content with all document’s lines, tables and attachments. This reader is able to add some additional information to the tag_hierarchy_level of LineMetadata. Look to the documentation of read() to get information about the method’s parameters. You can also see PDF and images handling to get more information about parameters dictionary possible arguments.

class dedoc.readers.RawTextReader(*, config: dict | None = None)[source]

Bases: BaseReader

This class allows to parse files with the following extensions: .txt, .txt.gz

can_read(file_path: str | None = None, mime: str | None = None, extension: str | None = None, parameters: dict | None = None) → bool[source]: Check if the document extension is suitable for this reader. Look to the documentation of can_read() to get information about the method’s parameters.

read(file_path: str, parameters: dict | None = None) → UnstructuredDocument[source]: This method returns only document lines. Look to the documentation of read() to get information about the method’s parameters.

class dedoc.readers.ArticleReader(config: dict | None = None)[source]

Bases: BaseReader

This class is used for parsing scientific articles with .pdf extension using GROBID system.

can_read(file_path: str | None = None, mime: str | None = None, extension: str | None = None, parameters: dict | None = None) → bool[source]

Check if:

the document extension is suitable for this reader (.pdf);
parameter “document_type” is “article”;
GROBID service is running on port 8070.

Look to the documentation of can_read() to get information about the method’s parameters.

read(file_path: str, parameters: dict | None = None) → UnstructuredDocument[source]

The method calls the service GROBID method /api/processFulltextDocument and analyzes the result (format XML/TEI) of the recognized article using beautifulsoup library. As a result, the method fills the class UnstructuredDocument. Article reader adds additional information to the tag_hierarchy_level of LineMetadata. The method extracts information about authors, keywords, bibliography items, sections, and tables. In table cells, colspan attribute can be filled according to the GROBID’s “cols” attribute. You can find more information about the extracted information from GROBID system on the page Article structure type (GROBID).

Look to the documentation of read() to get information about the method’s parameters.

class dedoc.readers.PdfBrokenEncodingReader(*, config: dict | None = None)[source]

Bases: PdfTxtlayerReader

This class allows to extract content (text, tables, attachments) from the .pdf documents with a textual layer with broken encoding (copyable documents, but copied text is incorrect) with complex background. It uses a pdfminer library for text extraction and CNN for font’s glyphs prediction. Currently, only Russian and English languages are supported.

For more information, look to pdf_with_text_layer option description in PDF and images handling.

can_read(file_path: str | None = None, mime: str | None = None, extension: str | None = None, parameters: dict | None = None) → bool[source]

Check if the document extension is suitable for this reader (PDF format is supported only). This method returns True only when the key pdf_with_text_layer with value bad_encoding is set in the dictionary parameters.

You can look to PDF and images handling to get more information about parameters dictionary possible arguments.

Look to the documentation of can_read() to get information about the method’s parameters.