dedoc.readers

class dedoc.readers.BaseReader[source]

This class is a base class for reading documents of any formats. It allows to check if the specific reader can read the document of some format and to get document’s text with metadata, tables and attachments.

The metadata (or annotations) of the text are various and may include text boldness and color, footnotes or links to tables. Some of the readers can also extract information about line type and hierarchy level (for example, list item) - this information is stored in the tag_hierarchy_level attribute of the class LineMetadata.

abstract can_read(path: str, mime: str, extension: str, document_type: str | None = None, parameters: dict | None = None) → bool[source]

Check if this reader can handle the given file.

Parameters:

path – path to the file in the file system
mime – MIME type of a file
extension – file extension, for example .doc or .pdf
document_type – type of file, for example scientific article, presentation slides and so on
parameters – dict with additional parameters for document reader (as language for scans or delimiter for csv)

Returns:

True if this reader can handle the file, False otherwise

abstract read(path: str, document_type: str | None = None, parameters: dict | None = None) → UnstructuredDocument[source]

Read file from disk and extract text with annotations, tables and attachments from the document. The given file should have appropriate extension and type so it should be checked by the method can_read(), which should return True beforehand.

Parameters:

path – path to the file in the file system
document_type – type of the file, for example scientific article, presentation slides and so on
parameters – dict with additional parameters for document reader (as language for scans or delimiter for csv)

Returns:

intermediate representation of the document with lines, tables and attachments

class dedoc.readers.ReaderComposition(readers: List[BaseReader])[source]

This class allows to read any document of the predefined list of formats according to the available list of readers. The list of readers is set via the class constructor. The first suitable reader is used for parsing (the one whose method can_read() returns True), so the order of the given readers is important.

__init__(readers: List[BaseReader]) → None[source]

Parameters:: readers – the list of readers for documents of different formats that will be used for parsing

parse_file(tmp_dir: str, filename: str, parameters: Dict[str, str]) → UnstructuredDocument[source]

Get intermediate representation for the document of any format which one of the available readers can parse. If there is no suitable reader for the given document, the BadFileFormatException will be raised.

Parameters:

tmp_dir – the directory where the file is located
filename – name of the given file
parameters – dict with additional parameters for document reader (as language for scans or delimiter for csv)

Returns:

intermediate representation of the document with lines, tables and attachments

class dedoc.readers.ArchiveReader(*, config: dict)[source]

Bases: BaseReader

This reader allows to get archived files as attachments of the UnstructuredDocument. Documents with the following extensions can be parsed: .zip, .tar, .tar.gz, .rar, .7z.

__init__(*, config: dict) → None[source]

Parameters:: config – configuration of the reader, e.g. logger for logging

can_read(path: str, mime: str, extension: str, document_type: str | None = None, parameters: dict | None = None) → bool[source]: Check if the document extension is suitable for this reader. Look to the documentation of can_read() to get information about the method’s parameters.

read(path: str, document_type: str | None = None, parameters: dict | None = None) → UnstructuredDocument[source]: The method return empty content of archive, all content will be placed inside attachments. Look to the documentation of read() to get information about the method’s parameters.

class dedoc.readers.CSVReader[source]

Bases: BaseReader

This class allows to parse files with the following extensions: .csv, .tsv.

__init__() → None[source]

can_read(path: str, mime: str, extension: str, document_type: str | None = None, parameters: dict | None = None) → bool[source]: Check if the document extension is suitable for this reader. Look to the documentation of can_read() to get information about the method’s parameters.

read(path: str, document_type: str | None = None, parameters: dict | None = None) → UnstructuredDocument[source]: The method will place all extracted content inside tables of the UnstructuredDocument. The lines and attachments remain empty. Look to the documentation of read() to get information about the method’s parameters.

class dedoc.readers.DocxReader(*, config: dict)[source]

Bases: BaseReader

This class is used for parsing documents with .docx extension. Please use DocxConverter for getting docx file from similar formats.

__init__(*, config: dict) → None[source]

Parameters:: config – configuration of the reader, e.g. logger for logging

can_read(path: str, mime: str, extension: str, document_type: str | None = None, parameters: dict | None = None) → bool[source]: Check if the document extension is suitable for this reader. Look to the documentation of can_read() to get information about the method’s parameters.

read(path: str, document_type: str | None = None, parameters: dict | None = None) → UnstructuredDocument[source]: The method return document content with all document’s lines, tables and attachments. This reader is able to add some additional information to the tag_hierarchy_level of LineMetadata. Look to the documentation of read() to get information about the method’s parameters.

class dedoc.readers.EmailReader(*, config: dict)[source]

Bases: BaseReader

This class is used for parsing documents with .eml extension (e-mail messages saved into files).

__init__(*, config: dict) → None[source]

Parameters:: config – configuration of the reader, e.g. logger for logging

can_read(path: str, mime: str, extension: str, document_type: str | None = None, parameters: dict | None = None) → bool[source]: Check if the document extension or mime is suitable for this reader. Look to the documentation of can_read() to get information about the method’s parameters.

read(path: str, document_type: str | None = None, parameters: dict | None = None) → UnstructuredDocument[source]

The method return document content with all document’s lines, tables and attachments. This reader is able to add some additional information to the tag_hierarchy_level of LineMetadata. It also saves some data from the message’s header (fields “subject”, “from”, “to”, “cc”, “bcc”, “date”, “reply-to”) to the attached json file with prefix message_header_.

Look to the documentation of read() to get information about the method’s parameters.

class dedoc.readers.ExcelReader[source]

Bases: BaseReader

This class is used for parsing documents with .xlsx extension. Please use ExcelConverter for getting xlsx file from similar formats.

__init__() → None[source]

can_read(path: str, mime: str, extension: str, document_type: str | None = None, parameters: dict | None = None) → bool[source]: Check if the document extension is suitable for this reader. Look to the documentation of can_read() to get information about the method’s parameters.

read(path: str, document_type: str | None = None, parameters: dict | None = None) → UnstructuredDocument[source]: This method extracts tables and attachments from the document, lines attribute remains empty. Look to the documentation of read() to get information about the method’s parameters.

class dedoc.readers.HtmlReader(*, config: dict)[source]

Bases: BaseReader

This reader allows to handle documents with the following extensions: .html, .shtml

__init__(*, config: dict) → None[source]

Parameters:: config – configuration of the reader, e.g. logger for logging

can_read(path: str, mime: str, extension: str, document_type: str | None = None, parameters: dict | None = None) → bool[source]: Check if the document extension is suitable for this reader. Look to the documentation of can_read() to get information about the method’s parameters.

read(path: str, document_type: str | None = None, parameters: dict | None = None) → UnstructuredDocument[source]: The method return document content with all document’s lines and tables, attachments remain empty. This reader is able to add some additional information to the tag_hierarchy_level of LineMetadata. Look to the documentation of read() to get information about the method’s parameters.

class dedoc.readers.JsonReader[source]

Bases: BaseReader

This reader allows handle json files.

__init__() → None[source]

can_read(path: str, mime: str, extension: str, document_type: str | None = None, parameters: dict | None = None) → bool[source]: Check if the document extension is suitable for this reader (it has .json extension). Look to the documentation of can_read() to get information about the method’s parameters.

read(path: str, document_type: str | None = None, parameters: dict | None = None) → UnstructuredDocument[source]: The method return document content with all document’s lines and attachments, tables remain empty. This reader considers json lists as list items and adds this information to the tag_hierarchy_level of LineMetadata. The dictionaries are processed by creating key line with type key and value line as a child. Look to the documentation of read() to get information about the method’s parameters.

class dedoc.readers.MhtmlReader(*, config: dict)[source]

Bases: BaseReader

This reader can process files with the following extensions: .mhtml, .mht, .mhtml.gz, .mht.gz

__init__(*, config: dict) → None[source]

Parameters:: config – configuration of the reader, e.g. logger for logging

can_read(path: str, mime: str, extension: str, document_type: str | None = None, parameters: dict | None = None) → bool[source]: Check if the document extension is suitable for this reader. Look to the documentation of can_read() to get information about the method’s parameters.

read(path: str, document_type: str | None = None, parameters: dict | None = None) → UnstructuredDocument[source]: The method return document content with all document’s lines, tables and attachments. This reader is able to add some additional information to the tag_hierarchy_level of LineMetadata. Look to the documentation of read() to get information about the method’s parameters.

class dedoc.readers.NoteReader(*, config: dict)[source]

Bases: BaseReader

This class is used for parsing documents with .note.pickle extension.

__init__(*, config: dict) → None[source]

Parameters:: config – configuration of the reader, e.g. logger for logging

can_read(path: str, mime: str, extension: str, document_type: str | None = None, parameters: dict | None = None) → bool[source]: Check if the document extension is suitable for this reader. Look to the documentation of can_read() to get information about the method’s parameters.

read(path: str, document_type: str | None = None, parameters: dict | None = None) → UnstructuredDocument[source]: The method return document content with all document’s lines. Look to the documentation of read() to get information about the method’s parameters.

class dedoc.readers.PptxReader[source]

Bases: BaseReader

This class is used for parsing documents with .pptx extension. Please use PptxConverter for getting pptx file from similar formats.

__init__() → None[source]

can_read(path: str, mime: str, extension: str, document_type: str | None = None, parameters: dict | None = None) → bool[source]: Check if the document extension is suitable for this reader. Look to the documentation of can_read() to get information about the method’s parameters.

read(path: str, document_type: str | None = None, parameters: dict | None = None) → UnstructuredDocument[source]: The method return document content with all document’s lines, tables and attachments. Look to the documentation of read() to get information about the method’s parameters.

class dedoc.readers.PdfBaseReader(config: dict)[source]

Bases: BaseReader

Base class for pdf documents parsing.

__init__(config: dict) → None[source]

Parameters:: config – configuration of the reader, e.g. logger for logging

read(path: str, document_type: str | None = None, parameters: dict | None = None) → UnstructuredDocument[source]: The method return document content with all document’s lines, tables and attachments. This reader is able to add some additional information to the tag_hierarchy_level of LineMetadata. Look to the documentation of read() to get information about the method’s parameters.

class dedoc.readers.PdfImageReader(*, config: dict)[source]

Bases: PdfBaseReader

This class allows to extract content from the .pdf documents without a textual layer (not copyable documents), as well as from images (scanned documents).

The following features are implemented to enhance the recognition results:

optical character recognition using Tesseract OCR;
table detection and recognition;
document binarization (configure via need_binarization parameter);
document orientation correction (automatically rotate on 90, 180, 270 degrees if it’s needed);
one and two column documents classification;
detection of bold text.

It isn’t recommended to use this reader for extracting content from PDF documents with a correct textual layer, use other PDF readers instead.

__init__(*, config: dict) → None[source]

Parameters:: config – configuration of the reader, e.g. logger for logging

can_read(path: str, mime: str, extension: str, document_type: str | None = None, parameters: dict | None = None) → bool[source]: Check if the document extension is suitable for this reader, i.e. it has .pdf extension, or it is an image. Look to the documentation of can_read() to get information about the method’s parameters.

class dedoc.readers.PdfTabbyReader(*, config: dict)[source]

Bases: PdfBaseReader

This class allows to extract content (textual and table) from the .pdf documents with a textual layer (copyable documents). It uses java code to get the result.

It is recommended to use this class as a handler for PDF documents with a correct textual layer if you don’t need to check textual layer correctness. For more information, look to pdf_with_text_layer option description in the table Api parameters for files parsing via dedoc.

__init__(*, config: dict) → None[source]

Parameters:: config – configuration of the reader, e.g. logger for logging

can_read(path: str, mime: str, extension: str, document_type: str | None = None, parameters: dict | None = None) → bool[source]

Check if the document extension is suitable for this reader (PDF format is supported only). This method returns True only when the key pdf_with_text_layer with value tabby is set in the dictionary parameters.

You can look to the table Api parameters for files parsing via dedoc to get more information about parameters dictionary possible arguments.

Look to the documentation of can_read() to get information about the method’s parameters.

read(path: str, document_type: str | None = None, parameters: dict | None = None) → UnstructuredDocument[source]: The method return document content with all document’s lines, tables and attachments. This reader is able to add some additional information to the tag_hierarchy_level of LineMetadata. Look to the documentation of read() to get information about the method’s parameters.

class dedoc.readers.PdfTxtlayerReader(*, config: dict)[source]

Bases: PdfBaseReader

This class allows to extract content (text, tables, attachments) from the .pdf documents with a textual layer (copyable documents). It uses a pdfminer library for content extraction.

For more information, look to pdf_with_text_layer option description in the table Api parameters for files parsing via dedoc.

__init__(*, config: dict) → None[source]

Parameters:: config – configuration of the reader, e.g. logger for logging

can_read(path: str, mime: str, extension: str, document_type: str | None = None, parameters: dict | None = None) → bool[source]

Check if the document extension is suitable for this reader (PDF format is supported only). This method returns True only when the key pdf_with_text_layer with value true is set in the dictionary parameters.

You can look to the table Api parameters for files parsing via dedoc to get more information about parameters dictionary possible arguments.

Look to the documentation of can_read() to get information about the method’s parameters.

class dedoc.readers.PdfAutoReader(*, config: dict)[source]

Bases: BaseReader

This class allows to extract content from the .pdf documents of any kind. PDF documents can have a textual layer (copyable documents) or be without it (images, scanned documents).

PdfAutoReader is used for automatic detection of a correct textual layer in the given PDF file:

if PDF document has a correct textual layer then PdfTxtLayerReader or PdfTabbyReader is used for document content extraction;
if PDF document doesn’t have a correct textual layer then PdfImageReader is used for document content extraction.

For more information, look to pdf_with_text_layer option description in the table Api parameters for files parsing via dedoc.

__init__(*, config: dict) → None[source]

Parameters:: config – configuration of the reader, e.g. logger for logging

can_read(path: str, mime: str, extension: str, document_type: str | None = None, parameters: dict | None = None) → bool[source]

Check if the document extension is suitable for this reader (PDF format is supported only). This method returns True only when the key pdf_with_text_layer with value auto or auto_tabby is set in the dictionary parameters.

It is recommended to use pdf_with_text_layer=auto_tabby because it’s faster and allows to get better results. You can look to the table Api parameters for files parsing via dedoc to get more information about parameters dictionary possible arguments.

Look to the documentation of can_read() to get information about the method’s parameters.

read(path: str, document_type: str | None = None, parameters: dict | None = None) → UnstructuredDocument[source]: The method return document content with all document’s lines, tables and attachments. This reader is able to add some additional information to the tag_hierarchy_level of LineMetadata. Look to the documentation of read() to get information about the method’s parameters.

class dedoc.readers.RawTextReader(*, config: dict)[source]

Bases: BaseReader

This class allows to parse files with the following extensions: .txt, .txt.gz

__init__(*, config: dict) → None[source]

Parameters:: config – configuration of the reader, e.g. logger for logging

can_read(path: str, mime: str, extension: str, document_type: str | None = None, parameters: dict | None = None) → bool[source]: Check if the document extension is suitable for this reader. Look to the documentation of can_read() to get information about the method’s parameters.

read(path: str, document_type: str | None = None, parameters: dict | None = None) → UnstructuredDocument[source]: This method returns only document lines, some types of the lines (e.g. list_item) may be found using regular expressions. Look to the documentation of read() to get information about the method’s parameters.