Changelog
v2.7 (2026-06-24)
Release note: v2.7
Added comments extraction from DOCX files. Extracted comments are stored in
LinkedTextAnnotationof the commented line.Added attachment extraction (images) to
PdfImageReader(for images and PDF without a textual layer).Small bug fix in
PdfTabbyReader.Added notes/comments extraction from PDF (parameter
extract_notes) with linking to the commented text. Extracted comments are stored inLinkedTextAnnotationof the commented line.
v2.6.1 (2025-12-16)
Release note: v2.6.1
Fixed some bugs in
DocxReader.Replace outdated
pylzmadependency bypy7zr.
v2.6 (2025-09-19)
Release note: v2.6
Improved table merge algorithm (added check on table layout).
Improved header footer analysis of
HeaderFooterDetector.Added header footer analysis support in
PdfTabbyReader.Added header footer analysis info (parameter
need_header_footer_analysis) in documentation.Updated to python3.10.
Updated to ubuntu22.04.
Added Support and Contributing (project rules, how to build, how to develop) in documentation.
v2.5 (2025-09-05)
Release note: v2.5
Added simple multilingual textual layer correctness classification based on letter percentage calculation (
textual_layer_classifier=letter).Added a new parameter
textual_layer_classifier = [simple, ml (default), letter].Removed parameter
fast_textual_layer_detection. Now it is atextual_layer_classifier=simple.Fixed bug with
table_type=table_wo_external_bounds.Added parameter
table_typeandTableRecognizerinfo into documentation.
v2.4 (2025-07-28)
Release note: v2.4
Upgrade
PyPDF2topypdf>4and fix bugs in attachments extraction from PDF files.Added
each_page_textual_layer_detectionparameter for textual layer detection on each page of PDF documents (forPdfAutoReader).Added
ENABLE_CANCELLATIONenv variable for enabling/disabling parsing cancellation after client disconnection (enabled by default).Fixed location coordinates of attached images extracted by
PdfTabbyReader.Added new reader
PdfBrokenEncodingReaderfor PDF documents with textual layer but broken encoding (pdf_with_text_layer=bad_encoding).
v2.3.2 (2024-12-25)
Release note: v2.3.2
Improve merging multi-page tables in
PdfTabbyReader.Stop parsing after client disconnection (for API usage, see issue 488).
v2.3.1 (2024-11-15)
Release note: v2.3.1
Fix bug with bold lines in
DocxReader(see issue 479).Upgraded requirements.txt (beautifulsoup4 to 4.12.3 version).
Added support for external grobid (added support env variable
GROBID_AUTH_KEYfor “Authorization” in request header).Added GOST (Russian government standard) frame recognition in
PdfTabbyReader(need_gost_frame_analysisparameter).Update documentation (added GOST frame handling).
v2.3 (2024-09-19)
Release note: v2.3
Dedoc telegram chat created.
Added
patternsparameter for configuring default structure type (Configure structure extraction using patterns).Added notebooks with Dedoc usage Notebooks with Dedoc usage examples (see issue 484).
Fix bug
OutOfMemoryError: Java heap spaceinPdfTabbyReader(see issue 489).Fix bug with numeration in
DocxReader(see issue 494).Added GOST (Russian government standard) frame recognition in
PdfImageReaderandPdfTxtlayerReader(need_gost_frame_analysisparameter).
v2.2.7 (2024-08-16)
Release note: v2.2.7
Fix bugs with
start,endofBBoxAnnotationinPdfTabbyReader.Improve columns classification and orientation detection for PDF and images (
is_one_column_documentanddocument_orientationparameters).Upgrade
docker:docker-composeis no longer supported, usedocker composeinstead.Fix bug of tables parsing in
DocxReader(see issue 478).Added simple textual layer detection in
PdfAutoReader(fast_textual_layer_detectionparameter).Improve paragraph extraction from PDF documents and images.
Retrain a classifier for diplomas (
document_type="diploma") on a new dataset.
v2.2.6 (2024-07-22)
Release note: v2.2.6
Upgrade dependencies:
numpy<2.0anddedoc-utils==0.3.7.Since this version,
dedocis supported by langchain (langchain-community>=0.2.10).
v2.2.5 (2024-07-15)
Release note: v2.2.5
Added internal functions and classes to support integration of Dedoc into langchain
Upgrade some dependencies, in particular,
xgboost>=1.6.0,pandas,pdfminer.six
v2.2.4 (2024-06-20)
Release note: v2.2.4
Show page division and page numbers in the HTML output representation (API usage,
return_format="html").Make imports from dedoc library faster.
Added tutorial how to add a new language to dedoc (not finished entirely).
Added additional page_id metadata for multi-page nodes (
structure_type="tree"in API,TreeConstructorin the library).Updated OCR and orientation/columns classification benchmarks.
Minor edits of
README.md.Fixed empty cells handling in
CSVReader.Fixed bounding boxes extraction for text in tables for
PdfTabbyReader.
v2.2.3 (2024-06-05)
Release note: v2.2.3
Show attached images and added ability to download attached files in the HTML output representation (API usage,
return_format="html").Added hierarchy level information and annotations to
PptxReader.
v2.2.2 (2024-05-21)
Release note: v2.2.2
Added images extraction to
ArticleReader.Added attachments and references to them in the HTML output representation (
return_format="html").Fixed functionality of parameter
need_content_analysis.Fixed
CSVReader(exclude BOM character from the output).Added handling files with wrong extension or without extension to
DedocManager(detect file type by its content).Update
README.md.
v2.2.1 (2024-05-03)
Release note: v2.2.1
Added
fintocstructure type for parsing financial prospects according to the FinTOC 2022 Shared task (FintocStructureExtractor).Fixed small bugs in
ArticleReader: colspan for tables, keywords, sections numbering, etc.Added references to nodes and fixed small bugs in the HTML output representation (
return_format="html").Removed
other_fieldsfromLineMetadataandDocumentMetadata.Update
README.md.
v2.2 (2024-04-17)
Release note: v2.2
PdfTabbyReaderimproved: bugs fixes, speed increase of partial PDF extraction (with parameterpages).Added benchmarks for evaluation of PDF readers performance.
Added
ReferenceAnnotationclass.Fixed bug in
can_readmethod for all readers.Added
articlestructure type for parsing scientific articles using GROBID (ArticleReader,ArticleStructureExtractor).
v2.1.1 (2024-03-21)
Release note: v2.1.1
Update
README.md.Update table and time benchmarks.
Re-label line-classifier datasets (law, tz, diploma, paragraphs datasets).
Update tasker creators (for the labeling system).
Fix HTML table parsing.
v2.1 (2024-03-05)
Release note: v2.1
Custom loggers deleted (the common logger is used for all dedoc classes).
Do not change the document image if it has a correct orientation (orientation correction function changed).
Use only
PdfTabbyReaderduring detection of a textual layer in PDF files.Code related to the labeling mode refactored and removed from the library package (it is located in the separate directory).
Added
BoldAnnotationfor words inPdfImageReader.More benchmarks are added: images of tables parsing, postprocessing of Tesseract OCR.
Some fixes are made in a web-form of Dedoc.
Tutorial how to add a new structure type to Dedoc added Adding support for a new structure type to Dedoc.
Parsing of EML and HTML files fixed.
v2.0 (2023-12-25)
Release note: v2.0
Fix table extraction from PDF using empty config (see issue).
Add more benchmarks for Tesseract.
Fix extension extraction for file names with several dots.
Change names of some methods and their parameters for all main classes (attachments extractors, converters, readers, metadata extractors, structure extractors, structure constructors). Please look to the Package reference of documentation for more details.
Add
AttachAnnotationandTableAnnotationto PPTX (see discussion).
v1.1.1 (2023-11-24)
Release note: v1.1.1
Use older
pydanticversion for improving compatibility with other libraries.Add support for RTF format (
DocxConverter).Fix bug in handling files’ names with dots and spaces.
Fix bug in non-integer values of text formatting in
DocxReader.Add support of
on_gpuparameter inconfig.Add attached images extraction for
PdfTabbyReader.Fix partial file reading for
PdfTabbyReader.Add tutorial how to create dedoc’s basic data structures.
Fix
attachments_dirparameter for readers and attachments extractors.
v1.1.0 (2023-10-24)
Release note: v1.1.0
Add
BBoxAnnotationto table cells forPdfTabbyReader.Fix swagger, add api schema classes, remove
to_dictmethod fromParsedDocument.Improve parsing PDF by
PdfTxtlayerReader, add benchmarks.Fix
BBoxAnnotationextraction for tables inPdfImageReaderusingtable_type=split_last_columnparameter.Change base method of metadata extractors, rename it to
extract_metadata.Unify
BBoxAnnotationextraction for all PDF readers - return only words bboxes.Increase timeout value for all converters.
v1.0 (2023-10-10)
Release note: v1.0
Remove
is_one_column_document_listparameter.Add tutorial about support for a new document type to the documentation.
Improve textual layer correctness classifier.
Improve orientation and columns classifier.
Change table’s output structure - added CellWithMeta instead of a textual string.
Add
BBoxAnnotationto table cells forPdfTxtlayerReaderandPdfImageReader.Add
ConfidenceAnnotationto table cells forPdfImageReader.Remove
insert_tableparameter.Added information about table and page rotation to the table and document metadata respectively.
Use dedoc-utils library for document images preprocessing.
Change web interface, fix online-examples of document processing.
Add comparison operator to
LineWithMeta.
v0.11.2 (2023-09-06)
Release note: v0.11.2
Remove plexus-utils-1.1.jar.
Update installation documentation.
Add documentation for Tesseract OCR installation.
Add documentation for annotations.
Add documentation for secure torch.
Fix examples.
v0.11.1 (2023-08-30)
Release note: v0.11.1
Add bbox annotations in
PdfTabbyReader.Add bbox annotations for words in
PdfTxtlayerReader.Add an option
plain_textto thereturn_formatparameter.Reduce size of the dedoc base image, move dockerfiles to the separate repository.
Refactor script for tesseract benchmarking.
Make fixed dedoc dependencies as ranges.
Add table cell properties in
PdfTabbyReader.
v0.11.0 (2023-08-22)
Release note: v0.11.0
Rename exceptions classes.
Update style tests.
Change
ConfidenceAnnotationvalue range to[0, 1].Add bbox annotations for words in
PdfImageReader.
v0.10.0 (2023-08-01)
Release note: v0.10.0
Add
ConfidenceAnnotationannotation forPdfImageReader.Remove version parameter from metadata extractors, structure constructors and parsed document methods.
Add version file and version resolving for the library.
Add recursive handling of attachments.
Add parameter for saving attachments in a custom directory.
Remove dedoc threaded manager.
Improve
PdfAutoReader.Add temporary file name to
DocumentMetadata.
v0.9.2 (2023-07-18)
Release note: v0.9.2
Fix bug for diplomas with
insert_table=true.Fix logging in PDF slicing.
Make
PdfAutoReaderfaster.Update bold classifier.
Tests Refactoring.
Fix bug in models downloading inside docker container.
v0.9.1 (2023-07-05)
Release note: v0.9.1
Fixed bug with
AttachAnnotationin docx: its value is equal attachment uid instead of file name.
v0.9 (2023-06-26)
Release note: v0.9
Publication of the first version of dedoc library.