Changelog ========= v2.6.1 (2025-12-16) ------------------- Release note: `v2.6.1 `_ * Fixed some bugs in `DocxReader`. * Replace outdated `pylzma` dependency by `py7zr`. v2.6 (2025-09-19) ----------------- Release note: `v2.6 `_ * Improved table merge algorithm (added check on table layout) `MultiPageTableExtractor`. * Improved header footer analysis `HeaderFooterDetector`. * Added header footer analysis support in `PdfTabbyReader`. * Added header footer analysis info (parameter `need_header_footer_analysis`) in documentation. * Updated to python3.10. * Updated to ubuntu22.04. * Added `Support and Contributing` (project rules, how to build, how to develop) in documentation. v2.5 (2025-09-05) ----------------- Release note: `v2.5 `_ * Added simple multilingual textual layer correctness classification based on letter percentage calculation (`textual_layer_classifier=letter`). * Added a new parameter `textual_layer_classifier = [simple, ml (default), letter]`. * Removed parameter `fast_textual_layer_detection`. Now it is a `textual_layer_classifier=simple`. * Fixed bug with `table_type=table_wo_external_bounds` (fixed cv2.BoundingRect). * Added parameter `table_type` and `TableRecognition` info into documentation. v2.4 (2025-07-28) ----------------- Release note: `v2.4 `_ * Upgrade `PyPDF2` to `pypdf>4` and fix bugs in attachments extraction from PDF files. * Added `each_page_textual_layer_detection` parameter for textual layer detection on each page of PDF documents (for `PdfAutoReader`). * Added `ENABLE_CANCELLATION` env variable for enabling/disabling parsing cancellation after client disconnection (enabled by default). * Fixed location coordinates of attached images extracted by `PdfTabbyReader`. * Added new reader `PdfBrokenEncodingReader` for PDF documents with textual layer but broken encoding (`pdf_with_text_layer=bad_encoding`). v2.3.2 (2024-12-25) ------------------- Release note: `v2.3.2 `_ * Improve merging multi-page tables in `PdfTabbyReader`. * Stop parsing after client disconnection (for API usage, see `issue 488 `_). v2.3.1 (2024-11-15) ------------------- Release note: `v2.3.1 `_ * Fix bug with bold lines in `DocxReader` (see `issue 479 `_). * Upgraded requirements.txt (beautifulsoup4 to 4.12.3 version). * Added support for external grobid (added support env variable `GROBID_AUTH_KEY` for "Authorization" in request header). * Added GOST (Russian government standard) frame recognition in `PdfTabbyReader` (`need_gost_frame_analysis` parameter). * Update documentation (added GOST frame recognition). v2.3 (2024-09-19) ----------------- Release note: `v2.3 `_ * `Dedoc telegram chat `_ created. * Added `patterns` parameter for configuring default structure type (:ref:`using_patterns`). * Added notebooks with Dedoc usage :ref:`table_notebooks` (see `issue 484 `_). * Fix bug `OutOfMemoryError: Java heap space` in `PdfTabbyReader` (see `issue 489 `_). * Fix bug with numeration in `DocxReader` (see `issue 494 `_). * Added GOST (Russian government standard) frame recognition in `PdfImageReader` and `PdfTxtlayerReader` (`need_gost_frame_analysis` parameter). v2.2.7 (2024-08-16) ------------------- Release note: `v2.2.7 `_ * Fix bugs with `start`, `end` of `BBoxAnnotation` in `PdfTabbyReader`. * Improve columns classification and orientation detection for PDF and images (`is_one_column_document` and `document_orientation` parameters). * Upgrade `docker`: `docker-compose` is no longer supported, use `docker compose` instead. * Fix bug of tables parsing in `DocxReader` (see `issue 478 `_). * Added simple textual layer detection in `PdfAutoReader` (`fast_textual_layer_detection` parameter). * Improve paragraph extraction from PDF documents and images. * Retrain a classifier for diplomas (document_type="diploma") on a new dataset. v2.2.6 (2024-07-22) ------------------- Release note: `v2.2.6 `_ * Upgrade dependencies: `numpy<2.0` and `dedoc-utils==0.3.7`. * Since this version, `dedoc` is supported by `langchain `_ (langchain-community>=0.2.10). v2.2.5 (2024-07-15) ------------------- Release note: `v2.2.5 `_ * Added internal functions and classes to support integration of Dedoc into `langchain `_ * Upgrade some dependencies, in particular, `xgboost>=1.6.0`, `pandas`, `pdfminer.six` v2.2.4 (2024-06-20) ------------------- Release note: `v2.2.4 `_ * Show page division and page numbers in the HTML output representation (API usage, return_format="html"). * Make imports from dedoc library faster. * Added tutorial how to add a new language to dedoc (not finished entirely). * Added additional page_id metadata for multi-page nodes (structure_type="tree" in API, `TreeConstructor` in the library). * Updated OCR and orientation/columns classification benchmarks. * Minor edits of `README.md`. * Fixed empty cells handling in `CSVReader`. * Fixed bounding boxes extraction for text in tables for `PdfTabbyReader`. v2.2.3 (2024-06-05) ------------------- Release note: `v2.2.3 `_ * Show attached images and added ability to download attached files in the HTML output representation (API usage, return_format="html"). * Added hierarchy level information and annotations to `PptxReader`. v2.2.2 (2024-05-21) ------------------- Release note: `v2.2.2 `_ * Added images extraction to `ArticleReader`. * Added attachments and references to them in the HTML output representation (return_format="html"). * Fixed functionality of parameter `need_content_analysis`. * Fixed `CSVReader` (exclude BOM character from the output). * Added handling files with wrong extension or without extension to `DedocManager` (detect file type by its content). * Update `README.md`. v2.2.1 (2024-05-03) ------------------- Release note: `v2.2.1 `_ * Added `fintoc` structure type for parsing financial prospects according to the `FinTOC 2022 Shared task `_ (`FintocStructureExtractor`). * Fixed small bugs in `ArticleReader`: colspan for tables, keywords, sections numbering, etc. * Added references to nodes and fixed small bugs in the HTML output representation (return_format="html"). * Removed `other_fields` from `LineMetadata` and `DocumentMetadata`. * Update `README.md`. v2.2 (2024-04-17) ----------------- Release note: `v2.2 `_ * `PdfTabbyReader` improved: bugs fixes, speed increase of partial PDF extraction (with parameter `pages`). * Added benchmarks for evaluation of PDF readers performance. * Added `ReferenceAnnotation` class. * Fixed bug in `can_read` method for all readers. * Added `article` structure type for parsing scientific articles using `GROBID `_ (`ArticleReader`, `ArticleStructureExtractor`). v2.1.1 (2024-03-21) ------------------- Release note: `v2.1.1 `_ * Update `README.md`. * Update table and time benchmarks. * Re-label line-classifier datasets (law, tz, diploma, paragraphs datasets). * Update tasker creators (for the labeling system). * Fix HTML table parsing. v2.1 (2024-03-05) ----------------- Release note: `v2.1 `_ * Custom loggers deleted (the common logger is used for all dedoc classes). * Do not change the document image if it has a correct orientation (orientation correction function changed). * Use only `PdfTabbyReader` during detection of a textual layer in PDF files. * Code related to the labeling mode refactored and removed from the library package (it is located in the separate directory). * Added `BoldAnnotation` for words in `PdfImageReader`. * More benchmarks are added: images of tables parsing, postprocessing of Tesseract OCR. * Some fixes are made in a web-form of Dedoc. * Tutorial how to add a new structure type to Dedoc added. * Parsing of EML and HTML files fixed. v2.0 (2023-12-25) ----------------- Release note: `v2.0 `_ * Fix table extraction from PDF using empty config (see `issue `_). * Add more benchmarks for Tesseract. * Fix extension extraction for file names with several dots. * Change names of some methods and their parameters for all main classes (attachments extractors, converters, readers, metadata extractors, structure extractors, structure constructors). Please look to the `Package reference` of `documentation `_ for more details. * Add `AttachAnnotation` and `TableAnnotation` to `PPTX` (see `discussion `_). * Fix bugs in `DOCX` handling (see issues `378 `_, `379 `_ v1.1.1 (2023-11-24) ------------------- Release note: `v1.1.1 `_ * Use older `pydantic` version for improving compatibility with other libraries. * Add support for `RTF` format. * Fix bug in handling files' names with dots and spaces. * Fix bug in non-integer values of text formatting in `DocxReader`. * Add support of `on_gpu` parameter in `config`. * Add attached images extraction for `PdfTabbyReader`. * Fix partial file reading for `PdfTabbyReader`. * Add tutorial how to create dedoc's basic data structures. * Fix `attachments_dir` parameter for readers and attachments extractors. v1.1.0 (2023-10-24) ------------------- Release note: `v1.1.0 `_ * Add `BBoxAnnotation` to table cells for `PdfTabbyReader`. * Fix swagger, add api schema classes, remove `to_dict` method from `ParsedDocument`. * Improve parsing PDF by `PdfTxtlayerReader`, add benchmarks. * Fix `BBoxAnnotation` extraction for tables in `PdfImageReader` using `table_type=split_last_column` parameter. * Change base method of metadata extractors, rename it to `extract_metadata`. * Unify `BBoxAnnotation` extraction for all PDF readers - return only words bboxes. * Increase timeout value for all converters. v1.0 (2023-10-10) ----------------- Release note: `v1.0 `_ * Remove `is_one_column_document_list` parameter. * Add tutorial about support for a new document type to the documentation. * Improve textual layer correctness classifier. * Improve orientation and columns classifier. * Change table's output structure - added `CellWithMeta` instead of a textual string. * Add `BBoxAnnotation` to table cells for `PdfTxtlayerReader` and `PdfImageReader`. * Add `ConfidenceAnnotation` to table cells for `PdfImageReader`. * Remove `insert_table` parameter. * Added information about table and page rotation to the table and document metadata respectively. * Use `dedoc-utils `_ library for document images preprocessing. * Change web interface, fix online-examples of document processing. * Add comparison operator to `LineWithMeta`. v0.11.2 (2023-09-06) -------------------- Release note: `v0.11.2 `_ * Remove plexus-utils-1.1.jar. * Update installation documentation. * Add documentation for Tesseract OCR installation. * Add documentation for annotations. * Add documentation for secure torch. * Fix examples. v0.11.1 (2023-08-30) -------------------- Release note: `v0.11.1 `_ * Add bbox annotations in `PdfTabbyReader`. * Add bbox annotations for words in `PdfTxtlayerReader`. * Add an option `plain_text` to the `return_format` parameter. * Reduce size of the dedoc base image, move dockerfiles to the `separate repository `_. * Refactor script for tesseract benchmarking. * Make fixed dedoc dependencies as ranges. * Add table cell properties in `PdfTabbyReader`. v0.11.0 (2023-08-22) -------------------- Release note: `v0.11.0 `_ * Rename exceptions classes. * Update style tests. * Change `ConfidenceAnnotation` value range to `[0, 1]`. * Add bbox annotations for words in `PdfImageReader`. v0.10.0 (2023-08-01) -------------------- Release note: `v0.10.0 `_ * Add ConfidenceAnnotation annotation for PdfImageReader. * Remove version parameter from metadata extractors, structure constructors and parsed document methods. * Add version file and version resolving for the library. * Add recursive handling of attachments. * Add parameter for saving attachments in a custom directory. * Remove dedoc threaded manager. * Improve PdfAutoReader. * Add temporary file name to DocumentMetadata. v0.9.2 (2023-07-18) ------------------- Release note: `v0.9.2 `_ * Fix bug for diplomas with `insert_table=true`. * Fix logging in PDF slicing. * Make PdfAutoReader faster. * Update bold classifier. * Tests Refactoring. * Fix bug in models downloading inside docker container. v0.9.1 (2023-07-05) ------------------- Release note: `v0.9.1 `_ * Fixed bug with `AttachAnnotation` in docx: its value is equal attachment uid instead of file name. v0.9 (2023-06-26) ----------------- Release note: `v0.9 `_ * Publication of the first version of dedoc library.