Changelog ========= v2.7 (2026-06-24) ----------------- Release note: `v2.7 `_ * Added comments extraction from DOCX files. Extracted comments are stored in :class:`~dedoc.data_structures.LinkedTextAnnotation` of the commented line. * Added attachment extraction (images) to :class:`~dedoc.readers.PdfImageReader` (for images and PDF without a textual layer). * Small bug fix in :class:`~dedoc.readers.PdfTabbyReader`. * Added notes/comments extraction from PDF (parameter ``extract_notes``) with linking to the commented text. Extracted comments are stored in :class:`~dedoc.data_structures.LinkedTextAnnotation` of the commented line. v2.6.1 (2025-12-16) ------------------- Release note: `v2.6.1 `_ * Fixed some bugs in :class:`~dedoc.readers.DocxReader`. * Replace outdated ``pylzma`` dependency by ``py7zr``. v2.6 (2025-09-19) ----------------- Release note: `v2.6 `_ * Improved table merge algorithm (added check on table layout). * Improved header footer analysis of :class:`~dedoc.readers.pdf_reader.utils.header_footers_analysis.HeaderFooterDetector`. * Added header footer analysis support in :class:`~dedoc.readers.PdfTabbyReader`. * Added header footer analysis info (parameter ``need_header_footer_analysis``) in documentation. * Updated to python3.10. * Updated to ubuntu22.04. * Added :ref:`contributing` (project rules, how to build, how to develop) in documentation. v2.5 (2025-09-05) ----------------- Release note: `v2.5 `_ * Added simple multilingual textual layer correctness classification based on letter percentage calculation (``textual_layer_classifier=letter``). * Added a new parameter ``textual_layer_classifier = [simple, ml (default), letter]``. * Removed parameter ``fast_textual_layer_detection``. Now it is a ``textual_layer_classifier=simple``. * Fixed bug with ``table_type=table_wo_external_bounds``. * Added parameter ``table_type`` and :class:`~dedoc.readers.pdf_reader.pdf_image_reader.table_recognizer.table_recognizer.TableRecognizer` info into documentation. v2.4 (2025-07-28) ----------------- Release note: `v2.4 `_ * Upgrade ``PyPDF2`` to ``pypdf>4`` and fix bugs in attachments extraction from PDF files. * Added ``each_page_textual_layer_detection`` parameter for textual layer detection on each page of PDF documents (for :class:`~dedoc.readers.PdfAutoReader`). * Added ``ENABLE_CANCELLATION`` env variable for enabling/disabling parsing cancellation after client disconnection (enabled by default). * Fixed location coordinates of attached images extracted by :class:`~dedoc.readers.PdfTabbyReader`. * Added new reader :class:`~dedoc.readers.PdfBrokenEncodingReader` for PDF documents with textual layer but broken encoding (``pdf_with_text_layer=bad_encoding``). v2.3.2 (2024-12-25) ------------------- Release note: `v2.3.2 `_ * Improve merging multi-page tables in :class:`~dedoc.readers.PdfTabbyReader`. * Stop parsing after client disconnection (for API usage, see `issue 488 `_). v2.3.1 (2024-11-15) ------------------- Release note: `v2.3.1 `_ * Fix bug with bold lines in :class:`~dedoc.readers.DocxReader` (see `issue 479 `_). * Upgraded requirements.txt (beautifulsoup4 to 4.12.3 version). * Added support for external grobid (added support env variable ``GROBID_AUTH_KEY`` for "Authorization" in request header). * Added GOST (Russian government standard) frame recognition in :class:`~dedoc.readers.PdfTabbyReader` (``need_gost_frame_analysis`` parameter). * Update documentation (added :ref:`gost_frame_handling`). v2.3 (2024-09-19) ----------------- Release note: `v2.3 `_ * `Dedoc telegram chat `_ created. * Added ``patterns`` parameter for configuring default structure type (:ref:`using_patterns`). * Added notebooks with Dedoc usage :ref:`table_notebooks` (see `issue 484 `_). * Fix bug ``OutOfMemoryError: Java heap space`` in :class:`~dedoc.readers.PdfTabbyReader` (see `issue 489 `_). * Fix bug with numeration in :class:`~dedoc.readers.DocxReader` (see `issue 494 `_). * Added GOST (Russian government standard) frame recognition in :class:`~dedoc.readers.PdfImageReader` and :class:`~dedoc.readers.PdfTxtlayerReader` (``need_gost_frame_analysis`` parameter). v2.2.7 (2024-08-16) ------------------- Release note: `v2.2.7 `_ * Fix bugs with ``start``, ``end`` of :class:`~dedoc.data_structures.BBoxAnnotation` in :class:`~dedoc.readers.PdfTabbyReader`. * Improve columns classification and orientation detection for PDF and images (``is_one_column_document`` and ``document_orientation`` parameters). * Upgrade ``docker``: ``docker-compose`` is no longer supported, use ``docker compose`` instead. * Fix bug of tables parsing in :class:`~dedoc.readers.DocxReader` (see `issue 478 `_). * Added simple textual layer detection in :class:`~dedoc.readers.PdfAutoReader` (``fast_textual_layer_detection`` parameter). * Improve paragraph extraction from PDF documents and images. * Retrain a classifier for diplomas (``document_type="diploma"``) on a new dataset. v2.2.6 (2024-07-22) ------------------- Release note: `v2.2.6 `_ * Upgrade dependencies: ``numpy<2.0`` and ``dedoc-utils==0.3.7``. * Since this version, ``dedoc`` is supported by `langchain `_ (langchain-community>=0.2.10). v2.2.5 (2024-07-15) ------------------- Release note: `v2.2.5 `_ * Added internal functions and classes to support integration of Dedoc into `langchain `_ * Upgrade some dependencies, in particular, ``xgboost>=1.6.0``, ``pandas``, ``pdfminer.six`` v2.2.4 (2024-06-20) ------------------- Release note: `v2.2.4 `_ * Show page division and page numbers in the HTML output representation (API usage, ``return_format="html"``). * Make imports from dedoc library faster. * Added tutorial how to add a new language to dedoc (not finished entirely). * Added additional page_id metadata for multi-page nodes (``structure_type="tree"`` in API, :class:`~dedoc.structure_constructors.TreeConstructor` in the library). * Updated OCR and orientation/columns classification benchmarks. * Minor edits of ``README.md``. * Fixed empty cells handling in :class:`~dedoc.readers.CSVReader`. * Fixed bounding boxes extraction for text in tables for :class:`~dedoc.readers.PdfTabbyReader`. v2.2.3 (2024-06-05) ------------------- Release note: `v2.2.3 `_ * Show attached images and added ability to download attached files in the HTML output representation (API usage, ``return_format="html"``). * Added hierarchy level information and annotations to :class:`~dedoc.readers.PptxReader`. v2.2.2 (2024-05-21) ------------------- Release note: `v2.2.2 `_ * Added images extraction to :class:`~dedoc.readers.ArticleReader`. * Added attachments and references to them in the HTML output representation (``return_format="html"``). * Fixed functionality of parameter ``need_content_analysis``. * Fixed :class:`~dedoc.readers.CSVReader` (exclude BOM character from the output). * Added handling files with wrong extension or without extension to :class:`~dedoc.DedocManager` (detect file type by its content). * Update ``README.md``. v2.2.1 (2024-05-03) ------------------- Release note: `v2.2.1 `_ * Added ``fintoc`` structure type for parsing financial prospects according to the `FinTOC 2022 Shared task `_ (`FintocStructureExtractor`). * Fixed small bugs in :class:`~dedoc.readers.ArticleReader`: colspan for tables, keywords, sections numbering, etc. * Added references to nodes and fixed small bugs in the HTML output representation (``return_format="html"``). * Removed ``other_fields`` from :class:`~dedoc.data_structures.LineMetadata` and :class:`~dedoc.data_structures.DocumentMetadata`. * Update ``README.md``. v2.2 (2024-04-17) ----------------- Release note: `v2.2 `_ * :class:`~dedoc.readers.PdfTabbyReader` improved: bugs fixes, speed increase of partial PDF extraction (with parameter ``pages``). * Added benchmarks for evaluation of PDF readers performance. * Added :class:`~dedoc.data_structures.ReferenceAnnotation` class. * Fixed bug in ``can_read`` method for all readers. * Added ``article`` structure type for parsing scientific articles using `GROBID `_ (:class:`~dedoc.readers.ArticleReader`, :class:`~dedoc.structure_extractors.ArticleStructureExtractor`). v2.1.1 (2024-03-21) ------------------- Release note: `v2.1.1 `_ * Update ``README.md``. * Update table and time benchmarks. * Re-label line-classifier datasets (law, tz, diploma, paragraphs datasets). * Update tasker creators (for the labeling system). * Fix HTML table parsing. v2.1 (2024-03-05) ----------------- Release note: `v2.1 `_ * Custom loggers deleted (the common logger is used for all dedoc classes). * Do not change the document image if it has a correct orientation (orientation correction function changed). * Use only :class:`~dedoc.readers.PdfTabbyReader` during detection of a textual layer in PDF files. * Code related to the labeling mode refactored and removed from the library package (it is located in the separate directory). * Added :class:`~dedoc.data_structures.BoldAnnotation` for words in :class:`~dedoc.readers.PdfImageReader`. * More benchmarks are added: images of tables parsing, postprocessing of Tesseract OCR. * Some fixes are made in a web-form of Dedoc. * Tutorial how to add a new structure type to Dedoc added :ref:`add_structure_type`. * Parsing of EML and HTML files fixed. v2.0 (2023-12-25) ----------------- Release note: `v2.0 `_ * Fix table extraction from PDF using empty config (see `issue `_). * Add more benchmarks for Tesseract. * Fix extension extraction for file names with several dots. * Change names of some methods and their parameters for all main classes (attachments extractors, converters, readers, metadata extractors, structure extractors, structure constructors). Please look to the `Package reference` of `documentation `_ for more details. * Add :class:`~dedoc.data_structures.AttachAnnotation` and :class:`~dedoc.data_structures.TableAnnotation` to PPTX (see `discussion `_). * Fix bugs in DOCX handling (see issues `378 `_, `379 `_ v1.1.1 (2023-11-24) ------------------- Release note: `v1.1.1 `_ * Use older ``pydantic`` version for improving compatibility with other libraries. * Add support for RTF format (:class:`~dedoc.converters.DocxConverter`). * Fix bug in handling files' names with dots and spaces. * Fix bug in non-integer values of text formatting in :class:`~dedoc.readers.DocxReader`. * Add support of ``on_gpu`` parameter in ``config``. * Add attached images extraction for :class:`~dedoc.readers.PdfTabbyReader`. * Fix partial file reading for :class:`~dedoc.readers.PdfTabbyReader`. * Add tutorial how to create dedoc's basic data structures. * Fix ``attachments_dir`` parameter for readers and attachments extractors. v1.1.0 (2023-10-24) ------------------- Release note: `v1.1.0 `_ * Add :class:`~dedoc.data_structures.BBoxAnnotation` to table cells for :class:`~dedoc.readers.PdfTabbyReader`. * Fix swagger, add api schema classes, remove ``to_dict`` method from :class:`~dedoc.data_structures.ParsedDocument`. * Improve parsing PDF by :class:`~dedoc.readers.PdfTxtlayerReader`, add benchmarks. * Fix :class:`~dedoc.data_structures.BBoxAnnotation` extraction for tables in :class:`~dedoc.readers.PdfImageReader` using ``table_type=split_last_column`` parameter. * Change base method of metadata extractors, rename it to ``extract_metadata``. * Unify :class:`~dedoc.data_structures.BBoxAnnotation` extraction for all PDF readers - return only words bboxes. * Increase timeout value for all converters. v1.0 (2023-10-10) ----------------- Release note: `v1.0 `_ * Remove ``is_one_column_document_list`` parameter. * Add tutorial about support for a new document type to the documentation. * Improve textual layer correctness classifier. * Improve orientation and columns classifier. * Change table's output structure - added `CellWithMeta` instead of a textual string. * Add :class:`~dedoc.data_structures.BBoxAnnotation` to table cells for :class:`~dedoc.readers.PdfTxtlayerReader` and :class:`~dedoc.readers.PdfImageReader`. * Add :class:`~dedoc.data_structures.ConfidenceAnnotation` to table cells for :class:`~dedoc.readers.PdfImageReader`. * Remove ``insert_table`` parameter. * Added information about table and page rotation to the table and document metadata respectively. * Use `dedoc-utils `_ library for document images preprocessing. * Change web interface, fix online-examples of document processing. * Add comparison operator to :class:`~dedoc.data_structures.LineWithMeta`. v0.11.2 (2023-09-06) -------------------- Release note: `v0.11.2 `_ * Remove plexus-utils-1.1.jar. * Update installation documentation. * Add documentation for Tesseract OCR installation. * Add documentation for annotations. * Add documentation for secure torch. * Fix examples. v0.11.1 (2023-08-30) -------------------- Release note: `v0.11.1 `_ * Add bbox annotations in :class:`~dedoc.readers.PdfTabbyReader`. * Add bbox annotations for words in :class:`~dedoc.readers.PdfTxtlayerReader`. * Add an option ``plain_text`` to the ``return_format`` parameter. * Reduce size of the dedoc base image, move dockerfiles to the `separate repository `_. * Refactor script for tesseract benchmarking. * Make fixed dedoc dependencies as ranges. * Add table cell properties in :class:`~dedoc.readers.PdfTabbyReader`. v0.11.0 (2023-08-22) -------------------- Release note: `v0.11.0 `_ * Rename exceptions classes. * Update style tests. * Change :class:`~dedoc.data_structures.ConfidenceAnnotation` value range to ``[0, 1]``. * Add bbox annotations for words in :class:`~dedoc.readers.PdfImageReader`. v0.10.0 (2023-08-01) -------------------- Release note: `v0.10.0 `_ * Add :class:`~dedoc.data_structures.ConfidenceAnnotation` annotation for :class:`~dedoc.readers.PdfImageReader`. * Remove version parameter from metadata extractors, structure constructors and parsed document methods. * Add version file and version resolving for the library. * Add recursive handling of attachments. * Add parameter for saving attachments in a custom directory. * Remove dedoc threaded manager. * Improve :class:`~dedoc.readers.PdfAutoReader`. * Add temporary file name to :class:`~dedoc.data_structures.DocumentMetadata`. v0.9.2 (2023-07-18) ------------------- Release note: `v0.9.2 `_ * Fix bug for diplomas with ``insert_table=true``. * Fix logging in PDF slicing. * Make :class:`~dedoc.readers.PdfAutoReader` faster. * Update bold classifier. * Tests Refactoring. * Fix bug in models downloading inside docker container. v0.9.1 (2023-07-05) ------------------- Release note: `v0.9.1 `_ * Fixed bug with :class:`~dedoc.data_structures.AttachAnnotation` in docx: its value is equal attachment uid instead of file name. v0.9 (2023-06-26) ----------------- Release note: `v0.9 `_ * Publication of the first version of dedoc library.