Changelog

v1.1.1 (2023-11-24)

Release note: v1.1.1

  • Use older pydantic version for improving compatibility with other libraries.

  • Add support for RTF format.

  • Fix bug in handling files’ names with dots and spaces.

  • Fix bug in non-integer values of text formatting in DocxReader.

  • Add support of on_gpu parameter in config.

  • Add attached images extraction for PdfTabbyReader.

  • Fix partial file reading for PdfTabbyReader.

  • Add tutorial how to create dedoc’s basic data structures.

  • Fix attachments_dir parameter for readers and attachments extractors.

v1.1.0 (2023-10-24)

Release note: v1.1.0

  • Add BBoxAnnotation to table cells for PdfTabbyReader.

  • Fix swagger, add api schema classes, remove to_dict method from ParsedDocument.

  • Improve parsing PDF by PdfTxtlayerReader, add benchmarks.

  • Fix BBoxAnnotation extraction for tables in PdfImageReader using table_type=split_last_column parameter.

  • Change base method of metadata extractors, rename it to extract_metadata.

  • Unify BBoxAnnotation extraction for all PDF readers - return only words bboxes.

  • Increase timeout value for all converters.

v1.0 (2023-10-10)

Release note: v1.0

  • Remove is_one_column_document_list parameter.

  • Add tutorial about support for a new document type to the documentation.

  • Improve textual layer correctness classifier.

  • Improve orientation and columns classifier.

  • Change table’s output structure - added CellWithMeta instead of a textual string.

  • Add BBoxAnnotation to table cells for PdfTxtlayerReader and PdfImageReader.

  • Add ConfidenceAnnotation to table cells for PdfImageReader.

  • Remove insert_table parameter.

  • Added information about table and page rotation to the table and document metadata respectively.

  • Use dedoc-utils library for document images preprocessing.

  • Change web interface, fix online-examples of document processing.

  • Add comparison operator to LineWithMeta.

v0.11.2 (2023-09-06)

Release note: v0.11.2

  • Remove plexus-utils-1.1.jar.

  • Update installation documentation.

  • Add documentation for Tesseract OCR installation.

  • Add documentation for annotations.

  • Add documentation for secure torch.

  • Fix examples.

v0.11.1 (2023-08-30)

Release note: v0.11.1

  • Add bbox annotations in PdfTabbyReader.

  • Add bbox annotations for words in PdfTxtlayerReader.

  • Add an option plain_text to the return_format parameter.

  • Reduce size of the dedoc base image, move dockerfiles to the separate repository.

  • Refactor script for tesseract benchmarking.

  • Make fixed dedoc dependencies as ranges.

  • Add table cell properties in PdfTabbyReader.

v0.11.0 (2023-08-22)

Release note: v0.11.0

  • Rename exceptions classes.

  • Update style tests.

  • Change ConfidenceAnnotation value range to [0, 1].

  • Add bbox annotations for words in PdfImageReader.

v0.10.0 (2023-08-01)

Release note: v0.10.0

  • Add ConfidenceAnnotation annotation for PdfImageReader.

  • Remove version parameter from metadata extractors, structure constructors and parsed document methods.

  • Add version file and version resolving for the library.

  • Add recursive handling of attachments.

  • Add parameter for saving attachments in a custom directory.

  • Remove dedoc threaded manager.

  • Improve PdfAutoReader.

  • Add temporary file name to DocumentMetadata.

v0.9.2 (2023-07-18)

Release note: v0.9.2

  • Fix bug for diplomas with insert_table=true.

  • Fix logging in PDF slicing.

  • Make PdfAutoReader faster.

  • Update bold classifier.

  • Tests Refactoring.

  • Fix bug in models downloading inside docker container.

v0.9.1 (2023-07-05)

Release note: v0.9.1

  • Fixed bug with AttachAnnotation in docx: its value is equal attachment uid instead of file name.

v0.9 (2023-06-26)

Release note: v0.9

  • Publication of the first version of dedoc library.