Types of textual lines

Each reader returns UnstructuredDocument with textual lines. Readers don’t fill hierarchy_level metadata field (structure extractors do this), but they can fill tag_hierarchy_level with information about line types. Below the readers are enlisted that can return non-empty tag_hierarchy_level in document lines metadata:

  • + means that the reader can return lines of this type.

  • - means that the reader doesn’t return lines of this type due to complexity of the task or lack of information provided by the format.

Line types returned by each reader

Reader

header

list_item

unknown

key

DocxReader

+

+

+

-

PptxReader

+

+

+

-

HtmlReader, MhtmlReader, EmailReader

+

+

+

-

RawTextReader

-

-

+

-

JsonReader

-

+

+

+

PdfImageReader

-

-

+

-

PdfTabbyReader

+

+

+

-

PdfTxtlayerReader

-

-

+

-