Types of textual lines

Each reader returns UnstructuredDocument with textual lines. Readers don’t fill hierarchy_level metadata field (structure extractors do this), but they can fill hierarchy_level_tag with information about line types. Below the readers are enlisted that can return non-empty hierarchy_level_tag in document lines metadata:

  • + means that the reader can return lines of this type.

  • - means that the reader doesn’t return lines of this type due to complexity of the task or lack of information provided by the format.

Line types returned by each reader

Reader

header

list_item

raw_text, unknown

key

DocxReader

+

+

+

-

HtmlReader, MhtmlReader, EmailReader

+

+

+

-

RawTextReader

-

+

+

-

JsonReader

-

+

+

+

PdfImageReader

-

+

+

-

PdfTabbyReader

+

+

+

-

PdfTxtlayerReader

-

+

+

-