Article structure type (GROBID)

This structure type is used for scientific article analysis using GROBID system.

Note

In case you use dedoc as a library or a separate Docker image (without docker-compose). If you want to use this structure extractor, you should run GROBID service via Docker (or see grobid running instruction).
docker run --rm --init --ulimit core=0 -p 8070:8070 lfoppiano/grobid:0.8.0

We analyze the recognition results from GROBID. The following types of objects are included in the resulting tree:

article’s title;

authors with their affiliations to organizations and emails;

article’s sections headers (for example Abstract, Introduction, .., Conclusion etc);

tables and their content;

bibliography;

references on tables and bibliography items.

There are the following line types in the article structure type:

root;

author (includes author_first_name, author_surname, email);

keywords (includes keyword);

author_affiliation (includes org_name, address);

abstract;

section;

bibliography;

bibliography_item (includes [title | title_journal | title_series | title_conference_proceedings], author, biblScope_volume, biblScope_pages, DOI, publisher, date);

raw_text.

You can see the example of the document of this structure type. This page provides examples of this article analysis.

Below is a description of nodes in the output tree:

root: node containing the text of the article title.
There is only one root node in any document. It is obligatory for any document of article type. All other document lines are children of the root node. We take the title’s text from GROBID’s TEI-XML path tag <title>:
<fileDesc> <titleStmt> <title> Title's text </title> // -> node.paragraph_type="root" </fileDesc> </titleStmt>
author: information about an author of the article.
author nodes are children of the node root. This type of node has subnodes.
- author_first_name - <persname> tag in GROBID’s output. The node doesn’t have children nodes.
- author_surname - <surname> tag in GROBID’s output. The node doesn’t have children nodes.
- email - author’s email, <email> tag in GROBID’s output. The node doesn’t have children nodes.
- author_affiliation - author affiliation description.
GROBID’s TEI-XML <author>’s name information
<author> // -> node.paragraph_type="author" <persname> <forename type="first">Sonia</forename> // -> node.paragraph_type="author_first_name" <surname>Belaïd</surname> // -> node.paragraph_type="author_surname" <email></email> // -> node.paragraph_type="email" </persname> ... </author>

author_affiliation: Author’s affiliation description.

author_affiliation nodes are children of the node author. This type of node has subnodes.

org_name - organization description, <orgname> tag in GROBID’s output. The node doesn’t have children nodes.
address - organization address, <address> tag in GROBID’s output. The node doesn’t have children nodes.

GROBID’s TEI-XML tag <author><affiliation> information according the affiliation description :

<author>    // -> node.paragraph_type="author"
...
<affiliation key="aff2">        // -> node.paragraph_type="author_affiliation"
    <orgName type="department">ICTEAM/ELEN/Crypto Group</orgName>       // -> node.paragraph_type="org_name"
    <orgName type="institution">Université catholique de Louvain</orgName>
    <address>
        <country key="BE">Belgium</country>
    </address>
</affiliation>

The result of parsing of the second author of the article:

        {
          "node_id": "0.1",
          "text": "",
          "annotations": [],
          "metadata": {
            "paragraph_type": "author",
            "page_id": 0,
            "line_id": 0
          },
          "subparagraphs": [
            {
              "node_id": "0.1.0",
              "text": "Vincent",
              "annotations": [],
              "metadata": {
                "paragraph_type": "author_first_name",
                "page_id": 0,
                "line_id": 0
              },
              "subparagraphs": []
            },
            {
              "node_id": "0.1.1",
              "text": "Grosso",
              "annotations": [],
              "metadata": {
                "paragraph_type": "author_surname",
                "page_id": 0,
                "line_id": 0
              },
              "subparagraphs": []
            },
            {
              "node_id": "0.1.2",
              "text": "aff2",
              "annotations": [],
              "metadata": {
                "paragraph_type": "author_affiliation",
                "page_id": 0,
                "line_id": 0
              },
              "subparagraphs": [
                {
                  "node_id": "0.1.2.0",
                  "text": "ICTEAM/ELEN/Crypto Group",
                  "annotations": [],
                  "metadata": {
                    "paragraph_type": "org_name",
                    "page_id": 0,
                    "line_id": 0
                  },
                  "subparagraphs": []
                },
                {
                  "node_id": "0.1.2.1",
                  "text": "Belgium",
                  "annotations": [],
                  "metadata": {
                    "paragraph_type": "address",
                    "page_id": 0,
                    "line_id": 0
                  },
                  "subparagraphs": []
                }
              ]
            }
          ]
        },

keywords node (if exist) is a child node of the node root.

keywords node contains keyword nodes as children. Each keyword node contains the text of one key word item.
abstract is the article’s abstract section (<abstract> tag in GROBID’s output).
section: nodes of article sections (for example “Introduction”, “Conclusion”, “V Experiments …” etc.). This type of node has a subnode raw_text.

section nodes are children of a node root and may me nested (e.g., section “2.1. Datasets” is nested to the section “2. Related work”).
bibliography is the article’s bibliography list which contains only bibliography_item nodes.

bibliography_item is the article’s bibliography item description.

bibliography_item nodes are children of the node bibliography. This type of node has subnodes.

title or title_journal or title_series or title_conference_proceedings- name of the bibliography item. The node doesn’t have children nodes.
author - bibliography author name, <address> tag in GROBID’s output. The node doesn’t have children nodes.
biblScope_volume - volume name, <biblScope unit=”volume”>4</biblScope> tag in GROBID’s output. The node doesn’t have children nodes.
biblScope_pages - volume name, <biblScope unit=”page” from=”471” to=”488” /> tag in GROBID’s output. The node doesn’t have children nodes.
DOI - bibliography DOI name, <idno> tag in GROBID’s output. The node doesn’t have children nodes.
publisher - bibliography DOI name, <publisher> tag in GROBID’s output. The node doesn’t have children nodes.
date - publication date, <date> tag in GROBID’s output. The node doesn’t have children nodes.

There is GROBID’s TEI-XML <bibliography>’s item information description here . We parse GROBID’s biblStruct and create a bibliography_item node. Example of GROBID’s biblStruct:

<listBibl>
    <biblStruct xml:id="b0">
        <analytic>
            <title level="a" type="main">Leakage-resilient symmetric encryption via re-keying</title>
            <author>
                <persName><forename type="first">Michel</forename><surname>Abdalla</surname></persName>
            </author>
            <author>
                <persName><forename type="first">Sonia</forename><surname>Belaïd</surname></persName>
            </author>
            <author>
                <persName><forename type="first">Pierre-Alain</forename><surname>Fouque</surname></persName>
            </author>
        </analytic>
        <monogr>
            <title level="m">Bertoni and Coron</title>
            <imprint>
                <biblScope unit="volume">4</biblScope>
                <biblScope unit="page" from="471" to="488" />
            </imprint>
        </monogr>
    </biblStruct>
    <biblStruct xml:id="b1">

We set paragraph_type of the title according the tag level in GROBID (see title level’s description):

For <title><level="a"> set the paragraph_type="title" for article title or chapter title (but not thesis, see below). Here “a” stands for analytics (a part of a monograph).
For <title><level="j"> set the paragraph_type="title_journal" for journal title.
For <title><level="s"> set the paragraph_type="title_series" for series title (e.g. “Lecture Notes in Computer Science”).
For <title><level="m"> set the paragraph_type="title_conference_proceedings" for non journal bibliographical item holding the cited article, e.g. conference proceedings title. Note if a book is cited, the title of the book is annotated with <title level="m">.

We present a bibliography item as the node with fields paragraph_type="bibliography_item" and unique id uid="uuid". All bibliography_item nodes are children of the bibliography node. The example of the bibliography item parsing of the article in dedoc:

            {
              "node_id": "0.12.5",
              "text": "",
              "annotations": [],
              "metadata": {
                "paragraph_type": "bibliography_item",
                "page_id": 0,
                "line_id": 0,
                "uid": "10982954-0872-11ef-b95c-0242ac120002"
              },
              "subparagraphs": [
                {
                  "node_id": "0.12.5.0",
                  "text": "Template attacks",
                  "annotations": [],
                  "metadata": {
                    "paragraph_type": "title",
                    "page_id": 0,
                    "line_id": 0
                  },
                  "subparagraphs": []
                },
                {
                  "node_id": "0.12.5.1",
                  "text": "CHES",
                  "annotations": [],
                  "metadata": {
                    "paragraph_type": "title_conference_proceedings",
                    "page_id": 0,
                    "line_id": 0
                  },
                  "subparagraphs": []
                },
                {
                  "node_id": "0.12.5.2",
                  "text": "Lecture Notes in Computer Science",
                  "annotations": [],
                  "metadata": {
                    "paragraph_type": "title_series",
                    "page_id": 0,
                    "line_id": 0
                  },
                  "subparagraphs": []
                },
                {
                  "node_id": "0.12.5.3",
                  "text": "Suresh Chari",
                  "annotations": [],
                  "metadata": {
                    "paragraph_type": "author",
                    "page_id": 0,
                    "line_id": 0
                  },
                  "subparagraphs": []
                },
                {
                  "node_id": "0.12.5.4",
                  "text": "Josyula R Rao",
                  "annotations": [],
                  "metadata": {
                    "paragraph_type": "author",
                    "page_id": 0,
                    "line_id": 0
                  },
                  "subparagraphs": []
                },
                {
                  "node_id": "0.12.5.5",
                  "text": "Pankaj Rohatgi",
                  "annotations": [],
                  "metadata": {
                    "paragraph_type": "author",
                    "page_id": 0,
                    "line_id": 0
                  },
                  "subparagraphs": []
                },
                {
                  "node_id": "0.12.5.6",
                  "text": "2523",
                  "annotations": [],
                  "metadata": {
                    "paragraph_type": "biblScope_volume",
                    "page_id": 0,
                    "line_id": 0
                  },
                  "subparagraphs": []
                },
                {
                  "node_id": "0.12.5.7",
                  "text": "13-28",
                  "annotations": [],
                  "metadata": {
                    "paragraph_type": "biblScope_page",
                    "page_id": 0,
                    "line_id": 0
                  },
                  "subparagraphs": []
                },
                {
                  "node_id": "0.12.5.8",
                  "text": "Springer",
                  "annotations": [],
                  "metadata": {
                    "paragraph_type": "publisher",
                    "page_id": 0,
                    "line_id": 0
                  },
                  "subparagraphs": []
                },
                {
                  "node_id": "0.12.5.9",
                  "text": "2002",
                  "annotations": [],
                  "metadata": {
                    "paragraph_type": "date",
                    "page_id": 0,
                    "line_id": 0
                  },
                  "subparagraphs": []
                }
              ]
            },

bibliography references: bibliography references in annotations of the article’s text.

Text can contain references on bibliography_item nodes. For example, “Authors in [5] describe an approach …”. Here “[5]” is the reference. We present the bibliography reference as the annotation with name="bibliography_ref" and value of bibliography item’s uuid. See documentation of the class ReferenceAnnotation for more details.

Example of a bibliography reference in dedoc is given below. There is a textual node with two bibliography references (with two annotations):

            {
              "node_id": "0.10.0",
              "text": "The results in this work essentially show that masking and leakage-resilient constructions hardly combine constructively. For (stateful) PRGs, our experiments indicate that both for software and hardware implementations, a leakageresilient design instantiated with an unprotected AES is the most efficient solution to reach any given security level. For stateless PRFs, they rather show that a bounded data complexity guarantee is (mostly) ineffective in bounding the (computational) complexity of the best attacks. So implementing masking and limiting the lifetime of the cryptographic implementation is the best solution in this case. Nevertheless, the chosen-plaintext tweak proposed in [34] is an interesting exception to this conclusion, as it leads to security-bounded hardware implementations for stateless primitives that are particularly interesting from an application point-of-view, e.g. for re-synchronization, challenge-response protocols, . . . Beyond the further analysis of such constructions, their extension to software implementations is an interesting scope for further research. In this respect, the combination of a chosen-plaintext leakage-resilient PRF with the shuffling countermeasure in [62] seems promising, as it could \"emulate\" the keydependent algorithmic noise ensuring security bounds in hardware.           ",
              "annotations": [
                {
                  "start": 690,
                  "end": 694,
                  "name": "reference",
                  "value": "109c08ee-0872-11ef-b95c-0242ac120002"
                },
                {
                  "start": 1214,
                  "end": 1218,
                  "name": "reference",
                  "value": "10a004ee-0872-11ef-b95c-0242ac120002"
                }
              ],
              "metadata": {
                "paragraph_type": "raw_text",
                "page_id": 0,
                "line_id": 0
              },
              "subparagraphs": []
            }

In the example, the annotations reference two bibliography_item nodes:

            {
              "node_id": "0.12.33",
              "text": "",
              "annotations": [],
              "metadata": {
                "paragraph_type": "bibliography_item",
                "page_id": 0,
                "line_id": 0,
                "uid": "109c08ee-0872-11ef-b95c-0242ac120002"
              },

            {
              "node_id": "0.12.61",
              "text": "",
              "annotations": [],
              "metadata": {
                "paragraph_type": "bibliography_item",
                "page_id": 0,
                "line_id": 0,
                "uid": "10a004ee-0872-11ef-b95c-0242ac120002"
              },

raw_text: node referring to a simple document line.

It has the least importance in the document tree hierarchy, so it is situated in the leaves of the tree. It is nested to the node corresponding the previous line with a more important type.