Article structure type (GROBID)
This structure type is used for scientific article analysis using GROBID system.
Note
In case you use dedoc as a library or a separate Docker image (without docker-compose). If you want to use this structure extractor, you should run GROBID service via Docker (or see grobid running instruction).
docker run --rm --init --ulimit core=0 -p 8070:8070 lfoppiano/grobid:0.8.0
We analyze the recognition results from GROBID. The following types of objects are included in the resulting tree:
article’s title;
authors with their affiliations to organizations and emails;
article’s sections headers (for example Abstract, Introduction, .., Conclusion etc);
tables and their content;
bibliography;
references on tables and bibliography items.
There are the following line types in the article structure type:
root
;
author
(includesauthor_first_name
,author_surname
,
keywords
(includeskeyword
);
author_affiliation
(includesorg_name
,address
);
abstract
;
section
;
bibliography
;
bibliography_item
(includes [title
|title_journal
|title_series
|title_conference_proceedings
],author
,biblScope_volume
,biblScope_pages
,DOI
,publisher
,date
);
raw_text
.
You can see the example
of the document of this structure type.
This page provides examples of this article analysis.
Below is a description of nodes in the output tree:
root: node containing the text of the article title.
There is only one root node in any document. It is obligatory for any document of article type. All other document lines are children of the root node. We take the title’s text from GROBID’s TEI-XML path tag <title>:
<fileDesc> <titleStmt> <title> Title's text </title> // -> node.paragraph_type="root" </fileDesc> </titleStmt>author: information about an author of the article.
author
nodes are children of the noderoot
. This type of node has subnodes.
author_first_name
- <persname> tag in GROBID’s output. The node doesn’t have children nodes.
author_surname
- <surname> tag in GROBID’s output. The node doesn’t have children nodes.
author_affiliation
- author affiliation description.GROBID’s TEI-XML <author>’s name information
<author> // -> node.paragraph_type="author" <persname> <forename type="first">Sonia</forename> // -> node.paragraph_type="author_first_name" <surname>Belaïd</surname> // -> node.paragraph_type="author_surname" <email></email> // -> node.paragraph_type="email" </persname> ... </author>author_affiliation: Author’s affiliation description.
author_affiliation
nodes are children of the nodeauthor
. This type of node has subnodes.
org_name
- organization description, <orgname> tag in GROBID’s output. The node doesn’t have children nodes.
address
- organization address, <address> tag in GROBID’s output. The node doesn’t have children nodes.GROBID’s TEI-XML tag <author><affiliation> information according the affiliation description :
<author> // -> node.paragraph_type="author" ... <affiliation key="aff2"> // -> node.paragraph_type="author_affiliation" <orgName type="department">ICTEAM/ELEN/Crypto Group</orgName> // -> node.paragraph_type="org_name" <orgName type="institution">Université catholique de Louvain</orgName> <address> <country key="BE">Belgium</country> </address> </affiliation>The result of parsing of the second author of the article:
{ "node_id": "0.1", "text": "", "annotations": [], "metadata": { "paragraph_type": "author", "page_id": 0, "line_id": 0 }, "subparagraphs": [ { "node_id": "0.1.0", "text": "Vincent", "annotations": [], "metadata": { "paragraph_type": "author_first_name", "page_id": 0, "line_id": 0 }, "subparagraphs": [] }, { "node_id": "0.1.1", "text": "Grosso", "annotations": [], "metadata": { "paragraph_type": "author_surname", "page_id": 0, "line_id": 0 }, "subparagraphs": [] }, { "node_id": "0.1.2", "text": "aff2", "annotations": [], "metadata": { "paragraph_type": "author_affiliation", "page_id": 0, "line_id": 0 }, "subparagraphs": [ { "node_id": "0.1.2.0", "text": "ICTEAM/ELEN/Crypto Group", "annotations": [], "metadata": { "paragraph_type": "org_name", "page_id": 0, "line_id": 0 }, "subparagraphs": [] }, { "node_id": "0.1.2.1", "text": "Belgium", "annotations": [], "metadata": { "paragraph_type": "address", "page_id": 0, "line_id": 0 }, "subparagraphs": [] } ] } ] },keywords node (if exist) is a child node of the node
root
.
keywords
node containskeyword
nodes as children. Eachkeyword
node contains the text of one key word item.abstract is the article’s abstract section (<abstract> tag in GROBID’s output).
section: nodes of article sections (for example “Introduction”, “Conclusion”, “V Experiments …” etc.). This type of node has a subnode
raw_text
.
section
nodes are children of a noderoot
and may me nested (e.g., section “2.1. Datasets” is nested to the section “2. Related work”).bibliography is the article’s bibliography list which contains only
bibliography_item
nodes.bibliography_item is the article’s bibliography item description.
bibliography_item
nodes are children of the nodebibliography
. This type of node has subnodes.
title
ortitle_journal
ortitle_series
ortitle_conference_proceedings
- name of the bibliography item. The node doesn’t have children nodes.
author
- bibliography author name, <address> tag in GROBID’s output. The node doesn’t have children nodes.
biblScope_volume
- volume name, <biblScope unit=”volume”>4</biblScope> tag in GROBID’s output. The node doesn’t have children nodes.
biblScope_pages
- volume name, <biblScope unit=”page” from=”471” to=”488” /> tag in GROBID’s output. The node doesn’t have children nodes.
DOI
- bibliography DOI name, <idno> tag in GROBID’s output. The node doesn’t have children nodes.
publisher
- bibliography DOI name, <publisher> tag in GROBID’s output. The node doesn’t have children nodes.
date
- publication date, <date> tag in GROBID’s output. The node doesn’t have children nodes.There is GROBID’s TEI-XML <bibliography>’s item information description here . We parse GROBID’s biblStruct and create a
bibliography_item
node. Example of GROBID’s biblStruct:<listBibl> <biblStruct xml:id="b0"> <analytic> <title level="a" type="main">Leakage-resilient symmetric encryption via re-keying</title> <author> <persName><forename type="first">Michel</forename><surname>Abdalla</surname></persName> </author> <author> <persName><forename type="first">Sonia</forename><surname>Belaïd</surname></persName> </author> <author> <persName><forename type="first">Pierre-Alain</forename><surname>Fouque</surname></persName> </author> </analytic> <monogr> <title level="m">Bertoni and Coron</title> <imprint> <biblScope unit="volume">4</biblScope> <biblScope unit="page" from="471" to="488" /> </imprint> </monogr> </biblStruct> <biblStruct xml:id="b1">We set paragraph_type of the title according the tag level in GROBID (see title level’s description):
For
<title><level="a">
set theparagraph_type="title"
for article title or chapter title (but not thesis, see below). Here “a” stands for analytics (a part of a monograph).For
<title><level="j">
set theparagraph_type="title_journal"
for journal title.For
<title><level="s">
set theparagraph_type="title_series"
for series title (e.g. “Lecture Notes in Computer Science”).For
<title><level="m">
set theparagraph_type="title_conference_proceedings"
for non journal bibliographical item holding the cited article, e.g. conference proceedings title. Note if a book is cited, the title of the book is annotated with<title level="m">
.We present a bibliography item as the node with fields
paragraph_type="bibliography_item"
and unique iduid="uuid"
. Allbibliography_item
nodes are children of thebibliography
node. The example of the bibliography item parsing of the article in dedoc:{ "node_id": "0.12.5", "text": "", "annotations": [], "metadata": { "paragraph_type": "bibliography_item", "page_id": 0, "line_id": 0, "uid": "10982954-0872-11ef-b95c-0242ac120002" }, "subparagraphs": [ { "node_id": "0.12.5.0", "text": "Template attacks", "annotations": [], "metadata": { "paragraph_type": "title", "page_id": 0, "line_id": 0 }, "subparagraphs": [] }, { "node_id": "0.12.5.1", "text": "CHES", "annotations": [], "metadata": { "paragraph_type": "title_conference_proceedings", "page_id": 0, "line_id": 0 }, "subparagraphs": [] }, { "node_id": "0.12.5.2", "text": "Lecture Notes in Computer Science", "annotations": [], "metadata": { "paragraph_type": "title_series", "page_id": 0, "line_id": 0 }, "subparagraphs": [] }, { "node_id": "0.12.5.3", "text": "Suresh Chari", "annotations": [], "metadata": { "paragraph_type": "author", "page_id": 0, "line_id": 0 }, "subparagraphs": [] }, { "node_id": "0.12.5.4", "text": "Josyula R Rao", "annotations": [], "metadata": { "paragraph_type": "author", "page_id": 0, "line_id": 0 }, "subparagraphs": [] }, { "node_id": "0.12.5.5", "text": "Pankaj Rohatgi", "annotations": [], "metadata": { "paragraph_type": "author", "page_id": 0, "line_id": 0 }, "subparagraphs": [] }, { "node_id": "0.12.5.6", "text": "2523", "annotations": [], "metadata": { "paragraph_type": "biblScope_volume", "page_id": 0, "line_id": 0 }, "subparagraphs": [] }, { "node_id": "0.12.5.7", "text": "13-28", "annotations": [], "metadata": { "paragraph_type": "biblScope_page", "page_id": 0, "line_id": 0 }, "subparagraphs": [] }, { "node_id": "0.12.5.8", "text": "Springer", "annotations": [], "metadata": { "paragraph_type": "publisher", "page_id": 0, "line_id": 0 }, "subparagraphs": [] }, { "node_id": "0.12.5.9", "text": "2002", "annotations": [], "metadata": { "paragraph_type": "date", "page_id": 0, "line_id": 0 }, "subparagraphs": [] } ] },bibliography references: bibliography references in annotations of the article’s text.
Text can contain references on
bibliography_item
nodes. For example, “Authors in [5] describe an approach …”. Here “[5]” is the reference. We present the bibliography reference as the annotation withname="bibliography_ref"
and value of bibliography item’s uuid. See documentation of the classReferenceAnnotation
for more details.Example of a bibliography reference in dedoc is given below. There is a textual node with two bibliography references (with two annotations):
{ "node_id": "0.10.0", "text": "The results in this work essentially show that masking and leakage-resilient constructions hardly combine constructively. For (stateful) PRGs, our experiments indicate that both for software and hardware implementations, a leakageresilient design instantiated with an unprotected AES is the most efficient solution to reach any given security level. For stateless PRFs, they rather show that a bounded data complexity guarantee is (mostly) ineffective in bounding the (computational) complexity of the best attacks. So implementing masking and limiting the lifetime of the cryptographic implementation is the best solution in this case. Nevertheless, the chosen-plaintext tweak proposed in [34] is an interesting exception to this conclusion, as it leads to security-bounded hardware implementations for stateless primitives that are particularly interesting from an application point-of-view, e.g. for re-synchronization, challenge-response protocols, . . . Beyond the further analysis of such constructions, their extension to software implementations is an interesting scope for further research. In this respect, the combination of a chosen-plaintext leakage-resilient PRF with the shuffling countermeasure in [62] seems promising, as it could \"emulate\" the keydependent algorithmic noise ensuring security bounds in hardware. ", "annotations": [ { "start": 690, "end": 694, "name": "reference", "value": "109c08ee-0872-11ef-b95c-0242ac120002" }, { "start": 1214, "end": 1218, "name": "reference", "value": "10a004ee-0872-11ef-b95c-0242ac120002" } ], "metadata": { "paragraph_type": "raw_text", "page_id": 0, "line_id": 0 }, "subparagraphs": [] }In the example, the annotations reference two
bibliography_item
nodes:{ "node_id": "0.12.33", "text": "", "annotations": [], "metadata": { "paragraph_type": "bibliography_item", "page_id": 0, "line_id": 0, "uid": "109c08ee-0872-11ef-b95c-0242ac120002" },{ "node_id": "0.12.61", "text": "", "annotations": [], "metadata": { "paragraph_type": "bibliography_item", "page_id": 0, "line_id": 0, "uid": "10a004ee-0872-11ef-b95c-0242ac120002" },raw_text: node referring to a simple document line.
It has the least importance in the document tree hierarchy, so it is situated in the leaves of the tree. It is nested to the node corresponding the previous line with a more important type.