LAME: Layout-Aware Metadata Extraction Approach for Research Articles

,


Introduction
With the development of science and technology, the number of related academic papers distributed periodically worldwide has reached more than several hundred thousand.However, their layout styles are as diverse as their subjects and publishers although the portable document format (PDF) is widely used globally as a standardized text-based document provision format.For example, the information order is inconsistent when converting such a document to text because no layout information separating the document content is provided.Thus, extracting meaningful information such as metadata, including title, author names, affiliations, abstract, and keywords, from a document is quite challenging.
Research on extracting metadata or document objects from PDF documents using machine learning has increased [1][2][3][4][5][6][7].In aspects of natural language processing (NLP) approach, Open-source software, such as Content ExtRactor and MINEr (CERMINE) [4] and GeneRation of Bibliographic Data (GROBID) [5], automatically extract metadata using the sequential labeling technique but generally do not take the layouts into account in detail.Therefore, they do not show reasonable metadata extraction performances for every research article due to their diverse (and sometimes bizarre) layout formats.Unlike existing NLP based metadata extraction approaches, PubLayNet [1], LayoutLM [7], and DocBank [3] employ object detection models, such as Mask region-based convolutional neural network (Mask R-CNN) [8] and Faster R-CNN [9], to detect the layout of academic literature and extract document objects (e.g., the text, figures, tables, titles, and lists).The critical weakness of them is the low layout analysis quality for unseen journals and different document types.For example, when we apply the PubLayNet model trained with Detectron2 [1] to the first page of a Korean academic journal, it cannot capture the correct regions of documents objects, as depicted in Fig. 1.
In terms of training data and its coverage, PubLayNet and LayoutLM automatically construct the training data using the metadata provided by PubMed Central Open Access-eXtensible Markup Language (PMCOA-XML) or LaTex.Nevertheless, these are primarily for extracting figures and tables; they do not cover all the necessary metadata, such as the abstract, author name, keyword, or other data [1].Moreover, to the best of our knowledge, the PMCOA-XML data of publications are only limited to biomedical journals and small numbers of LaTex data are available in the public domain.Recently, some training data for metadata extraction with consideration of the layout for the selected 40 Korean scientific journals were manually crafted [6].However, its layout-aware data quality is not so satisfactory due to inconsistent and noisy annotations.
To guarantee consistent annotation quality in constructing layout-aware training data and building a more sophisticated language model for advanced metadata extraction, we propose a LAyout-aware MEtadata extraction (LAME) framework composed of three key components.First, an automatic layout analysis for metadata is designed with PDFMiner.Second, a large amount of layout-aware metadata is automatically constructed by analyzing the first page of papers in selected journals.Finally, Layout-aware Bidirectional Encoder Representations from Transformers for Metadata (Layout-MetaBERT) models are constructed by adopting the BERT architecture [10].
In addition, to show the effectiveness of the Layout-MetaBERT models, we performed a set of experiments with other existing pretrained models and compared with the state-of-the-art (SOTA) model (i.e., bidirectional gated recurrent units and conditional random field (Bi-GRU-CRF)) for metadata extraction.
Our main contributions are as follows:


We proposed an automatic layout analysis method which doesn't requires PMCOA-XML (or Latex) data for metadata extraction.


We automatically generated training data for the layout-aware metadata from 70 research journals (65,007 PDF documents).


We constructed a new pretrained language model, Layout-MetaBERT, to deal with the metadata of research articles more effectively.


We demonstrated the effectiveness of Layout-MetaBERT on ten unseen research journals (13,300 PDF documents) with diverse layouts compared with the existing SOTA model (i.e., Bi-GRU-CRF).
2 Related work

Metadata extraction
Various attempts have been made to analyze and extract information from documents and classify them into specific categories.Studies on text classification have been continuous since 1990, and the performance of text classification has gradually improved with the employment of sophisticated machine learning algorithms, such as the support vector machine (SVM) [11], conditional random fields (CRF) [12], convolutional neural network (CNN) [13], and bidirectional long short-term memory (BiLSTM) [14].Afterward, various successful cases using bidirectional encoder representations from transformers (BERT) [10] pretrained with a large-scale corpus were introduced in the field of NLP.In the studies by [15] and [16], the pretrained BERT model was fine-tuned on the text classification task, and it showed results of close to or superior to the SOTA result for the target data.The use of BERT based pretrained models became popular due its high performances in various NLP fields, and more advanced pretrained models [17][18][19] were introduced according to various research purposes.
As a previous SOTA model of our metadata extraction task, a Bi-GRU-CRF model trained more than 20,000 human-annotated pages of layout boxes for metadata [6] from research articles showed an 82.46% of F1-score.However, accurately detecting and extracting regions for each type of metadata in documents is still a nontrivial task because of the various layout formats.

Document layout analysis
Document layout analysis (DLA) [7] and several PDF handling efforts [6], [11], [20] have been conducted to understand the structure of documents.The DLA aims to identify the layout of text and nontext objects on the page and detect the layout function and format.Recently, the LayoutLM model [7] employed three different information elements for BERT pretraining to identify layouts: 1) layout coordinates, 2) text extracted using optical character recognition software, and 3) image embedding by understanding the layout structure through image processing.Moreover, NLP-based DLA research on various web documents [21], Layout detections and layout creation methods to find text information and location [8], [22], [23] have been studied.[2] and [24] applied the object detection technique to text region detection.Interestingly, widely used object detection techniques (e.g., Mask R-CNN [8] and Faster R-CNN [9]) have been applied to the metadata extraction field [1], [3].
Due to the high cost of training data construction for DLA, many studies have attempted to build datasets automatically.For example, the PubMed Central website, which includes academic documents in the biomedical field, provides a PMCOA-XML file for each document, enabling an analysis of the document structure.In the case of PubLayNet [1] which utilizes the PubMed dataset, the XML and PDFMiner's TextBoxes were matched to construct about 1 million training data.However, this is generally possible only when accurate coordinates are provided to separate each layout and the text information elements for each.

Automatic layout analysis
To understand the layout that separates each metadata element in the given PDF file, we must observe the text and coordinate information on the document's first page.To this ends, we employ the open-source software, PDFMiner, to extract meaningful information surrounding the text in the PDF files.
If we parse a PDF document with the software, we obtain information on the page, TextBox, TextLine, and character (Char) hierarchically, as illustrated in Fig. 3.These include various text information, such as coordinates, text, font size, and font for each object.For example, text coordinates appear in the form of (x, y) coordinates along with the height and width of the page.

Textbox reconstruction
To reduce existing errors in TextBox recognition, as depicted in Fig. 3, TextBoxes were reconstructed starting from the Char unit with the information obtained from PDFMiner.First, the spacing between characters is analyzed using the coordinate information for each Char.Generally, each token's x-coordinate distance (character spacing) appears the same, but the distance is slightly different depending on the alignment method or language.Therefore, after collecting characters in the same y-coordinate, the corresponding characters were sorted based on the x-coordinate value.As displayed in Fig. 4, the distance between each Char is smaller than the font size of the Chars; thus, the Char is determined to be part of the same TextLine in an academic document consisting of two columns.After aligning the TextLines based on the y-axis, if the distance between each y-coordinate is smaller than the height of each TextLine, the two different TextLines are regarded as the same TextBox.However, this method cannot create a TextBox accurately by separating paragraphs from paragraphs.For more elaborate TextBox composition, it needs to decide whether to configure the TextBox by considering the left x-coordinate  0 , the right x-coordinate  1 , and the width (W) of each TextLine.For example, for sentences like those in Fig. 5, we can think of two cases in composing a TextBox by comparing each TextLine.
First, the beginning of a paragraph is usually indented.Therefore, if the difference between the  0 values of   and  −1 is greater than the font size of the Chars existing in each TextLine, the two TextLines should be included in different TextBoxes.Second, a TextLine that appears at the end of a paragraph has a shorter width because it has fewer Chars on average.Therefore, when the width of  −1 is smaller than the width of   ,  −1 and   should be assigned different TextBoxes.

Refinement with font information
PDFMiner can produce various pieces of information in terms of font information, such as the font name and style (e.g., bold, italic, etc.) as listed in Table 1.However, as in Fig. 6, English is frequently used for Korean abstracts in some journals published in Korea.In particular, abstracts written in Korean and English appear together on the first page of some research articles.In addition, certain strings are often treated as bold or italic and often have different fonts and sizes, such as section titles.Considering this problem, when composing a TextBox using the coordinate information described above, if the font information displayed on each line is different, it is not simply judged as a different line.After analyzing the font information of different languages that appear, the TextBox was determined by considering the number of the appearing fonts (e.g., bold and italic).

Table 1: Example of font information when pdfminer is applied
Although font information helps the layout composition, it is still confusing when the same font information is used for individual information marking or bold processing for emphasis or different metadata.Additional processing is required to correctly connect individual fonts to make a layout using the font information.Therefore, we compared only texts described in Korean and English and used only the fonts of the same language to determine the layout.

Adjustment of text box order
Academic papers may consist of one or two columns depending on the format for each journal.In some cases, only the main body consists of two columns, and the title, abstract, and author name are displayed in one column.For example, in Fig. 3, such information as the title and author name was arranged in the center, but the document object identifier (DOI) information or academic journal names appeared separately on the left and right sides.To effectively identify metadata consistently from varied layout formats, we sorted the textboxes extracted from the first pages of the research articles sequentially from top to bottom based on the y-axis.

Automatic training data construction
We compared the content extracted from the PDFMiner with the metadata prepared in advance to construct the layout-aware metadata automatically.If no metadata is available for the given research article, metadata can be automatically obtained through the DOI lookup.Therefore, this technique can be extended to all journal types where the registered DOI exists.
However, the compared textual content is not always precisely matched.Therefore, to determine the extent of the match, we allowed only fields with almost identical (or high similarity) matches for each layout text information element automatically acquired in the previous step as training data.We used a mixed textual-similarity measure for efficient computation based on the Levenshtein distance and bilingual evaluation understudy (BLEU) score.
The Levenshtein distance was calculated using Python's fuzzywuzzy 1 .The scores calculated using the BLEU [25] measure were summed to determine whether the given metadata displays a degree of agreement of 80% or more.Nevertheless, some post-processing is required in the process.In analyzing the text after extraction, some problems occur when dealing with expression substitutions (e.g., "<TEX>," cid:0000).Encoding errors reduced the portion of mathematical expressions that can be removed as much as possible, and we excluded the text with encoding problems to avoid these errors.

Pretraining Layout-MetaBERT
Although pretraining a BERT model requires a large corpus and a long training time, a fine-tuning step can make a difference in performance depending on the characteristics of the data used for pretraining.For example, when pretrained with specific domain data, such as SciBERT [19] and BioBERT [26], they performed better than Google's BERT model [10] in downstream tasks of science and technology or medical fields.However, to our best knowledge, there is no pretrained model designed to extract metadata based on research article data.Therefore, we newly pretrained a layout-aware language model, so called, Layout-MetaBERT, that can effectively deal with metadata from research articles.Fig. 7 describes how the previously constructed training data are used for pretraining and fine-tuning the Layout-MetaBERT.Different from the Google BERT model [10], in our Layout-MetaBERT pretraining, each document layout was considered a sequence in this study.Thus, each layout was classified by the [SEP] token to prepare the training data and was used for pretraining.In pretraining Layout-MetaBERT models, we followed three size models of the Google BERT: base (L = 12, H = 768, A = 12), small (L = 4, H = 512, A = 8), and tiny (L = 2, H = 128, A = 2), where L is the number transformer blocks, H is the hidden size, and A is the number of self-attention heads.We used a dictionary of 10,000 words built through the WordPiece mechanism and automatically generated training data extracted from the first page of 60 research journals among the 70 journals for the pretraining.The pretrained Layout-MetaBERT can be used for metadata extraction after fine-tuning.

Experiments
We summarize the results of three major components to examine the applicability of the LAME framework.First, we compare the results of the proposed automatic layout analysis with other layout analysis techniques.Second, we describe the statistics of the training data constructed according to the results of the automatic layout analysis.Finally, we compare metadata extraction performances our constructed Layout-MetaBERT models with other deep learning and machine learning techniques after fine-tuning.

Comparison with other layout analysis methods
No prepared correct answers exist for the target research articles; thus, we compared the generated layout boxes from PDFMiner, PubLayNet, and the proposed layout analysis method for two randomly selected documents (e.g., A and B) as depicted in Fig. 8.In Document A, the extraction results of PDFMiner and the layout analysis are similar.However, in PubLayNet, the information of the paragraphs is excessively separated, as indicated in Fig. 8-(a).For Document B, the extraction results of all techniques were somewhat similar, but PubLayNet displayed the author name and affiliation as one piece of information, and PDFMiner produced a separate box for the line-wrapped title.
The proposed method could generate a good enough layout analysis for the first page of the research articles through the comparisons.Comparing all three layout analysis results manually for each layout box to calculate the accuracy requires too much human labor and is beyond the scope of this paper.The performance of the constructed Layout-MetaBERT indirectly measures the quality of the layout analysis.

Training data construction
To reflect various kinds of layout formats, we used 70 research journals (Appendix 1) provided by the Korea Institute of Science and Technology Information (KISTI) to extract major metadata elements, such as titles, author names, author affiliations, keywords, and abstracts in Korean and English based on the automatic layout analysis in Section 3.1.Among the 70 journals, two journals were written in only Korean, 23 journals in only English, and 45 in Korean and English.
For each layout that separates metadata on the first page of the 70 journals (65,007 PDF documents), automatic labeling with ten labels was performed, and other layouts not included in the relevant information were labeled O.The statistics of automatically generated training data are presented in Table 2.

Experimental results
To check the performance of the proposed Layout-MetaBERT, 70 research journals (65,007 documents) were divided into 60 (51,676 documents) for pretraining (and fine-tuning) and 10 (13,331 documents) for testing, respectively.Table 3 lists the training and testing performances of the three Layout-MetaBERT models with widely used metadata extraction techniques.Finally, Table 4 describes the Macro-F1 and Micro-F1 scores for metadata classification comparisons with existing pretrained models.

Fine-tuning and Hyperparameters
In fine-tuning with various pretrained language models (e.g., three different sized models of Layout-MetaBERT, KoALBERT, KoELECTRA, and KoBERT), all experiments were conducted under the same configurations with an epoch of 5, batch size of 32, learning rate of 2e-5, and maximum sequence length of 256.In addition, we used the Nvidia RTX Titan 4-way system and Google's TensorFlow framework in Python 3.6.9for pretraining and fine-tuning.

Stable performances of Layout-MetaBERT
The proposed Layout-MetaBERT models can effectively extract metadata, as listed in Table 3.In particular, Layout-MetaBERT models make significant differences compared to the existing SOTA (i.e., Bi-GRU-CRF) model.Even the tiny model with the fewest parameters among the Layout-MetaBERT models has higher performance than other pretrained models in Macro-F1 and Micro-F1 scores, as displayed in Table 4.Moreover, three Layout-MetaBERT models have only minor differences between the Micro-F1 and Macro-F1 scores compared to other pretrained models.Moreover, the Layout-MetaBERT models exhibit 90% or more robustness in metadata extraction, confirming that pretraining the layout units with the BERT schemes is feasible in the metadata extraction task.Bi-GRU-CRF [6] (without position) 0.8610 0.8912 Bi-GRU-CRF [6] (with position) 0.9442 0.0985 CNN [13] 0.9425 0.824 SVM [11] 0.9411 0.8114 Table 4: Metadata extraction performances of primary BERT models for each label

Experiments with position information
Unlike other models, the Bi-GRU-CRF model used the absolute coordinates of metadata with other textual features.However, the model failed to discriminate unseen layouts from unseen journals when using the coordinate information for training various journal layout formats.Therefore, to determine the validity of the coordinate information, we performed additional experiments with the Bi-GRU-CRF (with position) and Bi-GRU-CRF (without position) models.Although Bi-GRU-CRF (with position) model demonstrated high performance in the training stage, it failed to recognize metadata-related layouts in unseen journals (less than 10% as F1 score).However, the performance of the Bi-GRU-CRF (without position) model had somewhat lower performance in the training stage compared to the other models.The model performed well, similar with that of KoALBERT.Thus, we confirmed that using absolute coordinate information can only be applied under the premise that the journals used in training also are used in testing.

Additional performance improvements
The proposed Layout-MetaBERT exhibited higher results than the existing SOTA model [6].However, absolute coordinate information could obtain poor results for documents in a format not learned.In addition, the proposed layout analysis method separates the metadata well from the first page of the academic documents of various layouts.However, the accuracy of the automatically generated training data is not perfect.There may be errors due to the difference between the metadata format of the document and the metadata written in advance.As mentioned, encoding errors also occur in extracting text from mathematical formulas or PDF documents.Generating the correct layout has a significant effect on extracting metadata and is an essential factor in automatically generating data.Therefore, if more sophisticated training data can be generated, the performance of Layout-MetaBERT can be further improved.

Restrictions of Layout-MetaBERT
Much research has been conducted on automatically extracting layouts from PDF documents.Creating accurate layouts has a significant influence on meta-extraction.This study attempted to compose the layout of the first page of an academic document using text information.Based on this, we trained the Layout-MetaBERT and confirmed the positive results for the applicability to the meta classification module.However, the proposed technique cannot be applied to all documents.An image-type PDF cannot be used unless the text is extracted.In this case, the extraction must be performed using a highperformance optical character recognition module.

Expansion to other metadata types
This study focused on extracting five major metadata elements (i.e., titles, abstracts, keywords, author names, and author information).Considering that the target research articles contain elements written in English, Korean, or both, the number of metadata becomes 10.However, other metadata (e.g., publication year, start page, end page, DOI, volume number, journal title, etc.) can be extracted further by applying highly refined regular expressions in the post-processing step.

Conclusion
In this paper, the LAME framework is proposed to extract metadata from PDFs of research articles with high performance.First, the automatic layout analysis detects the layout regions where metadata exists regardless of the journal formats based on text features, text coordinates, and font information.Second, by constructing automatic training data, we built high-quality metadata-separated training data for 70 journals (65,007 documents).In addition, our fine-tuned Layout-MetaBERT (base) demonstrated excellent metadata extraction performance (F1 = 94.6%) for even unseen journals with diverse layouts.Moreover, Layout-MetaBERT (tiny) with the fewest parameters exhibited superior performance than other pretraining models, implying that well-separated layouts induce effective metadata extraction when they meet appropriate language models.
In future work, we plan to conduct experiments to determine whether the proposed model applies to the more than 500 other journals not used in this study.Moreover, resolving potential errors in the automatically generated training data is a concern to create layouts that separate each metadata element in an advanced way.Furthermore, extending the number of metadata items extracted without postprocessing is an exciting but challenging task to resolve as future work.

3
Proposed frameworkFig.2 depicts our LAME framework consisting of three major components: automatic layout analysis, layout-aware training data construction, and Layout-MetaBERT generation.

Figure 3 :
Figure 3: TextBox reconstruction based on the results of PDFMiner.

Figure 4 :
Figure 4: Example of the separated column layout.

Figure 6 :
Figure 6: Example of when a Korean abstract and an English abstract exist together.

Table 2
Statistics for automatically generated training data

Table 3 :
Train and test performances of metadata extraction