1. Introduction
Many researchers from different research areas publish scientific papers to contribute to the scientific community. The massive flow of scientific articles makes accessing the required information more challenging. Researchers need to get a handle on salient information without reading the entire document. Therefore, the tendency toward automatic text summarization has started to grow. Automatic text summarization (ATS) presents the content of the document as quickly and concisely as possible while preserving the original document’s integrity. ATS can be performed with an extractive or abstractive approach depending on the selection, combination, or paraphrasing of salient sentences in the source document [
1]. Extractive summarization generates a summary from sentences that best express the main idea in the source document [
2], whereas abstractive summarization generates rewritten summaries with sentences or words that differ from the source document [
3]. In this respect, the abstractive approach generates summaries that are similar to human-written summaries.
The rapid development of deep learning techniques has resulted in great advances in abstractive summarization methods [
4,
5]. However, it is still a challenge to intelligently summarize scientific articles via traditional neural summarization methods. The main reasons are: (I) These methods are trained for task-specific applications on public datasets consisting of news articles, blog posts, tweets, etc. (II) The data presented in scientific articles is quite a complex process because it requires knowledge to identify new technologies and their connections [
6]. Therefore, information extraction (IE) systems may be preferred in practice to automatically determine these relations. However, the annotations of the relations extracted by these methods are not sufficient for a scientific summary of the text [
7,
8]. (III) In current studies [
4], it has been mentioned that LSTM and attention-based models cannot handle hierarchical structures. In this paper, considering all the above problems, a novel scientific text summarization model based on SciBERT and the graph transformer network (GTN) has been proposed that generates abstracts from the introduction section of scientific articles. First, entity, co-reference, and relation annotations were extracted from the source document with the Scientific Information Extractor (SciIE) to handle the hierarchical structure of the document [
7]. Second, a knowledge graph has been constructed with this information. The output of the last hidden state of SciBERT was used to encode the introduction section. Finally, GTN [
8] was performed to summarize text from the knowledge graph. The main contribution of this paper is:
2. Related Work
Text summarization approaches are performed by directly extracting salient sentences from the source documents or rewriting these sentences with words that differ from the source documents [
9,
10]. Previous researchers have focused on extractive methods to summarize scientific articles because abstractive methods are more difficult and complex because they require advanced NLP techniques. These studies are summarized in
Table 1. In this study [
3], they proposed a generic summarizer model that is language-independent and based on a quantum-inspired approach for the extraction of important sentences. In this study [
11], they proposed a graph-based framework that can also be applied to scientific articles without any domain or language constraints. This modelutilizes the advantages of graph-based, statistical based, semantic-based, and centrality-based methods. In this study [
12], they proposed a regression-based model to high light salient sentences in scientific articles. They experimented on three different scientific datasets (CSPubSum, AlPubSum, and BioPubSum) to demonstrate the effective ness of their method. In this study [
13], they constructed a large-scale manually annotated dataset (SciSummNet) for summarizing scientific articles. In addition, a hybrid summarization model was proposed. The effectiveness of their corpus on this model and the data-driven neural models was evaluated. In this study [
14], they presented a new model for summarizing scientific articles by inspiring SummPipon [
15]. In this study [
16], they constructed a novel corpus (SciTLDR) consisting of scientific papers related to the computer science domain. In addition, a novel model (CATTS) was proposed to evaluate their corpus. The proposed model is appropriate for both extractive and abstractive methods.
Abstractive methods are closer to reality in terms of generating a summary by thinking human-like. Recently, re searchers have focused on it for generating scientific sum maries [
9,
10]. In this study [
5], they presented a graph network-based model based on a sentence-level denoiser and an auto-regressive generator. To demonstrate the effectivness of their model, PubMed and CORD-19 datasets containing scientific articles in the biomedical domain were used. In this study [
17], they proposed a sequence-to-sequence-based model with three encoders and one decoder. In addition, they proposed novel evaluation metrics, namely ROUGE1 NOORDER, ROUGE1-STEM, and ROUGE1-CONTEXT. In this study [
4], they presented a SciBERT-based summarization model to summarize scientific articles related to COVID-19. This model consists of a graph attention network and a pre-training language model (SciBERT). To evaluate the proposed model, the CORD-19 (COVID-19 Open Research Dataset) consisting of scientific articles was used. In this study [
18], they presented a novel model consisting of timescale adaptation over the pointer-generator-coverage network. It has been mentioned that this model is successful in summarizing long articles.
Pre-trained language models (PTLMs) have made significant progress in abstractive methods and many NLP tasks. The main aim of this study is to analyze the relations at the sentence/token level with large-scale corpora. Most researchers have proposed many language models to enhance specific NLP tasks. These are performed through two basic strategies: feature-based and fine-tuning. The feature-based approach requires task-specific architectures with pre-trained representations as additional features. In the fine-tuning approach, a classification layer must be added to the pre-trained model. Fine-tuning approaches are widely preferred to enhance the quality of the generated summaries in ATS [
19,
20].
PTLMs are preferred in general-domain text summarization tasks and have achieved successful results. The BERT [
20] model was constructed to pre-train deep bidirectional representations of unlabeled text. This model has only an encoder. Therefore, it is stated that it cannot be suitable for abstractive approaches. To address this problem, many researchers have proposed novel models based on BERT. BERTSUMABS [
21] is a model that consists of an encoder and a decoder. GSUM [
22], based on a neural encoder decoder, is a model that takes several types of external guidance as input text. The Text-To-Text Transfer Transformer model(T5) [
23], based on encoder-decoder architecture, aims to generate a novel text from the text it receives as input. Refactor [
24] is a model consisting of a two-stage training process to identify candidate summaries from both document sentences and different base model outputs. To obtain se mantic results from scientific articles, the researchers have focused on the SciBERT model, which has the same architecture and configuration as BERT. SciBERT is a model with a maximum sequence length of 512 tokens pre-trained with large-scale scientific papers (1.14M papers) collected from semantic scholars related to different disciplines [
4,
6].
GTN is a model designed to learn node representation and identify the relations between disconnected nodes in graph structures [
25]. In most studies related to text summarization. [
4,
8,
26,
27,
28,
29], Graph Attention Transformer (GATs) are widely used among graph neural network-based approaches. GATs [
30] are models with an attention-based architecture constructed to operate data in a graph structure. The main aim is to find the representations of each node in the graph by adopting the attention mechanism. With GATs, the hierarchical structure of the document can be handled as a whole. In addition, meaningful relationships can be revealed between sentences, tokens, or entities through the preservation of the global context [
31].
According to [
4], GATs are successful in representing word co-occurrence graphs because they utilize a masked self-attention mechanism to capture dependency between neighbors and prevent information flow between disconnected nodes. According to [
26], GATs can extract the hierarchical structure of a document simultaneously as tokens, sentences, paragraphs, and documents. In addition, it provides consistency with the multi-head attention module in the BERT model. According to [
27], the advantage of GAT is that it can enhance the impact of the most salient parts of the source document. According to [
28], GATs can effectively capture the content in the source document through propagating contextual information [
29] proposed mix-order graph attention networks for handling indirectly connected nodes inspired by the traditional model of GATs. According to [
8], the use of self-attention in GATs constrains the vertex updates of information from adjacent nodes, despite eliminating the deficiencies of previous methods based on graph convolutions. Therefore, a graph transformer encoder built on the GATs architecture was proposed. It provides a more global contextualization of each vertex with a transformer-style architecture. However, GATs [
29] have not achieved as many efficient results as GTN. In this paper, a graph-based abstractive summarization (GBAS) approach consisting of three stages was proposed, inspired by the SciBERT and graph transformer, to generate a summary from the introduction sections of the source papers.
3. Proposed Model
The framework of the proposed model is illustrated in
Figure 1. The proposed model consists of five stages: dataset preparation, introduction encoder, graph construction, graph encoder, and summary decoder. First, the dataset is prepared with the introduction section and abstract of the scientific articles and the features extracted through the SciIE system. Second, word embedding is obtained with the output in the last layer of SciBERT from the introduction section. Then, a knowledge graph is constructed with features. The knowledge graph is encoded with a graph transformer. In the last stage, a summary is obtained from the knowledge graph. These are explained in detail below.
Second, word embedding is obtained with the output in the last layer of SciBERT from the introduction section. Then, a knowledge graph is constructed with features. The knowledge graph is encoded with a graph transformer. In the last stage, a summary is obtained from the knowledge graph. These are explained in detail below.
3.1. Dataset
In the current research, abstractive summaries are generated from the title or full text of scientific articles [
8,
32]. Among the sections, the introduction is neither as short as the title nor as long as the full text. This section contains necessary and sufficiently salient information regarding the purpose and scope of the article. Therefore, it is foreseen within the scope of this study that the summaries generated from this section can improve the performance of scientific text summarization. The main aim of this paper is to generate a scientific summary from the Introduction section. First, scientific articles up to April 2022 from the arXiv website were crawled containing current topics related to computer science such as "fingerprint", "image processing", "natural language processing", "cyber security" and "machine learning". The dataset properties are given in
Table 2. Considering the following factors, the SciIE system has been preferred for extracting salient information from scientific articles
The most relation IE systems [
33,
34] in the scientific domain are designed to obtain these within sentences. However, SciIE makes it possible to extract information by taking it into account across sentences [
7].
The SciIE system is designed to identify six entity types (task, method, metric, material, other-scientific term, and generic), seven relationship types (compare, part-of, conjunction, evaluate-for, feature-of, used-for, and hyponym-of), and co-reference annotation is used to obtain entity types and relations annotations.
3.2. The Graph-Based Abstractive Summarization Model: (GBAS)
Introduction Encoder: The GBAS model generates a summary from the introduction section. The pre-trained language model (SciBERT) was used as the introduction encoder. SciBERT has a multi-transformer architecture [
20].
In this study, the introduction section was tokenized. To obtain the corresponding sequences given in the samples as word embedding, we used the output of the last hidden state in SciBERT. Thus, word embedding was obtained for the introduction section of each article as follows:
where S represents source word sequences. The sub-word of each word is illustrated with , in which n and m indicate the order of words and sub-word order, respectively. In response to S, the target word sequence is obtained with ,…,}.
Graph Construction: The graph transformer operates the graph as an input. To construct the graph, the entities, their relations, and co-reference annotations for each introduction section were established as outlined in the dataset preparation. Then, it benefited from the graph preparation process of the GraphWriter model [
8] based on [
35]. Differently, the graph was constructed by considering the introduction and abstract sections together. According to the document structure given as an example in
Table 3, the graph construction stage is as follows:
The "abstract-relations" and "document-relations" arrays are rebuilt according to the "relations types", "abstract-entities", and "document-entities" indexes.
For instance, the "abstract-relations" array is converted to "0 1 1", according to the example of "brain tumor segmentation – CONJUNCTION – treatment outcome evaluation", so that the index of brain tumor segmentation is "0", the index of CONJUNCTION is "1" and the index of treatment outcome evaluation is "1".
For instance, the "abstract-relations" array is converted to "2 0 3", according to the example of "Deep learning techniques – USED-FOR – brain tumor segmentation-method", so that the index of deep learning techniques is "2", the index of USED-FOR is "0" and the index of brain tumor segmentation method is "3".
Entity names that could not be extracted because of spelling errors were excluded from the novel array. Accordingly, the "abstract-relations" array is [0 1 1; 2 0 3; 4 1 5; 9 0 8; 9 1 10; 10 0 8; 12 0 8; 10 0 8; 13 0 81; 9 0 17; 21 0 17; 23 0 22; 28 0 22 ; 28 1 29 ; 29 0 22].
The same transformation was performed in the introduction section.
The array of "document-relations" is [1 1 2; 19 0 18; 15 0 21; 25 0 26; 25 1 2; 28 0 26; 25 0 30; 31 0 5; 34 0 5; 34 0 36; 37 6 36; 39 6 36; 40 0 41; 49 1 50 ; 54 0 53 ; 55 6 54 ; 60 0 52 ; 63 0 66 ; 66 0 67 ; 34 0 75].
As a result, a comprehensive graph was constructed by combining the new transformation array of "abstract-relations" and "document relations"
Figure 2 shows the graph constructed for the introduction and abstract sections.
and
denote global nodes.
and
represent global nodes and contain entity name lists in
Table 3. To maintain the flow of information in the graph, the global node is connected to all nodes. This node is used as the start of the decoder. Nodes consist of entity names. Each labeled edge is replaced by two nodes. One of them represents the forward direction of the relations (Rel.), and the other represents the reverse direction of the relations. A novel node is connected to nodes consisting of entities (Ent.), preserving the directions of previous edges.
Within the scope of the study, entity names (abstract-entities, document entities), relations (abstract-relations, document relations) and global nodes (
and
) in Equations 2, 3, and 4 were combined to create the comprehensive graph in
Figure 3.
Graph Encoder: To encode the graph structure, a graph transformer architecture is based on GATs. This model uses the N-head self-attentional system, as shown in
Figure 4. In this model, each vertex is contextualized by attending to another connected vertex in the graph. For the calculation of the N independent attentions, equation (5) is applied.
where ∥ and
ϵ
indicate the concatenation of the N head attention, the neighborhood of
in the graph structure, and attention mechanisms in equation (7). For each head, independent transformations are learned with α respectively.
In this model, equations (8) and (9) are applied L times for each block: Where FNN(x) donates a two-layer feedforward network. As a result, each vertex encoding indicates with
=
which consists of relation, entities, and global vertex.
Summary Decoder: In this stage, the content vectors c obtained from the graph and introduction sequences are calculated by adding a decoder hidden state
at each
timestep. While vertex embedding
is used for graph sequences, and
is used for the introduction sequence. This is given in equation (10). In the last stage, it is given as input to RNN with
by concatenating the context vectors namely
from both graphs and the introduction.
To calculate the probability of copying from the input, it is applied in equation (11) [
36]. Taking into account this equality, the probability of the final next token is given by equation (12).
where
and
refer to the probability distribution over entities and input tokens, and remaining probability respectively. In
,
([
, ∥
],
) is performed for
ϵV ∥ T.
5. Discussion
The experiments were conducted with the above baseline methods on the SciTLDR and SciSummNet datasets and ArxivComp. The results of the experiment are shown in
Table 4,
Table 5 and
Table 6.
Graph-based methods are highly successful among abstractive and extractive methods. Because Seq2Seq models are not very good at transferring information in long-term sequences, abstractive methods concentrate on attention-based methods. Attention-based models fix this issue, but when taking the document’s hierarchical structure into account, they may result in a loss of semantic integrity. Therefore, graph-based methods are superior to other methods because they ensure the integrity of the document.
According to the results shown in
Table 4, the proposed model outperforms the baseline methods on the ArxivComp dataset. The GraphWriter model achieves the closest result to the proposed model among the abstractive methods. This model generates the summary from the title, whereas the proposed model generates the summary from the introduction section. This is how this model differs from the proposed model. Token embedding for the proposed model used SciBERT, which was trained using scientific articles. The introduction and abstract sections were combined to create the proposed model’s graph. The main difference between these approaches and others (BART, T5-based, and Billsum) is the restriction on token sequence lengths. The introduction section is also of variable length for each article. Therefore, the summaries produced by models do not correspond with the author’s summary for articles of different lengths. Graph-based techniques are better than other techniques because they guarantee document integrity.
In comparing the results of these models, the auto-regressive decoder in the BART model allows it to achieve the best summarization performance. As can be seen in
Table 6, graph-based methods are more successful than baseline methods in summarizing long documents because they preserve the integrity of the document.
The performance of the model for both Billsum and T5-model approaches dramatically declines as the document length increases. However, with the SciSummnet dataset—whose average word length is shorter—better performance was obtained in both models. Based on these results, as the average word length, T5-based techniques are not sufficient for summarizing long scientific documents.
Examining the results of the extraction methods (TextRank, LexRank and LSA) reveal that these methods are not as successful as the abstraction methods. It is seen that the lowest results are obtained in the SciTLDR dataset. The main reason for this is that as the length of the document increases, the number of sentences containing more general information also increases. When sentence selections are performed with these algorithms, the overlapping rate of the summary is decreased.
As can be seen in
Table 4 and
Table 6 the proposed method outperforms baseline methods on long documents. As can be seen in
Table 5, it is a comparable method for documents with a shorter average word length. An advantage of the proposed method is that it handles the document hierarchically for long documents.
When the results of the extractive methods (TextRank, LexRank, and LSA) are examined, it is seen that these methods are not as successful as the abstractive methods. It is seen that the lowest results are obtained in the SciTLDR data set. The main reason for this is that as the length of the document increases, the number of sentences containing more general information also increases. When sentence selections are performed with these algorithms, the overlapping rate of the summary is decreased.
As can be seen in
Table 4 and
Table 6 the proposed method outperforms baseline methods on long documents. As can be seen in
Table 5, it is a comparable method for documents with a shorter average word length. An advantage of the proposed method is that it handles the document hierarchically for long documents.
5.1. Human Evaluation
The Rouge metric compares the generated summary based on the overlapping of the n-grams with the ground-truth summary. However, it is not sufficient to prove the quality of the generated summaries. To overcome this problem, the generated summaries were also evaluated with human judgment. The evaluation criteria are as follows: 1) Concise ness(Con.) is whether you avoid redundant information; 2) Informativeness (I) is whether it contains salient information; 3) Coherence(Coh.) is whether the content of the generated summary is appropriate for the ground-truth summary; 4) Readability(R) means that the generated summary is easy to understand and fluent; and5) Grammatically(G), the question is whether the sentences are appropriate to the grammar rules.
Five expert volunteers rated the summaries from 1 (worst) to 5 (best) for each criterion. Fleiss’s Kappa analysis was performed to determine whether the evaluator scores were compatible with each other.
From the results in
Table 7, it is seen that the evaluations of the volunteers mostly agree with each other in that the result of each criterion is greater than 0.5. According to the results, the generated summary is generally informative, fluent, and overlaps with a ground-truth summary. However, it has been observed that grammatical problems remain. For instance, some words repeat more than once in the generated summary. In addition, it adversely affects conciseness and readability because < unk > and some entities extracted as "generic" by the SciIE system are not learning. It caused inconsistencies in the meaning of sentences because these words were removed from the generated summaries at the last stage. Because the proposed model is pretrained with topics related to computer science, it will produce successful summaries on these topics. In
Table 8, sample abstracts generated for scientific articles.