<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMC</journal-id>
<journal-id journal-id-type="nlm-ta">CMC</journal-id>
<journal-id journal-id-type="publisher-id">CMC</journal-id>
<journal-title-group>
<journal-title>Computers, Materials &#x0026; Continua</journal-title>
</journal-title-group>
<issn pub-type="epub">1546-2226</issn>
<issn pub-type="ppub">1546-2218</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">17441</article-id>
<article-id pub-id-type="doi">10.32604/cmc.2021.017441</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>A Semantic Supervision Method for Abstractive Summarization</article-title>
<alt-title alt-title-type="left-running-head">A Semantic Supervision Method for Abstractive Summarization</alt-title>
<alt-title alt-title-type="right-running-head">A Semantic Supervision Method for Abstractive Summarization</alt-title>
</title-group>
<contrib-group content-type="authors">
<contrib id="author-1" contrib-type="author">
<name name-style="western">
<surname>Hu</surname>
<given-names>Sunqiang</given-names>
</name>
<xref ref-type="aff" rid="aff-1">1</xref>
</contrib>
<contrib id="author-2" contrib-type="author">
<name name-style="western">
<surname>Li</surname>
<given-names>Xiaoyu</given-names>
</name>
<xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-3" contrib-type="author" corresp="yes">
<name name-style="western">
<surname>Deng</surname>
<given-names>Yu</given-names>
</name>
<xref ref-type="aff" rid="aff-1">1</xref><email>411180435@qq.com</email></contrib>
<contrib id="author-4" contrib-type="author">
<name name-style="western">
<surname>Peng</surname>
<given-names>Yu</given-names>
</name>
<xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-5" contrib-type="author">
<name name-style="western">
<surname>Lin</surname>
<given-names>Bin</given-names>
</name>
<xref ref-type="aff" rid="aff-2">2</xref></contrib>
<contrib id="author-6" contrib-type="author">
<name name-style="western">
<surname>Yang</surname>
<given-names>Shan</given-names>
</name>
<xref ref-type="aff" rid="aff-3">3</xref></contrib>
<aff id="aff-1"><label>1</label><institution>School of Information and Software Engineering, University of Electronic Science and Technology of China</institution>, <addr-line>Chengdu, 610054</addr-line>, <country>China</country></aff>
<aff id="aff-2"><label>2</label><institution>School of Engineering, Sichuan Normal University</institution>, <addr-line>Chengdu, 610066</addr-line>, <country>China</country></aff>
<aff id="aff-3"><label>3</label><institution>Department of Chemistry, Physics, and Atmospheric Sciences, Jackson State University</institution>, <addr-line>Jackson</addr-line>, <country>MS</country>, <addr-line>39217</addr-line>, <country>USA</country></aff>
</contrib-group>
<author-notes><corresp id="cor1">&#x002A;Corresponding Author: Yu Deng. Email: <email>411180435@qq.com</email></corresp></author-notes>
<pub-date pub-type="epub" date-type="pub" iso-8601-date="2021-05-31"><day>31</day><month>05</month><year>2021</year></pub-date>
<volume>69</volume>
<issue>1</issue>
<fpage>145</fpage>
<lpage>158</lpage>
<history>
<date date-type="received"><day>30</day><month>01</month><year>2021</year></date>
<date date-type="accepted"><day>17</day><month>03</month><year>2021</year></date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2021 Hu et al.</copyright-statement>
<copyright-year>2021</copyright-year>
<copyright-holder>Hu et al.</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMC_17441.pdf"></self-uri>
<abstract>
<p>In recent years, many text summarization models based on pre-training methods have achieved very good results. However, in these text summarization models, semantic deviations are easy to occur between the original input representation and the representation that passed multi-layer encoder, which may result in inconsistencies between the generated summary and the source text content. The Bidirectional Encoder Representations from Transformers (BERT) improves the performance of many tasks in Natural Language Processing (NLP). Although BERT has a strong capability to encode context, it lacks the fine-grained semantic representation. To solve these two problems, we proposed a semantic supervision method based on Capsule Network. Firstly, we extracted the fine-grained semantic representation of the input and encoded result in BERT by Capsule Network. Secondly, we used the fine-grained semantic representation of the input to supervise the fine-grained semantic representation of the encoded result. Then we evaluated our model on a popular Chinese social media dataset (LCSTS), and the result showed that our model achieved higher ROUGE scores (including R-1, R-2), and our model outperformed baseline systems. Finally, we conducted a comparative study on the stability of the model, and the experimental results showed that our model was more stable.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Text summarization</kwd>
<kwd>semantic supervision</kwd>
<kwd>capsule network</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="s1">
<label>1</label>
<title>Introduction</title>
<p>The goal of text summarization is to deliver important information in the source text with a small number of words. In the current era of information explosion, it is an undeniable fact that text information floods the Internet. Hence, it is necessary for us to apply text summarization which can help us obtain useful information from the source text. With the rapid development of artificial intelligence, automatic text summarization was proposed, that is, computers can aid people in complex text summarization. By using machine learning, deep learning, and other methods, we can get a general model for automatic text summarization, which can replace humans to extract summary from source text.</p>
<p>Automatic text summarization is usually divided into two categories according to the implementation method: extractive summarization and abstractive summarization. And Extractive summarization is to extract some sentences containing key information from the source text and combine them to a summarization, while abstractive summarization is to compress and refine the information of source text to generate a new summarization. Compared with extractive summarization, abstractive summarization is more innovative, because machines can generate summary contents that are more informative and attractive. Abstractive text summarization models are usually based on a sequence-to-sequence model [<xref ref-type="bibr" rid="ref-1">1</xref>]. It contains two parts: the encoder and decoder. The encoder encodes the input as a fixed-length context vector which contains important information of the input text, and the decoder decodes the vector into the output we desire. In the early days, we will choose RNN or LSTM [<xref ref-type="bibr" rid="ref-2">2</xref>] as the encoder-decoder structure of seq2seq, and use the last hidden unit of RNN or LSTM as the context vector of the decoder.</p>
<p>BERT [<xref ref-type="bibr" rid="ref-3">3</xref>] is a pre-trained language model which is trained in advance through a large amount of unsupervised data. With a good capability of contextual semantic representation, BERT has achieved very good performance in many tasks of NLP. However, it is not suitable to complete generative tasks for lack of the decoder structure. Dong et al. [<xref ref-type="bibr" rid="ref-4">4</xref>] proposed a Unified Pre-trained Language Model (UNILM), whose submodule seq2seqLM could complete the task of natural language generation by modifying BERT&#x2019;s mask matrix. BERT can encode each word accurately according to the context, but it lacks a fine-grained semantic representation of the entire input text, which results in semantic deviations between the result encoded by BERT and the original input text. The traditional seq2seq model does not perform well in text summarization, so we consider using the pre-trained model BERT to improve the actual effect of the text summarization. However, BERT has its flaws mentioned above. Therefore, we hope to overcome the defects by applying some methods and improve the effectiveness of the text summarization model based on BERT.</p>
<p>Nowadays, Neural Network has been applied to many fields [<xref ref-type="bibr" rid="ref-5">5</xref>,<xref ref-type="bibr" rid="ref-6">6</xref>], and automatic text summarization is one of its hot research. In this paper, according to the idea of seq2seqLM, we modified the mask matrix of BERT and used BERT-base to complete abstractive summarization. To reduce semantic deviations, we introduced a semantic supervision method based on Capsule Network [<xref ref-type="bibr" rid="ref-7">7</xref>] into our model. Following previous work, we evaluated our proposed model on the LCSTS dataset [<xref ref-type="bibr" rid="ref-8">8</xref>], the experimental results showed that our model is superior to the baseline system, and the proposed semantic supervision method can indeed improve the effectiveness of BERT.</p>
<p>The remainder of this paper is organized as follows. The related work will be discussed in Section 2. The proposed model will be presented in Section 3. Details of the experiment will be explained in Section 4. Comparison and discussion of experimental results will be made in Section 5. Conclusions and Future work will be drawn in Section 6.</p>
</sec>
<sec id="s2">
<label>2</label>
<title>Related Works</title>
<sec id="s2_1">
<label>2.1</label>
<title>Seq2seq Model</title>
<p>The research on abstractive summarization mainly depends on the seq2seq model proposed by Cho et al. [<xref ref-type="bibr" rid="ref-1">1</xref>], which solves the length inequality of input and output in generative tasks. The seq2seq model contains two parts: encoder and decoder. The encoder encodes the input into a context vector <italic>C</italic>, and the decoder decodes the output by <italic>C</italic>. The Seq2seq model was originally used for Neural Machine Translation (NMT), and firstly proposed by Rush et al. [<xref ref-type="bibr" rid="ref-9">9</xref>] based on attention mechanism [<xref ref-type="bibr" rid="ref-10">10</xref>] for abstractive summarization, and it proved to have good performance.</p>
</sec>
<sec id="s2_2">
<label>2.2</label>
<title>Pre-Trained Model and BERT</title>
<p>Pre-trained language model has become an important technology in NLP field in recent years. The main idea is that the model&#x2019;s parameters are no longer randomly initialized, but trained in advance by some tasks (such as Language Model) and large-scale text corpus. Then they are fine-tuned on the small dataset of specific tasks, and it makes it easy to tarin a model. The early pre-trained Language Model is Embeddings from Language Model (ELMo) [<xref ref-type="bibr" rid="ref-11">11</xref>], which can complete the feature extraction by bidirectional LSTM and fine-tune the downstream tasks. A Generative Pre-Training Language Model (GPT) can achieve very good performance by replacing LSTM with Transformer [<xref ref-type="bibr" rid="ref-12">12</xref>] in the text generation task. Based on GPT, Devlin et al. [<xref ref-type="bibr" rid="ref-3">3</xref>] considered using bidirectional Transformer and higher quality large-scale dataset for pre-training and obtained a better pre-trained language model BERT.</p>
<p>Liu et al. [<xref ref-type="bibr" rid="ref-13">13</xref>] proposed BERTSum for extractive summarization, a simple variant of BERT, and the model outperformed baseline on the CNN/DailyMail dataset. Later, Liu et al. [<xref ref-type="bibr" rid="ref-13">13</xref>] joined the decoder structure based on BERTSum to complete abstractive summarization and conducted experiments on the previous dataset. The experimental results showed that their model was superior to the previous model in both extraction summarization and abstractive summarization. The goal of UNILM proposed by Li et al. [<xref ref-type="bibr" rid="ref-4">4</xref>] is to adapt BERT to generative tasks, which is the same as that of Masked Sequence to Sequence Pre-training Model (MASS) proposed by Song et al. [<xref ref-type="bibr" rid="ref-14">14</xref>]. But UNILM is more succinct, sticking to BERT&#x2019;s idea and only using encoders to complete various NLP tasks. The UNILM is trained based on three objectives: Unidirectional LM (left-to-right and right-to-left), Bidirectional LM, and seq2seqLM. Seq2seqLM can complete abstractive summarization. It defines the source text as the first sentence and the corresponding summary as the second sentence. The first sentence is encoded by Bidirectional LM, and the second sentence is encoded by Unidirectional LM (left-to-right).</p>
</sec>
<sec id="s2_3">
<label>2.3</label>
<title>Semantic Supervision and Capsule Network</title>
<p>Ma et al. [<xref ref-type="bibr" rid="ref-15">15</xref>] proposed a method to improve semantic relevance in seq2seq model. By calculating the cosine similarity between the semantic vector of the source text and the summary, we can get the measure of semantic relevance between them. The larger the cosine value is, the more relevant they are, and the negative value of the cosine similarity is added to the loss function to maximize the semantic relevance between them. At the same time, Ma et al. [<xref ref-type="bibr" rid="ref-16">16</xref>] also proposed an autoencoder as an assistant supervisor method to improve the text representation. By minimizing the L2 distance between the summary encoder vector and the source text encoder vector, we can supervise the semantic representation of the source text and improve the semantic representation of the source text.</p>
<p>In 2017, Sabour et al. [<xref ref-type="bibr" rid="ref-7">7</xref>] proposed a new neural network structure called Capsule Network. The input and output of Capsule Network are all in the form of vectors, and the results of image classification experiments showed that Capsule Network has a strong ability of feature aggregation. Zhao et al. [<xref ref-type="bibr" rid="ref-17">17</xref>] proposed a model based on Capsule Network to do text classification. As a result, the model performed better than the baseline system in the experiment.</p>
<p>Based on the methods mentioned above, we complete abstractive summarization by adopting the idea of seq2seqLM, and added the semantic supervision method into the model. We conducted relevant experiments on the Chinese dataset LCSTS [<xref ref-type="bibr" rid="ref-8">8</xref>], and analyzed the experimental results.</p>
</sec>
</sec>
<sec id="s3">
<label>3</label>
<title>Proposed Model</title>
<sec id="s3_1">
<label>3.1</label>
<title>BERT for Abstractive Summarization</title>
<p>Our model structure is shown in <xref ref-type="fig" rid="fig-1">Fig. 1</xref>, and it is composed of four parts. Embedding Layer is responsible for transforming the input token into a vector representation. Transformer Layer is responsible for encoding the token vector representation according to the context information. Output Layer is used to parse the encoded result of Transformer Layer. And the last part is the Semantic Supervision module proposed by us, which is responsible for supervising the semantic encoding of Transformer Layer.</p>
<p><bold><italic>Embedding Layer</italic></bold></p>
<p>BERT&#x2019;s embedding layer contains Token Embedding, Segment Embedding and Position Embedding. Token Embedding is the vector representation of tokens, which is obtained by looking up the embedding matrix with token Id. Segment Embedding is used to express whether the current token comes from the first segment or the second segment. Position Embedding is the position vector of the current token. <xref ref-type="fig" rid="fig-1">Fig. 1</xref> shows Embedding Layer of BERT. The input representation follows that of BERT. We added a special token ([CLS]) at the beginning of input, and added a special token ([SEP]) at the end of every segment. <inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:mi>T</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula> represents the token sequence of the source text, and <inline-formula id="ieqn-2"><mml:math id="mml-ieqn-2"><mml:mi>S</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula> represents the token sequence of the summary. We got the input <inline-formula id="ieqn-3"><mml:math id="mml-ieqn-3"><mml:mi>X</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>C</mml:mi><mml:mi>L</mml:mi><mml:mi>S</mml:mi></mml:mstyle></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>T</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>T</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>T</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>S</mml:mi><mml:mi>E</mml:mi><mml:mi>P</mml:mi></mml:mstyle></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>S</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>S</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>S</mml:mi><mml:mi>E</mml:mi><mml:mi>P</mml:mi></mml:mstyle></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula> of the model by splicing <italic>T</italic>, <italic>S</italic> and special token. By summing corresponding Token Embedding, Position Embedding and Segment Embedding, we can get a vector representation of each input token.</p>
<fig id="fig-1">
<label>Figure 1</label>
<caption>
<title>An overview of our model. Its main body (on the left) is composed of embedding layer, transformer layer, and output layer. Based on the main body, it contains a semantic supervision module (on the right) 
 
</title>
</caption><graphic mimetype="image" mime-subtype="png" xlink:href="fig-1.png"/>
</fig>
<p><bold><italic>Transformer Layer</italic></bold></p>
<p>Transformer Layer consists of <italic>N</italic> Transformer Blocks which share the same structure but have different parameters to be trained. Transformer was originally proposed by Vaswani et al. [<xref ref-type="bibr" rid="ref-12">12</xref>], but only the Encoder part of Transformer is used in BERT. The reason why BERT can perform well in many NLP tasks is that it depends on a large amount of unsupervised data and the excellent semantic encoding capability of Transformer.</p>
<p>The input of seq2seqLM is the same as that of BERT, but the main difference is that seq2seqLM changes the mask matrix of multi-head attention in Transformer. As shown on the left of <xref ref-type="fig" rid="fig-2">Fig. 2</xref>, the source text&#x2019;s tokens can attend to each other from both directions (left-to-right and right-to-left), while every token of the summary can only attend to its left context (including itself) and all tokens in the source text. The mask matrix is designed as follows [<xref ref-type="bibr" rid="ref-4">4</xref>]:</p>
<p><disp-formula id="eqn-1">
<label>(1)</label>

<mml:math id="mml-eqn-1" display="block"><mml:mrow><mml:msub><mml:mrow><mml:mi>M</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mtable equalrows="false" columnlines="none" equalcolumns="false"><mml:mtr><mml:mtd columnalign="left"><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mspace width="1em"/></mml:mtd><mml:mtd columnalign="left"><mml:mstyle><mml:mtext>allow&#x00A0;to&#x00A0;attend</mml:mtext></mml:mstyle></mml:mtd></mml:mtr><mml:mtr><mml:mtd columnalign="left"><mml:mo>-</mml:mo><mml:mi>&#x221E;</mml:mi><mml:mo>,</mml:mo><mml:mspace width="1em"/></mml:mtd><mml:mtd columnalign="left"><mml:mstyle><mml:mtext>prevent&#x00A0;from&#x00A0;attending</mml:mtext></mml:mstyle></mml:mtd></mml:mtr> </mml:mtable></mml:mrow><mml:mo></mml:mo></mml:mrow></mml:mrow></mml:math></disp-formula></p>
<p>The element of the mask matrix is 0, which means the <italic>i</italic>th token can attend to the <italic>j</italic>th token. In contrast, the element is <inline-formula id="ieqn-4"><mml:math id="mml-ieqn-4"><mml:mo>-</mml:mo><mml:mi>&#x221E;</mml:mi></mml:math></inline-formula>, which means the <italic>i</italic>th token can&#x2019;t attend to the <italic>j</italic>th token. On the right of <xref ref-type="fig" rid="fig-2">Fig. 2</xref>, we showed the self-attention mask matrix <italic>M</italic> in <xref ref-type="disp-formula" rid="eqn-1">Eq. (1)</xref>, which is designed for the text summarization. The left part of <italic>M</italic> is set 0 so that all tokens can attend to the source text token. Our goal is to predict the summary, and attention from the source text to the summary is unnecessary, we set the upper right elements <inline-formula id="ieqn-5"><mml:math id="mml-ieqn-5"><mml:mo>-</mml:mo><mml:mi>&#x221E;</mml:mi></mml:math></inline-formula>. On the bottom right side, we set its lower triangular matrix elements 0, and other elements <inline-formula id="ieqn-6"><mml:math id="mml-ieqn-6"><mml:mo>-</mml:mo><mml:mi>&#x221E;</mml:mi></mml:math></inline-formula>, which prevents the current tokens of the summary from paying attention to the tokens after it.</p>
<fig id="fig-2">
<label>Figure 2</label>
<caption>
<title>The overview of self-attention mask matrix 
 
</title>
</caption><graphic mimetype="image" mime-subtype="png" xlink:href="fig-2.png"/>
</fig>
<p>The output of Embedding Layer is defined as <inline-formula id="ieqn-7"><mml:math id="mml-ieqn-7"><mml:msup><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula>, where <italic>X<sub>i</sub></italic> represents the vector representation of the <italic>i</italic>th token and <italic>n</italic> represents the length of the input sequence. We abbreviated the output of the <italic>l</italic>th Transformer block as: <inline-formula id="ieqn-8"><mml:math id="mml-ieqn-8"><mml:msup><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mstyle mathvariant="italic"><mml:mi>T</mml:mi><mml:mi>r</mml:mi><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mi>s</mml:mi><mml:mi>f</mml:mi><mml:mi>o</mml:mi><mml:mi>r</mml:mi><mml:mi>m</mml:mi><mml:mi>e</mml:mi></mml:mstyle><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>. In each Transformer Block, by aggregating multiple self-attention heads, we can get the output of the current multi-head attention. For the <italic>l</italic>th Transformer block, the output <italic>A<sub>l</sub></italic> of the multi-head attention is computed as follows:</p>
<p><disp-formula id="eqn-2">
<label>(2)</label>

<mml:math id="mml-eqn-2" display="block"><mml:mrow><mml:msup><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mstyle mathvariant="italic"><mml:mi>C</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi></mml:mstyle><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>h</mml:mi><mml:mi>e</mml:mi><mml:mi>a</mml:mi><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mspace width="0.3em"/><mml:mi>h</mml:mi><mml:mi>e</mml:mi><mml:mi>a</mml:mi><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mspace width="0.3em"/><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mspace width="0.3em"/><mml:mi>h</mml:mi><mml:mi>e</mml:mi><mml:mi>a</mml:mi><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:msup><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>o</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-9"><mml:math id="mml-ieqn-9"><mml:mi>h</mml:mi><mml:mi>e</mml:mi><mml:mi>a</mml:mi><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>s</mml:mi><mml:mi>o</mml:mi><mml:mi>f</mml:mi><mml:mi>t</mml:mi><mml:mstyle mathvariant="italic"><mml:mo class="qopname">max</mml:mo></mml:mstyle><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mfrac><mml:mrow><mml:mi>Q</mml:mi><mml:msup><mml:mrow><mml:mi>K</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mrow><mml:msqrt><mml:mrow><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msqrt></mml:mrow></mml:mfrac><mml:mo>+</mml:mo><mml:mi>M</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mi>V</mml:mi></mml:math></inline-formula></p>
<p>and <inline-formula id="ieqn-10"><mml:math id="mml-ieqn-10"><mml:mi>Q</mml:mi><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:msubsup><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mspace width="1em" class="quad"/><mml:mi>K</mml:mi><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:msubsup><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mspace width="1em" class="quad"/><mml:mi>V</mml:mi><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:msubsup><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>V</mml:mi> </mml:mrow></mml:msubsup></mml:math></inline-formula></p>
<p><inline-formula id="ieqn-11"><mml:math id="mml-ieqn-11"><mml:msup><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula> is the output of the (<italic>l</italic> &#x2212;1)th Transformer Block, where <italic>n</italic> is the length of the input sequence and <italic>d<sub>h</sub></italic> is the embedding size. <inline-formula id="ieqn-12"><mml:math id="mml-ieqn-12"><mml:msubsup><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>V</mml:mi> </mml:mrow></mml:msubsup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula> and <inline-formula id="ieqn-13"><mml:math id="mml-ieqn-13"><mml:msup><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>o</mml:mi></mml:mrow></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula> are the linearly projected matrices where <italic>d<sub>k</sub></italic> = <italic>d<sub>h</sub></italic>/<italic>h</italic> and <italic>d<sub>h</sub></italic> is the number of parallel attention heads. <inline-formula id="ieqn-14"><mml:math id="mml-ieqn-14"><mml:mi>M</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>n</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> is the mask matrix in <xref ref-type="disp-formula" rid="eqn-1">Eq. (1)</xref>.</p>
<p><bold><italic>Output Layer</italic></bold></p>
<p>We took the output of the last Transformer Block as the input of Output Layer. Output Layer consists of three parts: two full connection layers and one Layer Normalization.</p>
<p>The first full connection layer is used to add nonlinear operations to BERT&#x2019;s output, and we use GELU as the activation function, which is widely used in BERT. In <xref ref-type="disp-formula" rid="eqn-3">Eq. (3)</xref>, <italic>T</italic><sup><italic>N</italic></sup> is the output of the last Transformer Block, <italic>W</italic><sub>1</sub> is the matrix to be trained, <italic>b</italic><sub>1</sub> is the value of bias, and <italic>O</italic><sub>1</sub> is the output of the first full connection layer.</p>
<p><disp-formula id="eqn-3">
<label>(3)</label>

<mml:math id="mml-eqn-3" display="block"><mml:mrow><mml:msub><mml:mrow><mml:mi>O</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>G</mml:mi><mml:mi>E</mml:mi><mml:mi>L</mml:mi><mml:mi>U</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:msup><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msub><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:math></disp-formula></p>
<p>Different from Batch Normalization [<xref ref-type="bibr" rid="ref-18">18</xref>], Layer Normalization [<xref ref-type="bibr" rid="ref-19">19</xref>] does not depend on batch size and the length of the input sequence. Adding Layer Normalization can avoid gradient disappearance. In <xref ref-type="disp-formula" rid="eqn-4">Eq. (4)</xref>, <italic>LN</italic>(*) is Layer Normalization and <italic>O</italic><sub>2</sub> is the output of <italic>LN</italic>(*).</p>
<p><disp-formula id="eqn-4">
<label>(4)</label>

<mml:math id="mml-eqn-4" display="block"><mml:mrow><mml:msub><mml:mrow><mml:mi>O</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>L</mml:mi><mml:mi>N</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>O</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:math></disp-formula></p>
<p>The second full connection layer is used to parse the output, which contains <inline-formula id="ieqn-15"><mml:math id="mml-ieqn-15"><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>I</mml:mi></mml:math></inline-formula> (<italic>n</italic> is the length of output and <italic>I</italic> is the size of vocabulary) units, and we use softmax as the activation function. The softmax function is commonly used in multi-classification, and it map the output of multiple neurons to the interval (0, 1). Predicting a word is equivalent to a multi-classification task. In <xref ref-type="disp-formula" rid="eqn-5">Eq. (5)</xref>, <italic>W</italic><sub>3</sub> is the matrix to be trained, <italic>b</italic><sub>3</sub> is the value of bias, and <italic>O</italic><sub>3</sub> is the final output of our model.</p>
<p><disp-formula id="eqn-5">
<label>(5)</label>

<mml:math id="mml-eqn-5" display="block"><mml:msub><mml:mrow><mml:mi>O</mml:mi></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mstyle><mml:mtext>s</mml:mtext></mml:mstyle><mml:mstyle><mml:mtext>o</mml:mtext></mml:mstyle><mml:mstyle><mml:mtext>f</mml:mtext></mml:mstyle><mml:mstyle><mml:mtext>t</mml:mtext></mml:mstyle><mml:mstyle><mml:mtext>m</mml:mtext></mml:mstyle><mml:mstyle><mml:mtext>ax</mml:mtext></mml:mstyle><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>O</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
</sec>
<sec id="s3_2">
<label>3.2</label>
<title>Semantic Supervision Based on Capsule Network</title>
<p>For lack of fine-grained semantic representation in BERT, it can&#x2019;t produce high-quality summaries when it was applied to text summarization. And there are semantic deviations between the original input and the encoded result passed multi-layer encoder. We hope to improve these problems by adding semantic supervision based on Capsule Network. The implementation of semantic supervision is shown on the right side of <xref ref-type="fig" rid="fig-1">Fig. 1</xref>. At the training stage, we took the result of Token Embedding as the input of Capsule Network and got the semantic representation <italic>V<sub>i</sub></italic> of the input. At the same time, we did the same operation for the output of the last Transformer Block to get the semantic representation <italic>V<sub>o</sub></italic> of the output. We implemented the semantic supervision by minimizing the distance <inline-formula id="ieqn-16"><mml:math id="mml-ieqn-16"><mml:mi>d</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>o</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> between the semantic representation <italic>V<sub>i</sub></italic> and <italic>V<sub>o</sub></italic>. <inline-formula id="ieqn-17"><mml:math id="mml-ieqn-17"><mml:mi>d</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>o</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> is calculated as <xref ref-type="disp-formula" rid="eqn-6">Eq. (6)</xref>.</p>
<p><disp-formula id="eqn-6">
<label>(6)</label>

<mml:math id="mml-eqn-6" display="block"><mml:mi>d</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mspace width="0.3em"/><mml:msub><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>o</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mo lspace='0pt' rspace='0pt'>&#x2225;</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>o</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>&#x2225;</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></disp-formula></p>
<p>Ma et al. [<xref ref-type="bibr" rid="ref-15">15</xref>] directly took the input and output results of the model as semantic representations, which had low generalization capability. So we added a Capsule Network [<xref ref-type="bibr" rid="ref-7">7</xref>] which is capable of high-level feature clustering so as to extract semantic features. The Capsule Network uses vectors as input and output, and vector has a good representational capability, such as using vectors to represent words in word2vec. Of course, our experiment also showed that Capsule Network performed better than LSTM [<xref ref-type="bibr" rid="ref-2">2</xref>] and GRU [<xref ref-type="bibr" rid="ref-20">20</xref>]. We define a set of input vectors <inline-formula id="ieqn-18"><mml:math id="mml-ieqn-18"><mml:mi>u</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>u</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>u</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>u</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula>, and the output of Capsule Network is <inline-formula id="ieqn-19"><mml:math id="mml-ieqn-19"><mml:mi>v</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula>. The output of Capsule Network is calculated as follows:</p>
<p><disp-formula id="eqn-7">
<label>(7)</label>

<mml:math id="mml-eqn-7" display="block"><mml:mrow></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>u</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo>|</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>u</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow></mml:mrow></mml:math>
</disp-formula></p>
<p><disp-formula id="eqn-8">
<label>(8)</label>

<mml:math id="mml-eqn-8" display="block"><mml:mrow></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>u</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo>|</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow></mml:mrow></mml:math>
</disp-formula></p>
<disp-formula id="eqn-9">
<label>(9)</label>

<mml:math id="mml-eqn-9" display="block"><mml:mrow></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mstyle mathvariant="italic"><mml:mi>s</mml:mi><mml:mi>o</mml:mi><mml:mi>f</mml:mi><mml:mi>t</mml:mi><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mstyle><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:math>
</disp-formula>
<disp-formula id="eqn-10">
<label>(10)</label>

<mml:math id="mml-eqn-10" display="block"><mml:mrow></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mstyle displaystyle='true'><mml:mstyle displaystyle='true'><mml:munderover><mml:mrow><mml:mo>&#x2211;</mml:mo> </mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo lspace='0pt' rspace='0pt'>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover></mml:mstyle></mml:mstyle><mml:msub><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>u</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo>|</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow></mml:mrow></mml:math>
</disp-formula>
<disp-formula id="eqn-11">
<label>(11)</label>

<mml:math id="mml-eqn-11" display="block"><mml:mrow></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mstyle mathvariant="italic"><mml:mi>s</mml:mi><mml:mi>q</mml:mi><mml:mi>u</mml:mi><mml:mi>a</mml:mi><mml:mi>s</mml:mi><mml:mi>h</mml:mi></mml:mstyle><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msup><mml:mrow><mml:mrow><mml:mo>&#x2225;</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>&#x2225;</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mo>&#x2225;</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>&#x2225;</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:mfrac><mml:mo>&#x22C5;</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mrow><mml:mo>&#x2225;</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>&#x2225;</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:mrow><mml:mrow></mml:mrow></mml:math>
</disp-formula>
<p>It can be seen from <xref ref-type="disp-formula" rid="eqn-8">Eq. (8)</xref> that the calculation of <italic>b<sub>ij</sub></italic> requires <italic>v<sub>j</sub></italic>, but <italic>v<sub>j</sub></italic> is the final output, so it is impossible to calculate <italic>b<sub>ij</sub></italic> directly. <italic>b<sub>ij</sub></italic> is usually given an initial value and computed iteratively. Based on this idea, Sabour et al. [<xref ref-type="bibr" rid="ref-7">7</xref>] proposed a Dynamic Routing algorithm in their paper.</p>
<p>We took the output of Embedding layer <inline-formula id="ieqn-20"><mml:math id="mml-ieqn-20"><mml:mi>X</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula> as the input <inline-formula id="ieqn-21"><mml:math id="mml-ieqn-21"><mml:mi>u</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>u</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>u</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>u</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula> of Capsule Network and got the output <inline-formula id="ieqn-22"><mml:math id="mml-ieqn-22"><mml:mi>v</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula> where <inline-formula id="ieqn-23"><mml:math id="mml-ieqn-23"><mml:mi>X</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula> (<italic>n</italic> is the length of the input sequence and <italic>d<sub>h</sub></italic> is the embedding size). Each vector <italic>v<sub>i</sub></italic> in <italic>v</italic> represents a property, and the length of the vector represents the probability that the property exists. We calculated the norm of each vector in <italic>v</italic> to form a new vector as shown in <xref ref-type="disp-formula" rid="eqn-12">Eq. (12)</xref>, and <italic>V<sub>i</sub></italic> is the fine-grained semantic representation of the input <italic>X</italic>. Similarly, we regarded the output <inline-formula id="ieqn-24"><mml:math id="mml-ieqn-24"><mml:msup><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula> of BERT as the input <inline-formula id="ieqn-25"><mml:math id="mml-ieqn-25"><mml:msup><mml:mrow><mml:mi>u</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x2032;</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>u</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mi>u</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mi>&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mi>u</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x2032;</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula>, and got the output <inline-formula id="ieqn-26"><mml:math id="mml-ieqn-26"><mml:msup><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x2032;</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mi>&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x2032;</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula> by Capsule Network. By calculating the norm of each vector in <italic>v</italic>&#x2032;, we got a new vector as shown in <xref ref-type="disp-formula" rid="eqn-13">Eq. (13)</xref>, and <italic>V<sub>o</sub></italic> is the fine-grained semantic representation of the BERT&#x2019;s output.</p>
<p><disp-formula id="eqn-12">
<label>(12)</label>

<mml:math id="mml-eqn-12" display="block"><mml:mrow></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>|</mml:mo><mml:mo>,</mml:mo><mml:mspace width="0.3em"/><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>|</mml:mo><mml:mo>,</mml:mo><mml:mspace width="0.3em"/><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mspace width="0.3em"/><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:math>
</disp-formula></p>
<p><disp-formula id="eqn-13">
<label>(13)</label>

<mml:math id="mml-eqn-13" display="block"><mml:mrow></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>o</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mo>|</mml:mo><mml:msubsup><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>|</mml:mo><mml:mo>,</mml:mo><mml:mspace width="0.3em"/><mml:mo>|</mml:mo><mml:msubsup><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mi>&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>|</mml:mo><mml:mo>,</mml:mo><mml:mspace width="0.3em"/><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mspace width="0.3em"/><mml:mo>|</mml:mo><mml:msubsup><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>|</mml:mo></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:math>
</disp-formula></p>
<p>We found that the longer the input sequence is, the larger the semantic deviations are. So we use different intensity semantic supervision for different lengths of the input. We controlled the intensity of supervision by the parameter <inline-formula id="ieqn-27"><mml:math id="mml-ieqn-27"><mml:mi>&#x03BB;</mml:mi></mml:math></inline-formula> in <xref ref-type="disp-formula" rid="eqn-14">Eq. (14)</xref> where <italic>l<sub>s</sub></italic> is the length of the input sequence. The longer the input sequence is, the larger the supervision intensity is, and the shorter the input sequence is, the lower the supervision intensity is.</p>
<p><disp-formula id="eqn-14">
<label>(14)</label>

<mml:math id="mml-eqn-14" display="block"><mml:mrow><mml:mi>&#x03BB;</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi>l</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:msub><mml:mrow><mml:mi>l</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfrac></mml:mrow></mml:math></disp-formula></p>
<p>The loss function of Semantic Supervision can be written as follows:</p>
<p><disp-formula id="eqn-15">
<label>(15)</label>

<mml:math id="mml-eqn-15" display="block"><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x03BB;</mml:mi><mml:mi>d</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>o</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
</sec>
<sec id="s3_3">
<label>3.3</label>
<title>Loss Function and Training</title>
<p>There are two loss functions in our model that need to be optimized. The first one is the categorical cross-entropy loss in <xref ref-type="disp-formula" rid="eqn-16">Eq. (16)</xref>, where <italic>N</italic> is the number of all samples, <inline-formula id="ieqn-28"><mml:math id="mml-ieqn-28"><mml:mi>y</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> is the true label of the input sample, <inline-formula id="ieqn-29"><mml:math id="mml-ieqn-29"><mml:mi>&#x0177;</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> is the corresponding prediction label, <italic>D</italic> is the sample set, <italic>n</italic> is the length of summary and <italic>m</italic> is the vocabulary size. The other one is the semantic supervision loss defined in <xref ref-type="disp-formula" rid="eqn-15">Eq. (15)</xref>. Our objective is to minimize the loss function in <xref ref-type="disp-formula" rid="eqn-17">Eq. (17)</xref>.</p>
<p><disp-formula id="eqn-16">
<label>(16)</label>

<mml:math id="mml-eqn-16" display="block"><mml:mrow></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo>-</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:mfrac><mml:munder><mml:mrow><mml:mo>&#x2211;</mml:mo> </mml:mrow><mml:mrow><mml:mi>y</mml:mi><mml:mo lspace='0pt' rspace='0pt'>&#x2208;</mml:mo><mml:mi>D</mml:mi></mml:mrow></mml:munder><mml:mstyle displaystyle='true'><mml:mstyle displaystyle='true'><mml:munderover><mml:mrow><mml:mo>&#x2211;</mml:mo> </mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo lspace='0pt' rspace='0pt'>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover></mml:mstyle></mml:mstyle><mml:mstyle displaystyle='true'><mml:mstyle displaystyle='true'><mml:munderover><mml:mrow><mml:mo>&#x2211;</mml:mo> </mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo lspace='0pt' rspace='0pt'>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:munderover></mml:mstyle></mml:mstyle><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>g</mml:mi><mml:msub><mml:mrow><mml:mi>&#x0177;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow></mml:mrow></mml:math>
</disp-formula></p>
<p><disp-formula id="eqn-17">
<label>(17)</label>

<mml:math id="mml-eqn-17" display="block"><mml:mrow></mml:mrow><mml:mrow><mml:mi>L</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow></mml:mrow></mml:math>
</disp-formula></p>
<p>During training, we used Adam optimizer [<xref ref-type="bibr" rid="ref-21">21</xref>] with the setting: learning rate <inline-formula id="ieqn-30"><mml:math id="mml-ieqn-30"><mml:mi>&#x03B1;</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn><mml:msup><mml:mrow><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mo>-</mml:mo><mml:mn>5</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>, two momentum parameters <inline-formula id="ieqn-31"><mml:math id="mml-ieqn-31"><mml:msub><mml:mrow><mml:mi>&#x03B2;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>0</mml:mn><mml:mo>.</mml:mo><mml:mn>9</mml:mn></mml:math></inline-formula>, <inline-formula id="ieqn-32"><mml:math id="mml-ieqn-32"><mml:msub><mml:mrow><mml:mi>&#x03B2;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>0</mml:mn><mml:mo>.</mml:mo><mml:mn>999</mml:mn></mml:math></inline-formula> and <inline-formula id="ieqn-33"><mml:math id="mml-ieqn-33"><mml:mi>&#x03B5;</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn><mml:msup><mml:mrow><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mo>-</mml:mo><mml:mn>8</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>.</p>
</sec>
</sec>
<sec id="s4">
<label>4</label>
<title>Experiments</title>
<p>In this section, we will introduce our experiments in detail, including dataset, evaluation metric, experiment setting and baseline systems.</p>
<table-wrap id="table-1">
<label>Table 1</label>
<caption>
<title>Statistics of different datasets of LCSTS</title>
</caption>

<table>
<colgroup>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Dataset</th>
<th>pairs</th>
<th><inline-formula id="ieqn-34"><mml:math id="mml-ieqn-34"><mml:mstyle mathvariant="normal"><mml:mi>S</mml:mi><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>s</mml:mi></mml:mstyle><mml:mo>&#x003E;</mml:mo><mml:mo>=</mml:mo><mml:mn>3</mml:mn></mml:math></inline-formula></th>
</tr>
</thead>
<tbody>
<tr>
<td>PART I</td>
<td>2400591</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>PART II</td>
<td>10666</td>
<td>8685</td>
</tr>
<tr>
<td>PART III</td>
<td>1106</td>
<td>725</td>
</tr>
</tbody>
</table>
</table-wrap>
<sec id="s4_1">
<label>4.1</label>
<title>Dataset</title>
<p>We conducted experiments on LCSTS dataset [<xref ref-type="bibr" rid="ref-8">8</xref>] to evaluate the proposed method. LCSTS is a large-scale Chinese short text summarization dataset collected from Sina Weibo, which is a famous social media website in China. As shown in <xref ref-type="table" rid="table-1">Tab. 1</xref>, it consists of more than 2.4 million pairs (source text and summary) and is split into three parts. PART I includes 2,400,591 pairs, PART II includes 10,666 pairs, and PART III includes 1,106 pairs. Besides, the pairs of PART II and PART III also have manual scores (according to the relevance between the source text and summary) ranging from 1 to 5. Following the previous work [<xref ref-type="bibr" rid="ref-8">8</xref>], we only chose pairs with scores no less than 3 and used PART I as the training set, PART II as the validation set, and PART III as the test set.</p>
</sec>
<sec id="s4_2">
<label>4.2</label>
<title>Evaluation Metric and Experiment Setting</title>
<p>We used the ROUGE scores [<xref ref-type="bibr" rid="ref-22">22</xref>] to evaluate our summarization model which has been widely used for text summarization. They can measure the quality of the summary by computing the overlap between the generated summary and the reference summary. Following the previous work [<xref ref-type="bibr" rid="ref-8">8</xref>], we used ROUGE-1 (1-gram), ROUGE-2 (bigrams), and ROUGE-L (longest common subsequence) scores as the evaluation metric of the experimental results.</p>
<p>We used the Chinese glossary of BERT-base, which contains 21,128 characters, but the number we counted all the characters in PART I of LCSTS is 10,728. To reduce the computation, we only used the characters of the intersection between them, including 7,655 characters. In our model, we used the default embedding size 768 of BERT-base, the number of heads <italic>h</italic> = 12, and the number of Transformer blocks <italic>N</italic> = 12. For Capsule network, we set the number of output capsules to 50 and the output dimension to 16, and the number of routes to 3. We set the batch size to 16, and we used Dropout [<xref ref-type="bibr" rid="ref-23">23</xref>] in our model. Our model was trained on a single NVIDIA 2080Ti GPU. Following the previous work [<xref ref-type="bibr" rid="ref-24">24</xref>], we implemented the Beam Search and set the beam size to 3.</p>
</sec>
<sec id="s4_3">
<label>4.3</label>
<title>Baseline Systems</title>
<p>We have compared the proposed model with the following model&#x2019;s ROUGE score, and we would briefly introduce them next.</p>
<p><bold>RNN and RNN-context</bold> [<xref ref-type="bibr" rid="ref-8">8</xref>] are two seq2seq baseline models. The former uses GRU as encoder and decoder. Based on that, the latter adds attention mechanism.</p>
<p><bold>CopyNet</bold> [<xref ref-type="bibr" rid="ref-25">25</xref>] is the attention-based seq2seq model with the copy mechanism. The copy mechanism allows some tokens of the generated summary to be copied from the source content and it can effectively improve the problem of abstractive summarization with repeated words.</p>
<p><bold>DRGD</bold> [<xref ref-type="bibr" rid="ref-26">26</xref>] is a seq2seq-based model with a deep recurrent generative decoder. The model combines the decoder with a variational autoencoder and uses a recurrent latent random model to learn latent structure information implied in the target summaries.</p>
<p><bold>WEAN</bold> [<xref ref-type="bibr" rid="ref-27">27</xref>] is a novel model based on the encoder-decoder framework and its full name is Word Embedding Attention Network. The model generates the words by querying distributed word representations, hoping to capture the meaning of the corresponding words.</p>
<p><inline-formula id="ieqn-35"><mml:math id="mml-ieqn-35"><mml:mstyle class="text"><mml:mtext class="textbf" mathvariant="bold">Seq2Seq</mml:mtext></mml:mstyle><mml:mo>+</mml:mo><mml:mstyle class="text"><mml:mtext class="textbf" mathvariant="bold">superAE</mml:mtext></mml:mstyle></mml:math></inline-formula> [<xref ref-type="bibr" rid="ref-16">16</xref>] is a seq2seq-based model with an assistant supervisor. The assistant supervisor uses the representation of the summary to supervise that of the source content. And the model uses the autoencoder as an assistant supervisor. Besides, to determine the strength of supervision more dynamically, Adversarial Learning is introduced in the model.</p>
<table-wrap id="table-2">
<label>Table 2</label>
<caption>
<title>ROUGE scores of our model and baseline systems on LCSTS (W: word level; C: character level)</title>
</caption>

<table>
<colgroup>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Models</th>
<th>ROUGE-1</th>
<th>ROUGE-2</th>
<th>ROUGE-3</th>
</tr>
</thead>
<tbody>
<tr>
<td>RNN(W) [<xref ref-type="bibr" rid="ref-8">8</xref>]</td>
<td>17.7</td>
<td>8.5</td>
<td>15.8</td>
</tr>
<tr>
<td>RNN(C) [<xref ref-type="bibr" rid="ref-8">8</xref>]</td>
<td>21.5</td>
<td>8.9</td>
<td>18.6</td>
</tr>
<tr>
<td>RNN-context(W) [<xref ref-type="bibr" rid="ref-8">8</xref>]</td>
<td>26.8</td>
<td>16.1</td>
<td>24.1</td>
</tr>
<tr>
<td>RNN-context(C) [<xref ref-type="bibr" rid="ref-8">8</xref>]</td>
<td>29.9</td>
<td>17.4</td>
<td>27.2</td>
</tr>
<tr>
<td>CopyNet(W) [<xref ref-type="bibr" rid="ref-25">25</xref>]</td>
<td>35.0</td>
<td>22.3</td>
<td>32.0</td>
</tr>
<tr>
<td>CopyNet(C) [<xref ref-type="bibr" rid="ref-25">25</xref>]</td>
<td>34.4</td>
<td>21.6</td>
<td>31.3</td>
</tr>
<tr>
<td>DRGD(C) [<xref ref-type="bibr" rid="ref-26">26</xref>]</td>
<td>37.0</td>
<td>24.2</td>
<td>34.2</td>
</tr>
<tr>
<td>WEAN(C) [<xref ref-type="bibr" rid="ref-27">27</xref>]</td>
<td>37.8</td>
<td>25.0</td>
<td>35.2</td>
</tr>
<tr>
<td><inline-formula id="ieqn-36"><mml:math id="mml-ieqn-36"><mml:mstyle mathvariant="normal"><mml:mi>S</mml:mi><mml:mi>e</mml:mi><mml:mi>q</mml:mi><mml:mn>2</mml:mn><mml:mi>s</mml:mi><mml:mi>e</mml:mi><mml:mi>q</mml:mi></mml:mstyle><mml:mo>+</mml:mo><mml:mstyle mathvariant="normal"><mml:mi>s</mml:mi><mml:mi>u</mml:mi><mml:mi>p</mml:mi><mml:mi>e</mml:mi><mml:mi>r</mml:mi><mml:mi>A</mml:mi><mml:mi>E</mml:mi></mml:mstyle></mml:math></inline-formula>(C) [<xref ref-type="bibr" rid="ref-16">16</xref>]</td>
<td>39.2</td>
<td>26.0</td>
<td>36.2</td>
</tr>
<tr>
<td>BERT-seq2seqLM(C) (our impl.)</td>
<td>39.84</td>
<td>25.47</td>
<td>34.62</td>
</tr>
<tr>
<td>+<bold>SSC</bold>(C) (this paper)</td>
<td><bold>40</bold>.<bold>63</bold></td>
<td><bold>26</bold>.<bold>4</bold></td>
<td>35.75</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
<sec id="s5">
<label>5</label>
<title>Results and Discussion</title>
<p>For clearer clarification, we named the BERT with the modified mask matrix as BERT-seq2seqLM, and denote our model with semantic supervision based on Capsule Network as <bold>SSC</bold>.</p>
<p>After we compared our model with baseline systems, the experimental results of these models on LCSTS datasets are shown in <xref ref-type="table" rid="table-2">Tab. 2</xref>. Firstly, we compared our model with BERT-seq2seqLM, and it proved SSC outperformed BERT-seq2seqLM in the scores of ROUGE-1, ROUGE-2, and ROUGE-L. And it indicated that the semantic supervision method can improve the generation effect of Bert-seq2seqLM. Moreover, we compared the ROUGE scores of our model with the recent summarization systems and it showed that our model outperformed the baseline systems, and achieved higher scores on ROUGE-1 and ROUGE-2, while it was slightly lower than the baseline on ROUGE-L.</p>
<fig id="fig-3">
<label>Figure 3</label>
<caption>
<title>ROUGE scores curve of BERT-seq2seqLM and our model under different epoch training (including ROUGE-1, ROUGE-2, ROUGE-L scores curve)</title>
</caption><graphic mimetype="image" mime-subtype="png" xlink:href="fig-3.png"/>
</fig>
<p>In addition, we also compared the ROUGE scores of models under different epochs, as shown in <xref ref-type="fig" rid="fig-3">Fig. 3</xref>. It respectively contains the scores of ROUGE-1, ROUGE-2, and ROUGE-L of the models under different epochs. From the three subgraphs, we can see that the training effect of BERT-seq2seqLM is more stable and the overall evaluation score is higher after adding semantic supervision.</p>
<p>As for semantic supervision, in addition to Capsule Network, we also tried to use LSTM and GRU. However, after comparative experiments, we found that Capsule Network was more suitable. As shown in <xref ref-type="table" rid="table-3">Tab. 3</xref>, we can see that the ROUGE-1, ROUGE-2 and ROUGE-L scores of the semantic supervision based on LSTM were higher than the BERT-seq2seqLM without the introduction of the semantic supervision. And the semantic supervision based on GRU and Capsule Network were also better than BERT-seq2-seqLM. Therefore, by experimental comparison, it is very necessary to introduce the semantic supervision method in BERT-seq2seqLM to improve the problem of fine-grained semantic representation. And the best improvement can be achieved by using Capsule Network for semantic supervision.</p>
<table-wrap id="table-3">
<label>Table 3</label>
<caption>
<title>ROUGE scores of the semantic supervision network with different structures on the LCSTS</title>
</caption>

<table>
<colgroup>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Models</th>
<th>ROUGE-1</th>
<th>ROUGE-2</th>
<th>ROUGE-3</th>
</tr>
</thead>
<tbody>
<tr>
<td>BERT-seq2seqLM</td>
<td>39.84</td>
<td>25.47</td>
<td>34.62</td>
</tr>
<tr>
<td>+LSTM</td>
<td>40.19</td>
<td>25.79</td>
<td>35.14</td>
</tr>
<tr>
<td>+GRU</td>
<td>40.34</td>
<td>26.0</td>
<td>35.22</td>
</tr>
<tr>
<td>+Capsule</td>
<td>40.63</td>
<td>26.4</td>
<td>35.75</td>
</tr>
</tbody>
</table>
</table-wrap>
 
<table-wrap id="table-4">
<label>Table 4</label>
<caption>
<title>Some generated summary examples on the LCSTS test dataset</title>
</caption>
<table>
<colgroup>
<col/>
</colgroup>
<tbody>
<tr>
<td><graphic mimetype="image" mime-subtype="png" xlink:href="T04.png"/></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>As shown in <xref ref-type="table" rid="table-4">Tab. 4</xref>, we listed two examples of the test dataset generated by our model. These examples include the source text, the reference summary, the summary generated by the BERT-seq2seqLM model and the generated summary by our model. The first example is about smartphones and personal computers. The generation result of the bert-seq2seqLM model takes the frequently appearing word &#x201C;iPhone&#x201D; as the main body of the summary, which leads to the deviation. The second example is a summary of Mark Cuban&#x2019;s life. From the source text, we can see that the last sentence is a summary of the whole article, but BERT-seq2seqLM chose the wrong content as the summary. BERT-seq2seqLM with semantic supervision can generate the content close to the reference summary. From the content of the generated summary, we can see that our semantic supervision method can get better results. By comparing the generated results, we can see that the semantic supervision method based on Capsule Network can reduce the semantic deviations of BERT encoding to some extent.</p>
</sec>
<sec id="s6">
<label>6</label>
<title>Conclusion</title>
<p>According to the idea of UNILM, we transformed the mask matrix of BERT-base to accomplish the abstractive summarization. At the same time, we introduced the semantic supervision method based on Capsule Network into our model and improve the performance of text summarization model on the LCSTS dataset. Experimental results showed that our model outperformed baseline systems. In this paper, Semantic Supervision method was only used in the pre-trained language model. As for other neural network models, we have not do experiments for verification yet. In this experiment, we only used the Chinese dataset and did not verify on other datasets. In the future, we will improve the semantic supervision method and experiments for its problems.</p>
</sec>
</body>
<back>
<ack><p>We would like to thank all the researchers of this project for their effort.</p></ack>
<fn-group><fn fn-type="other"><p><bold>Funding Statement:</bold> This work was partially supported by the National Natural Science Foundation of China (Grant No. 61502082) and the National Key R&#x0026;D Program of China (Grant No. 2018YFA0306703).</p></fn>
<fn fn-type="conflict"><p><bold>Conflicts of Interest:</bold> The authors declare that they have no conflicts of interest to report regarding the present study.</p></fn></fn-group>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>K.</given-names> <surname>Cho</surname></string-name>, <string-name><given-names>B. V.</given-names> <surname>Merrienboer</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Gulcehr</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Bahdanau</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Bengio</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Learning phrase representations using RNN encoder-decoder for statistical machine translation</article-title>,&#x201D; in <conf-name>Conf. on Empirical Methods in Natural Language Processing</conf-name>, <publisher-loc>Doha, Qatar</publisher-loc>, pp. <fpage>1724</fpage>&#x2013;<lpage>1734</lpage>, <year>2014</year>.</mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Schmidhuber</surname></string-name></person-group>, &#x201C;<article-title>Long short-term memory</article-title>,&#x201D; <source>Neural Computation</source>, vol. <volume>9</volume>, no. <issue>8</issue>, pp. <fpage>1735</fpage>&#x2013;<lpage>1780</lpage>, <year>1997</year>.</mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Devlin</surname></string-name>, <string-name><given-names>M. W.</given-names> <surname>Chang</surname></string-name>, <string-name><given-names>K.</given-names> <surname>Lee</surname></string-name> and <string-name><given-names>K.</given-names> <surname>Toutanova</surname></string-name></person-group>, &#x201C;<article-title>Bert: Pre-training of deep bidirectional transformers for language understanding</article-title>,&#x201D; in <conf-name>Conf. of the North American Chapter of the Association for Computational Linguistics</conf-name>, <publisher-loc>Minneapolis, USA</publisher-loc>, vol. <volume>1</volume>, pp. <fpage>4171</fpage>&#x2013;<lpage>4186</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>L.</given-names> <surname>Dong</surname></string-name>, <string-name><given-names>N.</given-names> <surname>Yang</surname></string-name>, <string-name><given-names>W. H.</given-names> <surname>Wang</surname></string-name>, <string-name><given-names>F.</given-names> <surname>Wei</surname></string-name> and <string-name><given-names>X. D.</given-names> <surname>Liu</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Unified language model pre-training for natural language understanding and generation</article-title>,&#x201D; in <conf-name>Advances in Neural Information Processing Systems</conf-name>, <publisher-loc>Vancouver, Canada</publisher-loc>, pp. <fpage>13063</fpage>&#x2013;<lpage>13075</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Z. G.</given-names> <surname>Qu</surname></string-name>, <string-name><given-names>S. Y.</given-names> <surname>Chen</surname></string-name> and <string-name><given-names>X. J.</given-names> <surname>Wang</surname></string-name></person-group>, &#x201C;<article-title>A secure controlled quantum image steganography algorithm</article-title>,&#x201D; <source>Quantum Information Processing</source>, vol. <volume>19</volume>, no. <issue>380</issue>, pp. <fpage>1</fpage>&#x2013;<lpage>25</lpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>D.</given-names> <surname>Zheng</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Ran</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Li</surname></string-name> and <string-name><given-names>L.</given-names> <surname>Tian</surname></string-name></person-group>, &#x201C;<article-title>An efficient bar code image recognition algorithm for sorting system</article-title>,&#x201D; <source>Computers, Materials &#x0026; Continua</source>, vol. <volume>64</volume>, no. <issue>3</issue>, pp. <fpage>1885</fpage>&#x2013;<lpage>1895</lpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Sabour</surname></string-name>, <string-name><given-names>N.</given-names> <surname>Frosst</surname></string-name> and <string-name><given-names>G. E.</given-names> <surname>Hinton</surname></string-name></person-group>, &#x201C;<article-title>Dynamic routing between capsules</article-title>,&#x201D; in <conf-name> Advances in Neural Information Processing Systems</conf-name>, <publisher-loc>Long Beach, California, USA</publisher-loc>, pp. <fpage>3856</fpage>&#x2013;<lpage>3866</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>B. T.</given-names> <surname>Hu</surname></string-name>, <string-name><given-names>Q. C.</given-names> <surname>Chen</surname></string-name> and <string-name><given-names>F. Z.</given-names> <surname>Zhu</surname></string-name></person-group>, &#x201C;<article-title>LCSTS: A large scale Chinese short text summarization dataset</article-title>,&#x201D; in <conf-name>Conf. on Empirical Methods in Natural Language Processing</conf-name>, <publisher-loc>Lisbon, Portugal</publisher-loc>, pp. <fpage>1967</fpage>&#x2013;<lpage>1972</lpage>, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>A. M.</given-names> <surname>Rush</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Chopra</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Weston</surname></string-name></person-group>, &#x201C;<article-title>A neural attention model for abstractive sentence summarization</article-title>,&#x201D; in <conf-name>Conf. on Empirical Methods in Natural Language Processing</conf-name>, <publisher-loc>Lisbon, Portugal</publisher-loc>, pp. <fpage>379</fpage>&#x2013;<lpage>389</lpage>, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>D.</given-names> <surname>Bahdanau</surname></string-name>, <string-name><given-names>K. H.</given-names> <surname>Cho</surname></string-name> and <string-name><given-names>Y.</given-names> <surname>Bengio</surname></string-name></person-group>, &#x201C;<article-title>Neural machine translation by jointly learning to align and translate</article-title>,&#x201D; in <conf-name>Int. Conf. on Learning Representations</conf-name>, <publisher-loc>San Diego, USA</publisher-loc>, pp. <fpage>1</fpage>&#x2013;<lpage>15</lpage>, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>M. E.</given-names> <surname>Peters</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Neumann</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Iyyer</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Gardner</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Clark</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Deep contextualized word representations</article-title>,&#x201D; in <conf-name>Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</conf-name>, <publisher-loc>New Orleans, Louisiana, USA</publisher-loc>, pp. <fpage>2227</fpage>&#x2013;<lpage>2237</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Vaswani</surname></string-name>, <string-name><given-names>N.</given-names> <surname>Shazeer</surname></string-name>, <string-name><given-names>N.</given-names> <surname>Parmar</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Uszkoreit</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Jones</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Attention is all you need</article-title>,&#x201D; in <conf-name>Advances in Neural Information Processing Systems</conf-name>, <publisher-loc>Long Beach, California, USA</publisher-loc>, pp. <fpage>5998</fpage>&#x2013;<lpage>6008</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Liu</surname></string-name> and <string-name><given-names>M.</given-names> <surname>Lapata</surname></string-name></person-group>, &#x201C;<article-title>Text summarization with pretrained encoders</article-title>,&#x201D; in <conf-name>Conf. on Empirical Methods in Natural Language Processing and the 9th Int. Joint Conf. on Natural Language Processing</conf-name>, <publisher-loc>Hong Kong, China</publisher-loc>, pp. <fpage>3721</fpage>&#x2013;<lpage>3731</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>K. T.</given-names> <surname>Song</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Tian</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Qing</surname></string-name>, <string-name><given-names>J. F.</given-names> <surname>Lu</surname></string-name> and <string-name><given-names>T. Y.</given-names> <surname>Liu</surname></string-name></person-group>, &#x201C;<article-title>MASS: Masked sequence to sequence pre-training for language generation</article-title>,&#x201D; in <conf-name>Int. Conf. on Machine Learning</conf-name>, <publisher-loc>Long Beach, California, USA</publisher-loc>, pp. <fpage>5926</fpage>&#x2013;<lpage>5936</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>S. M.</given-names> <surname>Ma</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Sun</surname></string-name>, <string-name><given-names>J. J.</given-names> <surname>Xu</surname></string-name>, <string-name><given-names>H. F.</given-names> <surname>Wang</surname></string-name> and <string-name><given-names>W. J.</given-names> <surname>Li</surname></string-name></person-group>, &#x201C;<article-title>Improving semantic relevance for sequence-to-sequence learning of Chinese social media Text Summarization</article-title>,&#x201D; in <conf-name>Annual Meeting of the Association for Computational Linguistics</conf-name>, <publisher-loc>Vancouver, Canada</publisher-loc>, vol. <volume>2</volume>, pp. <fpage>635</fpage>&#x2013;<lpage>640</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>S. M.</given-names> <surname>Ma</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Sun</surname></string-name>, <string-name><given-names>J. Y.</given-names> <surname>Lin</surname></string-name> and <string-name><given-names>H. F.</given-names> <surname>Wang</surname></string-name></person-group>, &#x201C;<article-title>Autoencoder as assistant supervisor: Improving text representation for Chinese social media text summarization</article-title>,&#x201D; in <conf-name>Annual Meeting of the Association for Computational Linguistics</conf-name>, <publisher-loc>Melbourne, Australia</publisher-loc>, vol. <volume>2</volume>, pp. <fpage>725</fpage>&#x2013;<lpage>731</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>W.</given-names> <surname>Zhao</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Ye</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Yang</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Lei</surname></string-name>, <string-name><given-names>S. F.</given-names> <surname>Zhang</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Investigating capsule networks with dynamic routing for text classification</article-title>,&#x201D; in <conf-name>Conf. on Empirical Methods in Natural Language Processing</conf-name>, <publisher-loc>Brussels, Belgium</publisher-loc>, pp. <fpage>3110</fpage>&#x2013;<lpage>3119</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Ioffe</surname></string-name> and <string-name><given-names>C.</given-names> <surname>Szegedy</surname></string-name></person-group>, &#x201C;<article-title>Batch normalization: Accelerating deep network training by reducing internal covariate shift</article-title>,&#x201D; in <conf-name>Int. Conf. on Machine Learning</conf-name>, <publisher-loc>Lille, France</publisher-loc>, pp. <fpage>448</fpage>&#x2013;<lpage>456</lpage>, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>J. L.</given-names> <surname>Ba</surname></string-name>, <string-name><given-names>J. R.</given-names> <surname>Kiros</surname></string-name> and <string-name><given-names>G. E.</given-names> <surname>Hinton</surname></string-name></person-group>, &#x201C;<article-title>Layer Normalization</article-title>,&#x201D; <source>Stat</source>, vol. <volume>1050</volume>, pp. <fpage>21</fpage>, <year>2016</year>.</mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>K.</given-names> <surname>Cho</surname></string-name>, <string-name><given-names>B. V.</given-names> <surname>Merrienboer</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Bahdanau</surname></string-name> and <string-name><given-names>Y.</given-names> <surname>Bengio</surname></string-name></person-group>, &#x201C;<article-title>On the properties of neural machine translation: Encoder-decoder approaches</article-title>,&#x201D; in <conf-name>Conf. of the North American Chapter of the Association for Computational Linguistics</conf-name>, <publisher-loc>Denver, Colorado, USA</publisher-loc>, pp. <fpage>103</fpage>&#x2013;<lpage>112</lpage>, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>D. P.</given-names> <surname>Kingma</surname></string-name> and <string-name><given-names>J. L.</given-names> <surname>Ba</surname></string-name></person-group>, &#x201C;<article-title>Adam: A method for stochastic optimization</article-title>,&#x201D; in <conf-name>Int. Conf. for Learning Representations</conf-name>, <publisher-loc>San Diego, USA</publisher-loc>, pp. <fpage>1</fpage>&#x2013;<lpage>15</lpage>, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>C. Y.</given-names> <surname>Lin</surname></string-name></person-group>, &#x201C;<article-title>Rouge: A package for automatic evaluation of summaries</article-title>,&#x201D; in <conf-name>Text Summarization Branches Out</conf-name>, <publisher-loc>Barcelona, Spain</publisher-loc>, pp. <fpage>74</fpage>&#x2013;<lpage>81</lpage>, <year>2004</year>.</mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>N.</given-names> <surname>Srivastrava</surname></string-name>, <string-name><given-names>G.</given-names> <surname>Hinton</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Krizhevsky</surname></string-name>, <string-name><given-names>I.</given-names> <surname>Sutskever</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Salakhutdinov</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Dropout: A simple way to prevent neural networks from overfitting</article-title>,&#x201D; <source>Journal of Machine Learning Research</source>, vol. <volume>15</volume>, no. <issue>1</issue>, pp. <fpage>1929</fpage>&#x2013;<lpage>1958</lpage>, <year>2014</year>.</mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>P.</given-names> <surname>Koehn</surname></string-name></person-group>, &#x201C;<article-title>Pharaoh: A beam search decoder for phrase-based statistical machine translation models</article-title>,&#x201D; in <conf-name> Machine Translation: from Real Users to Research</conf-name>, <publisher-loc>Washington, DC, USA</publisher-loc>, pp. <fpage>115</fpage>&#x2013;<lpage>124</lpage>, <year>2004</year>.</mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>J. T.</given-names> <surname>Gu</surname></string-name>, <string-name><given-names>Z. D.</given-names> <surname>Lu</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Li</surname></string-name> and <string-name><given-names>V.</given-names> <surname>Li</surname></string-name></person-group>, &#x201C;<article-title>Incorporating copying mechanism in sequence-to-sequence learning</article-title>,&#x201D; in <conf-name>Annual Meeting of the Association for Computational Linguistics</conf-name>, <publisher-loc>Berlin, Germany</publisher-loc>, vol. <volume>1</volume>, pp. <fpage>1631</fpage>&#x2013;<lpage>1640</lpage>, <year>2016</year>. </mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>P. J.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>W.</given-names> <surname>Lam</surname></string-name>, <string-name><given-names>L. D.</given-names> <surname>Bing</surname></string-name> and <string-name><given-names>Z. H.</given-names> <surname>Wang</surname></string-name></person-group>, &#x201C;<article-title>Deep recurrent generative decoder for abstractive text summarization</article-title>,&#x201D; in <conf-name>Conf. on Empirical Methods in Natural Language Processing</conf-name>, <publisher-loc>Copenhagen, Denmark</publisher-loc>, pp. <fpage>2091</fpage>&#x2013;<lpage>2100</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-27"><label>[27]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>S. M.</given-names> <surname>Ma</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Sun</surname></string-name>, <string-name><given-names>W.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>S. J.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>W. J.</given-names> <surname>Li</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Query and output: Generating words by querying distributed word representations for paraphrase generation</article-title>,&#x201D; in <conf-name>Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</conf-name>, <publisher-loc>New Orleans, Louisiana, USA</publisher-loc>, vol. <volume>1</volume>, pp. <fpage>196</fpage>&#x2013;<lpage>206</lpage>, <year>2018</year>.</mixed-citation></ref>
</ref-list>
</back>
</article>