<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xml:lang="en" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMC</journal-id>
<journal-id journal-id-type="nlm-ta">CMC</journal-id>
<journal-id journal-id-type="publisher-id">CMC</journal-id>
<journal-title-group>
<journal-title>Computers, Materials &#x0026; Continua</journal-title>
</journal-title-group>
<issn pub-type="epub">1546-2226</issn>
<issn pub-type="ppub">1546-2218</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">45807</article-id>
<article-id pub-id-type="doi">10.32604/cmc.2023.045807</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Network Configuration Entity Extraction Method Based on Transformer with Multi-Head Attention Mechanism</article-title>
<alt-title alt-title-type="left-running-head">Network Configuration Entity Extraction Method Based on Transformer with Multi-Head Attention Mechanism</alt-title>
<alt-title alt-title-type="right-running-head">Network Configuration Entity Extraction Method Based on Transformer with Multi-Head Attention Mechanism</alt-title>
</title-group>
<contrib-group>
<contrib id="author-1" contrib-type="author">
<name name-style="western"><surname>Yang</surname><given-names>Yang</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-2" contrib-type="author">
<name name-style="western"><surname>Qu</surname><given-names>Zhenying</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-3" contrib-type="author">
<name name-style="western"><surname>Yan</surname><given-names>Zefan</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-4" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Gao</surname><given-names>Zhipeng</given-names></name><xref ref-type="aff" rid="aff-1">1</xref><email>gaozhipeng@bupt.edu.cn</email></contrib>
<contrib id="author-5" contrib-type="author">
<name name-style="western"><surname>Wang</surname><given-names>Ti</given-names></name><xref ref-type="aff" rid="aff-2">2</xref></contrib>
<aff id="aff-1"><label>1</label><institution>State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications</institution>, <addr-line>Beijing, 100876</addr-line>, <country>China</country></aff>
<aff id="aff-2"><label>2</label><institution>Product Development Department, China Unicom Smart City Research Institute</institution>, <addr-line>Beijing, 100044</addr-line>, <country>China</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Zhipeng Gao. Email: <email>gaozhipeng@bupt.edu.cn</email></corresp>
</author-notes>
<pub-date date-type="collection" publication-format="electronic">
<year>2024</year></pub-date>
<pub-date date-type="pub" publication-format="electronic"><day>30</day>
<month>1</month>
<year>2024</year></pub-date>
<volume>78</volume>
<issue>1</issue>
<fpage>735</fpage>
<lpage>757</lpage>
<history>
<date date-type="received">
<day>08</day>
<month>9</month>
<year>2023</year>
</date>
<date date-type="accepted">
<day>01</day>
<month>11</month>
<year>2023</year>
</date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2024 Yang et al.</copyright-statement>
<copyright-year>2024</copyright-year>
<copyright-holder>Yang et al.</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMC_45807.pdf"></self-uri>
<abstract>
<p>Nowadays, ensuring the quality of network services has become increasingly vital. Experts are turning to knowledge graph technology, with a significant emphasis on entity extraction in the identification of device configurations. This research paper presents a novel entity extraction method that leverages a combination of active learning and attention mechanisms. Initially, an improved active learning approach is employed to select the most valuable unlabeled samples, which are subsequently submitted for expert labeling. This approach successfully addresses the problems of isolated points and sample redundancy within the network configuration sample set. Then the labeled samples are utilized to train the model for network configuration entity extraction. Furthermore, the multi-head self-attention of the transformer model is enhanced by introducing the Adaptive Weighting method based on the Laplace mixture distribution. This enhancement enables the transformer model to dynamically adapt its focus to words in various positions, displaying exceptional adaptability to abnormal data and further elevating the accuracy of the proposed model. Through comparisons with Random Sampling (RANDOM), Maximum Normalized Log-Probability (MNLP), Least Confidence (LC), Token Entrop (TE), and Entropy Query by Bagging (EQB), the proposed method, Entropy Query by Bagging and Maximum Influence Active Learning (EQBMIAL), achieves comparable performance with only 40% of the samples on both datasets, while other algorithms require 50% of the samples. Furthermore, the entity extraction algorithm with the Adaptive Weighted Multi-head Attention mechanism (AW-MHA) is compared with BILSTM-CRF, Mutil_Attention-Bilstm-Crf, Deep_Neural_Model_NER and BERT_Transformer, achieving precision rates of 75.98% and 98.32% on the two datasets, respectively. Statistical tests demonstrate the statistical significance and effectiveness of the proposed algorithms in this paper.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Entity extraction</kwd>
<kwd>network configuration</kwd>
<kwd>knowledge graph</kwd>
<kwd>active learning</kwd>
<kwd>transformer</kwd>
</kwd-group>
<funding-group>
<award-group id="awg1">
<funding-source>National Key R&#x0026;D Program of China</funding-source>
<award-id>2019YFB2103202</award-id>
</award-group>
</funding-group>
</article-meta>
</front>
<body>
<sec id="s1">
<label>1</label>
<title>Introduction</title>
<p>As information and communication technology advances, the Internet&#x2019;s role in society has become crucial, and closely tied to economies and social development. The sudden outbreak of the epidemic has further emphasized the significance of the Internet, elevating the importance of network infrastructure and services. The increasing variety of devices and access methods has led to greater complexity, requiring efficient network management, resource allocation, and diverse services for various businesses.</p>
<p>A knowledge graph [<xref ref-type="bibr" rid="ref-1">1</xref>] is a structured knowledge base that employs graphs to represent entities, including concepts, individuals, and objects, along with their interconnected relationships in the real world. Knowledge graphs are utilized to transform various data forms into a graphical knowledge representation, where entities and relationships are depicted as nodes and edges. Once entities, such as devices, services, parameters, etc., are extracted from network configuration data, they are used to construct a graphical knowledge graph. The approach involves tracing the network&#x2019;s logical structure, mapping association relationships among network devices, and understanding the agreements within each device. Then, it analyzes business function relationships, updates the knowledge graph database in real-time based on network status changes, and supports keyword searches. Finally, knowledge interactions are identified, and data from various equipment configurations are constructed into a contactable, traceable, and extensible map. This aids in comprehending the structure and semantics of network configurations, as well as in standardizing and adding semantic meaning to network configuration data, ultimately enhancing network maintainability and scalability. The network configuration knowledge graph, derived after extracting network configuration entities, can provide support for various domains such as network configuration, fault diagnosis [<xref ref-type="bibr" rid="ref-2">2</xref>], performance analysis [<xref ref-type="bibr" rid="ref-3">3</xref>], security detection [<xref ref-type="bibr" rid="ref-4">4</xref>], and more.</p>
<p>Entity extraction constitutes a pivotal component in the creation of knowledge graphs. Its primary objective is to identify and extract entities from different data origins and structures, then converting them into structured data suitable for storage within a knowledge graph. Presently, entity extraction technologies fall into three fundamental categories. The first category relies on manual modification rules, frequently constructed using dictionaries and knowledge bases. However, this method necessitates lots of human effort to develop language models, leading to prolonged system cycles, sluggish information updates, and limited portability. The second category is based on statistical principles. Nevertheless, these approaches suffer from several drawbacks, including the need for many artificial features, expensive costs, and limited migration and generalization capabilities due to considerable human intervention. The third category is based on deep learning techniques. These approaches incorporate neural network models combined with attention mechanisms, which consider different features and influence levels. They yield diverse entity extraction results while minimizing attention to redundant information.</p>
<p>This paper focuses on the following key issues in existing entity extraction methods: (1) Both supervised and semi-supervised learning methods require a significant amount of labeled data, and the quality of these labeled samples directly impacts the performance of the classification model. However, data acquisition can be time-consuming and labor-intensive, often resulting in a substantial number of redundant samples in the training set. (2) Current neural network models that utilize the attention mechanism face a challenge when increasing the number of attention heads. This can lead to an expression bottleneck, diminishing the model&#x2019;s capability to effectively express context vectors and potentially resulting in decreased accuracy.</p>
<p>Our contributions are as follows:
<list list-type="bullet">
<list-item>
<p>This paper presents an entity extraction method on the basis of active learning and a transformer with an attention mechanism. The approach involves utilizing an enhanced active learning method to label the training set, which is then utilized to train the deep learning model.</p></list-item>
<list-item>
<p>An enhanced active learning method is presented to improve the quality of sample selection. Initially, the active learning technique is employed to select the most informative unlabeled samples using a specific strategy algorithm. These queried samples are then utilized to train the classification model, aiming to enhance its accuracy. Unlike existing query strategies, this paper introduces novel improvements that effectively address issues of outliers and redundancy within the training set examples.</p></list-item>
<list-item>
<p>This paper proposes an adaptive weighting method on the basis of the Laplace mixture distribution idea, aiming at overcoming the bottleneck problem in the expression of the multi-head attention mechanism. The Transformer model is employed to capture word semantics and context, as well as to handle long-distance dependencies through the multi-head self-attention mechanism. By combining the idea of the Laplace mixture distribution, the weight matrix is mixed and superimposed to heighten the expression ability of the distribution of attention, thus improving the performance of the multi-head self-attention mechanism of the Transformer.</p></list-item>
<list-item>
<p>This paper conducted simulation experiments on two datasets, demonstrating the effectiveness of the proposed method. Furthermore, when compared to other algorithms, the presented framework exhibits better performance.</p></list-item>
</list></p>
<p>The following sections are organized as follows. <xref ref-type="sec" rid="s2">Section 2</xref> introduces existing entity extraction and active learning methods. <xref ref-type="sec" rid="s3">Section 3</xref> elaborates on the specific algorithmic improvements. In <xref ref-type="sec" rid="s4">Section 4</xref>, the simulation results of the algorithm are given and compared with other methods. Finally, <xref ref-type="sec" rid="s5">Section 5</xref> provides the conclusion, which summarizes the research results.</p>
</sec>
<sec id="s2">
<label>2</label>
<title>Related Work</title>
<p>Expert annotation for network configuration data is often expensive and time-consuming. Active learning offers a solution to reduce labeling expenses and improve data utilization. Additionally, Transformers are renowned for their powerful self-attention mechanism, allowing them to capture long-range dependencies within input sequences. This is highly beneficial for handling complex relationships among entities in network configuration data. Therefore, this article is designed and improved based on Transformers and active learning. During the composition of this section, several literature searches were conducted on Web of Science, Engineering Village, and Google Scholar, employing keywords like &#x201C;entity extraction&#x201D;, &#x201C;active learning&#x201D;, and &#x201C;Transformer&#x201D;. These works encompassed the primary research achievements in this domain. <xref ref-type="table" rid="table-1">Table 1</xref> provides a succinct comparison of related studies. This section will provide a detailed discussion of both entity extraction and active learning.</p>
<table-wrap id="table-1">
<label>Table 1</label>
<caption>
<title>A comparative analysis of the literature</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th></th>
<th>Reference</th>
<th>Year</th>
<th>Models</th>
<th>Statistical-based methods</th>
<th>Deep learning-based methods</th>
<th>Active learning</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>Bikel et al. [<xref ref-type="bibr" rid="ref-5">5</xref>]</td>
<td>1999</td>
<td>Hidden Markov</td>
<td>Yes</td>
<td>No</td>
<td>No</td>
</tr>
<tr>
<td>2</td>
<td>Borthwick et al. [<xref ref-type="bibr" rid="ref-6">6</xref>]</td>
<td>1999</td>
<td>MEM</td>
<td>Yes</td>
<td>No</td>
<td>No</td>
</tr>
<tr>
<td>3</td>
<td>Chieu et al. [<xref ref-type="bibr" rid="ref-7">7</xref>]</td>
<td>2002</td>
<td>MEM</td>
<td>Yes</td>
<td>No</td>
<td>No</td>
</tr>
<tr>
<td>4</td>
<td>Mayfield et al. [<xref ref-type="bibr" rid="ref-8">8</xref>]</td>
<td>2003</td>
<td>SVM</td>
<td>Yes</td>
<td>No</td>
<td>No</td>
</tr>
<tr>
<td>5</td>
<td>McCallum et al. [<xref ref-type="bibr" rid="ref-9">9</xref>]</td>
<td>2003</td>
<td>CRF</td>
<td>Yes</td>
<td>No</td>
<td>No</td>
</tr>
<tr>
<td>6</td>
<td>Settles [<xref ref-type="bibr" rid="ref-10">10</xref>]</td>
<td>2004</td>
<td>CRF</td>
<td>Yes</td>
<td>No</td>
<td>No</td>
</tr>
<tr>
<td>7</td>
<td>Lample et al. [<xref ref-type="bibr" rid="ref-11">11</xref>]</td>
<td>2016</td>
<td>LSTM</td>
<td>No</td>
<td>Yes</td>
<td>No</td>
</tr>
<tr>
<td>8</td>
<td>Luo et al. [<xref ref-type="bibr" rid="ref-12">12</xref>]</td>
<td>2018</td>
<td>BILSTM</td>
<td>No</td>
<td>Yes</td>
<td>No</td>
</tr>
<tr>
<td>9</td>
<td>Xu et al. [<xref ref-type="bibr" rid="ref-13">13</xref>]</td>
<td>2019</td>
<td>BILSTM</td>
<td>No</td>
<td>Yes</td>
<td>No</td>
</tr>
<tr>
<td>10</td>
<td>Singh et al. [<xref ref-type="bibr" rid="ref-14">14</xref>]</td>
<td>2022</td>
<td>CNN</td>
<td>No</td>
<td>Yes</td>
<td>No</td>
</tr>
<tr>
<td>11</td>
<td>Parsaeimehr et al. [<xref ref-type="bibr" rid="ref-15">15</xref>]</td>
<td>2023</td>
<td>CNN</td>
<td>No</td>
<td>Yes</td>
<td>No</td>
</tr>
<tr>
<td>12</td>
<td>He et al. [<xref ref-type="bibr" rid="ref-16">16</xref>]</td>
<td>2023</td>
<td>Transformer</td>
<td>No</td>
<td>Yes</td>
<td>No</td>
</tr>
<tr>
<td>13</td>
<td>Alissa et al. [<xref ref-type="bibr" rid="ref-17">17</xref>]</td>
<td>2023</td>
<td>Transformer</td>
<td>No</td>
<td>Yes</td>
<td>No</td>
</tr>
<tr>
<td>14</td>
<td>Jia et al. [<xref ref-type="bibr" rid="ref-18">18</xref>]</td>
<td>2020</td>
<td>Transformer</td>
<td>No</td>
<td>Yes</td>
<td>No</td>
</tr>
<tr>
<td>15</td>
<td>Beygelzimer et al. [<xref ref-type="bibr" rid="ref-19">19</xref>]</td>
<td>2009</td>
<td>Importance weighted</td>
<td>No</td>
<td>Yes</td>
<td>Yes</td>
</tr>
<tr>
<td>16</td>
<td>Mahapatra et al. [<xref ref-type="bibr" rid="ref-20">20</xref>]</td>
<td>2018</td>
<td>GAN</td>
<td>No</td>
<td>Yes</td>
<td>Yes</td>
</tr>
<tr>
<td><bold>17</bold></td>
<td><bold>Our paper</bold></td>
<td><bold>2023</bold></td>
<td><bold>Transformer</bold></td>
<td><bold>No</bold></td>
<td><bold>Yes</bold></td>
<td><bold>Yes</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
<sec id="s2_1">
<label>2.1</label>
<title>Entity Extraction</title>
<p>As a highly effective method for managing text data, entity extraction has long been a focus of research in the areas of artificial intelligence and computer science. Its main goal is to automatically recognize particular entities from non-formatted textual data and classify them into predetermined groups, such as names of people, places, and organizations. The current technical methods for entity extraction may be loosely categorized into three groups: methods on the basis of manual customization rules, methods on the basis of traditional machine learning, and deep learning-based approaches.</p>
<p>In the early stages of entity extraction research, the reliance was primarily on hand-crafted rules. Experts and academics from numerous domains built knowledge extraction frameworks manually using their subject knowledge. These frameworks included a variety of strategies, including headwords, demonstrative words, directional words, positional words, punctuation analysis, and keyword detection. Pattern and string matching served as the main strategy, delivering admirable accuracy. However, this approach required extensive human labor to develop language models, because it mainly relied on the development of knowledge repositories and lexicons. As a result, it caused protracted system cycles, slowly updated information, and restricted portability of the extraction system.</p>
<p>With the advent of machine learning, statistical-based methods have been introduced for entity extraction. These methods leverage statistical machine learning to learn knowledge from a vast amount of labeled corpora, eliminating the need for manually defined rules. They treat Named Entity Recognition (NER) and word segmentation problems as sequence labeling tasks, where the predicted label depends not only on the current predicted sequence label but also on the preceding predicted label, displaying a strong interdependence between labels. Prominent statistical-based methods include the hidden Markov model (HMM) [<xref ref-type="bibr" rid="ref-5">5</xref>], maximum entropy model (MEM) [<xref ref-type="bibr" rid="ref-6">6</xref>,<xref ref-type="bibr" rid="ref-7">7</xref>], support vector machine (SVM) [<xref ref-type="bibr" rid="ref-8">8</xref>], and conditional random field (CRF) [<xref ref-type="bibr" rid="ref-9">9</xref>,<xref ref-type="bibr" rid="ref-10">10</xref>]. While these methods have yielded favorable outcomes in entity extraction, they present particular challenges additionally. The hidden Markov model relies solely on each state and its corresponding observation without considering the length of the observation sequence or word context. The maximum entropy model&#x2019;s constraint function relationship is tied to the number of samples, leading to extensive calculations in the iterative process, and making practical application more challenging. Support vector machines rely on quadratic programming to solve support vectors, involving calculations of m-order matrices that can be difficult to implement for large-scale training samples. Furthermore, the conditional random field model exhibits slow convergence speed, leading to elevated training costs and increased complexity.</p>
<p>In recent years, deep learning methods like Recurrent Neural Network (RNN) and Convolutional Neural Network (CNN) have gained widespread adoption in the area of natural language processing, demonstrating impressive results across various tasks. Techniques like LSTM-CRF [<xref ref-type="bibr" rid="ref-11">11</xref>], BILSTM-CRF [<xref ref-type="bibr" rid="ref-12">12</xref>,<xref ref-type="bibr" rid="ref-13">13</xref>], BERT-CNN [<xref ref-type="bibr" rid="ref-14">14</xref>], and CNN-CRF [<xref ref-type="bibr" rid="ref-15">15</xref>], among others, have been successfully applied. In comparison to earlier statistical machine learning methods, deep learning approaches offer distinct advantages in automatic feature learning, leveraging deep semantic knowledge, and addressing data sparsity issues. These methods utilize neural networks to automatically learn features and train sequence labeling models, surpassing traditional methods on the basis of handcrafted features, thus positioning them as current research hotspots.</p>
<p>However, RNNs demand sequential processing, rendering both training and inference time-consuming. Conversely, CNNs, originally tailored for visual tasks, excel at extracting local information due to their inherent bias but struggle to capture global context, resulting in suboptimal performance in entity recognition tasks. In recent years, Transformer models, powered by attention mechanisms, have achieved state-of-the-art results in natural language processing and computer vision. Transformers have effectively addressed the constraints of sequential processing in RNNs and the dependency on local features in CNNs, enabling them to adeptly capture global context. In [<xref ref-type="bibr" rid="ref-18">18</xref>], authors leveraged the Transformer for entity extraction and constructed a pre-trained model on the ClueNER dataset, achieving remarkable performance. Nonetheless, the pre-trained models established in [<xref ref-type="bibr" rid="ref-18">18</xref>] are grounded in standard Chinese corpora and exhibit inferior performance when applied to entity extraction tasks within specific network contexts. Furthermore, many researchers have delved into the utilization of Transformers in network-related scenarios. For instance, reference [<xref ref-type="bibr" rid="ref-16">16</xref>] employed the Transformer architecture for anomaly detection, while reference [<xref ref-type="bibr" rid="ref-17">17</xref>] leveraged Transformers for text simplification. Nevertheless, within the domain of entity extraction in network contexts, the exploration of the Transformer architecture has been relatively limited.</p>
</sec>
<sec id="s2_2">
<label>2.2</label>
<title>Active Learning in Model Training</title>
<p>Deep learning-based methods heavily rely on annotated corpora, and there is a shortage of large-scale, general-purpose corpora available for constructing and evaluating entity extraction systems. This limitation has emerged as a significant obstacle to the widespread adoption of these methods.</p>
<p>Both supervised learning and semi-supervised learning require a certain amount of labeled data, and the effectiveness of the classification model depends on the quality of these labeled samples. However, acquiring training samples is not only time-consuming and labor-intensive but also leads to a considerable number of redundant samples within the training set. To address these challenges and reduce training set size and labeling costs, active learning methods [<xref ref-type="bibr" rid="ref-19">19</xref>,<xref ref-type="bibr" rid="ref-20">20</xref>] have been proposed to optimize classification models. Active learning employs specialized algorithms to select the most informative unlabeled samples, which are then annotated by experts. These selected samples are subsequently integrated into the training process of the classification model, enhancing its accuracy. The key to successful active learning lies in selecting an appropriate query strategy. Two commonly used types of active learning models are stream-based active learning and pool-based active learning. Different situations may require various implementation solutions, and the choice of query strategy can be based on a single machine learning model or multiple models. Currently, widely adopted query strategies include uncertainty sampling, committee-based sampling, model change expectations, query error reduction, variance reduction, and density weighting, among others.</p>
<p>Compared to traditional supervised methods, active learning demonstrates improved capabilities in handling larger training datasets, distinguishing diverse sample points, and reducing the volume of training data and manual labeling costs. However, traditional active learning may prove insufficient when dealing with challenges such as multiclass classification, outliers, sample redundancy within the training set, and imbalanced data. This paper introduces a pool-based active learning method that incorporates innovative enhancements designed to effectively address the challenges posed by outliers and redundancy in the training set examples.</p>
</sec>
</sec>
<sec id="s3">
<label>3</label>
<title>Entity Extraction Method</title>
<p>This section describes the network configuration entity recognition method. <xref ref-type="sec" rid="s3_1">Section 3.1</xref> provides an overview of the overall model flow and structure. <xref ref-type="sec" rid="s3_2">Section 3.2</xref> elaborates on the improved active learning algorithm, EQBMIAL. <xref ref-type="sec" rid="s3_3">Section 3.3</xref> provides a detailed explanation of the specific structure of the entity extraction stage and the AW-MHA designed to enhance the transformer.</p>
<sec id="s3_1">
<label>3.1</label>
<title>Overall Flow and Framework</title>
<p>In this section, an enhanced entity extraction method is proposed, integrating active learning and an attention mechanism [<xref ref-type="bibr" rid="ref-21">21</xref>,<xref ref-type="bibr" rid="ref-22">22</xref>]. Leveraging entropy query-by-bagging (EQB) and maximum influence (MI) active learning strategies, samples with high uncertainty and low redundancy are selected and manually labeled, expanding the labeled sample set. Through iterative expansion and model training, the method improves the model&#x2019;s generalization ability. In addition, an improved adaptive weighting mechanism is introduced into the multi-head self-attention mechanism of the transformer model. This allows the model to achieve more flexible weight allocation by appropriately setting the mean and variance parameters. Moreover, this improvement can weigh information from different modes to better capture multimodal information. This fusion of plural attention heads leads to an effective amalgamation of information, which may also help the model deal with noise and uncertainty in the data better, thus improving the robustness and generalization performance of the whole model. The overall flow chart of the algorithm is shown in <xref ref-type="fig" rid="fig-1">Fig. 1</xref>.</p>
<fig id="fig-1">
<label>Figure 1</label>
<caption>
<title>Entity extraction flow chart</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_45807-fig-1.tif"/>
</fig>
<p>The entity extraction process is as follows:
<list list-type="order">
<list-item>
<p>In the sample extraction stage, the selected sample set is selected from the unlabelled sample pool utilizing the entropy bagging query and the active learning screening strategy on the basis of the maximum influence. These selected samples are included in the labeled dataset after expert labeling, while the remaining unselected samples expand the unlabeled pool.</p></list-item>
<list-item>
<p>During the entity extraction stage, the labeled dataset serves as the training set for training the entity extraction model, which comprises the input layer, hidden layers, including embedding, and an improved Transformer with adaptive weighting for multi-head self-attention mechanisms, as well as sequence labeling. Then, the output layer is composed.</p></list-item>
<list-item>
<p>The performance of the current model relies on its output. If the model meets the performance requirements, the model and labeled samples are obtained, and the whole process is ended. Otherwise, steps 1 and 2 are repeated until the performance requirements are satisfied.</p></list-item>
</list></p>
</sec>
<sec id="s3_2">
<label>3.2</label>
<title>Sample Extraction Method on the Basis of Improved Active Learning</title>
<p>Active learning involves selecting a subset of unlabelled samples for manual labeling and iteratively expanding the labeled dataset to enhance the model&#x2019;s generalization capability. In the active learning process, the key challenge is selecting unlabelled samples for labeling and efficiently improving the model&#x2019;s generalization after learning from these selected samples. The research at present is primarily on the basis of the pool-based sample selection strategies, aiming to develop sample selection strategies suitable for named entity recognition tasks while ensuring the model achieves a certain level of performance and minimizes labeling costs. One such active learning approach is the Entropy Query Bagging (EQB) algorithm, which utilizes information entropy to measure the uncertainty and information gain of samples. Therefore, it tends to select samples that provide maximum information during model training. In the context of network configuration entity extraction tasks, EQB significantly reduces the manual data labeling workload, thereby improving annotation efficiency. Furthermore, EQB can expedite model convergence by focusing more attention on the critical aspects of entity extraction tasks.</p>
<p>This research focuses on maintaining a future annotation of the sample pool and training the model after selecting the sample label through the active learning sample selection strategy to rapidly improve the model&#x2019;s generalization ability. Therefore, the active learning approach proposed in this article, EQBMIAL, builds upon EQB and enhances it with the maximization of influence to optimize the algorithm. The query strategy in EQB primarily relies on measuring sample categories based on entropy values, which provides valuable information. However, this method does not solve two problems. The first problem concerns isolated points. A sample with a large entropy value has only one sample in the dataset. Choosing this sample does not boost the model&#x2019;s accuracy or other indicators. A sample with a low entropy value is not an isolated point. Adding this sample may boost the model&#x2019;s accuracy and other indicators. The second problem is related to sample redundancy in the training set. If a sample with a large entropy value is already in the training set, adding this sample to the model does not improve accuracy. If a sample with a small entropy value is not in the training set, adding this sample may greatly improve the model. Therefore, to address these two problems, this paper puts forward to optimize the algorithm by combining entropy bagging and maximum influence. The maximum influence is used to identify the most influential samples, while entropy bagging encompasses entropy-based queries and maximum influence. The force index comprises two components: a representative model and a different model.</p>
<p>The first index is EQB, which is a query committee method. EQB begins by selecting k training subsets from the initial training set using bagging. These k subsets are then used to train k classification models, forming a committee. Each classifier in the committee predicts the category for each sample in the unlabelled sample set, resulting in each sample having k labels based on the predicted categories. EQB formally uses these tags to calculate the sample&#x2019;s entropy value, and the query&#x2019;s formula is as follows:
<disp-formula id="eqn-1"><label>(1)</label><mml:math id="mml-eqn-1" display="block"><mml:mi>f</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msubsup><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>w</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mi>N</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mi>w</mml:mi><mml:mo>|</mml:mo></mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mi>w</mml:mi><mml:mo>|</mml:mo></mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">]</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:mrow><mml:mtext>p</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mrow><mml:mtext>y</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>i</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mrow><mml:mtext>w</mml:mtext></mml:mrow><mml:mo>|</mml:mo></mml:mrow><mml:msub><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>i</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> denotes that the sample <inline-formula id="ieqn-2"><mml:math id="mml-ieqn-2"><mml:msub><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>i</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula>&#x2019;s probability is predicted to belong to class w by k training models. In other words, the predicted label of the sample <inline-formula id="ieqn-3"><mml:math id="mml-ieqn-3"><mml:msub><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>i</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> is determined on the basis of the number of votes for class w divided by k, where <inline-formula id="ieqn-4"><mml:math id="mml-ieqn-4"><mml:msub><mml:mrow><mml:mtext>N</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>i</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> represents the total number of categories.</p>
<p>When all classifiers in the committee make identical predictions for a sample, f(x) equals 0. This suggests that the category of this sample is highly certain for the current classification model, and adding it to the training set would offer little improvement to the model. On the other hand, when the predictions of the sample&#x2019;s label by the classifiers in the committee vary, f(x) increases, indicating that this sample provides a substantial amount of information that can be beneficial in enhancing the model.</p>
<p>The second part is the maximization of influence, which consists of two components: the representative model and the difference model.</p>
<p>The representative model is used to solve the outlier problem of the sample. It primarily relies on k-means clustering and category prior probability. To find a representative sample from the labeled sample set, k-means is utilized to cluster the data. Initially, k samples are selected as the initial cluster centers. The distance from each sample in the dataset to each of the k cluster centers is calculated, and each sample is assigned to the cluster corresponding to the center with the least distance. The formula used to determine each cluster&#x2019;s center is as follows:
<disp-formula id="eqn-2"><label>(2)</label><mml:math id="mml-eqn-2" display="block"><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mo>|</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo></mml:mrow></mml:mfrac><mml:msub><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>x</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mi>x</mml:mi></mml:math></disp-formula></p>
<p>The calculation is repeated until the termination condition is reached so that k representative samples are obtained. Then the category prior probability <inline-formula id="ieqn-5"><mml:math id="mml-ieqn-5"><mml:msub><mml:mrow><mml:mtext>b</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>k</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> formula in the unlabelled sample set is as follows:
<disp-formula id="eqn-3"><label>(3)</label><mml:math id="mml-eqn-3" display="block"><mml:mi>p</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mi>b</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mn>2</mml:mn><mml:msup><mml:mi>&#x03C3;</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:mfrac><mml:msup><mml:mrow><mml:mo>|</mml:mo><mml:mrow><mml:mo>|</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo></mml:mrow><mml:mo>|</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mi>b</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mn>2</mml:mn><mml:msup><mml:mi>&#x03C3;</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:mfrac><mml:msup><mml:mrow><mml:mo>|</mml:mo><mml:mrow><mml:mo>|</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo></mml:mrow><mml:mo>|</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:math></disp-formula>
<disp-formula id="eqn-4"><label>(4)</label><mml:math id="mml-eqn-4" display="block"><mml:msub><mml:mi>b</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mi>n</mml:mi></mml:mfrac><mml:msubsup><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>According to <xref ref-type="disp-formula" rid="eqn-4">Eq. (4)</xref>, the representative model formula is as follows:
<disp-formula id="eqn-5"><label>(5)</label><mml:math id="mml-eqn-5" display="block"><mml:mi>w</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msubsup><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:msubsup><mml:msub><mml:mi>b</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mn>2</mml:mn><mml:msup><mml:mi>&#x03C3;</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:mfrac><mml:msup><mml:mrow><mml:mo>|</mml:mo><mml:mrow><mml:mo>|</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo></mml:mrow><mml:mo>|</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>The larger the value <inline-formula id="ieqn-6"><mml:math id="mml-ieqn-6"><mml:mrow><mml:mtext>w</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>i</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>, the more representative the unlabelled sample with <inline-formula id="ieqn-7"><mml:math id="mml-ieqn-7"><mml:msub><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>i</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> value.</p>
<p>The difference model is used to settle the sample redundancy problem in the training set. The node similarity algorithm is mainly used. The node similarity algorithm uses Jaccard similarity. The similarity formula is as follows:
<disp-formula id="eqn-6"><label>(6)</label><mml:math id="mml-eqn-6" display="block"><mml:mrow><mml:mtext>Sim</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mtext>X</mml:mtext></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mtext>Y</mml:mtext></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mo>|</mml:mo><mml:mi mathvariant="normal">&#x0393;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mtext>X</mml:mtext></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2229;</mml:mo><mml:mi mathvariant="normal">&#x0393;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mtext>Y</mml:mtext></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>|</mml:mo></mml:mrow><mml:mrow><mml:mo>|</mml:mo><mml:mi mathvariant="normal">&#x0393;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mtext>X</mml:mtext></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x222A;</mml:mo><mml:mi mathvariant="normal">&#x0393;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mtext>Y</mml:mtext></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>|</mml:mo></mml:mrow></mml:mfrac></mml:math></disp-formula></p>
<p>The Jaccard distance formula is as follows:
<disp-formula id="eqn-7"><label>(7)</label><mml:math id="mml-eqn-7" display="block"><mml:msub><mml:mrow><mml:mtext>d</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>J</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mtext>X</mml:mtext></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mtext>Y</mml:mtext></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mtext>Sim</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mtext>X</mml:mtext></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mtext>Y</mml:mtext></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>For an unlabelled sample <inline-formula id="ieqn-8"><mml:math id="mml-ieqn-8"><mml:msub><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>i</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula>, the minimum Euclidean distance d(<inline-formula id="ieqn-9"><mml:math id="mml-ieqn-9"><mml:msub><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>i</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula>) between it and all currently labeled samples are used to measure the difference between the samples:
<disp-formula id="eqn-8"><label>(8)</label><mml:math id="mml-eqn-8" display="block"><mml:mrow><mml:mtext>d</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>i</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:munder><mml:mo movablelimits="true" form="prefix">min</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>i</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mtext>U</mml:mtext></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>j</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mtext>L</mml:mtext></mml:mrow></mml:mrow></mml:munder><mml:msub><mml:mrow><mml:mtext>d</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>J</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>i</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>j</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>The larger <inline-formula id="ieqn-10"><mml:math id="mml-ieqn-10"><mml:mrow><mml:mtext>d</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>i</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> is, the farther the currently unlabelled sample is from the currently labeled sample and the greater the difference in the sample.</p>
<p>The maximum influence model&#x2019;s formula is as follows:
<disp-formula id="eqn-9"><label>(9)</label><mml:math id="mml-eqn-9" display="block"><mml:mrow><mml:mtext>z</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>i</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mtext>a</mml:mtext></mml:mrow><mml:mo>&#x2217;</mml:mo><mml:mrow><mml:mtext>w</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>i</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:mtext>b</mml:mtext></mml:mrow><mml:mo>&#x2217;</mml:mo><mml:mrow><mml:mtext>d</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>i</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-11"><mml:math id="mml-ieqn-11"><mml:mrow><mml:mtext>a</mml:mtext></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>, <inline-formula id="ieqn-12"><mml:math id="mml-ieqn-12"><mml:mrow><mml:mtext>b</mml:mtext></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>, <inline-formula id="ieqn-13"><mml:math id="mml-ieqn-13"><mml:mrow><mml:mtext>a</mml:mtext></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:mtext>b</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula>, and a and b are the weights.</p>
<p>Finally, EQBMIAL is composed of the maximization of influence and EQB. The maximization of influence is composed of representative models and different models. The final EQBMIAL model is g(<inline-formula id="ieqn-14"><mml:math id="mml-ieqn-14"><mml:msub><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>i</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, and the formula is as follows:
<disp-formula id="eqn-10"><label>(10)</label><mml:math id="mml-eqn-10" display="block"><mml:mrow><mml:mtext>g</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>i</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mtext>c</mml:mtext></mml:mrow><mml:mo>&#x2217;</mml:mo><mml:mrow><mml:mtext>z</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>i</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:mtext>d</mml:mtext></mml:mrow><mml:mo>&#x2217;</mml:mo><mml:msup><mml:mrow><mml:mtext>H</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>BAG</mml:mtext></mml:mrow></mml:mrow></mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>i</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-15"><mml:math id="mml-ieqn-15"><mml:mrow><mml:mtext>c</mml:mtext></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>, <inline-formula id="ieqn-16"><mml:math id="mml-ieqn-16"><mml:mrow><mml:mtext>d</mml:mtext></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>, <inline-formula id="ieqn-17"><mml:math id="mml-ieqn-17"><mml:mrow><mml:mtext>c</mml:mtext></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:mtext>d</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula>, and c and d are the weights.</p>
<p>The EQB flow is as follows:</p>
<fig id="fig-9">
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_45807-fig-9.tif"/>
</fig>
</sec>
<sec id="s3_3">
<label>3.3</label>
<title>Transformer with Improved Multi-Head Self-Attention Based on Adaptive Weighting Idea</title>
<p>This section is divided into four subparts. In <xref ref-type="sec" rid="s3_3_1">Section 3.3.1</xref>, the structure of the entity extraction model and the functions of each layer are introduced. <xref ref-type="sec" rid="s3_3_2">Section 3.3.2</xref> explains the operation of the existing multi-head attention mechanism. <xref ref-type="sec" rid="s3_3_3">Section 3.3.3</xref> proposes the Adaptive Weighting Mechanism to enhance the multi-head attention mechanism further. <xref ref-type="sec" rid="s3_3_4">Section 3.3.4</xref> provides a detailed description of how the Adaptive Weighting Mechanism is utilized to enhance the multi-head attention of the transformer.</p>
<sec id="s3_3_1">
<label>3.3.1</label>
<title>Entity Extraction Model</title>
<p>The essence of entity extraction lies in constructing a model. This article adopts a deep learning approach to identify entities. The <xref ref-type="fig" rid="fig-2">Fig. 2</xref> below visually presents the entity extraction method devised within this study.</p>
<fig id="fig-2">
<label>Figure 2</label>
<caption>
<title>The model structure diagram of entity extraction</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_45807-fig-2.tif"/>
</fig>
<p>Input layer: This layer serves as the entry point and is responsible for receiving incoming text sequences <inline-formula id="ieqn-22"><mml:math id="mml-ieqn-22"><mml:mrow><mml:mtext>O</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mrow><mml:mtext>w</mml:mtext></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mtext>w</mml:mtext></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mtext>w</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>n</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula> as the input.</p>
<p>Embedding layer: This layer converts textual text into a format recognizable and computable by the model, providing computer models with enhanced text representation and processing capabilities. It does this by representing each word with a low-dimensional vector, where words with similar meanings occupy nearby positions in the vector space. In this paper, CNN is used to process the embedding layer, and the local receptive field and feature extraction capabilities of CNN are utilized to transform text data into meaningful embedding representations, providing richer features for subsequent tasks. After entering the text sequence <inline-formula id="ieqn-23"><mml:math id="mml-ieqn-23"><mml:mrow><mml:mtext>O</mml:mtext></mml:mrow></mml:math></inline-formula>, each word is converted into a pre-trained word vector, which forms a matrix <inline-formula id="ieqn-24"><mml:math id="mml-ieqn-24"><mml:mi>E</mml:mi><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mrow><mml:mtext>z</mml:mtext></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mtext>z</mml:mtext></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mtext>z</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>n</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>, with each row representing a word&#x2019;s embedding. Next, the embedded matrix E serves as the input for CNN, where convolution operations with different filter sizes extract local features from various positions. These features extracted from each convolution kernel are then pooled to generate the feature matrix <inline-formula id="ieqn-25"><mml:math id="mml-ieqn-25"><mml:mrow><mml:mtext>K</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>n</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>.</p>
<p>Transformer layer: To create a thorough feature representation, the transformer layer is used to capture global context information in the input text sequence. The embedding layer provides the input for this layer <inline-formula id="ieqn-26"><mml:math id="mml-ieqn-26"><mml:mrow><mml:mtext>K</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>n</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>. The multi-layer Transformer encoder uses a multi-head self-attention mechanism and a feedforward neural network for feature extraction and context modeling of text sequences. The decoder then classifies the features at each location, predicting the corresponding named entity category. The multi-head self-attention is a core component of the transformer that helps capture contextual information and entity relationships within the input sequence in the model. The adaptive Weighting mechanism is introduced in this paper to improve the multi-head self-attention mechanism of the transformer. This strategy uses Laplace mixture principles to refine how the distribution is expressed within the attention mechanism. Additionally, it considers the entire text sequence and the significance of every word in the phrase. This enhancement empowers the model to achieve improved performance. The output of the transformer layer is <inline-formula id="ieqn-27"><mml:math id="mml-ieqn-27"><mml:mrow><mml:mtext>ST</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mrow><mml:mtext>st</mml:mtext></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mtext>st</mml:mtext></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mtext>st</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>n</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>, where <inline-formula id="ieqn-28"><mml:math id="mml-ieqn-28"><mml:msub><mml:mrow><mml:mtext>st</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>i</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> stands for the attention vector for the present word.</p>
<p>Sequence Labeling layer: The sequence labeling layer uses CRF to address sequence labeling and serialization challenges by modeling the target sequence while considering the observation sequence. This layer forecasts the probabilities for character labels inside the context before and after using <inline-formula id="ieqn-29"><mml:math id="mml-ieqn-29"><mml:mrow><mml:mtext>ST</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mrow><mml:mtext>st</mml:mtext></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mtext>st</mml:mtext></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mtext>st</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>n</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>, the output of the Transformer layer, as input. With training, the CRF model enhances its ability to accurately predict BIO labels for word sequences based on context and label dependencies. The label sequences <inline-formula id="ieqn-30"><mml:math id="mml-ieqn-30"><mml:mrow><mml:mtext>F</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mrow><mml:mtext>y</mml:mtext></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mtext>y</mml:mtext></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mtext>y</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>n</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula> are then produced and <inline-formula id="ieqn-31"><mml:math id="mml-ieqn-31"><mml:msub><mml:mrow><mml:mtext>y</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>i</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> denotes the BIO label of the present word.</p>
<p>Output layer: In this layer, the label created by the sequence labeling layer is produced by the output layer.</p>
</sec>
<sec id="s3_3_2">
<label>3.3.2</label>
<title>Multi-Head Self-Attention Mechanism</title>
<p>In this model, the transformer layer utilizes a multi-head self-attention mechanism as a crucial component. Due to the fact that not every word carries equal importance for correctly identifying entities in named entity recognition, the attention mechanism plays a crucial role in configuring the network for entity recognition. The attention mechanism functions as an addressing process, determining the attention value by distributing attention across keys and associating it with corresponding values, all based on a task-specific query vector Q. This process can be deemed as a form of attention-based neural network, which reduces complexity by inputting only relevant task-related information into the neural network, rather than processing all N inputs.</p>
<p>The MHA operation process is shown in the blue box in <xref ref-type="fig" rid="fig-3">Fig. 3</xref>. The formula for the original MHA process has been elaborated in [<xref ref-type="bibr" rid="ref-23">23</xref>], so it will not be reiterated here. To begin, the input sequence undergoes linear transformations using three trainable weight matrices, yielding vector representations for query (Q), key (K), and value (V). Subsequently, for each attention head, the dot product of Q and K across all positions is computed, followed by scaling and applying the softmax function to derive position weight distributions. These position weights are then applied to the V vectors, resulting in a weighted average at each position and producing the output for each head. Lastly, the outputs generated by multiple attention heads are concatenated and undergo an additional linear transformation to yield the final multi-head attention output.</p>
<fig id="fig-3">
<label>Figure 3</label>
<caption>
<title>Adaptive weighted multi-attention mechanism structure diagram</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_45807-fig-3.tif"/>
</fig>
</sec>
<sec id="s3_3_3">
<label>3.3.3</label>
<title>Adaptive Weighting Mechanism</title>
<p>In order to further improve the ability of the multi-head self-attention mechanism to capture context information and the robustness of the model in the transformer, this paper proposes an adaptive weighting method based on the idea of the Laplace mixture model to realize the adaptive weight allocation.</p>
<p>The Laplacian distribution can be viewed as a combination of two exponential distributions symmetrically arranged back-to-back, often referred to as the double exponential distribution. Unlike the normal distribution, the Laplacian distribution shares a similar overall shape but is distinguished by heavier tails, making it more sensitive to outliers and exceptional observations. A single random variable, denoted as <inline-formula id="ieqn-32"><mml:math id="mml-ieqn-32"><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow></mml:math></inline-formula>, adheres to the Laplace distribution with a mean of <inline-formula id="ieqn-33"><mml:math id="mml-ieqn-33"><mml:mrow><mml:mtext>u</mml:mtext></mml:mrow></mml:math></inline-formula> and a scale parameter of <inline-formula id="ieqn-34"><mml:math id="mml-ieqn-34"><mml:mrow><mml:mtext>b</mml:mtext></mml:mrow></mml:math></inline-formula>. In the context of a D-dimensional vector <inline-formula id="ieqn-35"><mml:math id="mml-ieqn-35"><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow></mml:math></inline-formula>, if its individual elements conform to the Laplace distribution with <inline-formula id="ieqn-36"><mml:math id="mml-ieqn-36"><mml:mrow><mml:mtext>u</mml:mtext></mml:mrow></mml:math></inline-formula> as the mean vector, the probability density density can be described as follows:
<disp-formula id="eqn-11"><label>(11)</label><mml:math id="mml-eqn-11" display="block"><mml:mrow><mml:mtext>p</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mtext>u</mml:mtext></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mtext>b</mml:mtext></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mn>2</mml:mn><mml:mrow><mml:mtext>b</mml:mtext></mml:mrow></mml:mrow></mml:mfrac><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mrow><mml:mo>|</mml:mo><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mtext>u</mml:mtext></mml:mrow><mml:mo>|</mml:mo></mml:mrow><mml:mrow><mml:mtext>b</mml:mtext></mml:mrow></mml:mfrac><mml:mo>}</mml:mo></mml:mrow></mml:math></disp-formula>where u is the D-dimensional mean vector, and b is the scale parameter of this distribution.</p>
<p>The linear superposition of Laplace distributions can fit very complex density functions. Complex continuous density can be fitted by superposing enough Laplace distributions and adjusting their mean, scale parameters, and linear combination coefficients. The Laplace mixture model can be viewed as a model comprised of K individual Laplace models, with these K sub-models representing the mixture model&#x2019;s latent variables or hidden variables. This study considers the linear superposition of K Laplace distributions. The probability density function for this Laplace mixture distribution is as follows:
<disp-formula id="eqn-12"><label>(12)</label><mml:math id="mml-eqn-12" display="block"><mml:mrow><mml:mtext>p</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msubsup><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mrow><mml:mtext>k</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mrow><mml:mtext>K</mml:mtext></mml:mrow></mml:mrow></mml:msubsup><mml:msub><mml:mi mathvariant="normal">&#x03A0;</mml:mi><mml:mrow><mml:mrow><mml:mtext>k</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mtext>P</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>k</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>i</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mrow><mml:mtext>u</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>k</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mtext>b</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>k</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-37"><mml:math id="mml-ieqn-37"><mml:msub><mml:mrow><mml:mtext>p</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>k</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>i</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mrow><mml:mtext>u</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>k</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mtext>b</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>k</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> represents the probability density of the Laplace distribution with the parameter mean <inline-formula id="ieqn-38"><mml:math id="mml-ieqn-38"><mml:msub><mml:mrow><mml:mtext>u</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>k</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> and the scale parameter <inline-formula id="ieqn-39"><mml:math id="mml-ieqn-39"><mml:msub><mml:mrow><mml:mtext>b</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>k</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula>, and <inline-formula id="ieqn-40"><mml:math id="mml-ieqn-40"><mml:mrow><mml:mtext>p</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2265;</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mtext>p</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>k</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>i</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mrow><mml:mtext>u</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>k</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mtext>b</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>k</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2265;</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula>.</p>
</sec>
<sec id="s3_3_4">
<label>3.3.4</label>
<title>AW-MHA</title>
<p>In the Multi-Head Attention (MHA) mechanism, the model is divided into plural attention heads, to compose plural sub-spaces, allowing it to focus on information from various directions. While this benefits overall model training, increasing the number of attention heads can lead to expression bottlenecks, limiting the model&#x2019;s ability to capture context. To address this, this article draws inspiration from the Laplace mixture distribution and mixes and superimposes the attention weight matrices generated by plural attention heads to improve the distribution expression in attention. The improved operation process of MHA is shown in the red box in <xref ref-type="fig" rid="fig-3">Fig. 3</xref>. After the output for each head, the attention weight matrices generated by each attention head are combined with the characteristics of a mixture Laplace distribution.</p>
<p>The introduction of Adaptive weighting to the transformer&#x2019;s multi-head self-attention mechanism allows for the prioritization of words in different positions, enabling the model to focus on positions relevant to entities and thereby improving named entity recognition accuracy. The adaptive weighting mechanism, based on a Laplacian mixed distribution, has a stronger response to outliers due to the fat-tailed nature of the Laplacian distribution. This enhancement makes the model more resilient and adaptable to variations in different samples and contexts.</p>
<p>The weight matrix <inline-formula id="ieqn-41"><mml:math id="mml-ieqn-41"><mml:msub><mml:mrow><mml:mtext>A</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>i</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> calculated for each attention head is as follows:
<disp-formula id="eqn-13"><label>(13)</label><mml:math id="mml-eqn-13" display="block"><mml:msub><mml:mrow><mml:mtext>A</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>i</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mtext>Att</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mtext>Q</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>i</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mtext>K</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>i</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mtext>V</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>i</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:mtext>i</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:mtext>h</mml:mtext></mml:mrow></mml:math></disp-formula></p>
<p>Combining the idea of the Laplace mixture distribution and taking the weight matrix of each attention head as a base distribution, the AW-MHA formula is as follows:
<disp-formula id="eqn-14"><label>(14)</label><mml:math id="mml-eqn-14" display="block"><mml:msub><mml:mrow><mml:mtext>AW</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>MHA</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mtext>Q</mml:mtext></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mtext>K</mml:mtext></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mtext>V</mml:mtext></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mtext>W</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>o</mml:mtext></mml:mrow></mml:mrow></mml:msup><mml:msubsup><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mrow><mml:mtext>k</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mrow><mml:mtext>K</mml:mtext></mml:mrow></mml:mrow></mml:msubsup><mml:msub><mml:mi mathvariant="normal">&#x03A0;</mml:mi><mml:mrow><mml:mrow><mml:mtext>k</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mtext>A</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>k</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></disp-formula></p>
</sec>
</sec>
</sec>
<sec id="s4">
<label>4</label>
<title>Experiments</title>
<p>This section begins by introducing the dataset, comparative algorithms, and evaluation indices employed in this paper. Subsequently, <xref ref-type="sec" rid="s4_3">Section 4.3</xref> provides a detailed analysis of the performance of the EQBMIAL and AW-MHA algorithms. In <xref ref-type="sec" rid="s4_4">Section 4.4</xref>, statistical tests are conducted on the experimental results, further illustrating the significance of this algorithm. Finally, <xref ref-type="sec" rid="s4_5">Section 4.5</xref> delves into the analysis of the algorithm&#x2019;s time complexity.</p>
<sec id="s4_1">
<label>4.1</label>
<title>Dataset and Comparison Algorithm</title>
<p>To assess the proposed algorithm&#x2019;s performance, this study utilizes two datasets: the fine-grained named entity recognition dataset ClueNER and the network device configuration dataset NetDevConfNER. ClueNER is a Chinese dataset, which is used to identify named entities such as person names, place names, and organization names, with a training-to-test ratio set of 10:1. NetDevConfNER contains network configuration files from two vendors, provided by an internet service provider. These configuration files contain all parameter information that the devices adhere to during runtime. For this dataset, the training-to-test set ratio is approximately 2:1, encompassing 66 different data types.</p>
<p>To verify the improved sample sampling method of active learning proposed in this paper, comparison methods such as RANDOM, TE [<xref ref-type="bibr" rid="ref-24">24</xref>], MNLP [<xref ref-type="bibr" rid="ref-25">25</xref>], LC [<xref ref-type="bibr" rid="ref-26">26</xref>], and EQB [<xref ref-type="bibr" rid="ref-27">27</xref>] are employed. Among them, RANDOM selects sentences randomly for labeling, TE selects sentences with the highest entropy values, LC selects samples with the lowest confidence, MNLP is based on LC but uses regularized logarithmic probability to express uncertainty and solves the tendency of long sentences in the LC score, and EQB trains a committee of classifiers using labeled samples to select the &#x201C;most inconsistent&#x201D; unlabeled samples based on voting entropy.</p>
<p>To validate the entity extraction method, a comparison is conducted with three existing algorithms: BILSTM-CRF [<xref ref-type="bibr" rid="ref-28">28</xref>], Mutil_Attention-Bilstm-Crf [<xref ref-type="bibr" rid="ref-29">29</xref>], Deep_Neural_Model_NER [<xref ref-type="bibr" rid="ref-30">30</xref>] and BERT_Transformer [<xref ref-type="bibr" rid="ref-18">18</xref>]. The BILSTM-CRF efficiently captures two-way semantic interdependence when performing contextual sequence labeling tasks using a BiLSTM network. Additionally, it makes extensive use of conditional random fields to take into account the limitations and reliance between neighboring character label criteria. The Multi-Attention-BiLSTM-CRF algorithm includes both a conditional random field and a BiLSTM network. By using a multi-head attention mechanism, it considerably improves named entity recognition performance And captures plural semantic features from the character, word, and sentence levels. The Deep_Neural_Model_NER utilizes both a conditional random field and a BiLSTM network. It also incorporates a CNN to obtain local features from the current phrase, enhancing named entity identification performance. The BERT_Transformer first utilizes a semi-supervised entity-enhanced BERT method for pre-training. It then integrates entity information into BERT using the CharEntity-Transformer. Finally, it performs entity classification for Chinese entities. The suggested approach will be evaluated in comparison to these recognized.</p>
</sec>
<sec id="s4_2">
<label>4.2</label>
<title>Evaluation Index</title>
<p>In this part, the effectiveness of entity extraction is assessed using several measures. For the sake of evaluating the predictive ability of the presented algorithm, precision, recall, and F1-score are utilized while considering entity extraction to be a multi-classification problem. Out of all projected positive instances, precision is the percentage of correctly predicted positive events. The proportion of actual positive instances that were correctly anticipated, on the other hand, out of all positive instances is known as recall. The F1-score evaluates precision and Recall thoroughly while weighing their trade-offs. The following are the calculation formulas for the aforementioned indicators:
<disp-formula id="eqn-15"><label>(15)</label><mml:math id="mml-eqn-15" display="block"><mml:mrow><mml:mtext>Precision</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mtext>TP</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>TP</mml:mtext></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:mtext>FP</mml:mtext></mml:mrow></mml:mrow></mml:mfrac></mml:math></disp-formula>
<disp-formula id="eqn-16"><label>(16)</label><mml:math id="mml-eqn-16" display="block"><mml:mrow><mml:mtext>Recall</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mtext>TP</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>TP</mml:mtext></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:mtext>FN</mml:mtext></mml:mrow></mml:mrow></mml:mfrac></mml:math></disp-formula>
<disp-formula id="eqn-17"><label>(17)</label><mml:math id="mml-eqn-17" display="block"><mml:mrow><mml:mtext>F</mml:mtext></mml:mrow><mml:mn>1</mml:mn><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>2</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mrow><mml:mtext>Precision</mml:mtext></mml:mrow><mml:mo>&#x00D7;</mml:mo><mml:mrow><mml:mtext>Recall</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mtext>Precision</mml:mtext></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:mtext>Recall</mml:mtext></mml:mrow></mml:mrow></mml:mfrac></mml:math></disp-formula></p>
<p>TP stands for genuine positives, which are occurrences of positivity that were accurately expected to be positive; TN, often known as cases that were accurately expected to be negative; FP stands for cases of negative data that were mistakenly forecasted as positive, and FN for instances of positive data that were mistakenly projected as negative. These metrics help assess how accurately the program can extract items.</p>
</sec>
<sec id="s4_3">
<label>4.3</label>
<title>Result Analysis</title>
<p>In this section, the results of the sample training set sampling technique and entity extraction are analyzed. Comparisons are made between the suggested algorithm and various other algorithms to demonstrate its superior performance.</p>
<p>The parameter values for the algorithms in this paper were determined through a combination of empirical knowledge and multiple iterations to select the best-performing parameter configurations. During the process of parameter selection, this paper relied on prior research and domain expertise to initially define the parameter ranges. Subsequently, a series of experiments were conducted, testing various combinations of parameters, and ultimately selecting the parameter settings that exhibited the best performance under the experimental conditions.</p>
<p>During the experiment, a sensitivity analysis was conducted on the critical parameters of the network configuration entity extraction model. Initially, various word embedding dimensions, including 64, 128, and 256, were tested. The results revealed that higher dimensions led to improved performance but also incurred higher computational costs. Consequently, this paper opted for a word embedding dimension of 128. Subsequently, parameters of the CNN, such as kernel size and the number of convolutional layers, were explored. Ultimately, a kernel size of <inline-formula id="ieqn-42"><mml:math id="mml-ieqn-42"><mml:mn>3</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>3</mml:mn></mml:math></inline-formula> and the inclusion of 2 convolutional layers yielded the best results. Following that, we examined the transformer&#x2019;s layer count and the number of attention heads. Experiments were conducted with layer counts of 1, 2, and 4, demonstrating that higher layer counts facilitated the model in learning more information, resulting in enhanced performance at the cost of increased training time. Regarding the number of attention heads, we conducted experiments by incrementally increasing it from 4 to 20. The results showed an initial performance improvement followed by a decline. Without enhancements to the transformer, the performance plateaued at around 12 attention heads. However, due to the improvements made to the multi-head attention mechanism in this paper, the model exhibited superior performance and greater stability even when the number of attention heads reached 12, with a smaller decline in performance.</p>
<p>The enhanced active learning sample sampling approach is compared with several benchmark methods, including RANDOM, LC, MNLP, TE, and EQB, to validate it. The AW-MHA model is used in the entity extraction stage model. Both the ClueNER dataset and the NetDevConfNER dataset are used to test the sampling technique. In this research, the ordinate indicates changes in the three assessment metrics&#x2014;precision, Recall, and F1-score&#x2014;while the abscissa represents the proportion of samples chosen for manual labeling to the training set. Each technique is run ten times for each termination circumstance, and the average of the outcomes is used to get the final value.</p>
<p><xref ref-type="fig" rid="fig-4">Figs. 4</xref> to <xref ref-type="fig" rid="fig-6">6</xref> show the outcomes of the approaches used on the ClueNER dataset, while <xref ref-type="table" rid="table-2">Table 2</xref> summarizes the detailed indicator values for the NetDevConfNER dataset. These contrasts and studies show how well the suggested algorithm performs and how well it extracts entities.</p>
<fig id="fig-4">
<label>Figure 4</label>
<caption>
<title>ClueNER dataset precision index comparison</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_45807-fig-4.tif"/>
</fig><fig id="fig-5">
<label>Figure 5</label>
<caption>
<title>ClueNER dataset recall index comparison</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_45807-fig-5.tif"/>
</fig><fig id="fig-6">
<label>Figure 6</label>
<caption>
<title>ClueNER dataset F1-score index comparison</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_45807-fig-6.tif"/>
</fig><table-wrap id="table-2">
<label>Table 2</label>
<caption>
<title>Comparison of evaluation indicators of active learning methods on NetDevConfNER dataset (%)</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th colspan="2">Percentage of training set\ active learning methods</th>
<th>LC</th>
<th>MNLP</th>
<th>TE</th>
<th>EQB</th>
<th>RANDOM</th>
<th>EQBMIAL</th>
</tr>
</thead>
<tbody>
<tr>
<td/>
<td>Precision</td>
<td>76.04</td>
<td>77.82</td>
<td>78.87</td>
<td>77.80</td>
<td>72.65</td>
<td>79.17</td>
</tr>
<tr>
<td>10%</td>
<td>Recall</td>
<td>79.62</td>
<td>80.14</td>
<td>80.83</td>
<td>79.31</td>
<td>73.67</td>
<td>82.55</td>
</tr>
<tr>
<td/>
<td>F1-score</td>
<td>77.79</td>
<td>78.96</td>
<td>79.84</td>
<td>78.55</td>
<td>73.16</td>
<td>80.82</td>
</tr>
<tr>
<td/>
<td>Precision</td>
<td>79.55</td>
<td>80.17</td>
<td>84.73</td>
<td>84.54</td>
<td>76.07</td>
<td>87.06</td>
</tr>
<tr>
<td>20%</td>
<td>Recall</td>
<td>80.78</td>
<td>81.88</td>
<td>84.21</td>
<td>83.91</td>
<td>78.59</td>
<td>87.56</td>
</tr>
<tr>
<td/>
<td>F1-score</td>
<td>80.16</td>
<td>81.02</td>
<td>84.47</td>
<td>84.22</td>
<td>77.31</td>
<td>87.31</td>
</tr>
<tr>
<td/>
<td>Precision</td>
<td>83.02</td>
<td>84.75</td>
<td>86.03</td>
<td>85.91</td>
<td>81.80</td>
<td>89.14</td>
</tr>
<tr>
<td>30%</td>
<td>Recall</td>
<td>82.97</td>
<td>83.42</td>
<td>86.73</td>
<td>85.55</td>
<td>80.12</td>
<td>90.93</td>
</tr>
<tr>
<td/>
<td>F1-score</td>
<td>82.99</td>
<td>84.08</td>
<td>86.38</td>
<td>85.73</td>
<td>80.95</td>
<td>90.03</td>
</tr>
<tr>
<td/>
<td>Precision</td>
<td>85.07</td>
<td>86.07</td>
<td>90.56</td>
<td>89.12</td>
<td>85.19</td>
<td>93.74</td>
</tr>
<tr>
<td>40%</td>
<td>Recall</td>
<td>84.24</td>
<td>85.56</td>
<td>88.13</td>
<td>87.02</td>
<td>83.09</td>
<td>92.68</td>
</tr>
<tr>
<td/>
<td>F1-score</td>
<td>84.65</td>
<td>85.81</td>
<td>89.33</td>
<td>88.06</td>
<td>84.13</td>
<td>93.21</td>
</tr>
<tr>
<td/>
<td>Precision</td>
<td>90.83</td>
<td>91.06</td>
<td>93.96</td>
<td>92.23</td>
<td>88.80</td>
<td>95.75</td>
</tr>
<tr>
<td>50%</td>
<td>Recall</td>
<td>88.69</td>
<td>90.35</td>
<td>92.55</td>
<td>89.48</td>
<td>86.15</td>
<td>94.91</td>
</tr>
<tr>
<td/>
<td>F1-score</td>
<td>89.75</td>
<td>90.70</td>
<td>93.25</td>
<td>90.83</td>
<td>87.45</td>
<td>95.30</td>
</tr>
</tbody>
</table>
</table-wrap>
<p><xref ref-type="fig" rid="fig-4">Fig. 4</xref> shows that the EQBMIAL method has better results than several other comparison methods as far as precision is concerned. When the percentage of the training set reaches 10%, 20%, 30%, 40%, and 50%, the EQBMIAL method in this paper is improved by 1.87%, 2.36%, 2.1%, 0.45%, and 0.2%, respectively, compared to the optimal TE method. Using the samples required for precision as a comparison, the optimal TE method in the comparison algorithm requires at least 50% of the samples to achieve 75% precision, while the EQBMIAL method proposed in this paper only needs 40% of the samples can achieve 75% precision.</p>
<p><xref ref-type="fig" rid="fig-5">Fig. 5</xref> shows that the EQBMIAL method has better results than several other comparison algorithms as far as recall is concerned. When the percentage of the training set reaches 10%, 20%, 30%, 40%, and 50%, the EQBMIAL method proposed in this paper is improved by 7.14%, 4.08%, 0.34%, 3.5%, and 0.87%, respectively, compared to the optimal TE method. Using the samples required for the recall rate as a comparison, the optimal TE method in the comparison algorithm requires at least 50% of the samples to achieve a recall rate of 74%, while the EQBMIAL method put forward in this article only needs 40% of the samples can achieve 74%.</p>
<p><xref ref-type="fig" rid="fig-6">Fig. 6</xref> demonstrates that in the F1-score index, the EQBMIAL approach is preferable to the other comparison methods. When the percentage of the training set reaches 10%, 20%, 30%, 40%, and 50%, the EQBMIAL method proposed in this paper is improved by 5.21%, 3.29%, 2.13%, 2.42%, and 0.88% compared to the optimal TE method. Using the samples required by the F1-score as a comparison, the optimal TE method in the comparison algorithm requires at least 50% of the samples to achieve 74% of the F1-score, while the EQBMIAL method proposed in this paper only needs 40% of the samples can achieve a 74% F1-score.</p>
<p>The indicators of the method on the NetDevConfNER dataset are shown in <xref ref-type="table" rid="table-2">Table 2</xref>.</p>

<p>In terms of recall, the EQBMIAL methodology outperforms the other comparative techniques. When the training set&#x2019;s percentage hits 10%, 20%, 30%, 40%, and 50%, the EQBMIAL method proposed in this paper is improved by 1.72%, 3.35%, 4.2%, 4.55%, and 2.36%, respectively, when compared to the optimal TE method. Using the samples required for the recall rate as a comparison, the optimal TE method in the comparison algorithm requires at least 50% of the samples to achieve a recall rate of 92%, while the EQBMIAL method proposed in this paper only needs 40% of the samples can reach a recall rate of 92%.</p>
<p><xref ref-type="table" rid="table-2">Table 2</xref> shows that the EQBMIAL method is greater than the other comparison technologies in the F1-score index. When the percentage of the training set reaches 10%, 20%, 30%, 40%, and 50%, the EQBMIAL method proposed in this paper is improved by 0.98%, 2.84%, 3.65%, 3.88%, and 2.05%, respectively, compared to the optimal TE method. Taking the samples required by the F1-score as a comparison, the optimal TE method in the comparison algorithm requires at least 50% of the samples to achieve an F1-score of 93%, while the EQBMIAL method proposed in this paper only needs 40% of the samples can achieve a 93% F1-score.</p>

<p>It is evident that the results of using the EQBMIAL approach suggested in this research for the ClueNER and NetDevConfNER datasets greatly surpass those of previous comparison methods. The entity extraction approach suggested in this paper is then further validated through comparison with the comparison algorithms BILSTM-CRF, Mutil_Attention-Bilstm-Crf, Deep_Neural_Model_NER, and BERT_Transformer. The evaluation compares the precision, recall, and F1-score on both the ClueNER dataset and the NetDevConfNER dataset.</p>
<p>Next, the performance metrics of the algorithm on the ClueNER and NetDevConfNER datasets are summarized. When compared to the chosen benchmark algorithms, these findings attest to how effective and better the proposed entity extraction algorithm is.</p>
<p><xref ref-type="fig" rid="fig-7">Figs. 7</xref> and <xref ref-type="fig" rid="fig-8">8</xref> show the comparison of indicators of various entity extraction on the ClueNER dataset and the NetDevConfNER dataset.</p>
<fig id="fig-7">
<label>Figure 7</label>
<caption>
<title>Comparison of several ClueNER dataset algorithm indications</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_45807-fig-7.tif"/>
</fig><fig id="fig-8">
<label>Figure 8</label>
<caption>
<title>Comparison of several NetDevConfNER dataset algorithm indications</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_45807-fig-8.tif"/>
</fig>
<p>According to <xref ref-type="fig" rid="fig-7">Fig. 7</xref>, on the ClueNER dataset, the precision index values for BILSTM-CRF, Mutil_Attention-Bilstm-Crf, Deep_Neural_Model_NER, and BERT_Transformer are 70.5%, 74.01%, 74.82%, and 74.89%, respectively. In contrast, the precision value achieved by the method put forward in this article is 76.21%, surpassing the optimal comparison algorithm, BERT_Transformer by 1.32%. Regarding recall, the method put forward in this article reached a recall value of 72.25%, outperforming the optimal comparison algorithm, Mutil_Attention-Bilstm-Crf, by 0.49%. Furthermore, the F1-score of the comparison algorithms BILSTM-CRF, Mutil_Attention-Bilstm-Crf, Deep_Neural_Model_NER, and BERT_Transformer achieved index values of 69.87%, 72.86%, 72.74%, and 72.51%, respectively. In contrast, the value of the F1-score achieved by the proposed method is 74.18%, exhibiting a remarkable increase of 4.31% and outperforming the BILSTM-CRF algorithm. These results indicate that the algorithm put forward in this paper achieves superior outcomes regarding precision, recall, and F1-score compared with those selected comparison algorithms, making it a more effective solution for entity extraction tasks on the ClueNER dataset.</p>

<p><xref ref-type="fig" rid="fig-8">Fig. 8</xref> illustrates that the overall performance of various algorithms on the NetDevConfNER dataset is notably superior to that on the ClueNER dataset. The precision indicators of different algorithms show a slight variation, with BILSTM-CRF having the lowest precision index value of 86.74%, while AW-MHA achieves the highest precision with an index value of 98.43%. The suggested approach in this study outperforms the foremost Mutil_Attention-Bilstm-Crf algorithm by a small margin, with an increase of 0.74%. Regarding recall, the value of Recall achieved by the proposed algorithm is 98.01%, outperforming the Mutil_Attention-Bilstm-Crf algorithm by 0.14%. Regarding the F1-score, the BILSTM-CRF, Mutil_Attention-Bilstm-Crf, Deep_Neural_Model_NER, BERT_Transformer algorithms severally achieved index values of 84.21%, 97.76%, 94.55%, and 93.25%. In comparison, the value of the F1-score achieved by the proposed method is 98.22%, showing an improvement of 0.46% and outperforming the Mutil_Attention-Bilstm-Crf algorithm. Due to its focus on Chinese entity recognition, the BERT_Transformer algorithm is pre-trained using a Chinese character-based BERT model, which may result in lower performance on the network configuration dataset. These results indicate that the method suggested in this paper achieves superior outcomes of precision, recall, and F1-score compared with those selected comparison algorithms, demonstrating its effectiveness for entity extraction tasks on the NetDevConfNER dataset.</p>
</sec>
<sec id="s4_4">
<label>4.4</label>
<title>Statistical Results</title>
<p>In order to validate that the algorithm proposed in this paper exhibits significant differences compared to other algorithms in the task of network configuration entity extraction, the Kruskal-Wallis test is used for verification. Before conducting the statistical test, performance metrics for different algorithms are first grouped according to the algorithms. A significance level of 0.05 is set. In this test, the null hypothesis states that there is no significant difference in performance between the algorithm proposed in this paper and the other algorithms.</p>
<p>First, a Kruskal-Wallis test was conducted on the EQBMIAL algorithm proposed in this paper with a training set percentage of 50%. The calculated <italic>p</italic>-value for the CLUENER dataset was 0.0215, and for the NetDevConfNER dataset, it was 0.0114. In both of these datasets, the <italic>p</italic>-value is less than 0.05, leading to the rejection of the null hypothesis. Therefore, it can be concluded that the performance of the EQBMIAL algorithm is statistically highly significant on these two datasets.</p>
<p>Next, a Kruskal-Wallis test was conducted on the AW-MHA method. For the CLUENER dataset, the calculated <italic>p</italic>-value was 0.1586, indicating that this algorithm does not exhibit a significant difference from other algorithms. However, for the NetDevConfNER dataset, the calculated <italic>p</italic>-value was 0.0090, leading to the rejection of the null hypothesis. While its performance in the CLUENER dataset may not significantly outperform other algorithms, its excellent performance on the network configuration dataset demonstrates that the algorithm excels and is highly significant for the task of network configuration entity extraction.</p>
</sec>
<sec id="s4_5">
<label>4.5</label>
<title>Time Complexity Analysis</title>
<p>This section will delve into the time complexity analysis of the entity extraction stage. Among the elements within the entity extraction component, the most computationally demanding aspect is the enhanced transformer model. Consequently, our primary focus will be on the time complexity of this particular component. In this paper, the enhanced transformer layer incorporates an adaptive weighting mechanism based on the Laplace mixture distribution into the transformer&#x2019;s multi-head attention mechanism. Although this approach may introduce some additional computational overhead, the core time complexity of this section remains predominantly determined by the multi-head attention mechanism itself. As a result, we can represent the time complexity of this section of the model as <inline-formula id="ieqn-43"><mml:math id="mml-ieqn-43"><mml:mrow><mml:mtext>O</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mrow><mml:mtext>n</mml:mtext></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mrow><mml:mtext>d</mml:mtext></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:mtext>n</mml:mtext></mml:mrow><mml:msup><mml:mrow><mml:mtext>d</mml:mtext></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, where n represents the length of the input sequence, and d represents the dimension of word embeddings.</p>
</sec>
</sec>
<sec id="s5">
<label>5</label>
<title>Conclusion</title>
<p>This paper focuses on entity extraction using active learning and a multi-head self-attention mechanism with adaptive weighting. It starts with a review of relevant literature and introduces the concept of Laplace mixture distribution to address issues in the extraction model. The paper also presents EQBMIAL, an active learning method designed to handle outliers and redundancy in the training set. Additionally, AW-MHA is proposed to overcome challenges arising from the increasing number of attention heads.</p>
<p>Simulation experiments were conducted to compare the proposed algorithm with other models such as RANDOM, LC, MNLP, TE, and EQB. The results demonstrate significant improvements across various evaluation metrics, particularly achieving a precision of 98.32% on the NetDevConfNER dataset. This algorithm outperforms other models in network configuration entity recognition, highlighting its superior performance in current entity recognition tasks.</p>
<p>However, it is crucial to consider adversarial attacks when developing network configuration entity recognition models. These models are vulnerable to intentional manipulation of input data by attackers, leading to incorrect outputs. Therefore, enhancing the model&#x2019;s resilience against such attacks becomes essential. One approach to achieve this is through adversarial training, which can improve the model&#x2019;s robustness.</p>
</sec>
</body>
<back>
<ack>
<p>The authors thank the anonymous reviewers for their careful reading and valuable suggestions. Their professional opinions and suggestions have enabled me to explore and improve the research more comprehensively. In addition, we sincerely thank all the members involved in this paper for their support and valuable comments. Our collaboration has made this research richer and more meaningful.</p>
</ack>
<sec><title>Funding Statement</title>
<p>This work is supported by the National Key R&#x0026;D Program of China (2019YFB2103202).</p>
</sec>
<sec><title>Author Contributions</title>
<p>The authors confirm contribution to the paper as follows: study conception and design: Yang Yang, Zhenying Qu, Zefan Yan; data collection: Zefan Yan, Zhenying Qu, Ti Wang; analysis and interpretation of results: Yang Yang, Zhenying Qu, Zefan Yan, Zhipeng Gao; draft manuscript preparation: Zefan Yan, Zhenying Qu, Yang Yang. All authors reviewed the results and approved the final version of the manuscript.</p>
</sec>
<sec sec-type="data-availability"><title>Availability of Data and Materials</title>
<p>The ClueNER dataset can be accessed through &#x201C;<ext-link ext-link-type="uri" xlink:href="https://github.com/CLUEbenchmark/CLUENER2020">https://github.com/CLUEbenchmark/CLUENER2020</ext-link>&#x201D;. The NetDevConfNER dataset cannot be made publicly accessible due to proprietary restrictions.</p>
</sec>
<sec sec-type="COI-statement"><title>Conflicts of Interest</title>
<p>The authors declare that they have no conflicts of interest to report regarding the present study.</p>
</sec>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S. X.</given-names> <surname>Ji</surname></string-name>, <string-name><given-names>S. R.</given-names> <surname>Pan</surname></string-name>, <string-name><given-names>E.</given-names> <surname>Cambria</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Marttinen</surname></string-name> and <string-name><given-names>P. S.</given-names> <surname>Yu</surname></string-name></person-group>, &#x201C;<article-title>A survey on knowledge graphs: Representation, acquisition, and applications</article-title>,&#x201D; <source>IEEE Transactions on Neural Networks and Learning Systems</source>, vol. <volume>33</volume>, no. <issue>2</issue>, pp. <fpage>494</fpage>&#x2013;<lpage>514</lpage>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Yang</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Dong</surname></string-name>, <string-name><given-names>G.</given-names> <surname>Deng</surname></string-name> and <string-name><given-names>Y.</given-names> <surname>Liu</surname></string-name></person-group>, &#x201C;<article-title>Design and implementation of fault diagnosis system for power communication network based on CNN</article-title>,&#x201D; in <conf-name>Proc. of the 2021 13th Int. Conf. on Communication Software and Networks</conf-name>, <publisher-loc>Chongqing, China</publisher-loc>, pp. <fpage>69</fpage>&#x2013;<lpage>74</lpage>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Dezfulian</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Ghaedsharaf</surname></string-name> and <string-name><given-names>N.</given-names> <surname>Motee</surname></string-name></person-group>, &#x201C;<article-title>Performance analysis and optimal design of time-delay directed consensus networks</article-title>,&#x201D; <source>IEEE Transactions on Control of Network Systems</source>, vol. <volume>9</volume>, no. <issue>1</issue>, pp. <fpage>197</fpage>&#x2013;<lpage>209</lpage>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>K.</given-names> <surname>Shaukat</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Luo</surname></string-name> and <string-name><given-names>V.</given-names> <surname>Varadharajan</surname></string-name></person-group>, &#x201C;<article-title>A novel deep learning-based approach for malware detection</article-title>,&#x201D; <source>Engineering Applications of Artificial Intelligence</source>, vol. <volume>122</volume>, pp. <lpage>106030</lpage>, <year>2023</year>.</mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>D. M.</given-names> <surname>Bikel</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Schwartz</surname></string-name> and <string-name><given-names>R. M.</given-names> <surname>Weischedel</surname></string-name></person-group>, &#x201C;<article-title>An algorithm that learns what&#x2019;s in a name</article-title>,&#x201D; <source>Machine learning</source>, vol. <volume>34</volume>, pp. <fpage>211</fpage>&#x2013;<lpage>231</lpage>, <year>1999</year>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>A. E.</given-names> <surname>Borthwick</surname></string-name></person-group>, &#x201C;<article-title>A maximum entropy approach to named entity recognition</article-title>,&#x201D; <comment>Ph.D. dissertation, New York University, USA</comment>, <year>1999</year>.</mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>H. L.</given-names> <surname>Chieu</surname></string-name> and <string-name><given-names>H. T.</given-names> <surname>Ng</surname></string-name></person-group>, &#x201C;<article-title>Named entity recognition: A maximum entropy approach using global information</article-title>,&#x201D; in <conf-name>Proc. of the 19th Int. Conf. on Computational Linguistics</conf-name>, <publisher-loc>Taipei, Taiwan</publisher-loc>, pp. <fpage>190</fpage>&#x2013;<lpage>196</lpage>, <year>2002</year>.</mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Mayfield</surname></string-name>, <string-name><given-names>P.</given-names> <surname>McNamee</surname></string-name> and <string-name><given-names>C.</given-names> <surname>Piatko</surname></string-name></person-group>, &#x201C;<article-title>Named entity recognition using hundreds of thousands of features</article-title>,&#x201D; in <conf-name>Proc. of the Seventh Conf. on Natural Language Learning at HLT-NAACL 2003</conf-name>, <publisher-loc>Edmonton, Canada</publisher-loc>, pp. <fpage>184</fpage>&#x2013;<lpage>187</lpage>, <year>2003</year>.</mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>McCallum</surname></string-name> and <string-name><given-names>W.</given-names> <surname>Li</surname></string-name></person-group>, &#x201C;<article-title>Early results for named entity recognition with conditional random fields, feature induction, and web-enhanced lexicons</article-title>,&#x201D; in <conf-name>Proc. of the Seventh Conf. on Natural Language Learning</conf-name>, <publisher-loc>Edmonton, Canada</publisher-loc>, pp. <fpage>188</fpage>&#x2013;<lpage>191</lpage>, <year>2003</year>.</mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>B.</given-names> <surname>Settles</surname></string-name></person-group>, &#x201C;<article-title>Biomedical named entity recognition using conditional random fields and rich feature sets</article-title>,&#x201D; in <conf-name>Proc. of the Int. Joint Workshop on Natural Language Processing in Biomedicine and its Applications</conf-name>, <publisher-loc>Geneva, Switzerland</publisher-loc>, pp. <fpage>104</fpage>&#x2013;<lpage>107</lpage>, <year>2004</year>.</mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>G.</given-names> <surname>Lample</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Ballesteros</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Subramanian</surname></string-name>, <string-name><given-names>K.</given-names> <surname>Kawakami</surname></string-name> and <string-name><given-names>C.</given-names> <surname>Dyer</surname></string-name></person-group>, &#x201C;<article-title>Neural architectures for named entity recognition</article-title>,&#x201D; in <conf-name>Proc. of the 2016 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</conf-name>, <publisher-loc>San Diego, California, USA</publisher-loc>, pp. <fpage>260</fpage>&#x2013;<lpage>270</lpage>, <year>2016</year>.</mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>L.</given-names> <surname>Luo</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Yang</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Yang</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Wang</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>An attention-based BiLSTM-CRF approach to document-level chemical named entity recognition</article-title>,&#x201D; <source>Bioinformatics</source>, vol. <volume>34</volume>, no. <issue>8</issue>, pp. <fpage>1381</fpage>&#x2013;<lpage>1388</lpage>, <year>2018</year>; <pub-id pub-id-type="pmid">29186323</pub-id></mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>K.</given-names> <surname>Xu</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Yang</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Kang</surname></string-name>, <string-name><given-names>Q.</given-names> <surname>Wang</surname></string-name> and <string-name><given-names>W.</given-names> <surname>Liu</surname></string-name></person-group>, &#x201C;<article-title>Document-level attention-based BiLSTM-CRF incorporating disease dictionary for disease named entity recognition</article-title>,&#x201D; <source>Computers in Biology and Medicine</source>, vol. <volume>108</volume>, pp. <fpage>122</fpage>&#x2013;<lpage>132</lpage>, <year>2019</year>; <pub-id pub-id-type="pmid">31003175</pub-id></mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>A. K.</given-names> <surname>Singh</surname></string-name>, <string-name><given-names>I. R.</given-names> <surname>Khan</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Khan</surname></string-name>, <string-name><given-names>K.</given-names> <surname>Pant</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Debnath</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Multichannel CNN model for biomedical entity reorganization</article-title>,&#x201D; <source>BioMed Research International</source>, vol. <volume>2022</volume>, pp. <fpage>1</fpage>&#x2013;<lpage>11</lpage>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>E.</given-names> <surname>Parsaeimehr</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Fartash</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Akbari Torkestani</surname></string-name></person-group>, &#x201C;<article-title>Improving feature extraction using a hybrid of CNN and LSTM for entity identification</article-title>,&#x201D; <source>Neural Processing Letters</source>, vol. <volume>55</volume>, no. <issue>5</issue>, pp. <fpage>5979</fpage>&#x2013;<lpage>5994</lpage>, <year>2023</year>.</mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>He</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Deng</surname></string-name>, <string-name><given-names>B.</given-names> <surname>Chen</surname></string-name>, <string-name><given-names>R. S.</given-names> <surname>Sherratt</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Wang</surname></string-name></person-group>, &#x201C;<article-title>Unsupervised log anomaly detection method based on multi-feature</article-title>,&#x201D; <source>Computers, Materials &#x0026; Continua</source>, vol. <volume>76</volume>, no. <issue>1</issue>, pp. <fpage>517</fpage>&#x2013;<lpage>541</lpage>, <year>2023</year>.</mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Alissa</surname></string-name> and <string-name><given-names>M.</given-names> <surname>Wald</surname></string-name></person-group>, &#x201C;<article-title>Text simplification using transformer and BERT</article-title>,&#x201D; <source>Computers, Materials &#x0026; Continua</source>, vol. <volume>75</volume>, no. <issue>2</issue>, pp. <fpage>3479</fpage>&#x2013;<lpage>3495</lpage>, <year>2023</year>.</mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>C.</given-names> <surname>Jia</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Shi</surname></string-name>, <string-name><given-names>Q.</given-names> <surname>Yang</surname></string-name> and <string-name><given-names>Y.</given-names> <surname>Zhang</surname></string-name></person-group>, &#x201C;<article-title>Entity enhanced BERT pre-training for Chinese NER</article-title>,&#x201D; in <conf-name>Proc. of the 2020 Conf. on Empirical Methods in Natural Language Processing (EMNLP)</conf-name>, pp. <fpage>6384</fpage>&#x2013;<lpage>6396</lpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Beygelzimer</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Dasgupta</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Langford</surname></string-name></person-group>, &#x201C;<article-title>Importance weighted active learning</article-title>,&#x201D; in <conf-name>Proc. of the 26th Annual Int. Conf. on Machine Learning</conf-name>, <publisher-loc>Montreal, Canada</publisher-loc>, pp. <fpage>49</fpage>&#x2013;<lpage>56</lpage>, <year>2009</year>.</mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>D.</given-names> <surname>Mahapatra</surname></string-name>, <string-name><given-names>B.</given-names> <surname>Bozorgtabar</surname></string-name>, <string-name><given-names>J. P.</given-names> <surname>Thiran</surname></string-name> and <string-name><given-names>M.</given-names> <surname>Reyes</surname></string-name></person-group>, &#x201C;<article-title>Efficient active learning for image classification and segmentation using a sample selection and conditional generative adversarial network</article-title>,&#x201D; in <conf-name>Proc. of the Int. Conf. on Medical Image Computing and Computer-Assisted Intervention</conf-name>, <publisher-loc>Granada, Spain</publisher-loc>, vol. <volume>11071</volume>, pp. <fpage>580</fpage>&#x2013;<lpage>588</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>C.</given-names> <surname>Wu</surname></string-name>, <string-name><given-names>G.</given-names> <surname>Luo</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Guo</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Ren</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Zheng</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>An attention-based multi-task model for named entity recognition and intent analysis of chinese online medical questions</article-title>,&#x201D; <source>Journal of Biomedical Informatics</source>, vol. <volume>108</volume>, pp. <fpage>103511</fpage>, <year>2020</year>; <pub-id pub-id-type="pmid">32673791</pub-id></mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>R. A.</given-names> <surname>Hallyal</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Sujatha</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Padmashree</surname></string-name> and <string-name><given-names>S. M.</given-names> <surname>Meena</surname></string-name></person-group>, &#x201C;<article-title>Optimized recognition of CAPTCHA through attention models</article-title>,&#x201D; in <conf-name>2023 IEEE 8th Int. Conf. for Convergence in Technology (I2CT)</conf-name>, <publisher-loc>Lonavla, India</publisher-loc>, pp. <fpage>1</fpage>&#x2013;<lpage>7</lpage>, <year>2023</year>.</mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Vaswani</surname></string-name>, <string-name><given-names>N.</given-names> <surname>Shazeer</surname></string-name>, <string-name><given-names>N.</given-names> <surname>Parmar</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Uszkoreit</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Jones</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Attention is all you need</article-title>,&#x201D; in <conf-name>Proc. of the 31st Int. Conf. on Neural Information Processing Systems</conf-name>, <publisher-loc>Red Hook, NY, USA</publisher-loc>, pp. <fpage>6000</fpage>&#x2013;<lpage>6010</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>B.</given-names> <surname>Settles</surname></string-name> and <string-name><given-names>M.</given-names> <surname>Craven</surname></string-name></person-group>, &#x201C;<article-title>An analysis of active learning strategies for sequence labeling tasks</article-title>,&#x201D; in <conf-name>Proc. of the 2008 Conf. on Empirical Methods in Natural Language Processing</conf-name>, <publisher-loc>Honolulu, Hawaii, USA</publisher-loc>, pp. <fpage>1070</fpage>&#x2013;<lpage>1079</lpage>, <year>2008</year>.</mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Shen</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Yun</surname></string-name>, <string-name><given-names>Z. C.</given-names> <surname>Lipton</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Kronrod</surname></string-name> and <string-name><given-names>A.</given-names> <surname>Anandkumar</surname></string-name></person-group>, &#x201C;<article-title>Deep active learning for named entity recognition</article-title>,&#x201D; in <conf-name>Proc. of the 2nd Workshop on Representation Learning for NLP</conf-name>, <publisher-loc>Vancouver, Canada</publisher-loc>, pp. <fpage>252</fpage>&#x2013;<lpage>256</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>P.</given-names> <surname>Radmard</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Fathullah</surname></string-name> and <string-name><given-names>A.</given-names> <surname>Lipani</surname></string-name></person-group>, &#x201C;<article-title>Subsequence based deep active learning for named entity recognition</article-title>,&#x201D; in <source>Proc. of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Int. Joint Conf. on Natural Language Processing</source>, vol. <volume>1</volume>, pp. <fpage>4310</fpage>&#x2013;<lpage>4321</lpage>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-27"><label>[27]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>L.</given-names> <surname>Copa</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Tuia</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Volpi</surname></string-name> and <string-name><given-names>M.</given-names> <surname>Kanevski</surname></string-name></person-group>, &#x201C;<article-title>Unbiased query-by-bagging active learning for VHR image classification</article-title>,&#x201D; in <conf-name>Image and Signal Processing for Remote Sensing XVI</conf-name>, <publisher-loc>Toulouse, France</publisher-loc>, vol. <volume>7830</volume>, pp. <fpage>176</fpage>&#x2013;<lpage>183</lpage>, <year>2010</year>.</mixed-citation></ref>
<ref id="ref-28"><label>[28]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>Z.</given-names> <surname>Huang</surname></string-name>, <string-name><given-names>W.</given-names> <surname>Xu</surname></string-name> and <string-name><given-names>K.</given-names> <surname>Yu</surname></string-name></person-group>, &#x201C;<article-title>Bidirectional LSTM-CRF models for sequence taggin</article-title>,&#x201D; <comment>arXiv:1508.01991</comment>, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-29"><label>[29]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>G.</given-names> <surname>Wu</surname></string-name>, <string-name><given-names>G.</given-names> <surname>Tang</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Wang</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Zhang</surname></string-name> and <string-name><given-names>Z.</given-names> <surname>Wang</surname></string-name></person-group>, &#x201C;<article-title>An attention-based BiLSTM-CRF model for Chinese clinic named entity recognition</article-title>,&#x201D; <source>IEEE Access</source>, vol. <volume>7</volume>, pp. <fpage>113942</fpage>&#x2013;<lpage>113949</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-30"><label>[30]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>T.</given-names> <surname>L&#x00EA;</surname></string-name> and <string-name><given-names>M. S.</given-names> <surname>Burtsev</surname></string-name></person-group>, &#x201C;<article-title>A deep neural network model for the task of named entity recognition</article-title>,&#x201D; <source>International Journal of Machine Learning and Computing</source>, vol. <volume>9</volume>, no. <issue>1</issue>, pp. <fpage>8</fpage>&#x2013;<lpage>13</lpage>, <year>2019</year>.</mixed-citation></ref>
</ref-list>
</back></article>