<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xml:lang="en" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMC</journal-id>
<journal-id journal-id-type="nlm-ta">CMC</journal-id>
<journal-id journal-id-type="publisher-id">CMC</journal-id>
<journal-title-group>
<journal-title>Computers, Materials &#x0026; Continua</journal-title>
</journal-title-group>
<issn pub-type="epub">1546-2226</issn>
<issn pub-type="ppub">1546-2218</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">74138</article-id>
<article-id pub-id-type="doi">10.32604/cmc.2025.074138</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>A Unified Feature Selection Framework Combining Mutual Information and Regression Optimization for Multi-Label Learning</article-title>
<alt-title alt-title-type="left-running-head">A Unified Feature Selection Framework Combining Mutual Information and Regression Optimization for Multi-Label Learning</alt-title>
<alt-title alt-title-type="right-running-head">A Unified Feature Selection Framework Combining Mutual Information and Regression Optimization for Multi-Label Learning</alt-title>
</title-group>
<contrib-group>
<contrib id="author-1" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Lim</surname><given-names>Hyunki</given-names></name><xref rid="cor1" ref-type="corresp">&#x002A;</xref><email>hlim20@kyonggi.ac.kr</email></contrib>
<aff id="aff-1"><institution>Division of AI Computer Science and Engineering, Kyonggi University, Gwanggyosan-Ro, Yeongtong-Gu</institution>, <addr-line>Suwon-Si, 16227, Gyeonggi-Do</addr-line>, <country>Republic of Korea</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Hyunki Lim. Email: <email>hlim20@kyonggi.ac.kr</email></corresp>
</author-notes>
<pub-date date-type="collection" publication-format="electronic">
<year>2026</year>
</pub-date>
<pub-date date-type="pub" publication-format="electronic">
<day>10</day><month>2</month><year>2026</year>
</pub-date>
<volume>87</volume>
<issue>1</issue>
<elocation-id>51</elocation-id>
<history>
<date date-type="received">
<day>03</day>
<month>10</month>
<year>2025</year>
</date>
<date date-type="accepted">
<day>27</day>
<month>11</month>
<year>2025</year>
</date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2026 The Author.</copyright-statement>
<copyright-year>2026</copyright-year>
<copyright-holder>Published by Tech Science Press.</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMC_74138.pdf"></self-uri>
<abstract>
<p>High-dimensional data causes difficulties in machine learning due to high time consumption and large memory requirements. In particular, in a multi-label environment, higher complexity is required as much as the number of labels. Moreover, an optimization problem that fully considers all dependencies between features and labels is difficult to solve. In this study, we propose a novel regression-based multi-label feature selection method that integrates mutual information to better exploit the underlying data structure. By incorporating mutual information into the regression formulation, the model captures not only linear relationships but also complex non-linear dependencies. The proposed objective function simultaneously considers three types of relationships: (1) feature redundancy, (2) feature-label relevance, and (3) inter-label dependency. These three quantities are computed using mutual information, allowing the proposed formulation to capture nonlinear dependencies among variables. These three types of relationships are key factors in multi-label feature selection, and our method expresses them within a unified formulation, enabling efficient optimization while simultaneously accounting for all of them. To efficiently solve the proposed optimization problem under non-negativity constraints, we develop a gradient-based optimization algorithm with fast convergence. The experimental results on seven multi-label datasets show that the proposed method outperforms existing multi-label feature selection techniques.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>feature selection</kwd>
<kwd>multi-label learning</kwd>
<kwd>regression model optimization</kwd>
<kwd>mutual information</kwd>
</kwd-group>
<funding-group>
<award-group id="awg1">
<funding-source>Basic Science Research Program through the National Research Foundation of Korea (NRF)</funding-source>
<award-id>RS-2020-NR049579</award-id>
</award-group>
</funding-group>
</article-meta>
</front>
<body>
<sec id="s1">
<label>1</label>
<title>Introduction</title>
<p>Multi-label learning has attracted significant attention in recent years due to its ability to handle scenarios where each instance may be associated with multiple semantic labels simultaneously. It has been successfully applied across a wide range of domains and applications, including image classification [<xref ref-type="bibr" rid="ref-1">1</xref>], text classification [<xref ref-type="bibr" rid="ref-2">2</xref>], emotion recognition [<xref ref-type="bibr" rid="ref-3">3</xref>], fault diagnosis [<xref ref-type="bibr" rid="ref-4">4</xref>], and privacy protection [<xref ref-type="bibr" rid="ref-5">5</xref>]. These diverse applications demonstrate the growing importance of multi-label learning in addressing real-world problems where data exhibit complex and overlapping label structures.</p>
<p>In the fields of machine learning and pattern recognition, it is often necessary to handle high-dimensional data. Such data may contain redundant or irrelevant information that can interfere with learning. Moreover, high-dimensional datasets typically require excessive processing time and a large amount of memory consumption, making them challenging to work with [<xref ref-type="bibr" rid="ref-6">6</xref>]. These issues can degrade the performance of learning algorithms and hinder practical applications. In particular, when the data involves multiple labels per instance-commonly referred to as multi-label data-the complexity of learning grows substantially. In contrast to single-label scenarios, where each sample is associated with a single target, multi-label learning must consider multiple, potentially interdependent targets simultaneously. This increased dimensionality, coupled with inter-label dependency, leads to a combinatorial explosion in both the space and the associated computational complexity [<xref ref-type="bibr" rid="ref-7">7</xref>]. To address these challenges, many studies have introduced feature selection techniques in multi-label settings. Multi-label feature selection algorithms aim to identify and retain the most relevant features while removing unnecessary ones, based on certain evaluation criteria. The resulting feature subset can improve the accuracy of machine learning models, reduce training time, and enhance the interpretability of the data. Moreover, it helps mitigate risks such as the curse of dimensionality and overfitting [<xref ref-type="bibr" rid="ref-8">8</xref>].</p>
<p>Conventional feature selection approaches based on criteria that evaluate features independently of any specific learning model can be broadly categorized into two main streams: information-theoretic filter methods and regression-based embedded methods. The first stream, exemplified by methods such as mRMR [<xref ref-type="bibr" rid="ref-9">9</xref>], evaluates the redundancy between features and relevance to the target using mutual information to tackle the computational intractability of exhaustive subset search. These approaches can effectively capture nonlinear dependencies and pairwise relationships. However, they typically rely on greedy selection strategies, which may fail to consider global feature interactions and often lead to suboptimal solutions. The second stream formulates feature selection as a regression optimization problem, typically minimizing an objective of the form ||<italic>XW</italic> &#x2212; <italic>&#x03D2;</italic> || where <italic>X</italic> and <italic>&#x03D2;</italic> are input data and label set, <italic>W</italic> represents importance of each feature [<xref ref-type="bibr" rid="ref-10">10</xref>]. This framework allows feature selection to be incorporated into a global optimization procedure, thereby enhancing computational efficiency. However, such approaches inherently rely on the assumption of linear dependencies between features and labels, which constrains their capacity to model more complex relationships.</p>
<p>In this paper, we propose a unified feature selection framework that integrates information-theoretic redundancy measures into a regression-based formulation. By unifying these two complementary approaches, the proposed framework mitigates their respective limitations while leveraging their individual strengths. Specifically, we encode the mutual information relationships between features, and between features and labels into a matrix representation, which is then incorporated into the regression objective. This enables the model to retain the nonlinear dependency awareness of mutual information-based methods while benefiting from the efficient optimization capability of regression-based approaches.</p>
<p>The main contributions of this work are as follows:
<list list-type="bullet">
<list-item>
<p>Novel Objective Function: To overcome the limitations of traditional approaches in multi-label feature selection, we introduce a new objective function that merges efficient regression-based approach and mutual information-based criteria. The proposed function is designed to be simple yet capable of capturing essential information for effective feature selection. Its formulation is described in detail in <xref ref-type="sec" rid="s3_1">Sections 3.1</xref> and <xref ref-type="sec" rid="s3_2">3.2</xref>.</p></list-item>
<list-item>
<p>Efficient Optimization Algorithm: We develop an efficient algorithm based on gradient descent to optimize the proposed objective function. This algorithm emphasizes fast convergence and reduced computational complexity compared to existing methods. The optimization algorithm is presented in <xref ref-type="sec" rid="s3_3">Section 3.3</xref>, the complexity analysis is provided in <xref ref-type="sec" rid="s3_4">Section 3.4</xref>, and the convergence behavior of the algorithm is illustrated in <xref ref-type="sec" rid="s4_3">Section 4.3</xref>.</p></list-item>
<list-item>
<p>Improved Classification Performance: Through experiments on seven multi-label datasets, we confirm that the proposed method achieves superior classification performance compared to conventional feature selection techniques. Comprehensive comparison experiments and statistical analyses are reported in <xref ref-type="sec" rid="s4_2">Section 4.2</xref>, accompanied by extensive figures and tables.</p></list-item>
</list></p>
<p>The remainder of this paper is organized as follows. <xref ref-type="sec" rid="s2">Section 2</xref> reviews background knowledge on multi-label classification and feature selection methods. <xref ref-type="sec" rid="s3">Section 3</xref> provides a detailed description of the proposed methodology, including the integration of mutual information and regression objectives. <xref ref-type="sec" rid="s4">Section 4</xref> presents experimental results using seven benchmark multi-label datasets and compares the performance of the proposed approach with existing methods. Finally, <xref ref-type="sec" rid="s5">Section 5</xref> concludes the study and discusses potential future research directions.</p>
</sec>
<sec id="s2">
<label>2</label>
<title>Related Works</title>
<p>Feature selection has been widely investigated in single-label learning scenarios, where each instance is associated with only one target. In contrast, multi-label learning, which involves multiple potentially interdependent targets, introduces additional challenges due to increased dimensionality and complex label dependencies. Approaches to handling multi-label data can be broadly classified into two categories: those that transform the problem into a single-label format, and those that directly operate on the multi-label structure [<xref ref-type="bibr" rid="ref-11">11</xref>]. Converting multi-label data into single-label format allows for the direct application of traditional single-label feature selection techniques. However, this transformation often leads to information loss. One common method for this transformation is the Label Powerset (LP) approach, which treats all possible label combinations and treats each as a distinct class label [<xref ref-type="bibr" rid="ref-12">12</xref>]. While straightforward, LP suffers from severe class imbalance, as the number of resulting classes can grow exponentially with the number of labels. To address this, the Pruned Problem Transformation (PPT) method was introduced, which discards infrequent label combinations to mitigate imbalance [<xref ref-type="bibr" rid="ref-13">13</xref>]. However, this pruning also results in the loss of valuable label relationship information.</p>
<p>In contrast, two main feature selection approaches have been proposed to avoid the need for label transformation. The first approach is information-theoretic feature filter. The pairwise multi-label utility (PMU) method is one such approach, which evaluates label correlations using mutual information without transformating labels [<xref ref-type="bibr" rid="ref-14">14</xref>]. However, PMU is easy to finding only locally optimal solutions, since its greedy selection approach inherently constrains the search process and prevents the discovery of globally optimal feature subsets. Lee and Kim conducted a theoretical analysis of feature selection based on interaction information, demonstrating that lower-degree interaction information terms notably influence mutual information under an incremental selection scheme [<xref ref-type="bibr" rid="ref-15">15</xref>]. They further derived the upper and lower bounds of these terms to explain why score functions that consider lower-degree interactions can produce more effective feature subsets. The max-dependency and min-redundancy (MDMR) criterion has been proposed for multi-label feature selection, where a candidate feature is considered beneficial if it exhibits strong relevance to all class labels while remaining non-redundant with respect to other selected features across all labels [<xref ref-type="bibr" rid="ref-16">16</xref>]. However, MDMR is also easy to finding local optimal solution. The quadratic programming feature selection (QPFS) method reformulates mutual information-based evaluation as a numerical optimization problem to escape local optima [<xref ref-type="bibr" rid="ref-7">7</xref>]. However, QPFS requires the computation of a large number of mutual information terms, which can be computationally intensive. Zhang et al. proposed an approach that allows a feature to account for multiple labels by calculating mutual information across pairs of labels [<xref ref-type="bibr" rid="ref-17">17</xref>].</p>
<p>In another research direction, regression-based embedded methods integrate feature selection into the model training process itself. For example, decision tree classifiers naturally perform feature selection during their construction [<xref ref-type="bibr" rid="ref-18">18</xref>]. In multi-label scenarios, several embedded feature selection methods based on regression analysis have been proposed. Fan et al. utilizes ridge regression to construct a selection matrix and employs the <inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:msub><mml:mi>l</mml:mi><mml:mrow><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>-norm to develop a multi-label feature selection framework [<xref ref-type="bibr" rid="ref-19">19</xref>]. Li et al. introduces a flexible approach that allows feature selection between the <inline-formula id="ieqn-2"><mml:math id="mml-ieqn-2"><mml:msub><mml:mi>l</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:math></inline-formula>-norm and <inline-formula id="ieqn-3"><mml:math id="mml-ieqn-3"><mml:msub><mml:mi>l</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math></inline-formula>-norm criteria [<xref ref-type="bibr" rid="ref-20">20</xref>]. Fan et al. incorporates spectral graph theory to capture label correlations while simultaneously addressing feature redundancy [<xref ref-type="bibr" rid="ref-21">21</xref>]. Li et al. designed a matrix that accounts for higher-order label correlations and proposed a multi-label feature selection method that handles feature redundancy using an <inline-formula id="ieqn-4"><mml:math id="mml-ieqn-4"><mml:msub><mml:mi>l</mml:mi><mml:mrow><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>-norm regularization term [<xref ref-type="bibr" rid="ref-22">22</xref>]. Hu et al. proposed a method that encodes the mutual information between features and labels into a weight matrix and constructs an objective function accordingly to perform multi-label feature selection [<xref ref-type="bibr" rid="ref-23">23</xref>]. Dai et al. proposed a method that leverages fuzzy mutual information to emphasize strongly related labels and applies conditional mutual information to guide the selection of relevant features [<xref ref-type="bibr" rid="ref-24">24</xref>]. Faraji et al. proposed a method that considers both label correlations and label imbalance by designing an objective function that identifies shared patterns between the feature and label matrices [<xref ref-type="bibr" rid="ref-25">25</xref>]. He et al. designed to capture the variation in inter-label relationships by incorporating sample-wise correlations into the label space and applying the information to a sparse linear regression model [<xref ref-type="bibr" rid="ref-26">26</xref>]. Yang et al. proposed a novel embedded method that simultaneously considers inter-label correlations and label-specific features [<xref ref-type="bibr" rid="ref-27">27</xref>]. Their method constrains the correlation search space, learns distinctive features for each label, and incorporates a robust exploration strategy to reduce the impact of noise and outliers.</p>
<p>Another category is the wrapper approach, which identifies an optimal feature subset through iterative evaluation guided by a learning algorithm. Wrapper methods utilize classification performance as a direct evaluation metric to select the optimal feature subset. Optimization techniques such as boosting and genetic algorithms have been employed to improve selection quality [<xref ref-type="bibr" rid="ref-28">28</xref>]. The genetic algorithm for feature selection can be used for various applications [<xref ref-type="bibr" rid="ref-29">29</xref>,<xref ref-type="bibr" rid="ref-30">30</xref>]. Despite their effectiveness, wrapper methods are computationally expensive due to repeated classifier evaluations. However, wrapper methods are beyond the scope of this work and are therefore not considered here.</p>
<p>Recent studies have explored logic mining as an interpretable approach to feature selection and rule extraction, primarily implemented through Discrete Hopfield Neural Networks (DHNNs). A hybrid DHNN framework with Random 2-Satisfiability rules was introduced [<xref ref-type="bibr" rid="ref-31">31</xref>], where hybrid differential evolution and swarm mutation operators were incorporated to enhance the optimization of synaptic weights and diversify neuron states during retrieval, leading to improved transparency in decision-making. Similarly, Romli et al. proposed an optimized logic mining model using higher-order Random 3-Satisfiability representations in DHNNs, designed to prevent overfitting and flexibly induce logical structures that capture the behavioral characteristics of real-world datasets [<xref ref-type="bibr" rid="ref-32">32</xref>]. Beyond logic-mining-oriented formulations, recent work has also advanced the theoretical foundations of discrete Hopfield architectures. In particular, a simplified two-neuron discrete DHNN model was introduced, where bifurcation analysis, hyperchaotic attractor characterization, and field-programmable gate array(FPGA)-based hardware implementation demonstrated that even minimal DHNN structures can exhibit rich dynamical behaviors and robust randomness properties [<xref ref-type="bibr" rid="ref-33">33</xref>]. These findings highlight the broader modeling flexibility and dynamic expressiveness of DHNNs, complementing logic-mining-based approaches by reinforcing the underlying stability and dynamical mechanisms upon which interpretable feature-selection frameworks can be built.</p>
</sec>
<sec id="s3">
<label>3</label>
<title>Proposed Method</title>
<sec id="s3_1">
<label>3.1</label>
<title>Preliminary</title>
<p>Feature selection methods can be broadly categorized into two major streams: mutual information (MI)-based filter approaches and regression-based approaches.</p>
<p>The first stream is represented by classical methods such as mRMR (minimum redundancy maximum relevance) [<xref ref-type="bibr" rid="ref-9">9</xref>], which aim to maximize the relevance between features and labels while minimizing redundancy among selected features. These approaches rely on information-theoretic criteria, particularly mutual information, to measure pairwise dependencies between variables. They are straightforward and model-agnostic, and they can capture nonlinear relationships that linear methods may overlook. Given the full feature set <italic>F</italic> and the subset of features <italic>S</italic> that have already been selected, the next feature <inline-formula id="ieqn-5"><mml:math id="mml-ieqn-5"><mml:msub><mml:mi>f</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:math></inline-formula> is chosen by maximizing the trade-off between its relevance to the target and its redundancy with the selected features, as follows:
<disp-formula id="eqn-1"><label>(1)</label><mml:math id="mml-eqn-1" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>arg</mml:mi><mml:mo>&#x2061;</mml:mo><mml:munder><mml:mo movablelimits="true" form="prefix">max</mml:mo><mml:mrow><mml:msub><mml:mi>f</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>&#x2208;</mml:mo><mml:mi>F</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>S</mml:mi></mml:mrow></mml:munder><mml:mrow><mml:mi>M</mml:mi><mml:mi>I</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>S</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:mfrac><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:msub><mml:mi>f</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x2208;</mml:mo><mml:mi>S</mml:mi></mml:mrow></mml:munder><mml:mrow><mml:mi>M</mml:mi><mml:mi>I</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>where <inline-formula id="ieqn-6"><mml:math id="mml-ieqn-6"><mml:mrow><mml:mi>M</mml:mi><mml:mi>I</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo>,</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> denotes the mutual information. However, they typically employ greedy search strategies, selecting features sequentially based on local criteria. As a result, they may fail to consider global interactions among features and often lead to suboptimal feature subsets.</p>
<p>The second stream is exemplified by regression-based methods such as efficient and robust feature selection via joint <inline-formula id="ieqn-7"><mml:math id="mml-ieqn-7"><mml:msub><mml:mi>l</mml:mi><mml:mrow><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>-norms minimization [<xref ref-type="bibr" rid="ref-10">10</xref>], which formulate feature selection as an optimization problem. The objective function follows:
<disp-formula id="eqn-2"><label>(2)</label><mml:math id="mml-eqn-2" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:munder><mml:mo movablelimits="true" form="prefix">min</mml:mo><mml:mi>W</mml:mi></mml:munder><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>X</mml:mi><mml:mi>W</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>Y</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mi>&#x03B3;</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>W</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>where <italic>X</italic> is data matrix, <italic>&#x03D2;</italic> is label matrix. These approaches aim to minimize reconstruction error, while imposing structural regularization to induce sparsity across features. Such methods benefit from global optimization and can naturally incorporate feature-label dependencies into a unified objective. However, because they fundamentally rely on a linear regression model, they are often limited in capturing complex or nonlinear relationships that are essential in many real-world datasets.</p>
</sec>
<sec id="s3_2">
<label>3.2</label>
<title>Objective Function</title>
<p>Given a dataset <inline-formula id="ieqn-8"><mml:math id="mml-ieqn-8"><mml:mi>X</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> consisting of <inline-formula id="ieqn-9"><mml:math id="mml-ieqn-9"><mml:mi>n</mml:mi></mml:math></inline-formula> patterns and <inline-formula id="ieqn-10"><mml:math id="mml-ieqn-10"><mml:mi>d</mml:mi></mml:math></inline-formula> features, along with a multi-label set <italic>&#x03D2;</italic> consisting of <inline-formula id="ieqn-11"><mml:math id="mml-ieqn-11"><mml:mi>c</mml:mi></mml:math></inline-formula> labels, the basic objective function of a regression-based feature selection method can be formulated as:
<disp-formula id="eqn-3"><label>(3)</label><mml:math id="mml-eqn-3" display="block"><mml:mtable columnalign="right left" rowspacing="3pt" columnspacing="0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:mi></mml:mi><mml:munder><mml:mo movablelimits="true" form="prefix">min</mml:mo><mml:mi>W</mml:mi></mml:munder><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>X</mml:mi><mml:mi>W</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>Y</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msubsup><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>F</mml:mi><mml:mn>2</mml:mn></mml:msubsup></mml:mtd></mml:mtr><mml:mtr><mml:mtd /><mml:mtd><mml:mrow><mml:mtext>s.t.&#xA0;</mml:mtext></mml:mrow><mml:mi>W</mml:mi><mml:mo>&#x2265;</mml:mo><mml:mn>0</mml:mn></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>where <italic>W</italic> is a weight matrix that is subject to a non-negativity constraint. The non-negativity constraint ensures that feature selection is represented by positive values, while unselected features are represented by zeros. The larger the absolute value in <italic>W</italic>, the more reliable the corresponding feature is considered. After solving this optimization problem, features corresponding to rows with larger norms in <italic>W</italic> are selected.</p>
<p>The objective function is designed to find features that minimize the difference between the data projected by <italic>W</italic> and the multi-label targets. To construct this objective function, the Frobenius norm is used. The function is convex with respect to <italic>W</italic>, allowing the optimization problem to be solved efficiently. The top <inline-formula id="ieqn-12"><mml:math id="mml-ieqn-12"><mml:mi>k</mml:mi></mml:math></inline-formula> features are selected based on the magnitude of each row in the optimized <italic>W</italic>.</p>
<p>Although <xref ref-type="disp-formula" rid="eqn-3">Eq. (3)</xref> offers an efficient convex formulation, its modeling capacity is limited because it only captures linear relationships between features and labels. In multi-label learning, however, three important aspects are often overlooked in such regression-based approaches:
<list list-type="simple">
<list-item><label>1.</label><p>Feature redundancy: selecting highly correlated features reduces the generalization power [<xref ref-type="bibr" rid="ref-9">9</xref>].</p></list-item>
<list-item><label>2.</label><p>Label dependencies: multi-label data often contains significant inter-label correlations that should be preserved [<xref ref-type="bibr" rid="ref-34">34</xref>].</p></list-item>
<list-item><label>3.</label><p>Non-linear dependencies: labels may exhibit complex, non-linear relationships with features.</p></list-item>
</list></p>
<p>To address these limitations, we incorporate mutual information into the objective function. Mutual information captures the degree of statistical dependency between random variables and can reflect both linear and non-linear relationships [<xref ref-type="bibr" rid="ref-9">9</xref>,<xref ref-type="bibr" rid="ref-19">19</xref>,<xref ref-type="bibr" rid="ref-24">24</xref>,<xref ref-type="bibr" rid="ref-27">27</xref>,<xref ref-type="bibr" rid="ref-35">35</xref>]. The mutual information between two arbitrary random variables <italic>A</italic> and <italic>B</italic> is defined as follows:
<disp-formula id="eqn-4"><label>(4)</label><mml:math id="mml-eqn-4" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>M</mml:mi><mml:mi>I</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>A</mml:mi><mml:mo>,</mml:mo><mml:mi>B</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>a</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mi>A</mml:mi></mml:mrow></mml:munder><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>b</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mi>B</mml:mi></mml:mrow></mml:munder><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>a</mml:mi><mml:mo>,</mml:mo><mml:mi>b</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mstyle scriptlevel="0"><mml:mrow><mml:mo maxsize="2.047em" minsize="2.047em">[</mml:mo></mml:mrow></mml:mstyle><mml:mfrac><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>a</mml:mi><mml:mo>,</mml:mo><mml:mi>b</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>b</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mfrac><mml:mstyle scriptlevel="0"><mml:mrow><mml:mo maxsize="2.047em" minsize="2.047em">]</mml:mo></mml:mrow></mml:mstyle></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>where <inline-formula id="ieqn-13"><mml:math id="mml-ieqn-13"><mml:mi>P</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>a</mml:mi><mml:mo>,</mml:mo><mml:mi>b</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> represents the joint probability distribution function of <italic>A</italic> and <italic>B</italic>, while <inline-formula id="ieqn-14"><mml:math id="mml-ieqn-14"><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> and <inline-formula id="ieqn-15"><mml:math id="mml-ieqn-15"><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>b</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> represent their marginal probability distribution functions. In the case of continuous random variables, this summation is replaced by an integral:
<disp-formula id="eqn-5"><label>(5)</label><mml:math id="mml-eqn-5" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>M</mml:mi><mml:mi>I</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>A</mml:mi><mml:mo>,</mml:mo><mml:mi>B</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mo>&#x222B;</mml:mo><mml:mrow><mml:mi>A</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mo>&#x222B;</mml:mo><mml:mrow><mml:mi>B</mml:mi></mml:mrow></mml:msub><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mstyle scriptlevel="0"><mml:mrow><mml:mo maxsize="2.047em" minsize="2.047em">[</mml:mo></mml:mrow></mml:mstyle><mml:mfrac><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>a</mml:mi><mml:mo>,</mml:mo><mml:mi>b</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>b</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mfrac><mml:mstyle scriptlevel="0"><mml:mrow><mml:mo maxsize="2.047em" minsize="2.047em">]</mml:mo></mml:mrow></mml:mstyle><mml:mi>d</mml:mi><mml:mi>a</mml:mi><mml:mtext>&#xA0;</mml:mtext><mml:mi>d</mml:mi><mml:mi>b</mml:mi><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>When the <inline-formula id="ieqn-16"><mml:math id="mml-ieqn-16"><mml:mi>i</mml:mi></mml:math></inline-formula>-th feature vector and the <inline-formula id="ieqn-17"><mml:math id="mml-ieqn-17"><mml:mi>j</mml:mi></mml:math></inline-formula>-th label vector are denoted by <inline-formula id="ieqn-18"><mml:math id="mml-ieqn-18"><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-19"><mml:math id="mml-ieqn-19"><mml:msub><mml:mi>y</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:math></inline-formula>, respectively, the relevance and redundancy using mutual information can be expressed as <inline-formula id="ieqn-20"><mml:math id="mml-ieqn-20"><mml:mi>M</mml:mi><mml:mi>I</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> and <inline-formula id="ieqn-21"><mml:math id="mml-ieqn-21"><mml:mi>M</mml:mi><mml:mi>I</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, respectively. In a multi-label environment, the relationships between labels are additionally considered as [<xref ref-type="bibr" rid="ref-15">15</xref>].</p>
<p>By calculating these associations, three matrices <inline-formula id="ieqn-22"><mml:math id="mml-ieqn-22"><mml:mi>Q</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>, <inline-formula id="ieqn-23"><mml:math id="mml-ieqn-23"><mml:mi>R</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>c</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> and <inline-formula id="ieqn-24"><mml:math id="mml-ieqn-24"><mml:mi>S</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>c</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> can be constructed, where each element in these matrices is defined as follows:
<disp-formula id="eqn-6"><label>(6)</label><mml:math id="mml-eqn-6" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi>Q</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:mi>M</mml:mi><mml:mi>I</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mspace width="1em" /><mml:mrow><mml:mtext>(feature redundancy)</mml:mtext></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mi>M</mml:mi><mml:mi>I</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mspace width="1em" /><mml:mrow><mml:mtext>(label dependency)</mml:mtext></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mi>M</mml:mi><mml:mi>I</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mspace width="1em" /><mml:mrow><mml:mtext>(feature-label relevance)</mml:mtext></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>where <inline-formula id="ieqn-25"><mml:math id="mml-ieqn-25"><mml:msub><mml:mi>Q</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> refers to the (<inline-formula id="ieqn-26"><mml:math id="mml-ieqn-26"><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:math></inline-formula>)-th element of <italic>Q</italic>. The elements of these matrices represent the associations between features and labels. When the data is binarized for calculating mutual information, the mutual information is always greater than or equal to 0, and less than or equal to 1 when using a base-2 logarithm. Therefore, every elements in the matrices <italic>Q</italic>, <italic>R</italic> and <italic>S</italic> lies in within the range [0, 1]. The matrix <italic>Q</italic> represents feature redundancy, which should be minimized in order to eliminate redundant or highly correlated features. The matrix <italic>R</italic> captures the relationships between labels; since features associated with strongly correlated labels are desirable, the negative of this term is included in the objective to encourage their selection under a minimization framework. Similarly, <italic>S</italic> quantifies the relevance between features and individual labels. By incorporating mutual information into the regression formulation, the model is able to capture not only simple linear relationships but also complex non-linear dependencies. To promote the selection of highly relevant features, the negative of <italic>S</italic> is also incorporated into the objective function.</p>
<p>By combining these values with the regression <xref ref-type="disp-formula" rid="eqn-3">Eq. (3)</xref>, an objective function can be designed as follows:
<disp-formula id="eqn-7"><label>(7)</label><mml:math id="mml-eqn-7" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:mi></mml:mi><mml:munder><mml:mo movablelimits="true" form="prefix">min</mml:mo><mml:mi>W</mml:mi></mml:munder><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>X</mml:mi><mml:mi>W</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi mathvariant="normal">&#x03A5;</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msubsup><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>F</mml:mi><mml:mn>2</mml:mn></mml:msubsup><mml:mo>+</mml:mo><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mtext>Tr</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>W</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mi>Q</mml:mi><mml:mi>W</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mo>&#x2212;</mml:mo></mml:mrow><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mtext>Tr</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>R</mml:mi><mml:msup><mml:mi>W</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mi>W</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mo>&#x2212;</mml:mo></mml:mrow><mml:mi>&#x03B3;</mml:mi><mml:mrow><mml:mtext>Tr</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>S</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mi>W</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd /><mml:mtd><mml:mrow><mml:mtext>s.t.&#xA0;</mml:mtext></mml:mrow><mml:mi>W</mml:mi><mml:mo>&#x2265;</mml:mo><mml:mn>0</mml:mn></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>where <inline-formula id="ieqn-27"><mml:math id="mml-ieqn-27"><mml:mi>&#x03B1;</mml:mi><mml:mo>,</mml:mo><mml:mi>&#x03B2;</mml:mi></mml:math></inline-formula>, and <inline-formula id="ieqn-28"><mml:math id="mml-ieqn-28"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula> are weights for redundancy, label correlation, and relevance, respectively, and are all positive. The hyperparameters <inline-formula id="ieqn-29"><mml:math id="mml-ieqn-29"><mml:mi>&#x03B1;</mml:mi><mml:mo>,</mml:mo><mml:mi>&#x03B2;</mml:mi><mml:mo>,</mml:mo><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula> control the influence of each regularization term. In <inline-formula id="ieqn-30"><mml:math id="mml-ieqn-30"><mml:mtext>Tr</mml:mtext><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>W</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mi>Q</mml:mi><mml:mi>W</mml:mi></mml:math></inline-formula>), the <inline-formula id="ieqn-31"><mml:math id="mml-ieqn-31"><mml:mi>i</mml:mi></mml:math></inline-formula>-th value represents the sum of the redundancies of the <inline-formula id="ieqn-32"><mml:math id="mml-ieqn-32"><mml:mi>i</mml:mi></mml:math></inline-formula>-th feature with other features. Similarly, <inline-formula id="ieqn-33"><mml:math id="mml-ieqn-33"><mml:mi>R</mml:mi><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mi>T</mml:mi></mml:msubsup><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:math></inline-formula> represents the sum of the associations between the <inline-formula id="ieqn-34"><mml:math id="mml-ieqn-34"><mml:mi>i</mml:mi></mml:math></inline-formula>-th feature and the labels, and <inline-formula id="ieqn-35"><mml:math id="mml-ieqn-35"><mml:msup><mml:mi>S</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:math></inline-formula> represents the sum of the relevance of the <inline-formula id="ieqn-36"><mml:math id="mml-ieqn-36"><mml:mi>i</mml:mi></mml:math></inline-formula>-th feature across all labels. The interpretation of each term is as follows:
<list list-type="bullet">
<list-item>
<p>Regression loss <inline-formula id="ieqn-37"><mml:math id="mml-ieqn-37"><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>X</mml:mi><mml:mi>W</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>Y</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msubsup><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>F</mml:mi><mml:mn>2</mml:mn></mml:msubsup><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>: encourages accurate label reconstruction via a linear regression model.</p></list-item>
<list-item>
<p>Redundancy penalty <inline-formula id="ieqn-38"><mml:math id="mml-ieqn-38"><mml:mo stretchy="false">(</mml:mo><mml:mtext>Tr</mml:mtext><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>W</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mi>Q</mml:mi><mml:mi>W</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>: discourages selecting mutually redundant features.</p></list-item>
<list-item>
<p>Label dependency penalty <inline-formula id="ieqn-39"><mml:math id="mml-ieqn-39"><mml:mo stretchy="false">(</mml:mo><mml:mtext>Tr</mml:mtext><mml:mo stretchy="false">(</mml:mo><mml:mi>R</mml:mi><mml:msup><mml:mi>W</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mi>W</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>: promotes selecting features that explain correlated labels.</p></list-item>
<list-item>
<p>Relevance reward <inline-formula id="ieqn-40"><mml:math id="mml-ieqn-40"><mml:mo stretchy="false">(</mml:mo><mml:mtext>Tr</mml:mtext><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>S</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mi>W</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>: favors features with stronger associations to multiple labels.</p></list-item>
</list></p>
<p>Optimizing the proposed objective function thus corresponds to selecting features that not only minimize redundancy and maximize relevance and label association, but also contribute to accurately reconstructing the label space through a linear regression model. In this formulation, the regression-based loss captures the predictive structure in a supervised setting, while the mutual information-based terms guide the selection toward features that exhibit strong statistical dependency with the labels and minimal overlap with other features. This hybrid design enables the model to balance predictive accuracy with information-theoretic feature quality. In the next subsection, we present an efficient optimization framework to solve the proposed objective function under the non-negativity constraint.</p>
</sec>
<sec id="s3_3">
<label>3.3</label>
<title>Optimization</title>
<p>The convexity of the proposed objective function is primarily determined by the terms <inline-formula id="ieqn-41"><mml:math id="mml-ieqn-41"><mml:mtext>Tr</mml:mtext><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>W</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mi>Q</mml:mi><mml:mi>W</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> and <inline-formula id="ieqn-42"><mml:math id="mml-ieqn-42"><mml:mtext>Tr</mml:mtext><mml:mo stretchy="false">(</mml:mo><mml:mi>R</mml:mi><mml:msup><mml:mi>W</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mi>W</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>. When the matrices <italic>Q</italic> and &#x2212;<italic>R</italic> are positive definite, the objective function is convex, which facilitates efficient optimization. However, since <italic>Q</italic> and &#x2212;<italic>R</italic> are not guaranteed to be positive definite, we enforce convexity by adding the absolute value of the minimum eigenvalue of each matrix to its diagonal elements. This modification ensures that both <italic>Q</italic> and &#x2212;<italic>R</italic> become positive definite. Importantly, because <italic>Q</italic> and <italic>R</italic> originally have zero diagonal entries, adding a uniform scalar to the diagonal does not change the mutual-information structure that drives feature relevance. This correction preserves the feature selection behavior while ensuring positive definiteness for convex optimization.</p>
<p>To ensure the positive definiteness of the matrices <italic>Q</italic> and <italic>R</italic>, we adjust each matrix by adding the absolute value of its minimum eigenvalue to its diagonal entries. This yields modified matrices <inline-formula id="ieqn-43"><mml:math id="mml-ieqn-43"><mml:mrow><mml:mover><mml:mi>Q</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-44"><mml:math id="mml-ieqn-44"><mml:mrow><mml:mover><mml:mi>R</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula>, which are guaranteed to be positive definite. The transformation is defined as follows:
<disp-formula id="eqn-8"><label>(8)</label><mml:math id="mml-eqn-8" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:mover><mml:mi>Q</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mo>=</mml:mo><mml:mi>Q</mml:mi><mml:mo>+</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mrow><mml:mtext>min</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>Q</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mo>+</mml:mo><mml:mi>&#x03B5;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mi>d</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:mspace width="1em" /><mml:mtext>&#xA0;</mml:mtext><mml:mrow><mml:mover><mml:mi>R</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mo>=</mml:mo><mml:mi>R</mml:mi><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mrow><mml:mtext>max</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>R</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:mi>&#x03B5;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x22C5;</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mi>c</mml:mi></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>where <inline-formula id="ieqn-45"><mml:math id="mml-ieqn-45"><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mo movablelimits="true" form="prefix">min</mml:mo></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>Q</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> and <inline-formula id="ieqn-46"><mml:math id="mml-ieqn-46"><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mo movablelimits="true" form="prefix">max</mml:mo></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>R</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> denote the smallest and largest eigenvalues of <italic>Q</italic> and <italic>R</italic>, respectively, <inline-formula id="ieqn-47"><mml:math id="mml-ieqn-47"><mml:msub><mml:mi>I</mml:mi><mml:mi>d</mml:mi></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-48"><mml:math id="mml-ieqn-48"><mml:msub><mml:mi>I</mml:mi><mml:mi>c</mml:mi></mml:msub></mml:math></inline-formula> are identity matrices of appropriate dimensions matching <italic>Q</italic> and <italic>R</italic>, and <inline-formula id="ieqn-49"><mml:math id="mml-ieqn-49"><mml:mi>&#x03B5;</mml:mi></mml:math></inline-formula> is a small positive value to guarantee strict positive definiteness. This adjustment preserves the relative structure of the feature space while ensuring the convexity of the objective function. The optimization objective can be formulated as follows:
<disp-formula id="eqn-9"><label>(9)</label><mml:math id="mml-eqn-9" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:mi></mml:mi><mml:munder><mml:mo movablelimits="true" form="prefix">min</mml:mo><mml:mi>W</mml:mi></mml:munder><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>X</mml:mi><mml:mi>W</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>Y</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msubsup><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>F</mml:mi><mml:mn>2</mml:mn></mml:msubsup><mml:mo>+</mml:mo><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mtext>Tr</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>W</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mrow><mml:mover><mml:mi>Q</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mi>W</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mo>&#x2212;</mml:mo></mml:mrow><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mtext>Tr</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mover><mml:mi>R</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:msup><mml:mi>W</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mi>W</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mo>&#x2212;</mml:mo></mml:mrow><mml:mi>&#x03B3;</mml:mi><mml:mrow><mml:mtext>Tr</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>S</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mi>W</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd /><mml:mtd><mml:mrow><mml:mtext>s.t.&#xA0;</mml:mtext></mml:mrow><mml:mi>W</mml:mi><mml:mo>&#x2265;</mml:mo><mml:mn>0</mml:mn></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>To solve the modified convex objective function with a non-negativity constraint, we employ the projected gradient descent (PGD) algorithm. This algorithm performs optimization based on the gradient with respect to <italic>W</italic>, and any negative entries in <italic>W</italic> are projected to zero to enforce the non-negativity constraint. The gradient of the objective function with respect to <italic>W</italic> is given by:
<disp-formula id="eqn-10"><label>(10)</label><mml:math id="mml-eqn-10" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi mathvariant="normal">&#x2207;</mml:mi><mml:mi>W</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn>2</mml:mn><mml:msup><mml:mi>X</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:mi>X</mml:mi><mml:mi>W</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>Y</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mn>2</mml:mn><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mover><mml:mi>Q</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mi>W</mml:mi><mml:mrow><mml:mo>&#x2212;</mml:mo></mml:mrow><mml:mn>2</mml:mn><mml:mi>&#x03B2;</mml:mi><mml:mi>W</mml:mi><mml:mrow><mml:mover><mml:mi>R</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo></mml:mrow><mml:mi>&#x03B3;</mml:mi><mml:mi>S</mml:mi><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>Algorithm 1 presents the complete procedure of the proposed method. In Algorithm 1, the step size <inline-formula id="ieqn-50"><mml:math id="mml-ieqn-50"><mml:mi>&#x03B7;</mml:mi></mml:math></inline-formula> is provided as a user-defined input. In practice, a theoretically sound upper bound can be derived from the Lipschitz constant of the gradient. For the proposed objective, the gradient is given by <inline-formula id="ieqn-51"><mml:math id="mml-ieqn-51"><mml:mi mathvariant="normal">&#x2207;</mml:mi><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>W</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mn>2</mml:mn><mml:msup><mml:mi>X</mml:mi><mml:mi mathvariant="normal">&#x22A4;</mml:mi></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:mi>X</mml:mi><mml:mi>W</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>Y</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mn>2</mml:mn><mml:mi>&#x03B1;</mml:mi><mml:mi>Q</mml:mi><mml:mi>W</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>2</mml:mn><mml:mi>&#x03B2;</mml:mi><mml:mi>W</mml:mi><mml:mi>R</mml:mi><mml:mo>+</mml:mo><mml:mi>&#x03B3;</mml:mi><mml:mi>S</mml:mi></mml:math></inline-formula>, and the Lipschitz constant can be estimated as <inline-formula id="ieqn-52"><mml:math id="mml-ieqn-52"><mml:mi>L</mml:mi><mml:mo>=</mml:mo><mml:mn>2</mml:mn><mml:mo fence="false" stretchy="false">&#x2016;</mml:mo><mml:msup><mml:mi>X</mml:mi><mml:mi mathvariant="normal">&#x22A4;</mml:mi></mml:msup><mml:mi>X</mml:mi><mml:msub><mml:mo fence="false" stretchy="false">&#x2016;</mml:mo><mml:mn>2</mml:mn></mml:msub><mml:mo>+</mml:mo><mml:mn>2</mml:mn><mml:mi>&#x03B1;</mml:mi><mml:mo fence="false" stretchy="false">&#x2016;</mml:mo><mml:mi>Q</mml:mi><mml:msub><mml:mo fence="false" stretchy="false">&#x2016;</mml:mo><mml:mn>2</mml:mn></mml:msub><mml:mo>+</mml:mo><mml:mn>2</mml:mn><mml:mi>&#x03B2;</mml:mi><mml:mo fence="false" stretchy="false">&#x2016;</mml:mo><mml:mi>R</mml:mi><mml:msub><mml:mo fence="false" stretchy="false">&#x2016;</mml:mo><mml:mn>2</mml:mn></mml:msub></mml:math></inline-formula>. Thus, a stable choice is <inline-formula id="ieqn-53"><mml:math id="mml-ieqn-53"><mml:mi>&#x03B7;</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi>L</mml:mi></mml:math></inline-formula>, which guarantees monotonic descent and convergence under the projection constraint. Since the linear term <inline-formula id="ieqn-54"><mml:math id="mml-ieqn-54"><mml:mi>&#x03B3;</mml:mi><mml:mrow><mml:mi mathvariant="normal">T</mml:mi><mml:mi mathvariant="normal">r</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>S</mml:mi><mml:mi mathvariant="normal">&#x22A4;</mml:mi></mml:msup><mml:mi>W</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> does not affect <italic>L</italic>, this bound remains valid even for large <inline-formula id="ieqn-55"><mml:math id="mml-ieqn-55"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula>. In addition, a backtracking line search with Armijo condition is used to further ensure stability in early iterations.</p>
<fig id="fig-7">
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_74138-fig-7.tif"/>
</fig>
</sec>
<sec id="s3_4">
<label>3.4</label>
<title>Computational Complexity Analysis</title>
<p>The overall computational complexity of the proposed algorithm consists of three main components: (1) the computation of the mutual information-based matrices <italic>Q</italic>, <italic>R</italic>, and <italic>S</italic>; (2) the eigenvalue correction required to ensure positive definiteness of <italic>Q</italic> and <italic>R</italic>; and (3) the iterative optimization procedure via projected gradient descent (PGD).</p>
<p><inline-formula id="ieqn-70"><mml:math id="mml-ieqn-70"><mml:mi>Q</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> encodes the pairwise mutual information between features, requiring <inline-formula id="ieqn-71"><mml:math id="mml-ieqn-71"><mml:mi>O</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>d</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> time. <inline-formula id="ieqn-72"><mml:math id="mml-ieqn-72"><mml:mi>R</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>c</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> encodes the mutual information between labels, requiring <inline-formula id="ieqn-73"><mml:math id="mml-ieqn-73"><mml:mi>O</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>c</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> time. <inline-formula id="ieqn-74"><mml:math id="mml-ieqn-74"><mml:mi>S</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>c</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> encodes the mutual information between features and labels, requiring <inline-formula id="ieqn-75"><mml:math id="mml-ieqn-75"><mml:mi>O</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>d</mml:mi><mml:mi>c</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> time. Thus, the total cost for constructing these matrices is <inline-formula id="ieqn-76"><mml:math id="mml-ieqn-76"><mml:mi>O</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>d</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo>+</mml:mo><mml:msup><mml:mi>c</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo>+</mml:mo><mml:mi>d</mml:mi><mml:mi>c</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>.</p>
<p>To ensure the convexity of the objective function, we modify <italic>Q</italic> and <italic>R</italic> by adding the absolute value of their minimum eigenvalue to their diagonals: Computing the minimum eigenvalue of a symmetric matrix (e.g., via power iteration or Lanczos methods) typically requires <inline-formula id="ieqn-77"><mml:math id="mml-ieqn-77"><mml:mi>O</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>d</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> for <italic>Q</italic> and <inline-formula id="ieqn-78"><mml:math id="mml-ieqn-78"><mml:mi>O</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>c</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> for <italic>R</italic>, assuming low-rank approximations or a small number of iterations. Therefore, this step adds <inline-formula id="ieqn-79"><mml:math id="mml-ieqn-79"><mml:mi>O</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>d</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo>+</mml:mo><mml:msup><mml:mi>c</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> to the complexity.</p>
<p>In each iteration of PGD, the gradient with respect to <inline-formula id="ieqn-80"><mml:math id="mml-ieqn-80"><mml:mi>W</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>c</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> is computed. The complexity of computing each term is as follows: <inline-formula id="ieqn-81"><mml:math id="mml-ieqn-81"><mml:msup><mml:mi>X</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:mi>X</mml:mi><mml:mi>W</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>: <inline-formula id="ieqn-82"><mml:math id="mml-ieqn-82"><mml:mi>O</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>n</mml:mi><mml:mi>d</mml:mi><mml:mi>c</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, <inline-formula id="ieqn-83"><mml:math id="mml-ieqn-83"><mml:mrow><mml:mover><mml:mi>Q</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mi>W</mml:mi></mml:math></inline-formula>: <inline-formula id="ieqn-84"><mml:math id="mml-ieqn-84"><mml:mi>O</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>d</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mi>c</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, <inline-formula id="ieqn-85"><mml:math id="mml-ieqn-85"><mml:mi>W</mml:mi><mml:mrow><mml:mover><mml:mi>R</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula>: <inline-formula id="ieqn-86"><mml:math id="mml-ieqn-86"><mml:mi>O</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>d</mml:mi><mml:msup><mml:mi>c</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>. The term <inline-formula id="ieqn-87"><mml:math id="mml-ieqn-87"><mml:mi>&#x03B3;</mml:mi><mml:mi>S</mml:mi></mml:math></inline-formula> is constant once computed: <inline-formula id="ieqn-88"><mml:math id="mml-ieqn-88"><mml:mi>O</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>d</mml:mi><mml:mi>c</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>. Thus, the per-iteration cost of gradient computation is: <inline-formula id="ieqn-89"><mml:math id="mml-ieqn-89"><mml:mi>O</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>n</mml:mi><mml:mi>d</mml:mi><mml:mi>c</mml:mi><mml:mo>+</mml:mo><mml:msup><mml:mi>d</mml:mi><mml:mi>c</mml:mi></mml:msup><mml:mo>+</mml:mo><mml:mi>d</mml:mi><mml:msup><mml:mi>c</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>. Assuming the algorithm converges in <italic>T</italic> iterations (typically a small constant or logarithmic in practice), the total optimization cost is <inline-formula id="ieqn-90"><mml:math id="mml-ieqn-90"><mml:mi>O</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>n</mml:mi><mml:mi>d</mml:mi><mml:mi>c</mml:mi><mml:mo>+</mml:mo><mml:msup><mml:mi>d</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mi>c</mml:mi><mml:mo>+</mml:mo><mml:mi>d</mml:mi><mml:msup><mml:mi>c</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>.</p>
<p>Combining all components, the overall computational complexity is <inline-formula id="ieqn-91"><mml:math id="mml-ieqn-91"><mml:mi>O</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>n</mml:mi><mml:mi>d</mml:mi><mml:mi>c</mml:mi><mml:mo>+</mml:mo><mml:msup><mml:mi>d</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mi>c</mml:mi><mml:mo>+</mml:mo><mml:mi>d</mml:mi><mml:msup><mml:mi>c</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo>+</mml:mo><mml:msup><mml:mi>d</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo>+</mml:mo><mml:msup><mml:mi>c</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>. The proposed algorithm is thus computationally efficient for moderate values of <inline-formula id="ieqn-92"><mml:math id="mml-ieqn-92"><mml:mi>d</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-93"><mml:math id="mml-ieqn-93"><mml:mi>c</mml:mi></mml:math></inline-formula>, and scalable to larger datasets when <italic>T</italic> is reasonably small.</p>
<p>The proposed method involves three hyperparameters, <inline-formula id="ieqn-94"><mml:math id="mml-ieqn-94"><mml:mi>&#x03B1;</mml:mi></mml:math></inline-formula>, <inline-formula id="ieqn-95"><mml:math id="mml-ieqn-95"><mml:mi>&#x03B2;</mml:mi></mml:math></inline-formula>, and <inline-formula id="ieqn-96"><mml:math id="mml-ieqn-96"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula>, which control the contributions of smoothness, redundancy penalization, and label-alignment terms, respectively. In practice, these parameters are tuned over a coarse logarithmic grid, which keeps the computational cost manageable due to the convexity of the objective and the low per-iteration complexity of the solver.</p>
</sec>
</sec>
<sec id="s4">
<label>4</label>
<title>Experimental Results</title>
<sec id="s4_1">
<label>4.1</label>
<title>Experimental Settings</title>
<p>This section presents the experimental results to evaluate the effectiveness of the proposed method in improving classification performance. We compared classification performance using the Multi-label <inline-formula id="ieqn-97"><mml:math id="mml-ieqn-97"><mml:mi>k</mml:mi></mml:math></inline-formula>-Nearest Neighbors (ML<inline-formula id="ieqn-98"><mml:math id="mml-ieqn-98"><mml:mi>k</mml:mi></mml:math></inline-formula>NN) classifier [<xref ref-type="bibr" rid="ref-36">36</xref>], the Linear Support Vector Machine (LinSVM) classifier [<xref ref-type="bibr" rid="ref-17">17</xref>,<xref ref-type="bibr" rid="ref-20">20</xref>,<xref ref-type="bibr" rid="ref-23">23</xref>], and the Multi-Label Decision Tree (MLDT) classifier [<xref ref-type="bibr" rid="ref-18">18</xref>]. In ML<inline-formula id="ieqn-99"><mml:math id="mml-ieqn-99"><mml:mi>k</mml:mi></mml:math></inline-formula>NN, the value of <inline-formula id="ieqn-100"><mml:math id="mml-ieqn-100"><mml:mi>k</mml:mi></mml:math></inline-formula> was set to 5. ML<inline-formula id="ieqn-101"><mml:math id="mml-ieqn-101"><mml:mi>k</mml:mi></mml:math></inline-formula>NN has been widely used in previous studies to benchmark the performance of multi-label feature selection methods [<xref ref-type="bibr" rid="ref-7">7</xref>,<xref ref-type="bibr" rid="ref-8">8</xref>,<xref ref-type="bibr" rid="ref-19">19</xref>]. The regularization parameter <italic>C</italic> of the LinSVM classifier was determined through validation during the training phase. For each experiment, the training and test data were randomly split in an 8:2 ratio, and the process was repeated 10 times to compute the average performance.</p>
<p>Seven datasets were used in the experiments. The cal500 dataset is a multi-label music annotation dataset consisting of 500 Western popular songs, each annotated with multiple tags describing acoustic, emotional, and semantic content [<xref ref-type="bibr" rid="ref-37">37</xref>]. The corel5k dataset is an image annotation dataset comprising 5000 images, each associated with one or more textual labels from a controlled vocabulary of 260 words. The emotions dataset contains 593 music tracks with 72 audio features, each labeled with one or more of six primary emotional categories [<xref ref-type="bibr" rid="ref-38">38</xref>]. The enron dataset is a text corpus derived from company emails [<xref ref-type="bibr" rid="ref-13">13</xref>]. The genbase dataset is a biological dataset for protein function classification, where each instance represents a protein and each label corresponds to a specific function. The medical dataset consists of 978 clinical free-text reports, each labeled with up to 45 disease codes [<xref ref-type="bibr" rid="ref-39">39</xref>]. The slashdot dataset is a relational graph dataset representing a social network of users from the technology news site Slashdot. It includes directional friend/foe relationships among users, making it suitable for multi-label classification and link prediction tasks in networked data [<xref ref-type="bibr" rid="ref-40">40</xref>]. Detailed characteristics of each dataset are summarized in <xref ref-type="table" rid="table-1">Table 1</xref>.</p>
<table-wrap id="table-1">
<label>Table 1</label>
<caption>
<title>Information about data sets</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th>Data</th>
<th>No. of patterns</th>
<th>No. of features</th>
<th>No. of labels</th>
<th>Label cardinality</th>
<th>Label density</th>
<th>Type</th>
</tr>
</thead>
<tbody>
<tr>
<td>cal500</td>
<td>502</td>
<td>68</td>
<td>174</td>
<td>26.0438</td>
<td>0.1497</td>
<td>Music</td>
</tr>
<tr>
<td>corel5k</td>
<td>5000</td>
<td>499</td>
<td>374</td>
<td>3.5220</td>
<td>0.0094</td>
<td>Images</td>
</tr>
<tr>
<td>emotions</td>
<td>593</td>
<td>72</td>
<td>6</td>
<td>1.8685</td>
<td>0.3114</td>
<td>Music</td>
</tr>
<tr>
<td>enron</td>
<td>1702</td>
<td>1001</td>
<td>53</td>
<td>3.3784</td>
<td>0.0637</td>
<td>Text</td>
</tr>
<tr>
<td>genbase</td>
<td>662</td>
<td>1185</td>
<td>27</td>
<td>1.2523</td>
<td>0.0464</td>
<td>Biology</td>
</tr>
<tr>
<td>medical</td>
<td>978</td>
<td>1494</td>
<td>45</td>
<td>1.2454</td>
<td>0.0277</td>
<td>Text</td>
</tr>
<tr>
<td>slashdot</td>
<td>3782</td>
<td>1079</td>
<td>22</td>
<td>1.1809</td>
<td>0.0537</td>
<td>Network</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>For evaluation, we used three metrics: Hamming Loss (hloss), Ranking Loss (rloss), and Multi-label Accuracy (mlacc) [<xref ref-type="bibr" rid="ref-8">8</xref>]. Lower values of Hamming Loss and Ranking Loss indicate better classification performance, while higher values of Multi-label Accuracy reflect better performance. The number of selected features was determined based on a proportion <inline-formula id="ieqn-102"><mml:math id="mml-ieqn-102"><mml:msqrt><mml:mi>n</mml:mi></mml:msqrt></mml:math></inline-formula> of the total number <inline-formula id="ieqn-103"><mml:math id="mml-ieqn-103"><mml:mi>n</mml:mi></mml:math></inline-formula> of feature patterns in each dataset by referring to the methodology in [<xref ref-type="bibr" rid="ref-41">41</xref>]. To uniquely distinguish each sample in a dataset, at least <inline-formula id="ieqn-104"><mml:math id="mml-ieqn-104"><mml:msub><mml:mi>log</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>&#x2061;</mml:mo><mml:mi>n</mml:mi></mml:math></inline-formula> independent binary features are theoretically required, where <inline-formula id="ieqn-105"><mml:math id="mml-ieqn-105"><mml:mi>n</mml:mi></mml:math></inline-formula> denotes the number of samples. For example, only three features are sufficient to perfectly differentiate eight samples. However, this number is often too strict in real-world datasets, since features are typically not fully independent. Therefore, we adopt <inline-formula id="ieqn-106"><mml:math id="mml-ieqn-106"><mml:msqrt><mml:mi>n</mml:mi></mml:msqrt></mml:math></inline-formula> as a practical heuristic for the number of selected features, which provides a reasonably small yet representative subset size.</p>
<p>To compare against the proposed method, we selected five existing approaches: AMI [<xref ref-type="bibr" rid="ref-42">42</xref>], MDMR [<xref ref-type="bibr" rid="ref-16">16</xref>], FIMF [<xref ref-type="bibr" rid="ref-8">8</xref>], QPFS [<xref ref-type="bibr" rid="ref-7">7</xref>], and MFSJMI [<xref ref-type="bibr" rid="ref-17">17</xref>]. AMI selects features based on the first-order mutual information between features and labels. MDMR introduces a new feature evaluation function that considers mutual information between features and labels as well as among features. FIMF limits the number of labels considered during evaluation to enable fast mutual information-based feature selection. QPFS reformulates the mutual information problem into a quadratic programming framework to balance feature relevance and redundancy. MFSJMI selects features by considering label distribution and evaluating the relevance using joint mutual information. For all compared methods, including the proposed one, the hyperparameters <inline-formula id="ieqn-107"><mml:math id="mml-ieqn-107"><mml:mi>&#x03B1;</mml:mi></mml:math></inline-formula>, <inline-formula id="ieqn-108"><mml:math id="mml-ieqn-108"><mml:mi>&#x03B2;</mml:mi></mml:math></inline-formula>, and <inline-formula id="ieqn-109"><mml:math id="mml-ieqn-109"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula> were varied over the range <inline-formula id="ieqn-110"><mml:math id="mml-ieqn-110"><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>3</mml:mn></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>. The best-performing result from this grid search was reported. To enable reliable mutual information estimation, continuous features are discretized using the Label-Attribute Interdependence Maximization (LAIM) method [<xref ref-type="bibr" rid="ref-43">43</xref>], which is specifically designed for multi-label learning.</p>
</sec>
<sec id="s4_2">
<label>4.2</label>
<title>Comparison Results</title>
<p><xref ref-type="table" rid="table-2">Tables 2</xref>&#x2013;<xref ref-type="table" rid="table-4">4</xref> summarize the classification performance of different multi-label feature selection methods evaluated with the ML<inline-formula id="ieqn-111"><mml:math id="mml-ieqn-111"><mml:mi>k</mml:mi></mml:math></inline-formula>NN classifier. The best result for each dataset and metric is highlighted in bold. <xref ref-type="table" rid="table-2">Table 2</xref> presents the Hamming loss results. The proposed method achieves the lowest loss on six out of seven datasets, clearly outperforming the existing approaches. In particular, substantial improvements are observed on the emotions, enron, and medical datasets, demonstrating that the proposed formulation effectively reduces label-wise prediction errors. Although FIMF performs slightly better on corel5k, the proposed method achieves the overall best average performance across datasets. <xref ref-type="table" rid="table-3">Table 3</xref> reports the Ranking loss results. The proposed approach again shows strong superiority, achieving the lowest ranking loss on six datasets. These results indicate that the selected features enable the classifier to preserve the relative ranking of relevant and irrelevant labels more effectively than competing methods. Notably, while AMI achieves the best result on corel5k, the proposed method shows consistent improvements in more complex datasets where label dependencies are pronounced. <xref ref-type="table" rid="table-4">Table 4</xref> presents the Multi-label accuracy results. The proposed method outperforms all baselines on six datasets, achieving particularly large gains on corel5k, enron, and medical. This confirms that the proposed approach can identify highly discriminative and non-redundant features that generalize well across diverse label spaces. Although QPFS slightly exceeds the proposed method on slashdot, the overall performance trend demonstrates the robustness and adaptability of the proposed method across different data characteristics.</p>
<table-wrap id="table-2">
<label>Table 2</label>
<caption>
<title>Experimental Hamming loss result of the ML<inline-formula id="ieqn-112"><mml:math id="mml-ieqn-112"><mml:mi>k</mml:mi></mml:math></inline-formula>NN classifier</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th>Data</th>
<th>AMI</th>
<th>MDMR</th>
<th>FIMF</th>
<th>QPFS</th>
<th>MFSJMI</th>
<th>Proposed</th>
</tr>
</thead>
<tbody>
<tr>
<td>cal500</td>
<td>0.1537</td>
<td>0.1539</td>
<td>0.1541</td>
<td>0.1541</td>
<td>0.1545</td>
<td><bold>0.1504</bold></td>
</tr>
<tr>
<td>corel5k</td>
<td>0.0115</td>
<td>0.0115</td>
<td><bold>0.0112</bold></td>
<td>0.0115</td>
<td>0.0113</td>
<td>0.0117</td>
</tr>
<tr>
<td>emotions</td>
<td>0.2171</td>
<td>0.2172</td>
<td>0.2158</td>
<td>0.2154</td>
<td>0.2155</td>
<td><bold>0.2065</bold></td>
</tr>
<tr>
<td>enron</td>
<td>0.0620</td>
<td>0.0570</td>
<td>0.0530</td>
<td>0.0610</td>
<td>0.0585</td>
<td><bold>0.0504</bold></td>
</tr>
<tr>
<td>genbase</td>
<td>0.0040</td>
<td>0.0040</td>
<td>0.0043</td>
<td>0.0039</td>
<td>0.0043</td>
<td><bold>0.0020</bold></td>
</tr>
<tr>
<td>medical</td>
<td>0.0077</td>
<td>0.0076</td>
<td>0.0074</td>
<td>0.0085</td>
<td>0.0077</td>
<td><bold>0.0026</bold></td>
</tr>
<tr>
<td>slashdot</td>
<td>0.0520</td>
<td>0.0519</td>
<td>0.0521</td>
<td>0.0520</td>
<td>0.0490</td>
<td><bold>0.0447</bold></td>
</tr>
</tbody>
</table>
</table-wrap><table-wrap id="table-3">
<label>Table 3</label>
<caption>
<title>Experimental Ranking loss result of the ML<inline-formula id="ieqn-113"><mml:math id="mml-ieqn-113"><mml:mi>k</mml:mi></mml:math></inline-formula>NN classifier</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th>Data</th>
<th>AMI</th>
<th>MDMR</th>
<th>FIMF</th>
<th>QPFS</th>
<th>MFSJMI</th>
<th>Proposed</th>
</tr>
</thead>
<tbody>
<tr>
<td>cal500</td>
<td>0.3772</td>
<td>0.3764</td>
<td>0.3789</td>
<td>0.3753</td>
<td>0.3779</td>
<td><bold>0.3691</bold></td>
</tr>
<tr>
<td>corel5k</td>
<td><bold>0.7098</bold></td>
<td>0.7127</td>
<td>0.7257</td>
<td>0.7107</td>
<td>0.7178</td>
<td>0.7210</td>
</tr>
<tr>
<td>emotions</td>
<td>0.2554</td>
<td>0.2517</td>
<td>0.2534</td>
<td>0.2536</td>
<td>0.2402</td>
<td><bold>0.2299</bold></td>
</tr>
<tr>
<td>enron</td>
<td>0.3113</td>
<td>0.2812</td>
<td>0.2831</td>
<td>0.2985</td>
<td>0.2988</td>
<td><bold>0.2612</bold></td>
</tr>
<tr>
<td>genbase</td>
<td>0.0445</td>
<td>0.0445</td>
<td>0.0445</td>
<td>0.0445</td>
<td>0.0466</td>
<td><bold>0.0432</bold></td>
</tr>
<tr>
<td>medical</td>
<td>0.1189</td>
<td>0.1127</td>
<td>0.1111</td>
<td>0.1251</td>
<td>0.1279</td>
<td><bold>0.0550</bold></td>
</tr>
<tr>
<td>slashdot</td>
<td>0.4914</td>
<td>0.4910</td>
<td>0.4841</td>
<td>0.4839</td>
<td>0.5127</td>
<td><bold>0.4565</bold></td>
</tr>
</tbody>
</table>
</table-wrap><table-wrap id="table-4">
<label>Table 4</label>
<caption>
<title>Experimental Multi-label accuracy result of the ML<inline-formula id="ieqn-114"><mml:math id="mml-ieqn-114"><mml:mi>k</mml:mi></mml:math></inline-formula>NN classifier</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th>Data</th>
<th>AMI</th>
<th>MDMR</th>
<th>FIMF</th>
<th>QPFS</th>
<th>MFSJMI</th>
<th>Proposed</th>
</tr>
</thead>
<tbody>
<tr>
<td>cal500</td>
<td>0.2191</td>
<td>0.2161</td>
<td>0.2186</td>
<td>0.2186</td>
<td>0.2185</td>
<td><bold>0.2269</bold></td>
</tr>
<tr>
<td>corel5k</td>
<td>0.0408</td>
<td>0.0414</td>
<td>0.0358</td>
<td>0.0402</td>
<td>0.0361</td>
<td><bold>0.0588</bold></td>
</tr>
<tr>
<td>emotions</td>
<td>0.5178</td>
<td>0.5131</td>
<td>0.5231</td>
<td>0.5219</td>
<td>0.5270</td>
<td><bold>0.5428</bold></td>
</tr>
<tr>
<td>enron</td>
<td>0.2631</td>
<td>0.2992</td>
<td>0.3445</td>
<td>0.2793</td>
<td>0.2821</td>
<td><bold>0.4051</bold></td>
</tr>
<tr>
<td>genbase</td>
<td>0.9568</td>
<td>0.9564</td>
<td>0.9527</td>
<td>0.9577</td>
<td>0.9542</td>
<td><bold>0.9790</bold></td>
</tr>
<tr>
<td>medical</td>
<td>0.7746</td>
<td>0.7772</td>
<td>0.7816</td>
<td>0.7467</td>
<td>0.7615</td>
<td><bold>0.9352</bold></td>
</tr>
<tr>
<td>slashdot</td>
<td>0.3393</td>
<td>0.3404</td>
<td>0.2831</td>
<td><bold>0.3477</bold></td>
<td>0.2654</td>
<td>0.3469</td>
</tr>
</tbody>
</table>
</table-wrap>
<p><xref ref-type="table" rid="table-5">Tables 5</xref>&#x2013;<xref ref-type="table" rid="table-7">7</xref> present the classification results obtained using the LinSVM classifier. <xref ref-type="table" rid="table-5">Table 5</xref> summarizes the Hamming loss results. The proposed method achieves the lowest loss on all datasets. These results show that the selected features effectively reduce instance-level prediction errors even when evaluated with a linear classifier. <xref ref-type="table" rid="table-6">Table 6</xref> presents the Ranking loss results. The proposed method achieves the best results on four datasets-cal500, emotions, genbase, and medical-showing its ability to preserve the label ranking order with minimal degradation. While FIMF slightly outperforms others on corel5k and MDMR performs best on enron, the proposed approach remains highly competitive across datasets. The performance gain on emotions and medical demonstrates that the proposed feature selection method can effectively model label dependencies even under a linear decision boundary. <xref ref-type="table" rid="table-7">Table 7</xref> reports the Multi-label accuracy results. The proposed method achieves the highest accuracy on six datasets, with particularly large gains on corel5k, enron, and medical. This demonstrates the strong discriminative capability of the selected features and their ability to generalize across datasets with varying label sparsity. FIMF predicted all-zero label vectors in 9 out of 10 runs on Corel5k, leading to nearly zero multi-label accuracy.</p>
<table-wrap id="table-5">
<label>Table 5</label>
<caption>
<title>Experimental Hamming loss result of the LinSVM classifier</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th>Data</th>
<th>AMI</th>
<th>MDMR</th>
<th>FIMF</th>
<th>QPFS</th>
<th>MFSJMI</th>
<th>Proposed</th>
</tr>
</thead>
<tbody>
<tr>
<td>cal500</td>
<td>0.1363</td>
<td>0.1364</td>
<td>0.1362</td>
<td>0.1364</td>
<td>0.1363</td>
<td><bold>0.1360</bold></td>
</tr>
<tr>
<td>corel5k</td>
<td>0.0094</td>
<td>0.0094</td>
<td>0.0094</td>
<td>0.0094</td>
<td>0.0094</td>
<td><bold>0.0094</bold></td>
</tr>
<tr>
<td>emotions</td>
<td>0.2133</td>
<td>0.2109</td>
<td>0.2126</td>
<td>0.2113</td>
<td>0.2117</td>
<td><bold>0.2048</bold></td>
</tr>
<tr>
<td>enron</td>
<td>0.0571</td>
<td>0.0535</td>
<td>0.0511</td>
<td>0.0562</td>
<td>0.0545</td>
<td><bold>0.0478</bold></td>
</tr>
<tr>
<td>genbase</td>
<td>0.0031</td>
<td>0.0031</td>
<td>0.0033</td>
<td>0.0030</td>
<td>0.0034</td>
<td><bold>0.0013</bold></td>
</tr>
<tr>
<td>medical</td>
<td>0.0054</td>
<td>0.0053</td>
<td>0.0050</td>
<td>0.0063</td>
<td>0.0053</td>
<td><bold>0.0015</bold></td>
</tr>
<tr>
<td>slashdot</td>
<td>0.0423</td>
<td>0.0423</td>
<td>0.0445</td>
<td>0.0423</td>
<td>0.0450</td>
<td><bold>0.0423</bold></td>
</tr>
</tbody>
</table>
</table-wrap><table-wrap id="table-6">
<label>Table 6</label>
<caption>
<title>Experimental Ranking loss result of the LinSVM classifier</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th>Data</th>
<th>AMI</th>
<th>MDMR</th>
<th>FIMF</th>
<th>QPFS</th>
<th>MFSJMI</th>
<th>Proposed</th>
</tr>
</thead>
<tbody>
<tr>
<td>cal500</td>
<td>0.2496</td>
<td>0.2479</td>
<td>0.2485</td>
<td>0.2473</td>
<td>0.2478</td>
<td><bold>0.2415</bold></td>
</tr>
<tr>
<td>corel5k</td>
<td>0.1950</td>
<td>0.1945</td>
<td><bold>0.1849</bold></td>
<td>0.1933</td>
<td>0.1924</td>
<td>0.2400</td>
</tr>
<tr>
<td>emotions</td>
<td>0.1918</td>
<td>0.1874</td>
<td>0.1943</td>
<td>0.1897</td>
<td>0.1844</td>
<td><bold>0.1744</bold></td>
</tr>
<tr>
<td>enron</td>
<td>0.1551</td>
<td><bold>0.1348</bold></td>
<td>0.1377</td>
<td>0.1465</td>
<td>0.1412</td>
<td>0.1350</td>
</tr>
<tr>
<td>genbase</td>
<td>0.0070</td>
<td>0.0067</td>
<td>0.0063</td>
<td>0.0072</td>
<td>0.0082</td>
<td><bold>0.0042</bold></td>
</tr>
<tr>
<td>medical</td>
<td>0.0393</td>
<td>0.0370</td>
<td>0.0365</td>
<td>0.0466</td>
<td>0.0461</td>
<td><bold>0.0104</bold></td>
</tr>
<tr>
<td>slashdot</td>
<td>0.2396</td>
<td>0.2398</td>
<td>0.2436</td>
<td><bold>0.2381</bold></td>
<td>0.2651</td>
<td>0.2385</td>
</tr>
</tbody>
</table>
</table-wrap><table-wrap id="table-7">
<label>Table 7</label>
<caption>
<title>Experimental Multi-label accuracy result of the LinSVM classifier</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th>Data</th>
<th>AMI</th>
<th>MDMR</th>
<th>FIMF</th>
<th>QPFS</th>
<th>MFSJMI</th>
<th>Proposed</th>
</tr>
</thead>
<tbody>
<tr>
<td>cal500</td>
<td>0.2007</td>
<td>0.2007</td>
<td>0.2013</td>
<td>0.2014</td>
<td>0.2014</td>
<td><bold>0.2039</bold></td>
</tr>
<tr>
<td>corel5k</td>
<td>0.0051</td>
<td>0.0046</td>
<td>0.0002</td>
<td>0.0050</td>
<td>0.0048</td>
<td><bold>0.0233</bold></td>
</tr>
<tr>
<td>emotions</td>
<td>0.4505</td>
<td>0.4548</td>
<td>0.4592</td>
<td>0.4560</td>
<td>0.4681</td>
<td><bold>0.4917</bold></td>
</tr>
<tr>
<td>enron</td>
<td>0.2270</td>
<td>0.2844</td>
<td>0.3411</td>
<td>0.2378</td>
<td>0.2360</td>
<td><bold>0.4248</bold></td>
</tr>
<tr>
<td>genbase</td>
<td>0.9650</td>
<td>0.9650</td>
<td>0.9624</td>
<td>0.9662</td>
<td>0.9631</td>
<td><bold>0.9859</bold></td>
</tr>
<tr>
<td>medical</td>
<td>0.8103</td>
<td>0.8156</td>
<td>0.8304</td>
<td>0.7912</td>
<td>0.8112</td>
<td><bold>0.9552</bold></td>
</tr>
<tr>
<td>slashdot</td>
<td>0.3527</td>
<td>0.3522</td>
<td>0.2868</td>
<td>0.3558</td>
<td>0.2844</td>
<td><bold>0.3561</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
<p><xref ref-type="table" rid="table-8">Tables 8</xref>&#x2013;<xref ref-type="table" rid="table-10">10</xref> summarize the experimental results obtained using the MLDT classifier. <xref ref-type="table" rid="table-8">Table 8</xref> reports the Hamming loss results. The proposed method achieves the lowest loss on all seven datasets, showing remarkable consistency and robustness. In particular, the performance improvements on emotions, enron, genbase, and medical are significant, indicating that the proposed feature selection strategy effectively reduces label-wise prediction errors even for complex label spaces. <xref ref-type="table" rid="table-9">Table 9</xref> presents the Ranking loss results. The proposed approach achieves the best performance on five datasets while remaining highly competitive on the others. These results suggest that the proposed method allows the MLDT classifier to better preserve the relative ranking between relevant and irrelevant labels. Notably, the performance gain on corel5k and medical is substantial, confirming that the proposed MI-guided optimization enhances feature selection for both high-dimensional and label-dependent data. <xref ref-type="table" rid="table-10">Table 10</xref> shows the Multi-label accuracy results. The proposed method achieves the highest accuracy on all seven datasets, highlighting its superiority in overall predictive capability. The large gains on emotions, enron, and medical datasets illustrate that the proposed formulation captures label correlations more effectively than existing feature selection methods.</p>
<table-wrap id="table-8">
<label>Table 8</label>
<caption>
<title>Experimental Hamming loss result of the MLDT classifier</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th>Data</th>
<th>AMI</th>
<th>MDMR</th>
<th>FIMF</th>
<th>QPFS</th>
<th>MFSJMI</th>
<th>Proposed</th>
</tr>
</thead>
<tbody>
<tr>
<td>cal500</td>
<td>0.1964</td>
<td>0.1966</td>
<td>0.1968</td>
<td>0.1961</td>
<td>0.1949</td>
<td><bold>0.1946</bold></td>
</tr>
<tr>
<td>corel5k</td>
<td>0.0106</td>
<td>0.0107</td>
<td>0.0110</td>
<td>0.0106</td>
<td>0.0107</td>
<td><bold>0.0096</bold></td>
</tr>
<tr>
<td>emotions</td>
<td>0.2766</td>
<td>0.2734</td>
<td>0.2684</td>
<td>0.2743</td>
<td>0.2641</td>
<td><bold>0.2561</bold></td>
</tr>
<tr>
<td>enron</td>
<td>0.0619</td>
<td>0.0581</td>
<td>0.0589</td>
<td>0.0609</td>
<td>0.0595</td>
<td><bold>0.0556</bold></td>
</tr>
<tr>
<td>genbase</td>
<td>0.0034</td>
<td>0.0034</td>
<td>0.0036</td>
<td>0.0033</td>
<td>0.0036</td>
<td><bold>0.0014</bold></td>
</tr>
<tr>
<td>medical</td>
<td>0.0059</td>
<td>0.0059</td>
<td>0.0056</td>
<td>0.0068</td>
<td>0.0061</td>
<td><bold>0.0015</bold></td>
</tr>
<tr>
<td>slashdot</td>
<td>0.0442</td>
<td>0.0441</td>
<td>0.0488</td>
<td>0.0440</td>
<td>0.0473</td>
<td><bold>0.0436</bold></td>
</tr>
</tbody>
</table>
</table-wrap><table-wrap id="table-9">
<label>Table 9</label>
<caption>
<title>Experimental Ranking loss result of the MLDT classifier</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th>Data</th>
<th>AMI</th>
<th>MDMR</th>
<th>FIMF</th>
<th>QPFS</th>
<th>MFSJMI</th>
<th>Proposed</th>
</tr>
</thead>
<tbody>
<tr>
<td>cal500</td>
<td>0.4318</td>
<td>0.4339</td>
<td>0.4300</td>
<td>0.4341</td>
<td>0.4284</td>
<td><bold>0.4236</bold></td>
</tr>
<tr>
<td>corel5k</td>
<td>0.2086</td>
<td>0.2107</td>
<td>0.2248</td>
<td>0.2100</td>
<td>0.2158</td>
<td><bold>0.1714</bold></td>
</tr>
<tr>
<td>emotions</td>
<td>0.3586</td>
<td>0.3534</td>
<td>0.3532</td>
<td>0.3530</td>
<td>0.3299</td>
<td><bold>0.3283</bold></td>
</tr>
<tr>
<td>enron</td>
<td><bold>0.1453</bold></td>
<td>0.1524</td>
<td>0.1844</td>
<td>0.1492</td>
<td>0.1608</td>
<td>0.1567</td>
</tr>
<tr>
<td>genbase</td>
<td>0.0382</td>
<td>0.0382</td>
<td>0.0383</td>
<td>0.0381</td>
<td>0.0399</td>
<td><bold>0.0369</bold></td>
</tr>
<tr>
<td>medical</td>
<td>0.0636</td>
<td>0.0605</td>
<td>0.0564</td>
<td>0.0617</td>
<td>0.0399</td>
<td><bold>0.0358</bold></td>
</tr>
<tr>
<td>slashdot</td>
<td>0.2439</td>
<td>0.2430</td>
<td>0.2838</td>
<td><bold>0.2429</bold></td>
<td>0.2711</td>
<td>0.2460</td>
</tr>
</tbody>
</table>
</table-wrap><table-wrap id="table-10">
<label>Table 10</label>
<caption>
<title>Experimental Multi-label accuracy result of the MLDT classifier</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th>Data</th>
<th>AMI</th>
<th>MDMR</th>
<th>FIMF</th>
<th>QPFS</th>
<th>MFSJMI</th>
<th>Proposed</th>
</tr>
</thead>
<tbody>
<tr>
<td>cal500</td>
<td>0.2118</td>
<td>0.2092</td>
<td>0.2120</td>
<td>0.2103</td>
<td>0.2130</td>
<td><bold>0.2138</bold></td>
</tr>
<tr>
<td>corel5k</td>
<td>0.0383</td>
<td>0.0372</td>
<td>0.0350</td>
<td>0.0382</td>
<td>0.0363</td>
<td><bold>0.0408</bold></td>
</tr>
<tr>
<td>emotions</td>
<td>0.4181</td>
<td>0.4254</td>
<td>0.4202</td>
<td>0.4177</td>
<td>0.4352</td>
<td><bold>0.4557</bold></td>
</tr>
<tr>
<td>enron</td>
<td>0.2067</td>
<td>0.338</td>
<td>0.3659</td>
<td>0.2643</td>
<td>0.2906</td>
<td><bold>0.4029</bold></td>
</tr>
<tr>
<td>genbase</td>
<td>0.9625</td>
<td>0.9622</td>
<td>0.9596</td>
<td>0.9636</td>
<td>0.9611</td>
<td><bold>0.9854</bold></td>
</tr>
<tr>
<td>medical</td>
<td>0.8194</td>
<td>0.8208</td>
<td>0.8305</td>
<td>0.8000</td>
<td>0.8117</td>
<td><bold>0.9583</bold></td>
</tr>
<tr>
<td>slashdot</td>
<td>0.3662</td>
<td>0.3668</td>
<td>0.3115</td>
<td>0.3691</td>
<td>0.2953</td>
<td><bold>0.3713</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
<p><xref ref-type="fig" rid="fig-1">Fig. 1</xref> illustrates the ML<inline-formula id="ieqn-115"><mml:math id="mml-ieqn-115"><mml:mi>k</mml:mi></mml:math></inline-formula>NN performance variation of the proposed and comparative feature selection methods on the emotions dataset, where the number of selected features was gradually increased from 3 to 30. Three evaluation metrics-Hamming loss, Ranking loss, and Multi-label accuracy-were employed to analyze the behavior of each method under different feature subset sizes. As shown in the figures, all methods generally improve as the number of selected features increases, but the rate and stability of improvement differ substantially. For the Hamming loss, the proposed method consistently achieves the lowest values across all feature subset sizes, indicating robust generalization and effective elimination of redundant or noisy features. While MFSJMI and QPFS also exhibit a decreasing trend, their performance fluctuates more notably, suggesting less stability in feature selection. In terms of Ranking loss, the proposed method again demonstrates superior performance, maintaining lower loss values over the entire range of feature counts. This consistent improvement shows that the proposed formulation preserves the relative order between relevant and irrelevant labels more effectively than other methods, particularly when the feature space is limited. Finally, for Multi-label accuracy, the proposed method achieves the highest accuracy throughout the experiment, with a steady upward trend as the number of features increases. Even with a small number of features (e.g., fewer than 10), the proposed method already outperforms all baselines, highlighting its ability to identify highly informative and label-discriminative features early in the selection process.</p>
<fig id="fig-1">
<label>Figure 1</label>
<caption>
<title>ML<inline-formula id="ieqn-116"><mml:math id="mml-ieqn-116"><mml:mi>k</mml:mi></mml:math></inline-formula>NN performance variation of the number of selected features on (<bold>a</bold>) Hamming loss, (<bold>b</bold>) Ranking loss, and (<bold>c</bold>) Multi-label accuracy for the emotions dataset</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_74138-fig-1.tif"/>
</fig>
<p>To assess whether the observed MLDT performance differences among the compared feature selection methods are statistically significant across the seven datasets, we performed the Friedman&#x2013;Nemenyi non-parametric statistical test at a significance level of <inline-formula id="ieqn-117"><mml:math id="mml-ieqn-117"><mml:mi>&#x03B1;</mml:mi><mml:mo>=</mml:mo><mml:mn>0.05</mml:mn></mml:math></inline-formula>. The Friedman test yielded <inline-formula id="ieqn-118"><mml:math id="mml-ieqn-118"><mml:mi>p</mml:mi></mml:math></inline-formula>-values of 0.0069, 0.0897, and 0.0094 for Hamming loss, Ranking loss, and Multi-label accuracy, respectively. These results indicate that, while Ranking loss exhibits only marginal significance, the differences in Hamming loss and Multi-label accuracy are statistically significant, suggesting that the compared methods yield meaningfully distinct performance under these measures. <xref ref-type="fig" rid="fig-2">Fig. 2</xref> presents the corresponding Critical Distance (CD) diagrams obtained from the Nemenyi post-hoc analysis. The diagrams visualize the average ranks of the six methods and the critical distance at which the rank differences become statistically significant [<xref ref-type="bibr" rid="ref-44">44</xref>]. As shown in the figure, the proposed method consistently achieves the best rank for all three evaluation metrics, demonstrating robust and stable performance. Complementary to these visual results, <xref ref-type="table" rid="table-11">Table 11</xref> reports the complete average-rank matrix for each method and metric. The proposed method attains the lowest average ranks (1.00, 1.86, and 1.0 for Hamming loss, Ranking loss, and Accuracy, respectively), confirming that its improvements are systematic rather than dataset-specific. Overall, these findings validate the statistical reliability of the proposed feature selection approach.</p>
<fig id="fig-2">
<label>Figure 2</label>
<caption>
<title>Critical Distance diagrams for six compared methods across seven datasets based on (<bold>a</bold>) Hamming loss, (<bold>b</bold>) Ranking loss, and (<bold>c</bold>) Multi-label accuracy. The diagrams are obtained from the Friedman&#x2013;Nemenyi statistical test with <inline-formula id="ieqn-119"><mml:math id="mml-ieqn-119"><mml:mi>&#x03B1;</mml:mi><mml:mo>=</mml:mo><mml:mn>0.05</mml:mn></mml:math></inline-formula>. The Friedman test yielded <inline-formula id="ieqn-120"><mml:math id="mml-ieqn-120"><mml:mi>p</mml:mi></mml:math></inline-formula>-values of 0.0069, 0.0897, and 0.0094 for each metric, respectively</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_74138-fig-2.tif"/>
</fig><table-wrap id="table-11">
<label>Table 11</label>
<caption>
<title>Average ranks of compared feature selection methods across seven datasets for three evaluation metrics</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th>Measure</th>
<th>AMI</th>
<th>MDMR</th>
<th>FIMF</th>
<th>QPFS</th>
<th>MFSJMI</th>
<th>Proposed</th>
</tr>
</thead>
<tbody>
<tr>
<td>Hamming loss</td>
<td>4.00</td>
<td>3.71</td>
<td>4.57</td>
<td>3.71</td>
<td>4.00</td>
<td>1.00</td>
</tr>
<tr>
<td>Ranking loss</td>
<td>3.57</td>
<td>3.71</td>
<td>4.57</td>
<td>3.00</td>
<td>4.29</td>
<td>1.86</td>
</tr>
<tr>
<td>Multi-label accuracy</td>
<td>4.00</td>
<td>3.71</td>
<td>4.00</td>
<td>4.14</td>
<td>4.14</td>
<td>1.00</td>
</tr>
<tr>
<td>Average (Overall)</td>
<td>3.86</td>
<td>3.71</td>
<td>4.38</td>
<td>3.62</td>
<td>4.14</td>
<td>1.29</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The computational efficiency of each feature selection method was evaluated in terms of total execution time, as summarized in <xref ref-type="table" rid="table-12">Table 12</xref>. All experiments were conducted using MATLAB R2021b on a desktop equipped with an Intel Core i7-11700 CPU (2.5 GHz) and 32 GB of RAM. The maximum number of iterations for the proposed method was set to 100. The proposed method demonstrates competitive or superior computational efficiency compared to most baselines. In particular, it consistently outperforms AMI, MDMR, and MFSJMI, which incur substantial computational overhead due to repeated mutual information estimation or graph construction processes. While the proposed approach is slightly slower than FIMF-whose operations are relatively lightweight-it achieves significantly faster convergence than other information-theoretic methods such as QPFS and MFSJMI, especially on large-scale datasets like enron, genbase, and medical. These results indicate that the proposed optimization framework maintains high scalability without sacrificing selection quality, balancing effectiveness and computational cost efficiently across diverse multi-label datasets.</p>
<table-wrap id="table-12">
<label>Table 12</label>
<caption>
<title>Comparison of the execution time (in seconds) for different feature selection methods</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th>Data</th>
<th>AMI</th>
<th>MDMR</th>
<th>FIMF</th>
<th>QPFS</th>
<th>MFSJMI</th>
<th>Proposed</th>
</tr>
</thead>
<tbody>
<tr>
<td>cal500</td>
<td>0.49</td>
<td>2.51</td>
<td>0.01</td>
<td>0.09</td>
<td>14.83</td>
<td>0.21</td>
</tr>
<tr>
<td>emotions</td>
<td>0.57</td>
<td>0.12</td>
<td>0.00</td>
<td>0.03</td>
<td>0.04</td>
<td>0.03</td>
</tr>
<tr>
<td>enron</td>
<td>43.4</td>
<td>62.63</td>
<td>0.02</td>
<td>7.28</td>
<td>48.89</td>
<td>7.95</td>
</tr>
<tr>
<td>genbase</td>
<td>12.23</td>
<td>13.82</td>
<td>0.01</td>
<td>5.17</td>
<td>11.98</td>
<td>5.72</td>
</tr>
<tr>
<td>medical</td>
<td>25.38</td>
<td>42.61</td>
<td>0.01</td>
<td>10.46</td>
<td>42.70</td>
<td>11.62</td>
</tr>
<tr>
<td>slashdot</td>
<td>249.75</td>
<td>104.46</td>
<td>0.03</td>
<td>19.57</td>
<td>39.38</td>
<td>21.62</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4_3">
<label>4.3</label>
<title>Analysis of the Proposed Method</title>
<p><xref ref-type="fig" rid="fig-3">Figs. 3</xref>&#x2013;<xref ref-type="fig" rid="fig-5">5</xref> illustrate the sensitivity of the proposed method to its three hyperparameters, <inline-formula id="ieqn-121"><mml:math id="mml-ieqn-121"><mml:mi>&#x03B1;</mml:mi></mml:math></inline-formula>, <inline-formula id="ieqn-122"><mml:math id="mml-ieqn-122"><mml:mi>&#x03B2;</mml:mi></mml:math></inline-formula>, and <inline-formula id="ieqn-123"><mml:math id="mml-ieqn-123"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula>, across the cal500, emotions, and enron datasets, using multi-label accuracy with the ML<inline-formula id="ieqn-124"><mml:math id="mml-ieqn-124"><mml:mi>k</mml:mi></mml:math></inline-formula>NN classifier as the evaluation metric. Each figure contains two subplots: (a) fixes <inline-formula id="ieqn-125"><mml:math id="mml-ieqn-125"><mml:mi>&#x03B1;</mml:mi><mml:mo>=</mml:mo><mml:mn>0.1</mml:mn></mml:math></inline-formula> and varies <inline-formula id="ieqn-126"><mml:math id="mml-ieqn-126"><mml:mi>&#x03B2;</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-127"><mml:math id="mml-ieqn-127"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula>; (b) fixes <inline-formula id="ieqn-128"><mml:math id="mml-ieqn-128"><mml:mi>&#x03B2;</mml:mi><mml:mo>=</mml:mo><mml:mn>0.1</mml:mn></mml:math></inline-formula> and varies <inline-formula id="ieqn-129"><mml:math id="mml-ieqn-129"><mml:mi>&#x03B1;</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-130"><mml:math id="mml-ieqn-130"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula>. All surfaces are presented in 3D to visualize the joint effect of the remaining two parameters on classification performance. Overall, the results demonstrate that the proposed method is highly robust to variations in hyperparameters, as long as their values are not excessively large or small. This indicates that the method does not require fine-grained hyperparameter tuning to achieve strong performance. In the <xref ref-type="fig" rid="fig-3">Fig. 3</xref>, the accuracy remains consistently stable across all combinations of <inline-formula id="ieqn-131"><mml:math id="mml-ieqn-131"><mml:mi>&#x03B1;</mml:mi></mml:math></inline-formula>, <inline-formula id="ieqn-132"><mml:math id="mml-ieqn-132"><mml:mi>&#x03B2;</mml:mi></mml:math></inline-formula>, and <inline-formula id="ieqn-133"><mml:math id="mml-ieqn-133"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula>, suggesting that the method is largely insensitive to hyperparameter changes on this dataset. In the <xref ref-type="fig" rid="fig-4">Fig. 4</xref>, similar to cal500, the classification accuracy is stable across all hyperparameter settings, confirming the method&#x2019;s consistent behavior regardless of parameter choice. In the <xref ref-type="fig" rid="fig-5">Fig. 5</xref>, a minor performance decrease is observed when <inline-formula id="ieqn-134"><mml:math id="mml-ieqn-134"><mml:mi>&#x03B1;</mml:mi><mml:mo>=</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mn>3</mml:mn></mml:msup></mml:math></inline-formula>, while the remaining parameter settings yield stable and strong results. This suggests that overly large values of <inline-formula id="ieqn-135"><mml:math id="mml-ieqn-135"><mml:mi>&#x03B1;</mml:mi></mml:math></inline-formula>&#x2014;which weights the <italic>Q</italic> term&#x2014;may introduce redundancy in this dataset. Empirically, we find that effective values typically lie in the ranges <inline-formula id="ieqn-136"><mml:math id="mml-ieqn-136"><mml:mi>&#x03B1;</mml:mi><mml:mo>,</mml:mo><mml:mi>&#x03B2;</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>3</mml:mn></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula> and <inline-formula id="ieqn-137"><mml:math id="mml-ieqn-137"><mml:mi>&#x03B3;</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>. We therefore recommend performing grid search in a modest neighborhood around these intervals. For larger-scale problems or time-sensitive applications, alternative search strategies such as random search or Bayesian optimization may be employed to further reduce tuning overhead. These guidelines ensure a balance between computational cost and performance while maintaining robustness across datasets.</p>
<fig id="fig-3">
<label>Figure 3</label>
<caption>
<title>Comparison Multi-label accuracy of ML<inline-formula id="ieqn-138"><mml:math id="mml-ieqn-138"><mml:mi>k</mml:mi></mml:math></inline-formula>NN classification based on parameters change in the cal500 data set</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_74138-fig-3.tif"/>
</fig><fig id="fig-4">
<label>Figure 4</label>
<caption>
<title>Comparison Multi-label accuracy of ML<inline-formula id="ieqn-139"><mml:math id="mml-ieqn-139"><mml:mi>k</mml:mi></mml:math></inline-formula>NN classification based on parameters change in the emotions data set</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_74138-fig-4.tif"/>
</fig><fig id="fig-5">
<label>Figure 5</label>
<caption>
<title>Comparison Multi-label accuracy of ML<inline-formula id="ieqn-140"><mml:math id="mml-ieqn-140"><mml:mi>k</mml:mi></mml:math></inline-formula>NN classification based on parameters change in the enron data set</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_74138-fig-5.tif"/>
</fig>
<p><xref ref-type="fig" rid="fig-6">Fig. 6</xref> displays the convergence behavior of the proposed method for all used datasets. The horizontal axis represents the number of iterations of the proposed algorithm, and the vertical axis shows the value of the objective function. The objective function value drops sharply within the first three iterations and appears to converge before the tenth iteration. This indicates that the proposed algorithm operates efficiently.</p>
<fig id="fig-6">
<label>Figure 6</label>
<caption>
<title>Convergence rate of the proposed method</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_74138-fig-6a.tif"/>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_74138-fig-6b.tif"/>
</fig>
</sec>
</sec>
<sec id="s5">
<label>5</label>
<title>Conclusions</title>
<p>In this study, we proposed a novel regression-based objective function for multi-label feature selection that explicitly incorporates mutual information between features and labels. By integrating a mutual information-aware structure into a convex regression formulation, the proposed method enables efficient optimization via projected gradient descent while preserving important statistical dependencies in the data. Empirical evaluations across multiple benchmark datasets and classifiers demonstrate that our approach consistently achieves superior or competitive performance compared to existing methods, confirming its effectiveness and robustness in various multi-label learning scenarios.</p>
<p>Despite its strong performance, the proposed method has several limitations. First, computing mutual information for all feature&#x2014;feature and label&#x2014;label pairs can be computationally expensive, especially for high-dimensional datasets. Exploring approximation techniques or sparse estimation strategies may significantly reduce this overhead. Second, the method involves several hyperparameters whose tuning can impact performance and requires careful consideration. Future work will focus on developing adaptive or data-driven hyperparameter selection mechanisms to further enhance usability and generalization.</p>
</sec>
</body>
<back>
<ack>
<p>None.</p>
</ack>
<sec>
<title>Funding Statement</title>
<p>This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (RS-2020-NR049579).</p>
</sec>
<sec sec-type="data-availability">
<title>Availability of Data and Materials</title>
<p>Details of the dataset used in this study are provided in the main text. The dataset is accessible at <ext-link ext-link-type="uri" xlink:href="https://mulan.sourceforge.net/datasets-mlc.html">https://mulan.sourceforge.net/datasets-mlc.html</ext-link> (accessed on 20 July 2025).</p>
</sec>
<sec>
<title>Ethics Approval</title>
<p>Not applicable.</p>
</sec>
<sec sec-type="COI-statement">
<title>Conflicts of Interest</title>
<p>The authors declare no conflicts of interest to report regarding the present study.</p>
</sec>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>T</given-names></string-name>, <string-name><surname>Tang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Li</surname> <given-names>B</given-names></string-name></person-group>. <article-title>Establishing two-dimensional dependencies for multi-label image classification</article-title>. <source>Appl Sci</source>. <year>2025</year>;<volume>15</volume>(<issue>5</issue>):<fpage>2845</fpage>. doi:<pub-id pub-id-type="doi">10.3390/app15052845</pub-id>.</mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Hu</surname> <given-names>W</given-names></string-name>, <string-name><surname>Fan</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Yan</surname> <given-names>H</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>X</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>S</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>K</given-names></string-name></person-group>. <article-title>A survey of multi-label text classification under few-shot scenarios</article-title>. <source>Appl Sci</source>. <year>2025</year>;<volume>15</volume>(<issue>16</issue>):<fpage>8872</fpage>. doi:<pub-id pub-id-type="doi">10.3390/app15168872</pub-id>.</mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Feng</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Wei</surname> <given-names>R</given-names></string-name></person-group>. <article-title>Method of multi-label visual emotion recognition fusing fore-background features</article-title>. <source>Appl Sci</source>. <year>2024</year>;<volume>14</volume>(<issue>18</issue>):<fpage>8564</fpage>. doi:<pub-id pub-id-type="doi">10.21203/rs.3.rs-4752870/v1</pub-id>.</mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Han</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>W</given-names></string-name>, <string-name><surname>Ma</surname> <given-names>C</given-names></string-name>, <string-name><surname>He</surname> <given-names>J</given-names></string-name>, <string-name><surname>Gao</surname> <given-names>W</given-names></string-name></person-group>. <article-title>Robust multilabel feature selection with label enhancement for fault diagnosis</article-title>. <source>IEEE Transact Syst Man, Cybernet: Syst</source>. <year>2025</year>;<volume>55</volume>(<issue>11</issue>):<fpage>7841</fpage>&#x2013;<lpage>50</lpage>. doi:<pub-id pub-id-type="doi">10.1109/tsmc.2025.3598796</pub-id>.</mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Huang</surname> <given-names>F</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>N</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>H</given-names></string-name>, <string-name><surname>Bao</surname> <given-names>W</given-names></string-name>, <string-name><surname>Yuan</surname> <given-names>D</given-names></string-name></person-group>. <article-title>Distributed online multi-label learning with privacy protection in internet of things</article-title>. <source>Appl Sci</source>. <year>2023</year>;<volume>13</volume>(<issue>4</issue>):<fpage>2713</fpage>. doi:<pub-id pub-id-type="doi">10.3390/app13042713</pub-id>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Elhamifar</surname> <given-names>E</given-names></string-name>, <string-name><surname>Vidal</surname> <given-names>R</given-names></string-name></person-group>. <article-title>Sparse subspace clustering: algorithm, theory, and applications</article-title>. <source>IEEE Transact Pattern Anal Mach Intelle</source>. <year>2013</year>;<volume>35</volume>(<issue>11</issue>):<fpage>2765</fpage>&#x2013;<lpage>81</lpage>. doi:<pub-id pub-id-type="doi">10.1109/tpami.2013.57</pub-id>; <pub-id pub-id-type="pmid">24051734</pub-id></mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Lim</surname> <given-names>H</given-names></string-name>, <string-name><surname>Lee</surname> <given-names>J</given-names></string-name>, <string-name><surname>Kim</surname> <given-names>DW</given-names></string-name></person-group>. <article-title>Optimization approach for feature selection in multi-label classification</article-title>. <source>Pattern Recognition Letters</source>. <year>2017</year>;<volume>89</volume>:<fpage>25</fpage>&#x2013;<lpage>30</lpage>. doi:<pub-id pub-id-type="doi">10.1016/j.patrec.2017.02.004</pub-id>.</mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Lee</surname> <given-names>J</given-names></string-name>, <string-name><surname>Kim</surname> <given-names>DW</given-names></string-name></person-group>. <article-title>Fast multi-label feature selection based on information-theoretic feature ranking</article-title>. <source>Pattern Recognit</source>. <year>2015</year>;<volume>48</volume>:<fpage>2761</fpage>&#x2013;<lpage>71</lpage>. doi:<pub-id pub-id-type="doi">10.1016/j.patcog.2015.04.009</pub-id>.</mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Peng</surname> <given-names>H</given-names></string-name>, <string-name><surname>Long</surname> <given-names>F</given-names></string-name>, <string-name><surname>Ding</surname> <given-names>C</given-names></string-name></person-group>. <article-title>Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy</article-title>. <source>IEEE Transact Pattern Anal Mach Intell</source>. <year>2005</year>;<volume>27</volume>:<fpage>1226</fpage>&#x2013;<lpage>38</lpage>. doi:<pub-id pub-id-type="doi">10.1109/tpami.2005.159</pub-id>; <pub-id pub-id-type="pmid">16119262</pub-id></mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Nie</surname> <given-names>F</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Cai</surname> <given-names>X</given-names></string-name>, <string-name><surname>Ding</surname> <given-names>C</given-names></string-name></person-group>. <chapter-title>Efficient and robust feature selection via joint <italic>l<sub>2,1</sub></italic>-norms minimization</chapter-title>. In: <source>Advances in neural information processing systems 23 (NIPS 2010)</source>. <publisher-loc>Red Hook, NY, USA</publisher-loc>: <publisher-name>Curran Associates, Inc.</publisher-name>; <year>2010</year>.</mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Pereira</surname> <given-names>RB</given-names></string-name>, <string-name><surname>Plastino</surname> <given-names>A</given-names></string-name>, <string-name><surname>Zadrozny</surname> <given-names>B</given-names></string-name>, <string-name><surname>Merschmann</surname> <given-names>LH</given-names></string-name></person-group>. <article-title>Categorizing feature selection methods for multi-label classification</article-title>. <source>Artificial Intell Rev</source>. <year>2018</year>;<volume>49</volume>(<issue>1</issue>):<fpage>57</fpage>&#x2013;<lpage>78</lpage>. doi:<pub-id pub-id-type="doi">10.1007/s10462-016-9516-4</pub-id>.</mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Tsoumakas</surname> <given-names>G</given-names></string-name>, <string-name><surname>Vlahavas</surname> <given-names>I</given-names></string-name></person-group>. <chapter-title>Random k-labelsets: an ensemble method for multilabel classification</chapter-title>. In: <source>Machine learning: ECML 2007 (ECML 2007)</source>. <publisher-loc>Berlin/Heidelberg, Germany</publisher-loc>: <publisher-name>Springer</publisher-name>; <year>2007</year>. p. <fpage>406</fpage>&#x2013;<lpage>17</lpage> doi: <pub-id pub-id-type="doi">10.1007/978-3-540-74958-5_38</pub-id>.</mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Read</surname> <given-names>J</given-names></string-name></person-group>. <article-title>A pruned problem transformation method for multi-label classification</article-title>. In: <conf-name>Proceedings of the New Zealand Computer Science Research Student Conference; 2008 Apr 14&#x2013;18</conf-name>; <publisher-loc>Christchurch, New Zealand</publisher-loc>. p. <fpage>143</fpage>&#x2013;<lpage>50</lpage>.</mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Lee</surname> <given-names>J</given-names></string-name>, <string-name><surname>Kim</surname> <given-names>DW</given-names></string-name></person-group>. <article-title>Feature selection for multi-label classification using multivariate mutual information</article-title>. <source>Pattern Recognit Letters</source>. <year>2013</year>;<volume>34</volume>:<fpage>349</fpage>&#x2013;<lpage>57</lpage>. doi:<pub-id pub-id-type="doi">10.1016/j.patrec.2012.10.005</pub-id>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Lee</surname> <given-names>J</given-names></string-name>, <string-name><surname>Kim</surname> <given-names>DW</given-names></string-name></person-group>. <article-title>Mutual information-based multi-label feature selection using interaction information</article-title>. <source>Expert Syst Applicat</source>. <year>2015</year>;<volume>42</volume>:<fpage>2013</fpage>&#x2013;<lpage>25</lpage>. doi:<pub-id pub-id-type="doi">10.1016/j.eswa.2014.09.063</pub-id>.</mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Lin</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Hu</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Duan</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Multi-label feature selection based on max-dependency and min-redundancy</article-title>. <source>Neurocomputing</source>. <year>2015</year>;<volume>168</volume>:<fpage>92</fpage>&#x2013;<lpage>103</lpage>. doi:<pub-id pub-id-type="doi">10.1016/j.neucom.2015.06.010</pub-id>.</mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhang</surname> <given-names>P</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>G</given-names></string-name>, <string-name><surname>Song</surname> <given-names>J</given-names></string-name></person-group>. <article-title>MFSJMI: multi-label feature selection considering join mutual information and interaction weight</article-title>. <source>Pattern Recognit</source>. <year>2023</year>;<volume>138</volume>:<fpage>109378</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.patcog.2023.109378</pub-id>.</mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Clare</surname> <given-names>A</given-names></string-name>, <string-name><surname>King</surname> <given-names>RD</given-names></string-name></person-group>. <chapter-title>Knowledge discovery in multi-label phenotype data</chapter-title>. In: <source>Principles of data mining and knowledge discovery (PKDD 2001)</source>. <publisher-loc>Berlin/Heidelberg, Germany</publisher-loc>: <publisher-name>Springer</publisher-name>; <year>2001</year>. p. <fpage>42</fpage>&#x2013;<lpage>53</lpage> doi: <pub-id pub-id-type="doi">10.1007/3-540-44794-6_4</pub-id>.</mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Fan</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>B</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>W</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Weng</surname> <given-names>W</given-names></string-name>, <string-name><surname>Lan</surname> <given-names>W</given-names></string-name></person-group>. <article-title>Multi-label feature selection based on label correlations and feature redundancy</article-title>. <source>Knowl Based Syst</source>. <year>2022</year>;<volume>241</volume>:<fpage>108256</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.knosys.2022.108256</pub-id>.</mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Li</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Hu</surname> <given-names>L</given-names></string-name>, <string-name><surname>Gao</surname> <given-names>W</given-names></string-name></person-group>. <article-title>Multi-label feature selection via robust flexible sparse regularization</article-title>. <source>Pattern Recognit</source>. <year>2023</year>;<volume>134</volume>:<fpage>109074</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.patcog.2022.109074</pub-id>.</mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Fan</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Tang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>P</given-names></string-name>, <string-name><surname>Lin</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Du</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>Learning correlation information for multi-label feature selection</article-title>. <source>Pattern Recognit</source>. <year>2024</year>;<volume>145</volume>:<fpage>109899</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.patcog.2023.109899</pub-id>.</mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Li</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Hu</surname> <given-names>L</given-names></string-name>, <string-name><surname>Gao</surname> <given-names>W</given-names></string-name></person-group>. <article-title>Label correlations variation for robust multi-label feature selection</article-title>. <source>Informat Sci</source>. <year>2022</year>;<volume>609</volume>:<fpage>1075</fpage>&#x2013;<lpage>97</lpage>. doi:<pub-id pub-id-type="doi">10.1016/j.ins.2022.07.154</pub-id>.</mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Hu</surname> <given-names>L</given-names></string-name>, <string-name><surname>Gao</surname> <given-names>L</given-names></string-name>, <string-name><surname>Li</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>P</given-names></string-name>, <string-name><surname>Gao</surname> <given-names>W</given-names></string-name></person-group>. <article-title>Feature-specific mutual information variation for multi-label feature selection</article-title>. <source>Informat Sci</source>. <year>2022</year>;<volume>593</volume>:<fpage>449</fpage>&#x2013;<lpage>71</lpage>. doi:<pub-id pub-id-type="doi">10.1016/j.ins.2022.02.024</pub-id>.</mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Dai</surname> <given-names>J</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>W</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>C</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Multi-label feature selection by strongly relevant label gain and label mutual aid</article-title>. <source>Pattern Recognit</source>. <year>2024</year>;<volume>145</volume>:<fpage>109945</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.patcog.2023.109945</pub-id>.</mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Faraji</surname> <given-names>M</given-names></string-name>, <string-name><surname>Seyedi</surname> <given-names>SA</given-names></string-name>, <string-name><surname>Tab</surname> <given-names>FA</given-names></string-name>, <string-name><surname>Mahmoodi</surname> <given-names>R</given-names></string-name></person-group>. <article-title>Multi-label feature selection with global and local label correlation</article-title>. <source>Expert Syst Appl</source>. <year>2024</year>;<volume>246</volume>:<fpage>123198</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.eswa.2024.123198</pub-id>.</mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>He</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Lin</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>C</given-names></string-name>, <string-name><surname>Guo</surname> <given-names>L</given-names></string-name>, <string-name><surname>Ding</surname> <given-names>W</given-names></string-name></person-group>. <article-title>Multi-label feature selection based on correlation label enhancement</article-title>. <source>Inf Sci</source>. <year>2023</year>;<volume>647</volume>:<fpage>119526</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.ins.2023.119526</pub-id>.</mixed-citation></ref>
<ref id="ref-27"><label>[27]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Yang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>H</given-names></string-name>, <string-name><surname>Mi</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Luo</surname> <given-names>C</given-names></string-name>, <string-name><surname>Horng</surname> <given-names>SJ</given-names></string-name>, <string-name><surname>Li</surname> <given-names>T</given-names></string-name></person-group>. <article-title>Multi-label feature selection based on stable label relevance and label-specific features</article-title>. <source>Inf Sci</source>. <year>2023</year>;<volume>648</volume>:<fpage>119525</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.ins.2023.119525</pub-id>.</mixed-citation></ref>
<ref id="ref-28"><label>[28]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhang</surname> <given-names>ML</given-names></string-name>, <string-name><surname>Pe&#x00F1;a</surname> <given-names>JM</given-names></string-name>, <string-name><surname>Robles</surname> <given-names>V</given-names></string-name></person-group>. <article-title>Feature selection for multi-label naive Bayes classification</article-title>. <source>Informat Sci</source>. <year>2009</year>;<volume>179</volume>:<fpage>3218</fpage>&#x2013;<lpage>29</lpage>. doi:<pub-id pub-id-type="doi">10.1016/j.ins.2009.06.010</pub-id>.</mixed-citation></ref>
<ref id="ref-29"><label>[29]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Cho</surname> <given-names>DH</given-names></string-name>, <string-name><surname>Moon</surname> <given-names>SH</given-names></string-name>, <string-name><surname>Kim</surname> <given-names>YH</given-names></string-name></person-group>. <article-title>Genetic feature selection applied to KOSPI and cryptocurrency price prediction</article-title>. <source>Mathematics</source>. <year>2021</year>;<volume>9</volume>(<issue>20</issue>):<fpage>2574</fpage>. doi:<pub-id pub-id-type="doi">10.3390/math9202574</pub-id>.</mixed-citation></ref>
<ref id="ref-30"><label>[30]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Das</surname> <given-names>H</given-names></string-name>, <string-name><surname>Prajapati</surname> <given-names>S</given-names></string-name>, <string-name><surname>Gourisaria</surname> <given-names>MK</given-names></string-name>, <string-name><surname>Pattanayak</surname> <given-names>RM</given-names></string-name>, <string-name><surname>Alameen</surname> <given-names>A</given-names></string-name>, <string-name><surname>Kolhar</surname> <given-names>M</given-names></string-name></person-group>. <article-title>Feature selection using golden jackal optimization for software fault prediction</article-title>. <source>Mathematics</source>. <year>2023</year>;<volume>11</volume>(<issue>11</issue>):<fpage>2438</fpage>. doi:<pub-id pub-id-type="doi">10.3390/math11112438</pub-id>.</mixed-citation></ref>
<ref id="ref-31"><label>[31]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Guo</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Kasihmuddin</surname> <given-names>MSM</given-names></string-name>, <string-name><surname>Zamri</surname> <given-names>NE</given-names></string-name>, <string-name><surname>Li</surname> <given-names>J</given-names></string-name>, <string-name><surname>Romli</surname> <given-names>NA</given-names></string-name>, <string-name><surname>Mansor</surname> <given-names>MA</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>Logic mining method via hybrid discrete hopfield neural network</article-title>. <source>Comput Indust Eng</source>. <year>2025</year>;<volume>206</volume>:<fpage>111200</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.cie.2025.111200</pub-id>.</mixed-citation></ref>
<ref id="ref-32"><label>[32]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Romli</surname> <given-names>NA</given-names></string-name>, <string-name><surname>Zulkepli</surname> <given-names>NFS</given-names></string-name>, <string-name><surname>Kasihmuddin</surname> <given-names>MSM</given-names></string-name>, <string-name><surname>Karim</surname> <given-names>SA</given-names></string-name>, <string-name><surname>Jamaludin</surname> <given-names>SZM</given-names></string-name>, <string-name><surname>Rusdi</surname> <given-names>N</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>An optimized logic mining method for data processing through higher-order satisfiability representation in discrete hopfield neural network</article-title>. <source>Appl Soft Comput</source>. <year>2025</year>;<volume>184</volume>(<issue>B</issue>):<fpage>113759</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.asoc.2025.113759</pub-id>.</mixed-citation></ref>
<ref id="ref-33"><label>[33]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Bao</surname> <given-names>B</given-names></string-name>, <string-name><surname>Tang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Bao</surname> <given-names>H</given-names></string-name>, <string-name><surname>Hua</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>M</given-names></string-name></person-group>. <article-title>Simplified discrete two-neuron hopfield neural network and FPGA implementation</article-title>. <source>IEEE Trans Ind Electron</source>. <year>2025</year>;<volume>72</volume>(<issue>4</issue>):<fpage>4105</fpage>&#x2013;<lpage>15</lpage>. doi:<pub-id pub-id-type="doi">10.1109/tie.2024.3451052</pub-id>.</mixed-citation></ref>
<ref id="ref-34"><label>[34]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Huang</surname> <given-names>S</given-names></string-name>, <string-name><surname>Hu</surname> <given-names>W</given-names></string-name>, <string-name><surname>Lu</surname> <given-names>B</given-names></string-name>, <string-name><surname>Fan</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>X</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>X</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Application of label correlation in multi-label classification: a survey</article-title>. <source>Appl Sci</source>. <year>2024</year>;<volume>14</volume>(<issue>19</issue>):<fpage>9034</fpage>. doi:<pub-id pub-id-type="doi">10.3390/app14199034</pub-id>.</mixed-citation></ref>
<ref id="ref-35"><label>[35]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Klonecki</surname> <given-names>T</given-names></string-name>, <string-name><surname>Teisseyre</surname> <given-names>P</given-names></string-name>, <string-name><surname>Lee</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Cost-constrained feature selection in multilabel classification using an information-theoretic approach</article-title>. <source>Pattern Recognition</source>. <year>2023</year>;<volume>141</volume>(<issue>1</issue>):<fpage>1</fpage>&#x2013;<lpage>18</lpage>. doi:<pub-id pub-id-type="doi">10.1016/j.patcog.2023.109605</pub-id>.</mixed-citation></ref>
<ref id="ref-36"><label>[36]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhang</surname> <given-names>ML</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>ZH</given-names></string-name></person-group>. <article-title>ML-KNN: a lazy learning approach to multi-label learning</article-title>. <source>Pattern Recognit</source>. <year>2007</year>;<volume>40</volume>:<fpage>2038</fpage>&#x2013;<lpage>48</lpage>. doi:<pub-id pub-id-type="doi">10.1016/j.patcog.2006.12.019</pub-id>.</mixed-citation></ref>
<ref id="ref-37"><label>[37]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Turnbull</surname> <given-names>D</given-names></string-name>, <string-name><surname>Barrington</surname> <given-names>L</given-names></string-name>, <string-name><surname>Torres</surname> <given-names>D</given-names></string-name>, <string-name><surname>Lanckriet</surname> <given-names>G</given-names></string-name></person-group>. <article-title>Semantic annotation and retrieval of music and sound effects</article-title>. <source>IEEE Transact Audio, Speech, Lang Process</source>. <year>2008</year>;<volume>16</volume>(<issue>2</issue>):<fpage>467</fpage>&#x2013;<lpage>76</lpage>. doi:<pub-id pub-id-type="doi">10.1109/tasl.2007.913750</pub-id>.</mixed-citation></ref>
<ref id="ref-38"><label>[38]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Trohidis</surname> <given-names>K</given-names></string-name>, <string-name><surname>Tsoumakas</surname> <given-names>G</given-names></string-name>, <string-name><surname>Kalliris</surname> <given-names>G</given-names></string-name>, <string-name><surname>Vlahavas</surname> <given-names>IP</given-names></string-name></person-group>. <article-title>Multilabel classification of music into emotions</article-title>. In: <conf-name>International Conference on Music Information Retrieval</conf-name>. <publisher-loc>Philadelphia, PA, USA</publisher-loc>: <publisher-name>ISMIR
Society</publisher-name>; <year>2008</year>. p. <fpage>325</fpage>&#x2013;<lpage>30</lpage>.</mixed-citation></ref>
<ref id="ref-39"><label>[39]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Pestian</surname> <given-names>J</given-names></string-name>, <string-name><surname>Brew</surname> <given-names>C</given-names></string-name>, <string-name><surname>Matykiewicz</surname> <given-names>P</given-names></string-name>, <string-name><surname>Hovermale</surname> <given-names>DJ</given-names></string-name>, <string-name><surname>Johnson</surname> <given-names>N</given-names></string-name>, <string-name><surname>Cohen</surname> <given-names>KB</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>A shared task involving multi-label classification of clinical free text</article-title>. In: BioNLP <sup>&#x2032;</sup>07:<conf-name>Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing</conf-name>. <publisher-loc>Stroudsburg, PA, USA</publisher-loc>: <publisher-name>ACL</publisher-name>; <year>2007</year>. p. <fpage>97</fpage>&#x2013;<lpage>104</lpage>.</mixed-citation></ref>
<ref id="ref-40"><label>[40]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Leskovec</surname> <given-names>J</given-names></string-name>, <string-name><surname>Huttenlocher</surname> <given-names>D</given-names></string-name>, <string-name><surname>Kleinberg</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Signed networks in social media</article-title>. In: CHI <sup>&#x2032;</sup>10:<conf-name>Proceedings of the SIGCHI Conference on Human Factors in Computing Systems</conf-name>. <publisher-loc>New York, NY, USA</publisher-loc>: <publisher-name>ACM</publisher-name>; <year>2010</year>. p. <fpage>1361</fpage>&#x2013;<lpage>70</lpage>.</mixed-citation></ref>
<ref id="ref-41"><label>[41]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Lee</surname> <given-names>J</given-names></string-name>, <string-name><surname>Kim</surname> <given-names>DW</given-names></string-name></person-group>. <article-title>SCLS: multi-label feature selection based on scalable criterion for large label set</article-title>. <source>Pattern Recognit</source>. <year>2017</year>;<volume>66</volume>:<fpage>342</fpage>&#x2013;<lpage>52</lpage>. doi:<pub-id pub-id-type="doi">10.1016/j.patcog.2017.01.014</pub-id>.</mixed-citation></ref>
<ref id="ref-42"><label>[42]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Lee</surname> <given-names>J</given-names></string-name>, <string-name><surname>Lim</surname> <given-names>H</given-names></string-name>, <string-name><surname>Kim</surname> <given-names>DW</given-names></string-name></person-group>. <article-title>Approximating mutual information for multi-label feature selection</article-title>. <source>Elect Letters</source>. <year>2012</year>;<volume>48</volume>(<issue>15</issue>):<fpage>929</fpage>&#x2013;<lpage>30</lpage>. doi:<pub-id pub-id-type="doi">10.1049/el.2012.1600</pub-id>.</mixed-citation></ref>
<ref id="ref-43"><label>[43]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Cano</surname> <given-names>A</given-names></string-name>, <string-name><surname>Luna</surname> <given-names>JM</given-names></string-name>, <string-name><surname>Gibaja</surname> <given-names>EL</given-names></string-name>, <string-name><surname>Ventura</surname> <given-names>S</given-names></string-name></person-group>. <article-title>LAIM discretization for multi-label data</article-title>. <source>Informat Sci</source>. <year>2016</year>;<volume>330</volume>:<fpage>370</fpage>&#x2013;<lpage>84</lpage>. doi:<pub-id pub-id-type="doi">10.1016/j.ins.2015.10.032</pub-id>.</mixed-citation></ref>
<ref id="ref-44"><label>[44]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Dem&#x0161;ar</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Statistical comparisons of classifiers over multiple data sets</article-title>. <source>J Mach Learn Res</source>. <year>2006</year>;<volume>7</volume>:<fpage>1</fpage>&#x2013;<lpage>30</lpage>.</mixed-citation></ref>
</ref-list>
</back></article>














