<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xml:lang="en" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMC</journal-id>
<journal-id journal-id-type="nlm-ta">CMC</journal-id>
<journal-id journal-id-type="publisher-id">CMC</journal-id>
<journal-title-group>
<journal-title>Computers, Materials &#x0026; Continua</journal-title>
</journal-title-group>
<issn pub-type="epub">1546-2226</issn>
<issn pub-type="ppub">1546-2218</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">63465</article-id>
<article-id pub-id-type="doi">10.32604/cmc.2025.063465</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Neighbor Displacement-Based Enhanced Synthetic Oversampling for Multiclass Imbalanced Data</article-title>
<alt-title alt-title-type="left-running-head">Neighbor Displacement-Based Enhanced Synthetic Oversampling for Multiclass Imbalanced Data</alt-title>
<alt-title alt-title-type="right-running-head">Neighbor Displacement-Based Enhanced Synthetic Oversampling for Multiclass Imbalanced Data</alt-title>
</title-group>
<contrib-group>
<contrib id="author-1" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Putrama</surname><given-names>I Made</given-names></name><xref ref-type="aff" rid="aff-1">1</xref><xref ref-type="aff" rid="aff-2">2</xref><xref rid="cor1" ref-type="corresp">&#x002A;</xref><email>putrama.imade@edu.bme.hu</email></contrib>
<contrib id="author-2" contrib-type="author">
<name name-style="western"><surname>Martinek</surname><given-names>P&#x00E9;ter</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<aff id="aff-1"><label>1</label><institution>Department of Electronics Technology, Faculty of Electrical Engineering and Informatics, Budapest University of Technology and Economics</institution>, <addr-line>Budapest, 1111</addr-line>, <country>Hungary</country></aff>
<aff id="aff-2"><label>2</label><institution>Department of Informatics, Faculty of Engineering and Vocational, Universitas Pendidikan Ganesha</institution>, <addr-line>Singaraja, 81116</addr-line>, <country>Indonesia</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: I Made Putrama. Email: <email>putrama.imade@edu.bme.hu</email></corresp>
</author-notes>
<pub-date date-type="collection" publication-format="electronic">
<year>2025</year>
</pub-date>
<pub-date date-type="pub" publication-format="electronic">
<day>19</day><month>05</month><year>2025</year>
</pub-date>
<volume>83</volume>
<issue>3</issue>
<fpage>5699</fpage>
<lpage>5727</lpage>
<history>
<date date-type="received">
<day>15</day>
<month>1</month>
<year>2025</year>
</date>
<date date-type="accepted">
<day>09</day>
<month>4</month>
<year>2025</year>
</date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2025 The Authors.</copyright-statement>
<copyright-year>2025</copyright-year>
<copyright-holder>Published by Tech Science Press.</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMC_63465.pdf"></self-uri>
<abstract>
<p>Imbalanced multiclass datasets pose challenges for machine learning algorithms. They often contain minority classes that are important for accurate predictions. However, when the data is sparsely distributed and overlaps with data points from other classes, it introduces noise. As a result, existing resampling methods may fail to preserve the original data patterns, further disrupting data quality and reducing model performance. This paper introduces Neighbor Displacement-based Enhanced Synthetic Oversampling (NDESO), a hybrid method that integrates a data displacement strategy with a resampling technique to achieve data balance. It begins by computing the average distance of noisy data points to their neighbors and adjusting their positions toward the center before applying random oversampling. Extensive evaluations compare 14 alternatives on nine classifiers across synthetic and 20 real-world datasets with varying imbalance ratios. This evaluation was structured into two distinct test groups. First, the effects of k-neighbor variations and distance metrics are evaluated, followed by a comparison of resampled data distributions against alternatives, and finally, determining the most suitable oversampling technique for data balancing. Second, the overall performance of the NDESO algorithm was assessed, focusing on G-mean and statistical significance. The results demonstrate that our method is robust to a wide range of variations in these parameters and the overall performance achieves an average G-mean score of 0.90, which is among the highest. Additionally, it attains the lowest mean rank of 2.88, indicating statistically significant improvements over existing approaches. This advantage underscores its potential for effectively handling data imbalance in practical scenarios.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Neighbor</kwd>
<kwd>displacement</kwd>
<kwd>synthetic</kwd>
<kwd>oversampling</kwd>
<kwd>multiclass</kwd>
<kwd>imbalanced data</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="s1">
<label>1</label>
<title>Introduction</title>
<p>Imbalanced data remains a persistent challenge in various domains within the enterprise, especially in the context of classification tasks [<xref ref-type="bibr" rid="ref-1">1</xref>&#x2013;<xref ref-type="bibr" rid="ref-3">3</xref>]. Imbalanced datasets have a much larger number of majority classes than one or more heavily underrepresented minority classes. This imbalance problem creates substantial challenges for standard machine learning algorithms to accurately identify the minority classes, where correctly classifying these instances is often critical in many applications [<xref ref-type="bibr" rid="ref-4">4</xref>]. In the financial sector, imbalanced data classification poses significant obstacles, especially in fraud detection applications. Fraudulent transactions occur less frequently than legitimate transactions, resulting in a highly skewed dataset. Failure to accurately detect fraudulent activity can cause significant financial losses to organizations [<xref ref-type="bibr" rid="ref-5">5</xref>]. Similarly, in manufacturing, as automation advances and production complexity increases, monitoring equipment performance becomes essential to prevent failures and ensure safety. This makes collecting equipment data and detecting process patterns and anomalies to support maintenance strategies crucial [<xref ref-type="bibr" rid="ref-6">6</xref>]. However, in practice, equipment failures occur less frequently than in normal operations, resulting in imbalanced datasets [<xref ref-type="bibr" rid="ref-7">7</xref>,<xref ref-type="bibr" rid="ref-8">8</xref>]. Using imbalanced datasets for classification, relying solely on accuracy can be misleading. For example, a dataset with a 99% majority class and 1% minority class might show a high accuracy of 99% if the model correctly classifies all majority class instances despite failing to detect the critical minority class instances accurately [<xref ref-type="bibr" rid="ref-9">9</xref>].</p>
<p>To deal with imbalanced data, various strategies have been developed for classification tasks involving methods categorized into data level, algorithm level, or a combination of both [<xref ref-type="bibr" rid="ref-10">10</xref>&#x2013;<xref ref-type="bibr" rid="ref-12">12</xref>]. These approaches aim to present intelligence diagnostic results in enterprise applications accurately. However, in industrial scenarios, the complexity of data acquisition under the aforementioned conditions often leads to data collection that may exhibit small-sample problems, making it challenging to establish highly accurate models when insufficient data is available [<xref ref-type="bibr" rid="ref-13">13</xref>]. Additionally, the data may contain errors or anomalies, where noise in imbalanced data significantly affects the performance of the diagnostic algorithm [<xref ref-type="bibr" rid="ref-14">14</xref>,<xref ref-type="bibr" rid="ref-15">15</xref>]. Moreover, while the use of existing oversampling, undersampling, or hybrid methods has been found effective in dealing with imbalanced data, this approach does not always produce synthetic samples that are entirely free of noise or avoid overlapping between data. This is especially prominent for data that naturally has many overlapping data points. In such conditions, existing methods that generate new samples based on the unclean original data distribution tend to produce increasingly noisy samples, causing the generated data to deviate significantly from the original pattern, which ultimately affects the overall accuracy and effectiveness of the model.</p>
<p>In this paper, we introduce a hybrid resampling method called Neighbor Displacement-based Enhanced Synthetic Oversampling (NDESO), which aims to correct the noisy data points within each class by moving them closer to their centroids before oversampling is performed to balance the data distribution. This method identifies data points that are located around <italic>k</italic>-neighbors with different classes. The displacement is done by analyzing the distance of the relative position of the data point to its neighbors and then moving it closer to the corresponding class centroid based on the distance. This displacement process adjusts the data points towards their corresponding centroid, aligning them more closely with the characteristics of their class while preserving the original class label. With the adjusted position of noisy data, class balancing is performed through oversampling minority classes, ensuring improved outcomes while maintaining the patterns of the original data.</p>
<p>This study adopts a methodology similar to that proposed by [<xref ref-type="bibr" rid="ref-16">16</xref>], with a workflow illustrated in <xref ref-type="fig" rid="fig-1">Fig. 1</xref>. The process started with data exploration, followed by data preprocessing. Since the datasets originate from a widely used public repository, they are already cleaned and pre-processed, requiring minimal additional preprocessing. The main adjustment involves encoding categorical columns and rescaling the data. A resampling method is then applied to address the imbalance in the target variable, followed by the application of a classifier. Several metrics are measured, with the G-mean score reported in this study. The evaluation of the proposed method is done through a comparative analysis of 14 different resampling methods in an experiment using nine different classifiers on 20 real-world datasets. Statistical tests indicate that our proposed method yields significantly lower critical differences, demonstrating a notable improvement in performance compared to other methods.</p>
<fig id="fig-1">
<label>Figure 1</label>
<caption>
<title>Research methodology</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_63465-fig-1.tif"/>
</fig>
<p>The remainder of this manuscript is structured as follows: <xref ref-type="sec" rid="s2">Section 2</xref> explores existing research related to the field. <xref ref-type="sec" rid="s3">Section 3</xref> outlines the objectives of our study. <xref ref-type="sec" rid="s4">Section 4</xref> describes the method used for our proposed framework. Experiments covering data collection, system specifications, testing procedures, and results are described in <xref ref-type="sec" rid="s5">Section 5</xref>, which thoroughly demonstrates the efficacy of our approach. <xref ref-type="sec" rid="s6">Section 6</xref> provides a summary of our findings and offers suggestions for future research directions. Finally, <xref ref-type="sec" rid="s7">Section 7</xref> provides the conclusion of our work.</p>
</sec>
<sec id="s2">
<label>2</label>
<title>Related Works</title>
<p>Real-world classification datasets often have imbalanced class distributions, with many samples in one class and few in others. Many datasets feature multiclass or multilabel distributions, involving more than two classes or allowing instances to belong to multiple classes. Existing approaches can be grouped into data-level, algorithmic, or mixed methods that combine both approaches to address this imbalanced data problem. However, in this study, our approach is oriented towards data-level methods.</p>
<p>The data-level method uses a re-sampling strategy: under-sampling, over-sampling, or hybrid-sampling. Random undersampling (RUS) and random oversampling (ROS) are two techniques used to remove or add samples to make the classes balanced [<xref ref-type="bibr" rid="ref-17">17</xref>,<xref ref-type="bibr" rid="ref-18">18</xref>]. RUS is the most straightforward undersampling technique, where the existing majority of samples are removed until the class distribution is balanced. Another widely used undersampling technique is NearMiss, introduced by [<xref ref-type="bibr" rid="ref-19">19</xref>], which seeks to address class imbalance by selectively removing samples from the majority class. NearMiss eliminates majority class instances to enhance their separation when it detects proximity between instances of different classes. However, undersampling may inadvertently lead to reduced information, given the potential for eliminating valuable class instances [<xref ref-type="bibr" rid="ref-8">8</xref>]. Consequently, various methods have been proposed to improve data cleaning by identifying and removing redundant patterns and noise in the dataset, thereby enhancing classifier performance. Edited Nearest Neighbors (ENN), Condensed Nearest Neighbors (CNN), and Tomeklinks are several methods based on this technique [<xref ref-type="bibr" rid="ref-20">20</xref>]. These methods remove instances near the decision boundary between classes, enhancing class separation and improving classifier performance.</p>
<p>Unlike RUS, which removes example data, ROS replicates minority samples randomly until the class distribution is balanced. As a result, this method has been criticized for adding more data without contributing new information, which can lead to overfitting. To handle this problem, a well-known approach called Synthetic Minority Over-sampling TEchnique (SMOTE) was proposed [<xref ref-type="bibr" rid="ref-21">21</xref>]. SMOTE creates syntactic samples rather than directly copying from minority classes. It generates additional instances through interpolation within the <italic>k-</italic>nearest neighbors of a minority class [<xref ref-type="bibr" rid="ref-22">22</xref>]. Initially, a sample (x) is selected from the minority class, followed by another sample (y) from its <italic>k</italic>-nearest neighbors set within the same class. Subsequently, the new sample <italic>x&#x2019;</italic> is created through linear interpolation using the following formula:
<disp-formula id="eqn-1"><label>(1)</label><mml:math id="mml-eqn-1" display="block"><mml:msup><mml:mi>x</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mi>x</mml:mi><mml:mo>+</mml:mo><mml:mrow><mml:mtext>rand</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2217;</mml:mo><mml:mrow><mml:mo>|</mml:mo><mml:mi>x</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>y</mml:mi><mml:mo>|</mml:mo></mml:mrow></mml:math></disp-formula>
The iterative process of creating new synthetic samples is carried out until the desired amount of oversampling is achieved. Despite its straightforward and innovative approach, SMOTE has a key shortcoming called overgeneralization, as it ignores within-class imbalance, which tends to increase class overlap. Furthermore, the generation of new synthetic points to address the sparsity of the minority class may inadvertently introduce uninformative and noisy instances that do not accurately represent the actual underlying data patterns [<xref ref-type="bibr" rid="ref-23">23</xref>]. These extraneous patterns can contribute to overfitting, potentially reducing the model&#x2019;s generalizability. Since then, there have been other variations to the SMOTE approach, including Borderline-SMOTE, KMeans-SMOTE, and SMOTE with Support Vector Machine (SVM-SMOTE) [<xref ref-type="bibr" rid="ref-24">24</xref>]. Other variants exist that combine noise removal techniques aiming to clean data, such as SMOTE-Tomek and SMOTE with Edited Nearest Neighbors (SMOTE-ENN), which are methods designed to oversample minority classes in a dataset while also cleaning noisy instances [<xref ref-type="bibr" rid="ref-8">8</xref>]. These versions of the SMOTE technique focus on samples within the boundary area between classes, creating synthetic samples exclusively on the dividing line of two classes to avoid overfitting. Another recent hybrid variant was proposed by [<xref ref-type="bibr" rid="ref-23">23</xref>], which uses SMOTE for oversampling and Edited Centroid Displacement-based <italic>k</italic>-Nearest Neighbors (ECDNN) for undersampling, known as SMOTE-CDNN. The solution aims to minimize the within-cluster distance problem by using centroid displacement for class probability estimation. Other techniques similar to SMOTE, including Adaptive Synthetic Sampling (ADASYN) and Stacking algorithms, have been proposed to address multiclass imbalance problems, beating previous techniques regarding accuracy and sensitivity. ADASYN was introduced by [<xref ref-type="bibr" rid="ref-25">25</xref>] and focused on generating new samples by prioritizing instances based on the density distribution of the minority class. It creates more samples in areas where the minority class density is low and fewer samples where the minority class density is higher.</p>
<p>Despite the efficacy demonstrated by the earlier methods, our observations indicate that these techniques remain vulnerable to sparse, overlapping data points of multiclass imbalanced datasets. This leads to issues such as <italic>&#x2018;within class imbalance&#x2019;</italic> and <italic>&#x2018;small disjunct problem&#x2019;</italic>, which have also been emphasized by [<xref ref-type="bibr" rid="ref-10">10</xref>,<xref ref-type="bibr" rid="ref-22">22</xref>,<xref ref-type="bibr" rid="ref-26">26</xref>]. When the minority classes have few samples, and oversampling is performed within clusters, the majority class overwhelmingly dominates the dataset, leaving the remaining classes with only sparse instances. Furthermore, employing an approach like SMOTE, which generates synthetic instances by interpolating between current samples of the minority class, may not be effective in sparse datasets, as shown in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>. This is because the minority class samples are scattered and limited but contain overlapping instances that belong to different classes, potentially leading to synthetic samples that do not accurately represent realistic or potential minority class samples. Consequently, this can result in a poorly performing model on new, unseen data. Moreover, it has been observed that resampling techniques like ADASYN and KMeans-SMOTE do not always consistently succeed in identifying enough samples that form clusters. They also often struggle to determine the minimum number of neighbors required for some datasets, which could affect the efficiency of these methods. These challenges motivated our study to explore innovative and better strategies for dealing with sparse datasets.</p>
<fig id="fig-2">
<label>Figure 2</label>
<caption>
<title>Visualization of resampling on a sparse multiclass dataset: (<bold>a</bold>) Original dataset; (<bold>b</bold>) Noisy resampled dataset using SMOTE</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_63465-fig-2.tif"/>
</fig>
<p>Our method differs from the ECDNN approach proposed by [<xref ref-type="bibr" rid="ref-27">27</xref>], which determines a test data point&#x2019;s label based on the minimum displacement of the nearest class centroid of its <italic>k</italic>-neighbors after incorporating the test instance into the set. As illustrated in <xref ref-type="fig" rid="fig-3">Fig. 3</xref>, a test example (yellow diamond) can be assigned to class B since it produces the lowest displacement of the class centroid compared to other classes. The approach is employed for undersampling to eliminate noisy data points when their labels are predicted differently. It is integrated with SMOTE to balance the dataset, forming a technique known as SMOTE-CDNN [<xref ref-type="bibr" rid="ref-23">23</xref>]. Although this approach successfully balances the dataset, in cases involving sparse multiclass imbalanced data where minority samples hold helpful information, altering their labels or removing them when their predicted class differs from their original class could unintentionally eliminate valuable instances, potentially affecting classification performance in real-world scenarios. Therefore, instead of discarding these noisy data points, our method employs a unique strategy that adjusts their positions by analyzing the distances between a data point and its <italic>k</italic>-neighbors. It displaces the position only if the data point belongs to the minority class within its neighborhood (i.e., if the majority of its neighbors belong to a different class). We hypothesize that this adjustment will preserve their class labels and characteristics while mitigating their influence as noise. This refinement is expected to enhance class separation and promote a more balanced representation, distinguishing our approach from existing methodologies.</p>
<fig id="fig-3">
<label>Figure 3</label>
<caption>
<title>Visual illustration of the CDNN algorithm [<xref ref-type="bibr" rid="ref-27">27</xref>]</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_63465-fig-3.tif"/>
</fig>
</sec>
<sec id="s3">
<label>3</label>
<title>Objectives</title>
<p>The primary objective of this study is to evaluate the effectiveness of displacing noisy data points towards the center of their respective classes, as depicted in <xref ref-type="fig" rid="fig-4">Fig. 4</xref> on the left, resulting in a more precise separation illustrated on the right before additional samples are generated to balance the dataset. This approach aims to yield a more accurate representation of the underlying patterns within the data. While our approach utilizes random oversampling as a straightforward technique to achieve optimal performance, using cleaner, centroid-aligned data points as a base will mitigate the common issue of overfitting associated with random oversampling.</p>
<fig id="fig-4">
<label>Figure 4</label>
<caption>
<title>Visualization of class separation: (<bold>a</bold>) The original dataset with classes 1, 2, and 3, showing significant overlap among data points; (<bold>b</bold>) Improved class separation achieved through a displacement operation</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_63465-fig-4.tif"/>
</fig>
<p>Specifically, we seek to achieve the following objectives:
<list list-type="order">
<list-item>
<p>Create a resampling method that refines imbalanced datasets by shifting noisy data points towards their corresponding centroids, thus improving the dataset&#x2019;s quality.</p></list-item>
<list-item>
<p>Integrate random oversampling to achieve class balance using the cleaned data points, thereby reducing computational overhead while enhancing model performance and preventing underfitting.</p></list-item>
<list-item>
<p>Conduct a comprehensive comparison of the proposed method with baseline resampling methods using multiple classifiers to evaluate the effectiveness and robustness of the approach.</p></list-item>
<list-item>
<p>Assess the performance of the resampling methods using the various metrics to ensure balanced classification performance across imbalanced datasets and apply the Friedman-Nemenyi non-parametric statistical tests to validate the results.</p>
</list-item>
</list></p>
</sec>
<sec id="s4">
<label>4</label>
<title>Method</title>
<p>This section outlines the principle of our approach to perform the <italic>displacement of the noisy data points</italic> (NDE in <xref ref-type="sec" rid="s4_1">Section 4.1</xref>) before performing <italic>random oversampling</italic> (NDESO in <xref ref-type="sec" rid="s4_2">Section 4.2</xref>).</p>
<sec id="s4_1">
<label>4.1</label>
<title>Neighbor-Based Displacement Enhancement (NDE)</title>
<p>The foundational version of our algorithm, known as Neighbor-based Displacement Enhancement (NDE), is designed to address noisy class data points by strategically repositioning them closer to their respective centroids to enhance class separation, which operates as follows.</p>
<p>Given a set of data points <inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:mi>X</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x003A;</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>n</mml:mtext></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:mi>C</mml:mi></mml:math></inline-formula>, where <inline-formula id="ieqn-2"><mml:math id="mml-ieqn-2"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is an <italic>n</italic>-dimensional vector and <inline-formula id="ieqn-3"><mml:math id="mml-ieqn-3"><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is its corresponding class label. Suppose there are data points <inline-formula id="ieqn-4"><mml:math id="mml-ieqn-4"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> associated with different class labels <inline-formula id="ieqn-5"><mml:math id="mml-ieqn-5"><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> that overlap each other, as depicted in <xref ref-type="fig" rid="fig-5">Fig. 5</xref>. To reposition the overlapping data points, the process follows these steps, with the corresponding notations provided in <xref ref-type="table" rid="table-1">Table 1</xref>:</p>
<p><list list-type="simple">
<list-item><label>1.</label><p>A pairwise distance matrix <inline-formula id="ieqn-6"><mml:math id="mml-ieqn-6"><mml:mi>D</mml:mi><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>, is computed for each pair of data points <inline-formula id="ieqn-7"><mml:math id="mml-ieqn-7"><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mi>X</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>X</mml:mi></mml:math></inline-formula>.</p></list-item>
<list-item><label>2.</label><p>Based on the distance matrix <inline-formula id="ieqn-8"><mml:math id="mml-ieqn-8"><mml:mi>D</mml:mi></mml:math></inline-formula>, a set of indices <inline-formula id="ieqn-9"><mml:math id="mml-ieqn-9"><mml:mi>I</mml:mi></mml:math></inline-formula> is obtained for each data point, containing the indexes of its <inline-formula id="ieqn-10"><mml:math id="mml-ieqn-10"><mml:mi>k</mml:mi></mml:math></inline-formula>-nearest neighbors.</p></list-item>
<list-item><label>3.</label><p>Iterate over the dataset <inline-formula id="ieqn-11"><mml:math id="mml-ieqn-11"><mml:mi>X</mml:mi></mml:math></inline-formula> to identify a set of displaceable data points <inline-formula id="ieqn-12"><mml:math id="mml-ieqn-12"><mml:mi>M</mml:mi></mml:math></inline-formula>, where a data point <inline-formula id="ieqn-13"><mml:math id="mml-ieqn-13"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is deemed a candidate for displacement if it satisfies the following condition:<disp-formula id="eqn-2"><label>(2)</label><mml:math id="mml-eqn-2" display="block"><mml:mi>A</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo>|</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x1D4A9;</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x1D4C0;</mml:mi></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>}</mml:mo></mml:mrow><mml:mo>|</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>B</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo>|</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x1D4A9;</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mi>&#x1D4C0;</mml:mi></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2260;</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>}</mml:mo></mml:mrow><mml:mo>|</mml:mo></mml:mrow></mml:math></disp-formula>
<disp-formula id="ueqn-3"><mml:math id="mml-ueqn-3" display="block"><mml:mi>M</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:mi>X</mml:mi><mml:mo>&#x2223;</mml:mo><mml:mi>A</mml:mi><mml:mo>&#x003C;</mml:mo><mml:mi>B</mml:mi><mml:mo>}</mml:mo></mml:mrow></mml:math></disp-formula></p></list-item>
</list></p>

<fig id="fig-5">
<label>Figure 5</label>
<caption>
<title>Visual illustration of displace-able data point identification</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_63465-fig-5.tif"/>
</fig><table-wrap id="table-1">
<label>Table 1</label>
<caption>
<title>Notations used for representing data point displacement modeling</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
</colgroup>
<thead>
<tr>
<th align="center">Notations</th>
<th align="center">Description</th>
<th align="center">Notations</th>
<th align="center">Description</th>
</tr>
</thead>
<tbody>
<tr>
<td><inline-formula id="ieqn-47"><mml:math id="mml-ieqn-47"><mml:mi>X</mml:mi></mml:math></inline-formula></td>
<td>Set of data points (dataset)</td>
<td><inline-formula id="ieqn-48"><mml:math id="mml-ieqn-48"><mml:mi>R</mml:mi></mml:math></inline-formula></td>
<td>Set of centroids</td>
</tr>
<tr>
<td><italic>C</italic></td>
<td>Set of the class labels</td>
<td><inline-formula id="ieqn-49"><mml:math id="mml-ieqn-49"><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula></td>
<td>Centroid of class <inline-formula id="ieqn-50"><mml:math id="mml-ieqn-50"><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula></td>
</tr>
<tr>
<td><inline-formula id="ieqn-51"><mml:math id="mml-ieqn-51"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula></td>
<td>A data point in the dataset <inline-formula id="ieqn-52"><mml:math id="mml-ieqn-52"><mml:mi>X</mml:mi></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-53"><mml:math id="mml-ieqn-53"><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula></td>
<td>Set of data points belonging to class <inline-formula id="ieqn-54"><mml:math id="mml-ieqn-54"><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula></td>
</tr>
<tr>
<td><inline-formula id="ieqn-55"><mml:math id="mml-ieqn-55"><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula></td>
<td>Class label of the data point <inline-formula id="ieqn-56"><mml:math id="mml-ieqn-56"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-57"><mml:math id="mml-ieqn-57"><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula></td>
<td>Euclidean distance (absolute difference) between <inline-formula id="ieqn-58"><mml:math id="mml-ieqn-58"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-59"><mml:math id="mml-ieqn-59"><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula></td>
</tr>
<tr>
<td><italic>D</italic></td>
<td>Pairwise distance matrix</td>
<td><inline-formula id="ieqn-60"><mml:math id="mml-ieqn-60"><mml:mrow><mml:mover><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">&#x2192;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula></td>
<td>Unit vector pointing from <inline-formula id="ieqn-61"><mml:math id="mml-ieqn-61"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> to <inline-formula id="ieqn-62"><mml:math id="mml-ieqn-62"><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula></td>
</tr>
<tr>
<td><italic>M</italic></td>
<td>Set of displaceable data points</td>
<td><inline-formula id="ieqn-63"><mml:math id="mml-ieqn-63"><mml:msub><mml:mrow><mml:mi mathvariant="normal">&#x03D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula></td>
<td>Displacement distance of the data point <inline-formula id="ieqn-64"><mml:math id="mml-ieqn-64"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula></td>
</tr>
<tr>
<td><inline-formula id="ieqn-65"><mml:math id="mml-ieqn-65"><mml:msub><mml:mrow><mml:mi>&#x1D4A9;</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula></td>
<td>Set of <italic>k</italic>-nearest neighbors of <inline-formula id="ieqn-66"><mml:math id="mml-ieqn-66"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-67"><mml:math id="mml-ieqn-67"><mml:msubsup><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula></td>
<td>The displaced version of the data point <inline-formula id="ieqn-68"><mml:math id="mml-ieqn-68"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> after displacement</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The set of displaceable data points <inline-formula id="ieqn-14"><mml:math id="mml-ieqn-14"><mml:mi>M</mml:mi></mml:math></inline-formula> consists of points in dataset <inline-formula id="ieqn-15"><mml:math id="mml-ieqn-15"><mml:mi>X</mml:mi></mml:math></inline-formula> where the number of its neighboring points from the same class (set <inline-formula id="ieqn-16"><mml:math id="mml-ieqn-16"><mml:mi>A</mml:mi></mml:math></inline-formula>) is fewer than those belonging to different classes (set <inline-formula id="ieqn-17"><mml:math id="mml-ieqn-17"><mml:mi>B</mml:mi></mml:math></inline-formula>).
<list list-type="simple">
<list-item><label>4.</label><p>Compute the set of centroids <inline-formula id="ieqn-18"><mml:math id="mml-ieqn-18"><mml:mi>R</mml:mi></mml:math></inline-formula>:<disp-formula id="eqn-3"><label>(3)</label><mml:math id="mml-eqn-3" display="block"><mml:mi>R</mml:mi><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msub><mml:mo>&#x003A;</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mo>|</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msub><mml:mo>|</mml:mo></mml:mrow></mml:mfrac><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:munder><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-19"><mml:math id="mml-ieqn-19"><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> is the centroid of class <inline-formula id="ieqn-20"><mml:math id="mml-ieqn-20"><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-21"><mml:math id="mml-ieqn-21"><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> is the set of data points belonging to class <inline-formula id="ieqn-22"><mml:math id="mml-ieqn-22"><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. In other words, <inline-formula id="ieqn-23"><mml:math id="mml-ieqn-23"><mml:mi>R</mml:mi></mml:math></inline-formula> is the set of centroids for all classes, where each centroid <inline-formula id="ieqn-24"><mml:math id="mml-ieqn-24"><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> represents the center of the data points in class <inline-formula id="ieqn-25"><mml:math id="mml-ieqn-25"><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> defined as <inline-formula id="ieqn-26"><mml:math id="mml-ieqn-26"><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula>.</p></list-item>
</list>
<list list-type="simple">
<list-item><label>5.</label><p>Iterate through the set of displace-able data points <inline-formula id="ieqn-27"><mml:math id="mml-ieqn-27"><mml:mi>M</mml:mi></mml:math></inline-formula> to perform the following:
<list list-type="simple">
<list-item><label>(a)</label><p>Get the centroid <inline-formula id="ieqn-28"><mml:math id="mml-ieqn-28"><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> for a displaceable data point <inline-formula id="ieqn-29"><mml:math id="mml-ieqn-29"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>.</p></list-item>
<list-item><label>(b)</label><p>Compute the normalized direction vector from <inline-formula id="ieqn-30"><mml:math id="mml-ieqn-30"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> to the centroid <inline-formula id="ieqn-31"><mml:math id="mml-ieqn-31"><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula>.<disp-formula id="eqn-4"><label>(4)</label><mml:math id="mml-eqn-4" display="block"><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mtext>&#x00A0;</mml:mtext><mml:mrow><mml:mover><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">&#x2192;</mml:mo></mml:mover></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mfrac></mml:math></disp-formula></p></list-item>
</list>
<inline-formula id="ieqn-32"><mml:math id="mml-ieqn-32"><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> represents the absolute distance (or magnitude) between a data point <inline-formula id="ieqn-33"><mml:math id="mml-ieqn-33"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and the centroid of its cluster <inline-formula id="ieqn-34"><mml:math id="mml-ieqn-34"><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula>, while <inline-formula id="ieqn-35"><mml:math id="mml-ieqn-35"><mml:mrow><mml:mover><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">&#x2192;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> represents the normalized direction vector (a unit length) from <inline-formula id="ieqn-36"><mml:math id="mml-ieqn-36"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> to <inline-formula id="ieqn-37"><mml:math id="mml-ieqn-37"><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula>.
<list list-type="simple">
<list-item><label>(c)</label><p>Compute the displacement distance <inline-formula id="ieqn-38"><mml:math id="mml-ieqn-38"><mml:msub><mml:mrow><mml:mi mathvariant="normal">&#x03D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> for the data point <inline-formula id="ieqn-39"><mml:math id="mml-ieqn-39"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>:<disp-formula id="eqn-5"><label>(5)</label><mml:math id="mml-eqn-5" display="block"><mml:msub><mml:mrow><mml:mi mathvariant="normal">&#x03D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mi>k</mml:mi></mml:mfrac><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x1D4A9;</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:munder><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></disp-formula><inline-formula id="ieqn-40"><mml:math id="mml-ieqn-40"><mml:msub><mml:mrow><mml:mi mathvariant="normal">&#x03D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> denotes the mean distance between the data point <inline-formula id="ieqn-41"><mml:math id="mml-ieqn-41"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and its neighboring points.</p></list-item>
</list>
<list list-type="simple">
<list-item><label>(d)</label><p>Displace <inline-formula id="ieqn-42"><mml:math id="mml-ieqn-42"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> closer to its centroid as <inline-formula id="ieqn-43"><mml:math id="mml-ieqn-43"><mml:msubsup><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>:<disp-formula id="eqn-6"><label>(6)</label><mml:math id="mml-eqn-6" display="block"><mml:msubsup><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mover><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">&#x2192;</mml:mo></mml:mover></mml:mrow><mml:mo>&#x2217;</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="normal">&#x03D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></disp-formula><inline-formula id="ieqn-44"><mml:math id="mml-ieqn-44"><mml:msubsup><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> is the displaced version of <inline-formula id="ieqn-45"><mml:math id="mml-ieqn-45"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, adjusted to be closer to its centroid by a distance of <inline-formula id="ieqn-46"><mml:math id="mml-ieqn-46"><mml:msub><mml:mrow><mml:mi mathvariant="normal">&#x03D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>.</p></list-item>
</list></p></list-item>
</list></p>
<p>This NDE algorithm uses a default of <italic>k</italic> &#x003D; 5 neighbors and the Euclidean distance metric.</p>
</sec>
<sec id="s4_2">
<label>4.2</label>
<title>Neighbor-Based Displacement Enhanced Synthetic Oversampling (NDESO)</title>
<p>To address the class imbalance, random oversampling is applied to the dataset after removing noisy data points. This improved version of the NDE algorithm is called Neighbor-based Displacement Enhanced Synthetic Oversampling (NDESO). This oversampling process utilizes the <italic>RandomOverSampler</italic> method from the library <italic>imblearn</italic><xref ref-type="fn" rid="fn-1"><sup>1</sup></xref><fn id="fn-1"><label>1</label><p><ext-link ext-link-type="uri" xlink:href="https://imbalanced-learn.org">https://imbalanced-learn.org</ext-link></p></fn>. In this technique, new data points are generated by randomly duplicating samples from the minority class. Suppose <inline-formula id="ieqn-69"><mml:math id="mml-ieqn-69"><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>o</mml:mi><mml:mi>r</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>y</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> represents the set of data points from the minority class. The oversampling process creates additional data points by randomly selecting from <inline-formula id="ieqn-70"><mml:math id="mml-ieqn-70"><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>o</mml:mi><mml:mi>r</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>y</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> until the number of minority class points is equal to that of the majority class, denoted as <inline-formula id="ieqn-71"><mml:math id="mml-ieqn-71"><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>j</mml:mi><mml:mi>o</mml:mi><mml:mi>r</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>y</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, as given in the following formula:
<disp-formula id="eqn-7"><label>(7)</label><mml:math id="mml-eqn-7" display="block"><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mi>e</mml:mi><mml:mi>w</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mtext>random</mml:mtext></mml:mrow><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mrow><mml:mtext>sample&#xA0;</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mrow><mml:mtext>minority</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mrow><mml:mo>|</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mrow><mml:mtext>majority</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>|</mml:mo></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mo>|</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mrow><mml:mtext>minority</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>|</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>
<disp-formula id="ueqn-9"><mml:math id="mml-ueqn-9" display="block"><mml:msubsup><mml:mi>X</mml:mi><mml:mrow><mml:mrow><mml:mtext>minority</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mrow><mml:mtext>minority</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>&#x222A;</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mi>e</mml:mi><mml:mi>w</mml:mi></mml:mrow></mml:msub></mml:math></disp-formula></p>
<p>By cleaning the dataset in advance, this approach seeks to significantly improve the quality of the resampled data points, refining the oversampling process while minimizing additional overhead for a more effective class balance. The complete algorithm for the procedures of our proposed approach is given in Algorithm 1.</p>
<fig id="fig-14">
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_63465-fig-14.tif"/>
</fig>
</sec>
<sec id="s4_3">
<label>4.3</label>
<title>Performance Metrics</title>
<sec id="s4_3_1">
<label>4.3.1</label>
<title>G-Mean</title>
<p>We calculate precision, recall, F1-score, and G-mean as evaluation metrics during the experiment. However, we have chosen to report G-mean in this paper due to space constraints. For complete proof of the testing output, please refer to the GitHub repository link provided at the end of this manuscript.</p>
<p>Geometric mean (G-mean) is a widely recognized performance metric for imbalanced classification tasks, aiming to achieve a balanced assessment across various classes by focusing on both positive and negative classes. Mathematically, G-mean is defined as follows [<xref ref-type="bibr" rid="ref-28">28</xref>]:
<disp-formula id="eqn-8"><label>(8)</label><mml:math id="mml-eqn-8" display="block"><mml:mrow><mml:mtext>G</mml:mtext></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mtext>mean</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:msqrt><mml:mi>S</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>s</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>v</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>y</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>S</mml:mi><mml:mi>p</mml:mi><mml:mi>e</mml:mi><mml:mi>c</mml:mi><mml:mi>i</mml:mi><mml:mi>f</mml:mi><mml:mi>i</mml:mi><mml:mi>c</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>y</mml:mi></mml:msqrt></mml:math></disp-formula>where:
<list list-type="bullet">
<list-item>
<p><inline-formula id="ieqn-118"><mml:math id="mml-ieqn-118"><mml:mi>S</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>s</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>v</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>y</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mtext>Recall</mml:mtext></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>: The ratio of the actual positive instances that are correctly classified by the classifier. <inline-formula id="ieqn-119"><mml:math id="mml-ieqn-119"><mml:mi>S</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>s</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>v</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>y</mml:mi></mml:math></inline-formula> is often denoted as <inline-formula id="ieqn-120"><mml:math id="mml-ieqn-120"><mml:mi>T</mml:mi><mml:mi>r</mml:mi><mml:mi>u</mml:mi><mml:mi>e</mml:mi><mml:mi>P</mml:mi><mml:mi>o</mml:mi><mml:mi>s</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>v</mml:mi><mml:mi>e</mml:mi><mml:mi>R</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>e</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>T</mml:mi><mml:mi>P</mml:mi><mml:mi>R</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>.<disp-formula id="eqn-9"><label>(9)</label><mml:math id="mml-eqn-9" display="block"><mml:mi>T</mml:mi><mml:mi>P</mml:mi><mml:mi>R</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>T</mml:mi><mml:mi>r</mml:mi><mml:mi>u</mml:mi><mml:mi>e</mml:mi><mml:mspace width="thinmathspace" /><mml:mi>P</mml:mi><mml:mi>o</mml:mi><mml:mi>s</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>v</mml:mi><mml:mi>e</mml:mi><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi><mml:mi>r</mml:mi><mml:mi>u</mml:mi><mml:mi>e</mml:mi><mml:mspace width="thinmathspace" /><mml:mi>P</mml:mi><mml:mi>o</mml:mi><mml:mi>s</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>v</mml:mi><mml:mi>e</mml:mi><mml:mi>s</mml:mi><mml:mo>+</mml:mo><mml:mi>F</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi><mml:mi>e</mml:mi><mml:mspace width="thinmathspace" /><mml:mi>N</mml:mi><mml:mi>e</mml:mi><mml:mi>g</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>v</mml:mi><mml:mi>e</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:mfrac></mml:math></disp-formula></p></list-item>
</list>
<list list-type="bullet">
<list-item>
<p><inline-formula id="ieqn-121"><mml:math id="mml-ieqn-121"><mml:mi>S</mml:mi><mml:mi>p</mml:mi><mml:mi>e</mml:mi><mml:mi>c</mml:mi><mml:mi>i</mml:mi><mml:mi>f</mml:mi><mml:mi>i</mml:mi><mml:mi>c</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>y</mml:mi></mml:math></inline-formula>: The ratio of the actual negative instances that are correctly classified by the classifier. Specificity is often denoted as <inline-formula id="ieqn-122"><mml:math id="mml-ieqn-122"><mml:mi>T</mml:mi><mml:mi>r</mml:mi><mml:mi>u</mml:mi><mml:mi>e</mml:mi><mml:mspace width="thinmathspace" /><mml:mi>N</mml:mi><mml:mi>e</mml:mi><mml:mi>g</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>v</mml:mi><mml:mi>e</mml:mi><mml:mspace width="thinmathspace" /><mml:mi>R</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>e</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>T</mml:mi><mml:mi>N</mml:mi><mml:mi>R</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>.<disp-formula id="eqn-10"><label>(10)</label><mml:math id="mml-eqn-10" display="block"><mml:mi>T</mml:mi><mml:mi>N</mml:mi><mml:mi>R</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>T</mml:mi><mml:mi>r</mml:mi><mml:mi>u</mml:mi><mml:mi>e</mml:mi><mml:mspace width="thinmathspace" /><mml:mi>N</mml:mi><mml:mi>e</mml:mi><mml:mi>g</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>v</mml:mi><mml:mi>e</mml:mi><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi><mml:mi>r</mml:mi><mml:mi>u</mml:mi><mml:mi>e</mml:mi><mml:mspace width="thinmathspace" /><mml:mi>N</mml:mi><mml:mi>e</mml:mi><mml:mi>g</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>v</mml:mi><mml:mi>e</mml:mi><mml:mi>s</mml:mi><mml:mo>+</mml:mo><mml:mi>F</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi><mml:mi>e</mml:mi><mml:mspace width="thinmathspace" /><mml:mi>P</mml:mi><mml:mi>o</mml:mi><mml:mi>s</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>v</mml:mi><mml:mi>e</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:mfrac></mml:math></disp-formula></p></list-item>
</list></p>
<p>The G-mean score typically ranges from 0 to 1. A score of 1 indicates a perfect balance between <inline-formula id="ieqn-123"><mml:math id="mml-ieqn-123"><mml:mi>S</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>s</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>v</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>y</mml:mi><mml:mspace width="thinmathspace" /><mml:mrow><mml:mo>(</mml:mo><mml:mi>T</mml:mi><mml:mi>P</mml:mi><mml:mi>R</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-124"><mml:math id="mml-ieqn-124"><mml:mi>S</mml:mi><mml:mi>p</mml:mi><mml:mi>e</mml:mi><mml:mi>c</mml:mi><mml:mi>i</mml:mi><mml:mi>f</mml:mi><mml:mi>i</mml:mi><mml:mi>c</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>y</mml:mi><mml:mspace width="thinmathspace" /><mml:mo stretchy="false">(</mml:mo><mml:mi>T</mml:mi><mml:mi>N</mml:mi><mml:mi>R</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, meaning the model performs equally well on both positive and negative classes. A score closer to 0 indicates poorer performance in achieving this balance. In practice, the G-mean value varies for imbalanced classification tasks. However, values closer to 1 are targeted to indicate a better trade-off between <inline-formula id="ieqn-125"><mml:math id="mml-ieqn-125"><mml:mi>S</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>s</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>v</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>y</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-126"><mml:math id="mml-ieqn-126"><mml:mi>S</mml:mi><mml:mi>p</mml:mi><mml:mi>e</mml:mi><mml:mi>c</mml:mi><mml:mi>i</mml:mi><mml:mi>f</mml:mi><mml:mi>i</mml:mi><mml:mi>c</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>y</mml:mi></mml:math></inline-formula>.</p>
</sec>
<sec id="s4_3_2">
<label>4.3.2</label>
<title>Friedman and Post-Hoc Nemenyi Tests</title>
<p>In this study, we compared our resampling method with several baseline techniques. To evaluate performance, we used statistical tests, in particular the Friedman and Nemenyi tests, which are recommended and widely used for distinguishing statistically significant differences between approaches, to ensure stable results against randomness in the experimental analysis.</p>
<p>The Friedman and Nemenyi tests are nonparametric statistical methods used to assess the performance differences among multiple resampling methods in classification experiments. These tests do not assume any specific data distribution. The <italic>null</italic> hypothesis implies that all methods perform similarly across the tested datasets. However, rejecting this hypothesis (with a significance level alpha of 0.05) indicates significant performance differences among the resampling techniques [<xref ref-type="bibr" rid="ref-29">29</xref>].</p>
<p>For each dataset <inline-formula id="ieqn-127"><mml:math id="mml-ieqn-127"><mml:mi>i</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mi>D</mml:mi></mml:math></inline-formula>, the resampling methods are ranked from best to worst as 1 to <italic>k</italic>, where <italic>k</italic> denotes the number of resampling methods, including our proposed approach. The mean rank <inline-formula id="ieqn-128"><mml:math id="mml-ieqn-128"><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> for the <italic>j</italic>-th resampling method is defined by:
<disp-formula id="eqn-11"><label>(11)</label><mml:math id="mml-eqn-11" display="block"><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mi>N</mml:mi></mml:mfrac><mml:munder><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mi>D</mml:mi></mml:mrow></mml:munder><mml:msubsup><mml:mi>r</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msubsup></mml:math></disp-formula>where <inline-formula id="ieqn-129"><mml:math id="mml-ieqn-129"><mml:msubsup><mml:mi>r</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> represents the rank of the <italic>j</italic>-th resampling method on the <italic>i</italic>-th dataset, and <inline-formula id="ieqn-130"><mml:math id="mml-ieqn-130"><mml:mi>N</mml:mi></mml:math></inline-formula> is the total number of datasets <inline-formula id="ieqn-131"><mml:math id="mml-ieqn-131"><mml:mi>D</mml:mi></mml:math></inline-formula>. The Friedman test is then defined as:
<disp-formula id="eqn-12"><label>(12)</label><mml:math id="mml-eqn-12" display="block"><mml:msubsup><mml:mi>&#x03C7;</mml:mi><mml:mrow><mml:mi>F</mml:mi><mml:mi>r</mml:mi><mml:mi>i</mml:mi><mml:mi>e</mml:mi><mml:mi>d</mml:mi><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>12</mml:mn><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x22C5;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mfrac><mml:mrow><mml:mo>[</mml:mo><mml:munderover><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:munderover><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi>R</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x22C5;</mml:mo><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mrow><mml:mn>4</mml:mn></mml:mfrac><mml:mo>]</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>Suppose the Friedman test rejects the null hypothesis. In that case, we proceed with the post hoc Nemenyi test for a pairwise comparison among the resampling method based on a critical difference (CD), which is defined as:
<disp-formula id="eqn-13"><label>(13)</label><mml:math id="mml-eqn-13" display="block"><mml:mi>C</mml:mi><mml:mi>D</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mi>q</mml:mi><mml:mrow><mml:mi>&#x03B1;</mml:mi></mml:mrow></mml:msub><mml:mo>&#x22C5;</mml:mo><mml:msqrt><mml:mfrac><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x22C5;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mn>6</mml:mn><mml:mo>&#x22C5;</mml:mo><mml:mi>N</mml:mi></mml:mrow></mml:mfrac></mml:msqrt></mml:math></disp-formula>where <inline-formula id="ieqn-132"><mml:math id="mml-ieqn-132"><mml:msub><mml:mi>q</mml:mi><mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x03B1;</mml:mi></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> is the critical value from the Studentized range distribution for a given significance level <inline-formula id="ieqn-133"><mml:math id="mml-ieqn-133"><mml:mi>&#x03B1;</mml:mi></mml:math></inline-formula>, based on the <italic>CD</italic> value, resampling methods whose mean ranks differ by at least the <italic>CD</italic> value are considered significantly different. Likewise, if the difference in mean rank between the two resampling methods is smaller than the <italic>CD</italic> value, then the difference in performance is considered not statistically significant. In other words, they are statistically comparable.</p>
</sec>
</sec>
</sec>
<sec id="s5">
<label>5</label>
<title>Experiment</title>
<sec id="s5_1">
<label>5.1</label>
<title>Datasets and Environment</title>
<sec id="s5_1_1">
<label>5.1.1</label>
<title>Synthetic Datasets</title>
<p>We utilize a synthetic dataset to provide a clear visualization of the data before and after resampling, using our algorithm and comparing it with others. This dataset consists of three distinct classes (class 0, class 1, class 2), generated based on a normal distribution with a ratio of 50:500:100. To introduce some overlap between the classes, random noise was added to the data points, with the noise generated from a normal distribution and scaled by a factor of 0.75.</p>
</sec>
<sec id="s5_1_2">
<label>5.1.2</label>
<title>Real-World Datasets</title>
<p>The real-world datasets are sourced primarily from Knowledge Extraction<xref ref-type="fn" rid="fn-2"><sup>2</sup></xref><fn id="fn-2"><label>2</label><p><ext-link ext-link-type="uri" xlink:href="https://sci2s.ugr.es/keel/imbalanced.php">https://sci2s.ugr.es/keel/imbalanced.php</ext-link></p></fn> and OpenML<xref ref-type="fn" rid="fn-3"><sup>3</sup></xref><fn id="fn-3"><label>3</label><p><ext-link ext-link-type="uri" xlink:href="https://www.openml.org/search?type=data">https://www.openml.org/search?type=data</ext-link></p></fn>. Details of the real-world multiclass dataset used in the experiments are presented in <xref ref-type="table" rid="table-2">Table 2</xref>, with the number of classes ranging from 3&#x2013;11 and the imbalance ratio ranging from 1.0 to 853.0.</p>
<table-wrap id="table-2">
<label>Table 2</label>
<caption>
<title>Datasets description</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
</colgroup>
<thead>
<tr>
<th align="center">Dataset</th>
<th align="center">Description</th>
<th align="center">Features</th>
<th align="center">Instances</th>
<th align="center">Classes</th>
<th align="center">Majors</th>
<th align="center">Minors</th>
<th align="center">Imbalance ratio</th>
</tr>
</thead>
<tbody>
<tr>
<td>Autos<inline-formula id="ieqn-134"><mml:math id="mml-ieqn-134"><mml:mrow><mml:mi mathvariant="normal">&#x2663;</mml:mi></mml:mrow></mml:math></inline-formula></td>
<td>Automobile</td>
<td>25</td>
<td>159</td>
<td>6</td>
<td>48</td>
<td>[46, 29, 20, 13, 3]</td>
<td>16.00</td>
</tr>
<tr>
<td>Balance</td>
<td>Psychological experimental</td>
<td>4</td>
<td>625</td>
<td>3</td>
<td>288</td>
<td>[288, 49]</td>
<td>5.88</td>
</tr>
<tr>
<td>Contraceptive</td>
<td>Contraceptive method choice</td>
<td>9</td>
<td>1473</td>
<td>3</td>
<td>629</td>
<td>[511, 333]</td>
<td>1.89</td>
</tr>
<tr>
<td>Dermatology</td>
<td>Dermatology patients</td>
<td>34</td>
<td>366</td>
<td>6</td>
<td>112</td>
<td>[72, 61, 52, 49, 20]</td>
<td>5.60</td>
</tr>
<tr>
<td>Ecoli<inline-formula id="ieqn-135"><mml:math id="mml-ieqn-135"><mml:mrow><mml:mi mathvariant="normal">&#x2663;</mml:mi></mml:mrow></mml:math></inline-formula></td>
<td>Protein localization</td>
<td>7</td>
<td>336</td>
<td>9</td>
<td>139</td>
<td>[77, 52, 35, 20, 5, 4, 2, 2]</td>
<td>69.50</td>
</tr>
<tr>
<td>Glass<inline-formula id="ieqn-136"><mml:math id="mml-ieqn-136"><mml:mrow><mml:mi mathvariant="normal">&#x2663;</mml:mi></mml:mrow></mml:math></inline-formula></td>
<td>Glass chemical properties</td>
<td>9</td>
<td>214</td>
<td>6</td>
<td>76</td>
<td>[70, 29, 17, 13 ,9]</td>
<td>8.44</td>
</tr>
<tr>
<td>Hayes-Roth</td>
<td>People characteristics</td>
<td>4</td>
<td>132</td>
<td>3</td>
<td>51</td>
<td>[51, 30]</td>
<td>1.70</td>
</tr>
<tr>
<td>Lymphography<inline-formula id="ieqn-137"><mml:math id="mml-ieqn-137"><mml:mrow><mml:mi mathvariant="normal">&#x2663;</mml:mi></mml:mrow></mml:math></inline-formula></td>
<td>Patient&#x2019;s radiological examination</td>
<td>18</td>
<td>148</td>
<td>4</td>
<td>81</td>
<td>[61, 4, 2]</td>
<td>40.50</td>
</tr>
<tr>
<td>New-Thyroid</td>
<td>Thyroid diseases detection</td>
<td>5</td>
<td>215</td>
<td>3</td>
<td>150</td>
<td>[35, 30]</td>
<td>5.00</td>
</tr>
<tr>
<td>Pageblocks<inline-formula id="ieqn-138"><mml:math id="mml-ieqn-138"><mml:mrow><mml:mi mathvariant="normal">&#x2663;</mml:mi></mml:mrow></mml:math></inline-formula></td>
<td>Document page layout blocks</td>
<td>10</td>
<td>548</td>
<td>5</td>
<td>492</td>
<td>[33, 12, 8, 3]</td>
<td>164.00</td>
</tr>
<tr>
<td>Penbased</td>
<td>Pen-Based recognition of handwritten digits</td>
<td>16</td>
<td>1100</td>
<td>10</td>
<td>115</td>
<td>[115, 114, 114, 114, 106, 106, 106, 105, 105]</td>
<td>1.10</td>
</tr>
<tr>
<td>Segment</td>
<td>Image segmentation</td>
<td>18</td>
<td>2310</td>
<td>7</td>
<td>330</td>
<td>[330, 330, 330, 330, 330, 330]</td>
<td>1.00</td>
</tr>
<tr>
<td>Shuttle<inline-formula id="ieqn-139"><mml:math id="mml-ieqn-139"><mml:mrow><mml:mi mathvariant="normal">&#x2663;</mml:mi></mml:mrow></mml:math></inline-formula></td>
<td>Space shuttle dataset of the Statlog project</td>
<td>9</td>
<td>2175</td>
<td>5</td>
<td>1706</td>
<td>[338, 123, 6, 2]</td>
<td>853.00</td>
</tr>
<tr>
<td>Svmguide2</td>
<td>Benchmark dataset</td>
<td>20</td>
<td>391</td>
<td>3</td>
<td>221</td>
<td>[117, 53]</td>
<td>4.17</td>
</tr>
<tr>
<td>Svmguide4</td>
<td>Benchmark dataset</td>
<td>10</td>
<td>300</td>
<td>6</td>
<td>56</td>
<td>[56, 53, 47, 44, 44]</td>
<td>1.27</td>
</tr>
<tr>
<td>Thyroid<inline-formula id="ieqn-140"><mml:math id="mml-ieqn-140"><mml:mrow><mml:mi mathvariant="normal">&#x2663;</mml:mi></mml:mrow></mml:math></inline-formula></td>
<td>Thyroid diseases detection</td>
<td>21</td>
<td>720</td>
<td>3</td>
<td>666</td>
<td>[37, 17]</td>
<td>39.18</td>
</tr>
<tr>
<td>Vehicle</td>
<td>Vehicle object detection</td>
<td>18</td>
<td>846</td>
<td>4</td>
<td>218</td>
<td>[217, 212, 199]</td>
<td>1.10</td>
</tr>
<tr>
<td>Vowel</td>
<td>Vowel recognition of British English</td>
<td>10</td>
<td>528</td>
<td>11</td>
<td>48</td>
<td>[48, 48, 48, 48, 48, 48, 48, 48, 48, 48]</td>
<td>1.00</td>
</tr>
<tr>
<td>Wine</td>
<td>Chemical characteristics of wines</td>
<td>13</td>
<td>178</td>
<td>3</td>
<td>71</td>
<td>[59, 48]</td>
<td>1.48</td>
</tr>
<tr>
<td>Yeast<inline-formula id="ieqn-141"><mml:math id="mml-ieqn-141"><mml:mrow><mml:mi mathvariant="normal">&#x2663;</mml:mi></mml:mrow></mml:math></inline-formula></td>
<td>Protein localization sites in yeast cells</td>
<td>8</td>
<td>1484</td>
<td>10</td>
<td>463</td>
<td>[429, 244, 163, 51, 44, 35, 30, 20, 5]</td>
<td>92.60</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="table-2fn1" fn-type="other">
<p>Note: <inline-formula id="ieqn-142"><mml:math id="mml-ieqn-142"><mml:mo stretchy="false">[</mml:mo><mml:mrow><mml:mi mathvariant="normal">&#x2663;</mml:mi></mml:mrow><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula> dataset used in NDE testing.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</sec>
<sec id="s5_1_3">
<label>5.1.3</label>
<title>System Environment and Libraries</title>
<p>Throughout the experiment, Python running on a Virtual OS with Linux Server was primarily used, hosted on hardware featuring 32 GB of RAM, an Intel i9 processor with 16 cores, and a 25 GB SSD hard drive. We evaluated our proposed method against several baseline resampling techniques across various classifiers. The samplers and classifiers were implemented using the sci-kit-learn library<xref ref-type="fn" rid="fn-4"><sup>4</sup></xref><fn id="fn-4"><label>4</label><p><ext-link ext-link-type="uri" xlink:href="https://scikit-learn.org">https://scikit-learn.org</ext-link></p></fn>, with some utilizing the imbalanced-learn library<xref ref-type="fn" rid="fn-5"><sup>5</sup></xref><fn id="fn-5"><label>5</label><p><ext-link ext-link-type="uri" xlink:href="https://imbalanced-learn.org">https://imbalanced-learn.org</ext-link></p></fn>. Those resamplers are detailed in <xref ref-type="sec" rid="s2">Section 2</xref>, while classifiers include Support Vector Machine (SVC), <italic>k</italic>-Nearest Neighbors (<italic>k</italic>-NN), Decision Tree, Random Forest, Multi-layer Perceptron (MLP), Easy Ensemble Classifier, RUS Boost Classifier, Balanced Bagging Classifier, and Balanced Random Forest Classifier, each offering unique learning paradigms and handling imbalanced datasets differently. SVC and <italic>k</italic>-NN are instance-based methods sensitive to data distribution, making them suitable for assessing changes in class separation. DecisionTree and RandomForest are tree-based models that excel at capturing non-linear relationships, with RandomForest offering sophisticated ensemble learning capabilities. MLP, a neural network model, tests the performance of algorithms requiring balanced data for gradient-based optimization. EasyEnsemble, RUSBoost classifiers, BalancedBagging, and BalancedRandomForest, as ensemble techniques designed for imbalanced data, help evaluate the ability of our resampling technique to mitigate class imbalance across varying classifier characteristics.</p>
</sec>
</sec>
<sec id="s5_2">
<label>5.2</label>
<title>Testing Procedure</title>
<p>The dataset is initially processed using a resampling method to balance class distributions, which is crucial due to its sparsity and overlap. This step is performed before training the classifiers to address class imbalance, with the goal of improving model performance. After resampling, the dataset is then partitioned into 80% training and 20% testing subsets using Stratified <italic>k</italic>-fold cross-validation. The number of cross-validation folds, <inline-formula id="ieqn-143"><mml:math id="mml-ieqn-143"><mml:msub><mml:mi>n</mml:mi><mml:mrow><mml:mrow><mml:mtext>splits</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula>, is computed as:
<disp-formula id="eqn-14"><label>(14)</label><mml:math id="mml-eqn-14" display="block"><mml:msub><mml:mi>n</mml:mi><mml:mrow><mml:mrow><mml:mtext>splits</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo movablelimits="true" form="prefix">min</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>5</mml:mn><mml:mo>,</mml:mo><mml:mo movablelimits="true" form="prefix">min</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mi>r</mml:mi><mml:mi>a</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula></p>
<p>The <inline-formula id="ieqn-144"><mml:math id="mml-ieqn-144"><mml:mo movablelimits="true" form="prefix">min</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mi>r</mml:mi><mml:mi>a</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> indicates the minimum count of samples among the classes in the training data. The classifier is trained using the training portion of the data and subsequently evaluated on the test subset. To assess performance, the average G-mean metric is computed across folds, which is reported in this study.</p>
</sec>
<sec id="s5_3">
<label>5.3</label>
<title>Results</title>
<p>We conducted a series of tests to assess the effectiveness of the proposed methods. These tests were divided into two main groups as follows.</p>
<sec id="s5_3_1">
<label>5.3.1</label>
<title>NDE Evaluation</title>
<p>The first group focused on evaluating the performance of our base algorithm (NDE) using either a single synthetic dataset or eight real-world datasets (indicated by <inline-formula id="ieqn-145"><mml:math id="mml-ieqn-145"><mml:mrow><mml:mi mathvariant="normal">&#x2663;</mml:mi></mml:mrow></mml:math></inline-formula> in <xref ref-type="table" rid="table-2">Table 2</xref>, along with three classifiers, to illustrate the results visually. This group includes tests to evaluate the effects of varying <italic>k</italic>-neighbors, varying distance metrics, resampled data distributions, and algorithm performance when paired with another sampler for oversampling.
<list list-type="simple">
<list-item><label>1.</label><p>Varying k-neighbors</p>
</list-item>
</list></p>
<p>In this experiment, we evaluated our algorithm on the top 8 datasets with the highest imbalance ratios among the 20 datasets. We utilized three classifiers&#x2014;DecisionTree, <italic>k</italic>-NN, and Balanced Bagging&#x2014;while varying the <italic>k</italic> parameter from 2 to 25. The results presented in <xref ref-type="fig" rid="fig-6">Fig. 6</xref> demonstrate that our algorithm consistently achieves excellent performance in most cases. Notably, it maintains a G-mean value exceeding 0.90 for <italic>k</italic> &#x2265; 5. These findings underscore the robustness and effectiveness of our approach across varying <italic>k</italic> values, highlighting its suitability for diverse scenarios.</p>
<fig id="fig-6">
<label>Figure 6</label>
<caption>
<title>G-mean scores of our NDE algorithm evaluated across three classifiers for various <italic>k</italic>-nearest neighbors (2&#x2013;25) and datasets: (<bold>a</bold>) Autos; (<bold>b</bold>) Ecoli; (<bold>c</bold>) Glass; (<bold>d</bold>) Lymphography; (<bold>e</bold>) Pageblocks; (<bold>f</bold>) Shuttle; (<bold>g</bold>) Thyroid; (<bold>h</bold>) Yeast</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_63465-fig-6.tif"/>
</fig>
<p><list list-type="simple">
<list-item><label>2.</label><p>Varying distance metric selection</p></list-item>
</list></p>
<p>To assess the robustness of our algorithm, we evaluated its performance across commonly used distance metrics, including <italic>Euclidean</italic>, <italic>Cityblock</italic>, <italic>Minkowski</italic>, <italic>Cosine</italic>, and <italic>Hamming</italic>. While numerous other metrics could be tested, these were selected due to their prevalence in data processing tasks. For this test, the same 8 datasets and 3 classifiers were utilized with the default parameters of our algorithm. As shown in <xref ref-type="fig" rid="fig-7">Fig. 7</xref>, the findings indicate that our algorithm consistently achieves high G-mean values across various datasets and distance metrics. While a slight performance fluctuation is observed with the Hamming distance on the <italic>Autos</italic> dataset, the overall stability across metrics underscores the robustness and versatility of our approach in adapting to different metric selections.</p>
<fig id="fig-7">
<label>Figure 7</label>
<caption>
<title>G-mean scores of our NDE algorithm evaluated across three classifiers for various distance metrics and datasets: (<bold>a</bold>) Autos; (<bold>b</bold>) Ecoli; (<bold>c</bold>) Glass; (<bold>d</bold>) Lymphography; (<bold>e</bold>) Pageblocks; (<bold>f</bold>) Shuttle; (<bold>g</bold>) Thyroid; (<bold>h</bold>) Yeast</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_63465-fig-7.tif"/>
</fig>
<p><list list-type="simple">
<list-item><label>3.</label><p>Resampled data distribution</p></list-item>
</list></p>
<p>In this test, we compared the original dataset distribution with the results after applying our algorithm and baseline samplers, including ROS, RUS, SMOTE, ENN, NearMiss, TomekLinks, and ECDNN. Advanced variants such as SVM-SMOTE and SMOTE-Tomek were excluded to focus on the foundational methods. For this test, the synthetic dataset was used with the default parameters of our algorithm. As shown in <xref ref-type="fig" rid="fig-8">Fig. 8</xref>, our algorithm best separates data points into isolated clusters, showing its strong performance. While ENN and ECDNN also produce relatively good distributions, they leave more overlapping points than our method. The other samplers result in significant overlap, producing more noises that can negatively affect classification performance. These results confirm our method&#x2019;s superiority in producing well-separated and cleaner resampled datasets.</p>
<fig id="fig-8">
<label>Figure 8</label>
<caption>
<title>Scatter plots comparing our NDE algorithm with other methods, highlighting its clear class separation with minimal overlap, unlike others that exhibit significant overlap. The distributions are presented for (<bold>a</bold>) The original dataset and those resampled using: (<bold>b</bold>) NDE; (<bold>c</bold>) ROS; (<bold>d</bold>) RUS; (<bold>e</bold>) SMOTE; (<bold>f</bold>) ENN; (<bold>g</bold>) ADASYN; (<bold>h</bold>) NearMiss; (<bold>i</bold>) TomekLinks; (<bold>j</bold>) ECDNN</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_63465-fig-8.tif"/>
</fig>
<p><list list-type="simple">
<list-item><label>4.</label><p>Algorithm in combination with another sampler</p></list-item>
</list></p>
<p>To determine the optimal combination of our algorithm with other samplers, we evaluated the performance of pairing our method with ROS, RUS, SMOTE, ENN, NearMiss, and TomekLinks, excluding ECDNN due to its similar displacement-based approach to ours. The goal was to identify the best sampler for oversampling the data after displacement or removing noisy points. As depicted in <xref ref-type="fig" rid="fig-9">Fig. 9</xref>, our NDE algorithm pairs effectively with all methods except RUS and NearMiss. As further illustrated in <xref ref-type="fig" rid="fig-8">Fig. 8</xref>, these two methods tend to discard a significant number of data points, which does not align with our objective of applying oversampling after displacing the noisy data points. In this study, we propose combining NDE with ROS, as it is the most straightforward approach and delivers excellent results, which we report in the second group of tests.</p>
<fig id="fig-9">
<label>Figure 9</label>
<caption>
<title>G-mean scores of our NDE algorithm evaluated across three classifiers for various samplers and datasets: (<bold>a</bold>) Autos; (<bold>b</bold>) Ecoli; (<bold>c</bold>) Glass; (<bold>d</bold>) Lymphography; (<bold>e</bold>) Pageblocks; (<bold>f</bold>) Shuttle; (<bold>g</bold>) Thyroid; (<bold>h</bold>) Yeast</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_63465-fig-9.tif"/>
</fig>
</sec>
<sec id="s5_3_2">
<label>5.3.2</label>
<title>NDESO Evaluation</title>
<p>The second group assesses the performance of our NDESO (NDE &#x002B; ROS) resampling algorithm by comparing it to 14 baseline resampling methods across 20 real-world datasets and nine classifiers, with a focus on G-mean performance and statistical significance. The baseline methods include advanced versions of SMOTE, ADASYN, and SMOTE-CDNN, a hybrid approach combining CDNN with SMOTE oversampling.</p>
<p>Initially, we compared our method with all other resampling techniques, using their default parameters for a fair evaluation. However, while our method is flawless for any <italic>k</italic>-neighbor selection, some existing algorithms fail due to the sparse distribution of minority class data points in many datasets. These failures were primarily caused by the inability of the algorithms to handle situations where the number of neighbors is lower than the required minimum. A summary of such results is shown in <xref ref-type="table" rid="table-3">Table 3</xref>. As observed, KMeans-SMOTE is the most unstable when using default parameters, failing to execute on nine datasets. It is followed by NearMiss, ADASYN, Borderline-SMOTE, SMOTE, SMOTE-ENN, SMOTE-Tomek, SMOTE-CDNN, and SVM-SMOTE, all of which expect the minority class to have at least five neighbors. SMOTE-ENN, ENN, and ECDNN failed on several datasets where a class had only a single member. Similarly, RandomUnder and ENN also encountered failures during internal cross-validation when applied to small subsets of datasets. The only datasets where these methods were successfully executed with default parameters are <italic>Dermatology</italic>, <italic>Hayes-Roth</italic>, <italic>New-Thyroid</italic>, <italic>Segment</italic>, <italic>Svmguide2</italic>, and <italic>Wine</italic>.</p>
<table-wrap id="table-3">
<label>Table 3</label>
<caption>
<title>Error messages encountered during resampling using default parameters across methods</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
</colgroup>
<thead>
<tr>
<th align="center">Failing methods</th>
<th align="center">Dataset</th>
<th align="center">Error messages</th>
</tr>
</thead>
<tbody>
<tr>
<td>NearMiss, ADASYN, Borderline-SMOTE, SMOTE, SMOTE-ENN, SMOTE-Tomek, SMOTE-CDNN, SVM-SMOTE</td>
<td>Autos, Ecoli, Lymphography, Pageblocks, Shuttle, Yeast (6)</td>
<td><italic>&#x201C;Expected n_neighbors &#x003C;&#x003D; n_samples_fit...&#x201D;</italic></td>
</tr>
<tr>
<td>SMOTE-ENN, ENN, ECDNN</td>
<td>Autos, Lymphography, Svmguide4, Thyroid, Vowel,<break/>Yeast (6)</td>
<td><italic>&#x201C;The least populated class in y has only 1 member, which is too few...&#x201D;</italic></td>
</tr>
<tr>
<td>Kmeans-SMOTE</td>
<td>Autos, Balance, Ecoli, Glass, Lymphography, Pageblocks, Shuttle, Thyroid, Yeast (9)</td>
<td><italic>&#x201C;No clusters found with sufficient samples of class...&#x201D;</italic></td>
</tr>
<tr>
<td>SVM-SMOTE</td>
<td>Autos (1)</td>
<td><italic>&#x201C;All support vectors are considered as noise. SVM-SMOTE is not adapted to your dataset...&#x201D;</italic></td>
</tr>
<tr>
<td>ADASYN</td>
<td>Contraceptive, Glass, Penbased, Svmguide4, Vehicle (5)</td>
<td><italic>&#x201C;No samples will be generated with the provided ratio settings.&#x201D;</italic></td>
</tr>
<tr>
<td>RandomUnder, ENN</td>
<td>Ecoli, Lymphography, Shuttle (3)</td>
<td><italic>&#x201C;k-fold cross-validation requires at least one train/test split...&#x201D;</italic></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>To continue testing, we adjusted the k-neighbors parameter for methods like ADASYN, Borderline-SMOTE, SMOTE, SMOTE-ENN, SMOTE-Tomek, ENN, ECDNN, SVM-SMOTE, and Kmeans-SMOTE. If resampling failed on the initial attempt, the number of neighbors was progressively reduced until the sampler succeeded or reached the lowest possible value of k. However, some samplers, such as ADASYN, SVM-SMOTE, and Kmeans-SMOTE, were still unable to process some datasets even after reducing k to its lowest value.</p>
<p>The final evaluation results, summarized in <xref ref-type="table" rid="table-4">Table 4</xref>, highlight the performance of our approach. While the resampling was tested across various classifiers, yielding consistent overall superiority (as further detailed in subsequent analyses), this table explicitly showcases the best performance observed when using the MLP Classifier. As indicated in the table, our NDESO algorithm achieved the highest average G-mean of 0.9362, surpassing other algorithms in most cases. It exhibited only slightly lower performance on 5 out of 20 datasets, namely <italic>Dermatology</italic>, <italic>Hayes-Roth</italic>, <italic>Lymphography</italic>, <italic>Svmguide2</italic>, and <italic>Wine</italic>. Furthermore, our proposed method demonstrates significant improvements, mainly when applied to the <italic>Autos</italic>, <italic>Contraceptive</italic>, <italic>Svmguide4</italic>, <italic>Vowel</italic>, and <italic>Yeast</italic> datasets. On these datasets, other methods yield average results between 0.40 and 0.70, whereas our approach consistently achieves averages of 0.90 or higher. This indicates that our algorithm is more effective in reducing data overlap and performs better resampling to generate a more balanced and representative dataset for each class.</p>
<table-wrap id="table-4">
<label>Table 4</label>
<caption>
<title>G-mean scores of resampling methods with MLP classifier across datasets</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
</colgroup>
<thead>
<tr>
<th align="center"></th>
<th align="center">[1]</th>
<th align="center">[2]</th>
<th align="center">[3]</th>
<th align="center">[4]</th>
<th align="center">[5]</th>
<th align="center">[6]</th>
<th align="center">[7]</th>
<th align="center">[8]</th>
<th align="center">[9]</th>
<th align="center">[10]</th>
<th align="center">[11]</th>
<th align="center">[12]</th>
<th align="center">[13]</th>
<th align="center">[14]</th>
<th align="center">[15]</th>
</tr>
</thead>
<tbody>
<tr>
<td>Autos</td>
<td><bold>0.9613</bold></td>
<td>0.6792</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>0.6033</td>
<td>0.6875</td>
<td>0.9500</td>
<td>0.6921</td>
<td>&#x2013;</td>
<td>0.4274</td>
<td>0.6294</td>
<td>0.8141</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>Balance</td>
<td><bold>0.9667</bold></td>
<td>0.9160</td>
<td>0.8512</td>
<td>0.8393</td>
<td>0.9351</td>
<td>0.9420</td>
<td>0.9363</td>
<td>0.9505</td>
<td>0.9362</td>
<td>0.8917</td>
<td>0.6434</td>
<td>0.6458</td>
<td>0.9481</td>
<td>0.9343</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>Contraceptive</td>
<td><bold>0.8761</bold></td>
<td>0.5672</td>
<td>0.5434</td>
<td>0.4942</td>
<td>&#x2013;</td>
<td>0.5660</td>
<td>0.5693</td>
<td>0.7466</td>
<td>0.5748</td>
<td>0.5745</td>
<td>0.5433</td>
<td>0.6688</td>
<td>0.7186</td>
<td>0.5785</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>Dermatology</td>
<td>0.9962</td>
<td>0.9758</td>
<td>0.9778</td>
<td>0.9500</td>
<td>0.9815</td>
<td>0.9795</td>
<td>0.9794</td>
<td>0.9927</td>
<td>0.9889</td>
<td><bold>1.0000</bold></td>
<td>0.9709</td>
<td>0.9889</td>
<td>0.9879</td>
<td>0.9943</td>
<td>0.9852</td>
</tr>
<tr>
<td>Ecoli</td>
<td><bold>0.9065</bold></td>
<td>0.8075</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>0.6567</td>
<td>0.8538</td>
<td>0.8764</td>
<td>0.8520</td>
<td>0.5702</td>
<td>0.3868</td>
<td>0.5451</td>
<td>0.8841</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>Glass</td>
<td><bold>0.8259</bold></td>
<td>0.7244</td>
<td>0.5667</td>
<td>0.6667</td>
<td>&#x2013;</td>
<td>0.7075</td>
<td>0.7487</td>
<td>0.6707</td>
<td>0.7423</td>
<td>0.6056</td>
<td>0.3744</td>
<td>0.4861</td>
<td>0.6925</td>
<td>0.6462</td>
<td>0.7605</td>
</tr>
<tr>
<td>Hayes-Roth</td>
<td>0.7380</td>
<td>0.6565</td>
<td>0.6167</td>
<td>0.6633</td>
<td>0.6220</td>
<td>0.6889</td>
<td>0.6741</td>
<td><bold>0.8611</bold></td>
<td>0.6750</td>
<td>0.7044</td>
<td>0.6521</td>
<td>0.8542</td>
<td>0.7659</td>
<td>0.6333</td>
<td>0.6870</td>
</tr>
<tr>
<td>Lymphography</td>
<td>0.9179</td>
<td>0.9186</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>0.8990</td>
<td>0.9064</td>
<td>0.9521</td>
<td>0.9051</td>
<td>&#x2013;</td>
<td>0.3868</td>
<td>&#x2013;</td>
<td><bold>0.9525</bold></td>
<td>0.9162</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>New-Thyroid</td>
<td><bold>1.0000</bold></td>
<td>0.9528</td>
<td>0.9200</td>
<td>0.9600</td>
<td>0.9806</td>
<td>0.9889</td>
<td>0.9500</td>
<td>0.9629</td>
<td>0.9639</td>
<td>0.7211</td>
<td>0.7122</td>
<td>0.7467</td>
<td>0.9690</td>
<td>0.9528</td>
<td>0.9781</td>
</tr>
<tr>
<td>Pageblocks</td>
<td><bold>0.9980</bold></td>
<td>0.9777</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>0.9715</td>
<td>0.9827</td>
<td>0.9837</td>
<td>0.9892</td>
<td>0.9807</td>
<td>0.5714</td>
<td>0.3500</td>
<td>0.3911</td>
<td>0.9884</td>
<td>0.9778</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>Penbased</td>
<td><bold>0.9653</bold></td>
<td>0.9458</td>
<td>0.9346</td>
<td>0.9561</td>
<td>&#x2013;</td>
<td>0.9563</td>
<td>0.9502</td>
<td>0.9615</td>
<td>0.9562</td>
<td>0.9621</td>
<td>0.9421</td>
<td>0.9482</td>
<td>0.9526</td>
<td>0.9511</td>
<td>0.9506</td>
</tr>
<tr>
<td>Segment</td>
<td><bold>0.9773</bold></td>
<td>0.9356</td>
<td>0.9356</td>
<td>0.9340</td>
<td>0.9356</td>
<td>0.9361</td>
<td>0.9367</td>
<td>0.9611</td>
<td>0.9355</td>
<td>0.9603</td>
<td>0.9361</td>
<td>0.9414</td>
<td>0.9494</td>
<td>0.9351</td>
<td>0.9367</td>
</tr>
<tr>
<td>Shuttle</td>
<td><bold>0.9918</bold></td>
<td>0.9881</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>0.9811</td>
<td>0.9827</td>
<td>0.9865</td>
<td>0.9950</td>
<td>0.9867</td>
<td>0.8402</td>
<td>0.5632</td>
<td>0.5757</td>
<td>0.9950</td>
<td>0.9817</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>Svmguide2</td>
<td>0.8455</td>
<td>0.8193</td>
<td>0.7120</td>
<td>0.6231</td>
<td>0.7506</td>
<td>0.8437</td>
<td>0.8075</td>
<td>0.8203</td>
<td>0.8364</td>
<td>0.8065</td>
<td>0.7628</td>
<td>0.7915</td>
<td>0.8625</td>
<td>0.8511</td>
<td><bold>0.8814</bold></td>
</tr>
<tr>
<td>Svmguide4</td>
<td><bold>0.9366</bold></td>
<td>0.5106</td>
<td>0.5304</td>
<td>0.5726</td>
<td>&#x2013;</td>
<td>0.5718</td>
<td>0.4981</td>
<td>0.8250</td>
<td>0.5248</td>
<td>&#x2013;</td>
<td>0.4762</td>
<td>0.5132</td>
<td>0.5733</td>
<td>0.4889</td>
<td>0.5663</td>
</tr>
<tr>
<td>Thyroid</td>
<td><bold>0.9806</bold></td>
<td>0.9199</td>
<td>0.5889</td>
<td>0.6444</td>
<td>0.9203</td>
<td>0.9355</td>
<td>0.9161</td>
<td>0.9175</td>
<td>0.9148</td>
<td>&#x2013;</td>
<td>0.4160</td>
<td>&#x2013;</td>
<td>0.9211</td>
<td>0.9011</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>Vehicle</td>
<td><bold>0.9355</bold></td>
<td>0.7637</td>
<td>0.7609</td>
<td>0.7419</td>
<td>&#x2013;</td>
<td>0.7608</td>
<td>0.7536</td>
<td>0.8334</td>
<td>0.7705</td>
<td>0.8045</td>
<td>0.7656</td>
<td>0.7585</td>
<td>0.7613</td>
<td>0.7608</td>
<td>0.7763</td>
</tr>
<tr>
<td>Vowel</td>
<td><bold>0.9828</bold></td>
<td>0.6156</td>
<td>0.6114</td>
<td>0.5838</td>
<td>0.6315</td>
<td>0.6140</td>
<td>0.6231</td>
<td>&#x2013;</td>
<td>0.4681</td>
<td>&#x2013;</td>
<td>0.4598</td>
<td>0.4383</td>
<td>0.5527</td>
<td>0.6019</td>
<td>0.6058</td>
</tr>
<tr>
<td>Wine</td>
<td>0.9879</td>
<td>0.9758</td>
<td>0.9655</td>
<td>0.9738</td>
<td>0.9889</td>
<td>0.9818</td>
<td>0.9818</td>
<td>0.9889</td>
<td>0.9739</td>
<td>0.9889</td>
<td>0.9793</td>
<td><bold>0.9926</bold></td>
<td>0.9917</td>
<td>0.9758</td>
<td>0.9828</td>
</tr>
<tr>
<td>Yeast</td>
<td><bold>0.9349</bold></td>
<td>0.6144</td>
<td>0.5750</td>
<td>0.3750</td>
<td>&#x2013;</td>
<td>0.6611</td>
<td>0.6362</td>
<td>0.6716</td>
<td>0.6238</td>
<td>&#x2013;</td>
<td>0.5356</td>
<td>0.7252</td>
<td>0.6583</td>
<td>0.7000</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>Average G-mean</td>
<td><bold>0.9362</bold></td>
<td>0.8132</td>
<td>0.7393</td>
<td>0.7319</td>
<td>0.8817</td>
<td>0.8129</td>
<td>0.8190</td>
<td>0.8909</td>
<td>0.8151</td>
<td>0.7858</td>
<td>0.6142</td>
<td>0.7023</td>
<td>0.8469</td>
<td>0.8212</td>
<td>0.8283</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="table-4fn1" fn-type="other">
<p>Note: [1] NDESO; [2] RandomOver; [3] RandomUnder; [4] NearMiss; [5] ADASYN; [6] Borderline-SMOTE; [7] SMOTE; [8] SMOTE-ENN; [9] SMOTE-Tomek; [10] ENN; [11] TomekLinks; [12] ECDNN; [13] SMOTE-CDNN; [14] SVM-SMOTE; [15] Kmeans-SMOTE; [&#x2013;] Resampling error. The best score is highlighted in bold.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<p>The performance trend is consistently achieved by our algorithm, as shown in the testing across all datasets and classifiers, summarized in <xref ref-type="table" rid="table-5">Tables 5</xref> and <xref ref-type="table" rid="table-6">6</xref>. Our algorithm scored the highest average G-mean of 0.9051 across all classifiers. With an average rating of 2.88, the lowest among all methods, further confirming that it outperforms other algorithms. We also conducted tests to demonstrate the robustness of our algorithm across different selections of <italic>k</italic>-neighbors &#x003D; [2, 5, 11, 15, 21, 25], as shown in <xref ref-type="table" rid="table-7">Table 7</xref>. The results indicate that our algorithm consistently performs well across different values of <italic>k</italic>, with <italic>k</italic> &#x003D; 5 showing the best performance with an average G-mean of 0.9036, corresponding to our algorithm&#x2019;s default parameter. This suggests that the approach remains effective even with varying <italic>k</italic> values across datasets and classifiers.</p>
<table-wrap id="table-5">
<label>Table 5</label>
<caption>
<title>G-mean scores of resampling methods across various classifiers</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th align="center">Dataset</th>
<th>[1]</th>
<th>[2]</th>
<th>[3]</th>
<th>[4]</th>
<th>[5]</th>
<th>[6]</th>
<th>[7]</th>
<th>[8]</th>
<th>[9]</th>
<th>[10]</th>
<th>[11]</th>
<th>[12]</th>
<th>[13]</th>
<th>[14]</th>
<th>[15]</th>
</tr>
</thead>
<tbody>
<tr>
<td>SVC</td>
<td><bold>0.9389</bold></td>
<td>0.8320</td>
<td>0.7506</td>
<td>0.7455</td>
<td>0.9072</td>
<td>0.8256</td>
<td>0.8420</td>
<td>0.8962</td>
<td>0.8385</td>
<td>0.8152</td>
<td>0.6468</td>
<td>0.7381</td>
<td>0.8585</td>
<td>0.8418</td>
<td>0.8652</td>
</tr>
<tr>
<td><italic>k</italic>-NN</td>
<td><bold>0.9324</bold></td>
<td>0.8200</td>
<td>0.6781</td>
<td>0.6759</td>
<td>0.8675</td>
<td>0.8086</td>
<td>0.8285</td>
<td>0.9000</td>
<td>0.8203</td>
<td>0.7884</td>
<td>0.6258</td>
<td>0.6939</td>
<td>0.8543</td>
<td>0.8267</td>
<td>0.8221</td>
</tr>
<tr>
<td>DecisionTree</td>
<td><bold>0.9497</bold></td>
<td>0.8811</td>
<td>0.7192</td>
<td>0.6812</td>
<td>0.9035</td>
<td>0.8452</td>
<td>0.8577</td>
<td>0.9171</td>
<td>0.8603</td>
<td>0.8571</td>
<td>0.7189</td>
<td>0.7853</td>
<td>0.8929</td>
<td>0.8528</td>
<td>0.8581</td>
</tr>
<tr>
<td>RandomForest</td>
<td><bold>0.9706</bold></td>
<td>0.9204</td>
<td>0.7914</td>
<td>0.7560</td>
<td>0.9484</td>
<td>0.8965</td>
<td>0.9119</td>
<td>0.9666</td>
<td>0.9087</td>
<td>0.8761</td>
<td>0.7495</td>
<td>0.8229</td>
<td>0.9332</td>
<td>0.9058</td>
<td>0.9160</td>
</tr>
<tr>
<td>MLP</td>
<td><bold>0.9362</bold></td>
<td>0.8132</td>
<td>0.7393</td>
<td>0.7319</td>
<td>0.8817</td>
<td>0.8129</td>
<td>0.8190</td>
<td>0.8909</td>
<td>0.8151</td>
<td>0.7858</td>
<td>0.6142</td>
<td>0.7023</td>
<td>0.8469</td>
<td>0.8212</td>
<td>0.8283</td>
</tr>
<tr>
<td>EasyEnsemble</td>
<td><bold>0.8183</bold></td>
<td>0.6942</td>
<td>0.6333</td>
<td>0.6134</td>
<td>0.7794</td>
<td>0.6975</td>
<td>0.7110</td>
<td>0.8028</td>
<td>0.7283</td>
<td>0.7787</td>
<td>0.6688</td>
<td>0.7189</td>
<td>0.7638</td>
<td>0.7483</td>
<td>0.7018</td>
</tr>
<tr>
<td>RUSBoost</td>
<td><bold>0.6690</bold></td>
<td>0.5509</td>
<td>0.5744</td>
<td>0.5498</td>
<td>0.6658</td>
<td>0.6116</td>
<td>0.5746</td>
<td>0.7269</td>
<td>0.6104</td>
<td>0.7625</td>
<td>0.6340</td>
<td>0.6658</td>
<td>0.6961</td>
<td>0.6618</td>
<td>0.5648</td>
</tr>
<tr>
<td>Balanced Bagging</td>
<td><bold>0.9592</bold></td>
<td>0.8954</td>
<td>0.7546</td>
<td>0.7327</td>
<td>0.9201</td>
<td>0.8590</td>
<td>0.8886</td>
<td>0.9238</td>
<td>0.8885</td>
<td>0.8561</td>
<td>0.7416</td>
<td>0.8091</td>
<td>0.9084</td>
<td>0.8770</td>
<td>0.8860</td>
</tr>
<tr>
<td>Balanced RandomForest</td>
<td><bold>0.9714</bold></td>
<td>0.9191</td>
<td>0.7987</td>
<td>0.7545</td>
<td>0.9479</td>
<td>0.8746</td>
<td>0.9109</td>
<td>0.9376</td>
<td>0.9079</td>
<td>0.8998</td>
<td>0.7737</td>
<td>0.8220</td>
<td>0.9339</td>
<td>0.9033</td>
<td>0.9159</td>
</tr>
<tr>
<td>Average G-mean</td>
<td><bold>0.9051</bold></td>
<td>0.8140</td>
<td>0.7155</td>
<td>0.6934</td>
<td>0.8691</td>
<td>0.8035</td>
<td>0.8160</td>
<td>0.8847</td>
<td>0.8198</td>
<td>0.8244</td>
<td>0.6859</td>
<td>0.7509</td>
<td>0.8542</td>
<td>0.8265</td>
<td>0.8176</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="table-5fn1" fn-type="other">
<p>Note: [1] NDESO; [2] RandomOver; [3] RandomUnder; [4] NearMiss; [5] ADASYN; [6] Borderline-SMOTE; [7] SMOTE; [8] SMOTE-ENN; [9] SMOTE-Tomek; [10] ENN; [11] TomekLinks; [12] ECDNN; [13] SMOTE-CDNN; [14] SVM-SMOTE; [15] Kmeans-SMOTE. The best score is highlighted in bold.</p>
</fn>
</table-wrap-foot>
</table-wrap><table-wrap id="table-6">
<label>Table 6</label>
<caption>
<title>Mean rank results of resampling methods across various classifiers</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
</colgroup>
<thead>
<tr>
<th align="center">Dataset</th>
<th align="center">[1]</th>
<th align="center">[2]</th>
<th align="center">[3]</th>
<th align="center">[4]</th>
<th align="center">[5]</th>
<th align="center">[6]</th>
<th align="center">[7]</th>
<th align="center">[8]</th>
<th align="center">[9]</th>
<th align="center">[10]</th>
<th align="center">[11]</th>
<th align="center">[12]</th>
<th align="center">[13]</th>
<th align="center">[14]</th>
<th align="center">[15]</th>
</tr>
</thead>
<tbody>
<tr>
<td>SVC</td>
<td><bold>2.88</bold></td>
<td>7.75</td>
<td>12.10</td>
<td>11.50</td>
<td>10.30</td>
<td>7.58</td>
<td>7.00</td>
<td>3.28</td>
<td>6.88</td>
<td>8.95</td>
<td>11.30</td>
<td>8.18</td>
<td>4.30</td>
<td>8.03</td>
<td>10.00</td>
</tr>
<tr>
<td><italic>k</italic>-NN</td>
<td><bold>2.10</bold></td>
<td>7.55</td>
<td>11.85</td>
<td>11.45</td>
<td>10.68</td>
<td>7.03</td>
<td>7.18</td>
<td>3.63</td>
<td>7.58</td>
<td>8.53</td>
<td>11.45</td>
<td>8.88</td>
<td>3.90</td>
<td>8.05</td>
<td>10.18</td>
</tr>
<tr>
<td>DecisionTree</td>
<td><bold>2.35</bold></td>
<td>4.73</td>
<td>11.38</td>
<td>12.03</td>
<td>11.03</td>
<td>7.73</td>
<td>7.68</td>
<td>4.88</td>
<td>7.43</td>
<td>8.38</td>
<td>10.40</td>
<td>9.33</td>
<td>4.15</td>
<td>8.45</td>
<td>10.10</td>
</tr>
<tr>
<td>RandomForest</td>
<td><bold>2.23</bold></td>
<td>5.75</td>
<td>12.20</td>
<td>12.65</td>
<td>10.03</td>
<td>7.60</td>
<td>6.90</td>
<td>3.73</td>
<td>7.00</td>
<td>8.20</td>
<td>11.30</td>
<td>9.03</td>
<td>4.53</td>
<td>8.40</td>
<td>10.48</td>
</tr>
<tr>
<td>MLP</td>
<td><bold>1.85</bold></td>
<td>8.05</td>
<td>11.85</td>
<td>11.60</td>
<td>10.50</td>
<td>6.28</td>
<td>7.35</td>
<td>3.98</td>
<td>6.98</td>
<td>9.00</td>
<td>11.55</td>
<td>8.95</td>
<td>4.15</td>
<td>8.68</td>
<td>9.25</td>
</tr>
<tr>
<td>EasyEnsemble</td>
<td><bold>3.80</bold></td>
<td>8.58</td>
<td>11.50</td>
<td>11.18</td>
<td>10.70</td>
<td>7.70</td>
<td>8.15</td>
<td>5.53</td>
<td>6.65</td>
<td>8.73</td>
<td>8.85</td>
<td>6.63</td>
<td>4.00</td>
<td>7.85</td>
<td>10.18</td>
</tr>
<tr>
<td>RUSBoost</td>
<td><bold>6.18</bold></td>
<td>9.98</td>
<td>10.78</td>
<td>11.28</td>
<td>10.83</td>
<td>7.50</td>
<td>8.78</td>
<td>5.33</td>
<td>7.08</td>
<td>6.55</td>
<td>6.85</td>
<td>5.73</td>
<td>4.10</td>
<td>7.90</td>
<td>11.18</td>
</tr>
<tr>
<td>Balanced Bagging</td>
<td><bold>2.40</bold></td>
<td>4.98</td>
<td>12.08</td>
<td>12.00</td>
<td>10.93</td>
<td>7.30</td>
<td>6.83</td>
<td>4.88</td>
<td>6.30</td>
<td>9.73</td>
<td>10.45</td>
<td>8.83</td>
<td>4.70</td>
<td>8.88</td>
<td>9.75</td>
</tr>
<tr>
<td>Balanced RandomForest</td>
<td><bold>2.18</bold></td>
<td>5.73</td>
<td>11.35</td>
<td>12.00</td>
<td>11.23</td>
<td>7.95</td>
<td>7.55</td>
<td>3.60</td>
<td>7.58</td>
<td>8.95</td>
<td>10.95</td>
<td>8.58</td>
<td>3.98</td>
<td>7.95</td>
<td>10.45</td>
</tr>
<tr>
<td>Average Mean Rank</td>
<td><bold>2.88</bold></td>
<td>7.01</td>
<td>11.68</td>
<td>11.74</td>
<td>10.69</td>
<td>7.41</td>
<td>7.49</td>
<td>4.31</td>
<td>7.05</td>
<td>8.56</td>
<td>10.34</td>
<td>8.23</td>
<td>4.20</td>
<td>8.24</td>
<td>10.17</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="table-6fn1" fn-type="other">
<p>Note: [1] NDESO; [2] RandomOver; [3] RandomUnder; [4] NearMiss; [5] ADASYN; [6] Borderline-SMOTE; [7] SMOTE; [8] SMOTE-ENN; [9] SMOTE-Tomek; [10] ENN; [11] TomekLinks; [12] ECDNN; [13] SMOTE-CDNN; [14] SVM-SMOTE; [15] Kmeans-SMOTE. The best score is highlighted in bold.</p>
</fn>
</table-wrap-foot>
</table-wrap><table-wrap id="table-7">
<label>Table 7</label>
<caption>
<title>The effect of varying <italic>k</italic>-neighbors on our NDESO algorithm tested across datasets and classifiers</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Dataset</th>
<th align="center">NDESO (<italic>k</italic> <bold>&#x003D;</bold> 2)</th>
<th align="center">NDESO (<italic>k</italic> <bold>&#x003D;</bold> 5)</th>
<th align="center">NDESO (<italic>k</italic> <bold>&#x003D;</bold> 11)</th>
<th align="center">NDESO (<italic>k</italic> <bold>&#x003D;</bold> 15)</th>
<th align="center">NDESO (<italic>k</italic> <bold>&#x003D;</bold> 21)</th>
<th align="center">NDESO (<italic>k</italic> <bold>&#x003D;</bold> 25)</th>
</tr>
</thead>
<tbody>
<tr>
<td>SVC</td>
<td>0.9032</td>
<td>0.9389</td>
<td><bold>0.9398</bold></td>
<td>0.9253</td>
<td>0.9231</td>
<td>0.9204</td>
</tr>
<tr>
<td><italic>k</italic>-NN</td>
<td>0.8818</td>
<td>0.9324</td>
<td><bold>0.9375</bold></td>
<td>0.9294</td>
<td>0.9279</td>
<td>0.9247</td>
</tr>
<tr>
<td>DecisionTree</td>
<td>0.9166</td>
<td><bold>0.9487</bold></td>
<td><bold>0.9518</bold></td>
<td>0.9406</td>
<td>0.9429</td>
<td>0.9473</td>
</tr>
<tr>
<td>RandomForest</td>
<td>0.9511</td>
<td><bold>0.9706</bold></td>
<td>0.9694</td>
<td>0.9589</td>
<td>0.9628</td>
<td>0.9626</td>
</tr>
<tr>
<td>MLP</td>
<td>0.8888</td>
<td><bold>0.9332</bold></td>
<td>0.9283</td>
<td>0.9136</td>
<td>0.9120</td>
<td>0.9111</td>
</tr>
<tr>
<td>EasyEnsemble</td>
<td>0.7710</td>
<td><bold>0.8157</bold></td>
<td>0.8080</td>
<td>0.7944</td>
<td>0.7811</td>
<td>0.7906</td>
</tr>
<tr>
<td>RUSBoost</td>
<td>0.6046</td>
<td><bold>0.6645</bold></td>
<td>0.6541</td>
<td>0.6635</td>
<td>0.6466</td>
<td>0.6527</td>
</tr>
<tr>
<td>Balanced Bagging</td>
<td>0.9313</td>
<td><bold>0.9576</bold></td>
<td>0.9568</td>
<td>0.9471</td>
<td>0.9493</td>
<td>0.9530</td>
</tr>
<tr>
<td>Balanced RandomForest</td>
<td>0.9490</td>
<td><bold>0.9711</bold></td>
<td>0.9707</td>
<td>0.9585</td>
<td>0.9644</td>
<td>0.9632</td>
</tr>
<tr>
<td>AVERAGE</td>
<td>0.8664</td>
<td><bold>0.9036</bold></td>
<td>0.9018</td>
<td>0.8924</td>
<td>0.8900</td>
<td>0.8917</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="table-7n1" fn-type="other">
<p>Note: The best score is highlighted in bold.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<p>The radar plot presented in <xref ref-type="fig" rid="fig-10">Fig. 10</xref> demonstrates the consistently superior performance of our NDESO algorithm compared to other resampling methods across various datasets and classifiers. As illustrated in the plot, the line representing our algorithm is consistently positioned at the outermost edge in most classifications, indicating that it outperforms all other methods in terms of G-mean across the tested scenarios. This visually reinforces the effectiveness and robustness of NDESO, providing a clear indication of its superior classification results. However, using the RUS Boost classifier appears to be less effective for sparse imbalanced data, especially for imbalanced ratios with majority classes of more than 54%&#x2013;56% [<xref ref-type="bibr" rid="ref-30">30</xref>]. This leads to poor performance across nearly all resamplers across datasets, as evidenced in both radar plots. On this classifier, our algorithm was unable to achieve the best performance, primarily because the classifier relies on random undersampling techniques that may discard important minority class data points. This approach contradicts the objective of our algorithm, which aims to generate additional data points after displacing noisy ones.</p>
<fig id="fig-10">
<label>Figure 10</label>
<caption>
<title>Comparison of the average G-mean scores of resampling methods across various classifiers and datasets, evaluating NDESO against: (<bold>a</bold>) ROS, RUS, NearMiss, ADASYN, Borderline-SMOTE, SMOTE, SMOTE-ENN; (<bold>b</bold>) SMOTE-Tomek, ENN, TomekLinks, Edited-CDNN, SMOTE-CNN, SVM-SMOTE, KMeans-SMOTE</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_63465-fig-10.tif"/>
</fig>
<p>The results of the Friedman test and subsequent Nemenyi post-hoc analysis for each classifier are presented in <xref ref-type="fig" rid="fig-11">Fig. 11</xref>. Statistically, our approach demonstrates strong performance, as evidenced by the Friedman test yielding a <italic>p</italic>-value significantly lower than the threshold of 0.05, leading to the rejection of the <italic>null</italic> hypothesis. Furthermore, the Nemenyi post-hoc test reveals that our method shows significant performance, outperforming most of the other methods. This is reflected in the consistently lowest mean rank for our method, except for the RUS Boost classifier. Furthermore, only a few other methods cross the same critical difference threshold as our method, which underlines that the performance improvement achieved by our method, as indicated by the obtained G-mean scores, is statistically significant compared to other alternatives.</p>
<fig id="fig-11">
<label>Figure 11</label>
<caption>
<title>Statistical comparison of resampling methods across classifiers and datasets using the Friedman-Nemenyi post-test. The plots show the mean rank of each resampler (left) and the Critical Difference (CD) diagram (right) for the following classifiers: (<bold>a</bold>) SVC; (<bold>b</bold>) k-NN; (<bold>c</bold>) DecisionTree; (<bold>d</bold>) RandomForest; (<bold>e</bold>) MLP; (<bold>f</bold>) EasyEnsemble; (<bold>g</bold>) RUSBoost; (<bold>h</bold>) BalancedBagging; (<bold>i</bold>) BalancedRandomForest</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_63465-fig-11a.tif"/>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_63465-fig-11b.tif"/>
</fig>
<p>For example, examining the test results for the MLP classifier, shown in <xref ref-type="fig" rid="fig-11">Fig. 11</xref>, which correspond to the tabulated results in <xref ref-type="table" rid="table-4">Table 4</xref>, a <italic>p</italic>-value of 6.08e&#x2212;20 is shown. This strongly rejects the <italic>null</italic> hypothesis, confirming that the observed differences are highly unlikely to have occurred by chance. The Critical Difference (CD) &#x003D; 4.796 from the Nemenyi test suggests that methods with a mean rank difference exceeding this value demonstrate statistically significant performance differences. Conversely, methods with mean rank differences within this value are considered to have no statistically significant differences. For instance, the following groups did not show significant performance differences, as their mean rank differences fall within the CD value (indicated by crossing the same bold line in the diagram):
<list list-type="bullet">
<list-item>
<p>RandomUnder, NearMiss, TomekLinks, ADASYN, KMeans-SMOTE, ENN, ECDNN, SVM-SMOTE, RandomOver, and SMOTE</p>
</list-item>
<list-item>
<p>NearMiss, TomekLinks, ADASYN, KMeans-SMOTE, ENN, ECDNN, SVM-SMOTE, RandomOver, SMOTE, and SMOTE-Tomek</p></list-item>
<list-item>
<p>ADASYN, KMeans-SMOTE, ENN, ECDNN, SVM-SMOTE, RandomOver, SMOTE, SMOTE-Tomek, and Borderline-SMOTE</p></list-item>
<list-item>
<p>SVM-SMOTE, RandomOver, SMOTE, SMOTE-Tomek, Borderline-SMOTE, SMOTE-CDNN, and SMOTE-ENN</p></list-item>
<list-item>
<p>Borderline-SMOTE, SMOTE-CDNN, SMOTE-ENN, and NDESO</p></list-item>
</list></p>
<p>This indicates that our method, NDESO, shows a statistically significant performance difference when compared to all other methods, except for Borderline-SMOTE, SMOTE-CDNN, and SMOTE-ENN. However, the superior performance of our method is further supported by its lowest mean rank of 1.85, indicating it outperforms all other methods. In addition, our algorithm achieved an average G-mean score of 0.9362, higher than Borderline-SMOTE&#x2019;s average of 0.8129, SMOTE-ENN&#x2019;s average of 0.8909 (with one failure on the <italic>vowel</italic> dataset), and SMOTE-CDNN&#x2019;s average of 0.8469, emphasizes its excellence.</p>
<p>We present the scatter plots and confusion matrices for the <italic>yeast</italic> dataset&#x2014;one of the sparsest and most imbalanced datasets with ten classes&#x2014;in <xref ref-type="fig" rid="fig-12">Figs. 12</xref> and <xref ref-type="fig" rid="fig-13">13</xref>, respectively. The t-SNE plots reveal that similar to the patterns observed with the NDE algorithm in <xref ref-type="fig" rid="fig-8">Fig. 8</xref>, our NDESO algorithm generates data points with less overlap than other resamplers. In contrast, ECDNN produces relatively clean and separated data points but still exhibits significant overlaps. Additionally, the total number of classes, initially ten, is reduced to nine, indicating that ECDNN may have inadvertently removed critical data points. Furthermore, the confusion matrix plots highlight that our algorithm achieved the best performance, as evidenced by a more apparent dark blue diagonal pattern, reflecting a strong alignment between predicted and actual values. This further confirms the accuracy and reliability of NDESO in correctly classifying the data. ADASYN, ENN, and KMeans-SMOTE were unsuccessful in resampling this dataset, which can also be observed in <xref ref-type="table" rid="table-4">Table 4</xref>, encountering the same error shown in <xref ref-type="table" rid="table-3">Table 3</xref>.</p>
<fig id="fig-12">
<label>Figure 12</label>
<caption>
<title>Scatter plots compare our NDESO algorithm with other methods on the <italic>Yeast</italic> dataset, showing distributions for (<bold>a</bold>) the original dataset alongside those generated by: (<bold>b</bold>) NDESO; (<bold>c</bold>) SMOTE; (<bold>d</bold>) ROS; (<bold>e</bold>) RUS; (<bold>f</bold>) NearMiss; (<bold>g</bold>) Borderline-SMOTE; (<bold>h</bold>) SMOTE-ENN; (<bold>i</bold>) SMOTE-Tomek; (<bold>j</bold>) TomekLinks; (<bold>k</bold>) ECDNN; (<bold>l</bold>) SMOTE-CDNN; (<bold>m</bold>) SVM-SMOTE</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_63465-fig-12.tif"/>
</fig><fig id="fig-13">
<label>Figure 13</label>
<caption>
<title>Confusion matrices, assessed using G-mean scores on the <italic>Yeast</italic> dataset, demonstrate the effectiveness of different resampling methods. NDESO achieved the highest accuracy in aligning predicted labels with true labels. The matrices are presented for: (<bold>a</bold>) NDESO; (<bold>b</bold>) SMOTE; (<bold>c</bold>) ROS; (<bold>d</bold>) RUS; (<bold>e</bold>) NearMiss; (<bold>f</bold>) Borderline-SMOTE; (<bold>g</bold>) SMOTE-ENN; (<bold>h</bold>) SMOTE-Tomek; (<bold>i</bold>) TomekLinks; (<bold>j</bold>) ECDNN; (<bold>k</bold>) SMOTE-CDNN; (<bold>l</bold>) SVM-SMOTE</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_63465-fig-13.tif"/>
</fig>
<p>While our method shows a promising result, there is one notable caveat. Our algorithm calculates the average distance between a data point and its <italic>k</italic>-neighbors and then moves the data point closer to the centroid while maintaining the same distance. This distance calculation process can be computationally expensive, particularly for large datasets. As a result, when referring to <xref ref-type="table" rid="table-8">Table 8</xref>, our method does not show the best performance in terms of execution time. The time recorded reflects only the duration of the resampling process applied to the initial data before the classification task and does not represent the total execution time. More straightforward methods such as RandomUnder, RandomOver, NearMiss, and SMOTE, along with some of their variants, perform better in terms of time efficiency. These methods rely on more straightforward sampling techniques, especially RandomUnder and NearMiss, which can substantially reduce the majority class to balance the dataset. However, this comes at a cost&#x2014;these methods produce fewer representative samples, causing lower test performance. On the other hand, while our method is not the fastest, it is not the slowest either. Specifically, our method outperforms the SMOTE-CDNN, a more recent SMOTE variant with relatively good resampling results. Additionally, our method is faster than several other methods, such as ECDNN, SVM-SMOTE, and KMeans-SMOTE.</p>
<table-wrap id="table-8">
<label>Table 8</label>
<caption>
<title>Execution time (in seconds) of resampling methods with MLP classifier across datasets</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Dataset</th>
<th>[1]</th>
<th>[2]</th>
<th>[3]</th>
<th>[4]</th>
<th>[5]</th>
<th>[6]</th>
<th>[7]</th>
<th>[8]</th>
<th>[9]</th>
<th>[10]</th>
<th>[11]</th>
<th>[12]</th>
<th>[13]</th>
<th>[14]</th>
<th>[15]</th>
</tr>
</thead>
<tbody>
<tr>
<td>Autos</td>
<td>0.0041</td>
<td><bold>0.0024</bold></td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>0.0045</td>
<td>0.0062</td>
<td>0.0853</td>
<td>0.0514</td>
<td>&#x2013;</td>
<td>0.0024</td>
<td>0.0049</td>
<td>0.0926</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>Balance</td>
<td>0.0206</td>
<td><bold>0.0012</bold></td>
<td>0.0023</td>
<td>0.0039</td>
<td>0.0026</td>
<td>0.0024</td>
<td>0.0019</td>
<td>0.0044</td>
<td>0.0039</td>
<td>0.0028</td>
<td>0.0033</td>
<td>0.0157</td>
<td>0.0284</td>
<td>0.0054</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>Contraceptive</td>
<td>0.0964</td>
<td>0.0017</td>
<td><bold>0.0015</bold></td>
<td>0.0042</td>
<td>&#x2013;</td>
<td>0.0077</td>
<td>0.0039</td>
<td>0.0108</td>
<td>0.0093</td>
<td>0.0051</td>
<td>0.0049</td>
<td>0.1101</td>
<td>0.1692</td>
<td>0.0841</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>Dermatology</td>
<td>0.008</td>
<td>0.0041</td>
<td><bold>0.0021</bold></td>
<td>0.0037</td>
<td>0.0173</td>
<td>0.0144</td>
<td>0.0051</td>
<td>0.0176</td>
<td>0.0141</td>
<td>0.0053</td>
<td>0.0114</td>
<td>0.0120</td>
<td>0.0276</td>
<td>0.0326</td>
<td>0.6554</td>
</tr>
<tr>
<td>Ecoli</td>
<td>0.0125</td>
<td><bold>0.0020</bold></td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>0.0123</td>
<td>0.0053</td>
<td>0.0125</td>
<td>0.0088</td>
<td>0.0038</td>
<td>0.0024</td>
<td>0.0082</td>
<td>0.0636</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>Glass</td>
<td>0.0044</td>
<td>0.0018</td>
<td><bold>0.0017</bold></td>
<td>0.0034</td>
<td>&#x2013;</td>
<td>0.0091</td>
<td>0.0045</td>
<td>0.0060</td>
<td>0.0045</td>
<td>0.0033</td>
<td>0.0021</td>
<td>0.0062</td>
<td>0.0124</td>
<td>0.0144</td>
<td>0.0176</td>
</tr>
<tr>
<td>Hayes-Roth</td>
<td>0.0027</td>
<td><bold>0.0014</bold></td>
<td>0.0017</td>
<td>0.0024</td>
<td>0.0022</td>
<td>0.0025</td>
<td>0.0018</td>
<td>0.0047</td>
<td>0.0027</td>
<td>0.0028</td>
<td>0.0023</td>
<td>0.0032</td>
<td>0.0051</td>
<td>0.0051</td>
<td>0.0055</td>
</tr>
<tr>
<td>Lymphography</td>
<td>0.0071</td>
<td>0.0024</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>0.0042</td>
<td>0.0051</td>
<td>0.1013</td>
<td>0.0770</td>
<td>&#x2013;</td>
<td><bold>0.0021</bold></td>
<td>&#x2013;</td>
<td>0.0174</td>
<td>0.0840</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>New-Thyroid</td>
<td>0.0039</td>
<td>0.0020</td>
<td>0.0018</td>
<td>0.0033</td>
<td>0.0029</td>
<td>0.0029</td>
<td>0.0022</td>
<td>0.0037</td>
<td>0.0035</td>
<td>0.0021</td>
<td><bold>0.0017</bold></td>
<td>0.0085</td>
<td>0.0107</td>
<td>0.0093</td>
<td>0.0078</td>
</tr>
<tr>
<td>Pageblocks</td>
<td>0.0153</td>
<td><bold>0.0016</bold></td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>0.0084</td>
<td>0.0550</td>
<td>0.0330</td>
<td>0.0493</td>
<td>0.035</td>
<td>0.0032</td>
<td>0.0030</td>
<td>0.0141</td>
<td>0.2976</td>
<td>0.0920</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>Penbased</td>
<td>0.0450</td>
<td>0.0024</td>
<td><bold>0.0016</bold></td>
<td>0.0938</td>
<td>&#x2013;</td>
<td>0.0158</td>
<td>0.0070</td>
<td>0.0866</td>
<td>0.0806</td>
<td>0.0447</td>
<td>0.0749</td>
<td>0.0498</td>
<td>0.0828</td>
<td>0.1453</td>
<td>0.8824</td>
</tr>
<tr>
<td>Segment</td>
<td>0.2421</td>
<td><bold>0.0023</bold></td>
<td>0.0029</td>
<td>0.0533</td>
<td>0.0023</td>
<td>0.0033</td>
<td>0.0021</td>
<td>0.0273</td>
<td>0.0061</td>
<td>0.0151</td>
<td>0.0072</td>
<td>0.2694</td>
<td>0.2516</td>
<td>0.0030</td>
<td>0.0020</td>
</tr>
<tr>
<td>Shuttle</td>
<td>0.2376</td>
<td><bold>0.0020</bold></td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>0.0104</td>
<td>0.0128</td>
<td>0.0071</td>
<td>0.0734</td>
<td>0.0680</td>
<td>0.0170</td>
<td>0.0210</td>
<td>0.1977</td>
<td>3.0294</td>
<td>0.0651</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>Svmguide2</td>
<td>0.0086</td>
<td>0.0039</td>
<td><bold>0.0025</bold></td>
<td>0.0044</td>
<td>0.0864</td>
<td>0.0754</td>
<td>0.0749</td>
<td>0.0849</td>
<td>0.0799</td>
<td>0.0730</td>
<td>0.0743</td>
<td>0.0099</td>
<td>0.0373</td>
<td>0.0914</td>
<td>0.1012</td>
</tr>
<tr>
<td>Svmguide4</td>
<td>0.0069</td>
<td>0.0025</td>
<td><bold>0.0014</bold></td>
<td>0.0034</td>
<td>&#x2013;</td>
<td>0.0053</td>
<td>0.0041</td>
<td>0.0057</td>
<td>0.0052</td>
<td>&#x2013;</td>
<td>0.0031</td>
<td>0.0099</td>
<td>0.0136</td>
<td>0.0283</td>
<td>0.1271</td>
</tr>
<tr>
<td>Thyroid</td>
<td>0.0206</td>
<td>0.0029</td>
<td><bold>0.0017</bold></td>
<td>0.0914</td>
<td>0.0732</td>
<td>0.0862</td>
<td>0.0048</td>
<td>0.0077</td>
<td>0.0076</td>
<td>&#x2013;</td>
<td>0.0099</td>
<td>&#x2013;</td>
<td>0.2731</td>
<td>0.1068</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>Vehicle</td>
<td>0.0337</td>
<td>0.0019</td>
<td><bold>0.0017</bold></td>
<td>0.006</td>
<td>&#x2013;</td>
<td>0.0060</td>
<td>0.0822</td>
<td>0.0873</td>
<td>0.0805</td>
<td>0.0766</td>
<td>0.0615</td>
<td>0.0332</td>
<td>0.1361</td>
<td>0.1210</td>
<td>0.3299</td>
</tr>
<tr>
<td>Vowel</td>
<td>0.0138</td>
<td>0.0020</td>
<td><bold>0.0015</bold></td>
<td>0.0047</td>
<td>0.0014</td>
<td>0.0014</td>
<td>0.0016</td>
<td>&#x2013;</td>
<td>0.0054</td>
<td>&#x2013;</td>
<td>0.0028</td>
<td>0.0293</td>
<td>0.0176</td>
<td>0.0016</td>
<td>0.0016</td>
</tr>
<tr>
<td>Wine</td>
<td>0.0047</td>
<td>0.0017</td>
<td><bold>0.0015</bold></td>
<td>0.0027</td>
<td>0.0040</td>
<td>0.0040</td>
<td>0.0040</td>
<td>0.0064</td>
<td>0.0039</td>
<td>0.0024</td>
<td>0.0038</td>
<td>0.0072</td>
<td>0.0072</td>
<td>0.0062</td>
<td>0.0138</td>
</tr>
<tr>
<td>Yeast</td>
<td>0.1244</td>
<td>0.0019</td>
<td><bold>0.0015</bold></td>
<td>0.0719</td>
<td>&#x2013;</td>
<td>0.0572</td>
<td>0.0162</td>
<td>0.0882</td>
<td>0.0303</td>
<td>&#x2013;</td>
<td>0.0075</td>
<td>0.0905</td>
<td>1.0080</td>
<td>0.4161</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>Average G-mean</td>
<td>0.0456</td>
<td>0.0022</td>
<td><bold>0.0018</bold></td>
<td>0.0235</td>
<td>0.0192</td>
<td>0.0191</td>
<td>0.0137</td>
<td>0.0402</td>
<td>0.0289</td>
<td>0.0184</td>
<td>0.0151</td>
<td>0.0489</td>
<td>0.2791</td>
<td>0.0729</td>
<td>0.1949</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="table-8fn1" fn-type="other">
<p>Note: [1] NDESO; [2] RandomOver; [3] RandomUnder; [4] NearMiss; [5] ADASYN; [6] Borderline-SMOTE; [7] SMOTE; [8] SMOTE-ENN; [9] SMOTE-Tomek; [10] ENN; [11] TomekLinks; [12] ECDNN; [13] SMOTE-CDNN; [14] SVM-SMOTE; [15] Kmeans-SMOTE. The best score is highlighted in bold.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</sec>
</sec>
</sec>
<sec id="s6">
<label>6</label>
<title>Discussion</title>
<p>The SMOTE method and its variations have been widely exploited and proven effective in addressing imbalanced class problems in most scenarios. However, while these methods generate quantitatively balanced data, the synthetic data patterns often deviate from the original distribution, introducing noise into the resampled dataset, especially in cases where there are sparse data with overlapping data points. Recently, Wang et al. [<xref ref-type="bibr" rid="ref-23">23</xref>] proposed a hybrid method called SMOTE-CDNN that combines undersampling and oversampling techniques. The undersampling technique in this method removes data points that do not match the class prediction upon the centroid displacement. However, this process may unintentionally discard important information, causing the oversampling to replicate more from less essential ones, potentially increasing the amount of non-representative data points.</p>
<p>In this study, we propose an alternative approach that preserves these noisy data by shifting them closer to the center of their class before performing random oversampling to balance the dataset. There are several benefits to this approach:
<list list-type="order">
<list-item>
<p>Relative positioning</p>
<p>Moving these noisy data points keeps their relative positioning within the cluster intact, preserving their contribution toward its class data distribution.</p></list-item>
<list-item>
<p>Noise reduction</p>
<p>Noisy data points are outliers that can distort the representation of their class. Moving them closer to their centroid helps align them with the central of their class, reducing their impact as noise.</p></list-item>
<list-item>
<p>Class Boundaries</p>
<p>Moving them from overlapping data points increases the distinct boundaries between classes.</p></list-item>
<list-item>
<p>Statistical properties</p>
<p>Moving these noisy data points towards the center, the dataset&#x2019;s statistical properties (e.g., mean, variance) are largely preserved.</p></list-item>
<list-item>
<p>Oversampling</p>
<p>Doing oversampling after such adjustments makes the generated synthetic data points more likely to align with the class&#x2019;s data characteristics.</p></list-item>
</list></p>
<p>We have validated this hypothesis through various tests, which show that our method consistently outperforms most other resampling methods. Our method successfully handles datasets with an extensive variation in imbalance levels, ranging from ratios of 1:1, 1:94, and 1:164 to a very extreme ratio of 1:853. This ratio far exceeds the level tested in [<xref ref-type="bibr" rid="ref-23">23</xref>], which reached a maximum ratio of 1:130. In addition, the data distribution results of our method show a much better ability to overcome overlap or noise problems compared to SMOTE-CDNN and other SMOTE variations.</p>
<p>The differences between our method and existing approaches are twofold. First, instead of removing data points, as in NearMiss, ENN, or TomekLinks [<xref ref-type="bibr" rid="ref-19">19</xref>,<xref ref-type="bibr" rid="ref-20">20</xref>], our method repositions them, preserving their characteristics while refining their positions, preventing them from contributing as noise. This has the interpretation of recreating better data points while still maintaining similar data characteristics. Second, our method remains computationally efficient, as the repositioning process is straightforward and followed by generating synthetic data based on the improved pattern. This balance between simplicity and effectiveness ensures competitive speed, better than SMOTE-CDNN (which also uses a displacement approach), but achieves a more representative data distribution, leading to better performance.</p>
<p>The experimental results show that our proposed approach is effective in addressing the challenges posed by sparse and imbalanced data. However, as highlighted in the results analysis, our method does have certain limitations. To generate more representative resampling outcomes, it corrects noisy data points prior to oversampling, which introduces additional processing compared to several baseline methods. As a result, our method is not the fastest, although it outperforms other approaches, mainly SMOTE extensions such as SMOTE-CDNN, SVM-SMOTE, and K-Means SMOTE [<xref ref-type="bibr" rid="ref-24">24</xref>]. Additionally, the testing conducted in this study has utilized public datasets, with a maximum of 2310 instances and up to 34 features. While these datasets are sparse and exhibit extreme imbalance, with an imbalance ratio reaching 853, they are not sufficient on their own to fully assess the method&#x2019;s robustness. More extensive testing is needed, particularly with large-scale and high-dimensional data, to better understand the consistency of our method&#x2019;s performance when applied to larger datasets. Despite these limitations, our method holds the potential for addressing similar challenges in practical applications, especially in domains where critical data samples are sparse compared to less important data. For instance, in fraud detection in the financial sector, in predicting machine failures, or in other domains such as medical diagnostics, where positive data is limited due to the infrequent occurrence [<xref ref-type="bibr" rid="ref-31">31</xref>,<xref ref-type="bibr" rid="ref-32">32</xref>]. In such cases, our approach can mitigate the imbalance in the minority class, enabling better representation of the critical samples and ultimately improving model performance.</p>
<p>For further exploration, validating our approach through testing and comparison with solutions designed for big data and distributed computing would further highlight its potential, demonstrating how it can be scaled to handle broader and more complex case studies. Big data processing often requires more complex and resource-intensive workflows, especially when dealing with distributed systems and network constraints, as highlighted in various studies on resampling techniques for big data [<xref ref-type="bibr" rid="ref-33">33</xref>&#x2013;<xref ref-type="bibr" rid="ref-35">35</xref>]. Additionally, exploring its application to binary big data formats, including images, spatial datasets, and graph-based representations, presents an interesting direction for further investigation. This could open up new opportunities in geospatial analysis, network science, and high-dimensional data processing, paving the way for broader adoption and adaptation of the proposed approach in various real-world scenarios.</p>
</sec>
<sec id="s7">
<label>7</label>
<title>Conclusion</title>
<p>This paper proposes a hybrid resampling method consisting of a noisy data point displacement approach with a random oversampling technique to handle imbalanced multiclass data. Our foundational approach performs noisy data point displacement by taking the average distance of a data point among its <italic>k</italic>-neighbors and repositioning it closer to the class centroid with equal distance. This procedure is repeated across overlapping data points, resulting in a cleaner class separation ready for oversampling. The method continues to refine the dataset by performing random oversampling to balance the data distribution. The approach was validated by confronting it with 14 other baseline resamplers on nine classifiers across 20 real-world datasets. Various parameter variations of the method have also been evaluated to demonstrate its robustness. Extensive testing results with statistical test confirmation have shown that our approach outperforms most baselines, highlighting our method&#x2019;s suitability for various real-world imbalanced classification tasks. Further research on the effectiveness of our strategy towards resampling in big data environments with other types of datasets is an open topic for investigation.</p>
</sec>
</body>
<back>
<ack>
<p>The authors thank all editors and anonymous reviewers for their comments and suggestions.</p>
</ack>
<sec>
<title>Funding Statement</title>
<p>The authors received no specific funding for this study.</p>
</sec>
<sec>
<title>Author Contributions</title>
<p>The authors confirm their contribution to the paper as follows: study conception and design: I Made Putrama; data collection: I Made Putrama; analysis and interpretation of results: I Made Putrama; draft manuscript preparation: I Made Putrama; resources, validation, and supervision: P&#x00E9;ter Martinek. All authors reviewed the results and approved the final version of the manuscript.</p>
</sec>
<sec sec-type="data-availability">
<title>Availability of Data and Materials</title>
<p>The data that support the findings of this study are available on GitHub at <ext-link ext-link-type="uri" xlink:href="https://github.com/goshlive/imbalanced-ndeso">https://github.com/goshlive/imbalanced-ndeso</ext-link> (accessed on 14 January 2025).</p>
</sec>
<sec>
<title>Ethics Approval</title>
<p>Not applicable.</p>
</sec>
<sec sec-type="COI-statement">
<title>Conflicts of Interest</title>
<p>The authors declare no conflicts of interest to report regarding the present study.</p>
</sec>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Arafa</surname> <given-names>A</given-names></string-name>, <string-name><surname>El-Fishawy</surname> <given-names>N</given-names></string-name>, <string-name><surname>Badawy</surname> <given-names>M</given-names></string-name>, <string-name><surname>Radad</surname> <given-names>M</given-names></string-name></person-group>. <article-title>RN-SMOTE: reduced noise SMOTE based on DBSCAN for enhancing imbalanced data classification</article-title>. <source>J King Saud Univ&#x2014;Comput Inf Sci</source>. <year>2022</year>;<volume>34</volume>(<issue>8</issue>):<fpage>5059</fpage>&#x2013;<lpage>74</lpage>. doi:<pub-id pub-id-type="doi">10.1016/j.jksuci.2022.06.005</pub-id>.</mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Ren</surname> <given-names>J</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Cheung</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Gao</surname> <given-names>XZ</given-names></string-name>, <string-name><surname>Guo</surname> <given-names>X</given-names></string-name></person-group>. <article-title>Grouping-based oversampling in kernel space for imbalanced data classification</article-title>. <source>Pattern Recognit</source>. <year>2023</year>;<volume>133</volume>:<fpage>108992</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.patcog.2022.108992</pub-id>.</mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Madkour</surname> <given-names>AH</given-names></string-name>, <string-name><surname>Abdelkader</surname> <given-names>HM</given-names></string-name>, <string-name><surname>Mohammed</surname> <given-names>AM</given-names></string-name></person-group>. <article-title>Dynamic classification ensembles for handling imbalanced multiclass drifted data streams</article-title>. <source>Inf Sci</source>. <year>2024</year>;<volume>670</volume>:<fpage>120555</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.ins.2024.120555</pub-id>.</mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Lango</surname> <given-names>M</given-names></string-name>, <string-name><surname>Stefanowski</surname> <given-names>J</given-names></string-name></person-group>. <article-title>What makes multi-class imbalanced problems difficult? An experimental study</article-title>. <source>Expert Syst Appl</source>. <year>2022</year>;<volume>199</volume>:<fpage>116962</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.eswa.2022.116962</pub-id>.</mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Yu</surname> <given-names>T</given-names></string-name>, <string-name><surname>Huo</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>Classification of imbalanced data set in financial field based on combined algorithm</article-title>. <source>Mob Inf Syst</source>. <year>2022</year>;<volume>2022</volume>:<fpage>1</fpage>&#x2013;<lpage>7</lpage>. doi:<pub-id pub-id-type="doi">10.1155/2022/1839204</pub-id>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Han</surname> <given-names>T</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>C</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>W</given-names></string-name>, <string-name><surname>Jiang</surname> <given-names>D</given-names></string-name></person-group>. <article-title>Deep transfer network with joint distribution adaptation: a new intelligent fault diagnosis framework for industry application</article-title>. <source>ISA Trans</source>. <year>2020</year>;<volume>97</volume>:<fpage>269</fpage>&#x2013;<lpage>81</lpage>. doi:<pub-id pub-id-type="doi">10.1016/j.isatra.2019.08.012</pub-id>; <pub-id pub-id-type="pmid">31420125</pub-id></mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Fan</surname> <given-names>S</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Song</surname> <given-names>Z</given-names></string-name></person-group>. <article-title>Imbalanced sample selection with deep reinforcement learning for fault diagnosis</article-title>. <source>IEEE Trans Ind Inform</source>. <year>2022</year>;<volume>18</volume>(<issue>4</issue>):<fpage>2518</fpage>&#x2013;<lpage>27</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TII.2021.3100284</pub-id>.</mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Giorgio</surname> <given-names>A</given-names></string-name>, <string-name><surname>Cola</surname> <given-names>G</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>L</given-names></string-name></person-group>. <article-title>Systematic review of class imbalance problems in manufacturing</article-title>. <source>J Manuf Syst</source>. <year>2023</year>;<volume>71</volume>:<fpage>620</fpage>&#x2013;<lpage>44</lpage>. doi:<pub-id pub-id-type="doi">10.1016/j.jmsy.2023.10.014</pub-id>.</mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Rezvani</surname> <given-names>S</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>X</given-names></string-name></person-group>. <article-title>A broad review on class imbalance learning techniques</article-title>. <source>Appl Soft Comput</source>. <year>2023</year>;<volume>143</volume>(<issue>9</issue>):<fpage>110415</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.asoc.2023.110415</pub-id>.</mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Liu</surname> <given-names>M</given-names></string-name>, <string-name><surname>Dong</surname> <given-names>M</given-names></string-name>, <string-name><surname>Jing</surname> <given-names>C</given-names></string-name></person-group>. <article-title>A modified real-value negative selection detector-based oversampling approach for multiclass imbalance problems</article-title>. <source>Inf Sci</source>. <year>2021</year>;<volume>556</volume>:<fpage>160</fpage>&#x2013;<lpage>76</lpage>. doi:<pub-id pub-id-type="doi">10.1016/j.ins.2020.12.058</pub-id>.</mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Niaz</surname> <given-names>NU</given-names></string-name>, <string-name><surname>Shahariar</surname> <given-names>KMN</given-names></string-name>, <string-name><surname>Patwary</surname> <given-names>MJA</given-names></string-name></person-group>. <article-title>Class imbalance problems in machine learning: a review of methods and future challenges</article-title>. In: <conf-name>Proceedings of the 2nd International Conference on Computing Advancements</conf-name>; <year>2022 Mar 10&#x2013;12</year>; <publisher-loc>Dhaka, Bangladesh</publisher-loc>. p. <fpage>485</fpage>&#x2013;<lpage>90</lpage>. doi:<pub-id pub-id-type="doi">10.1145/3542954.3543024</pub-id>.</mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Ding</surname> <given-names>H</given-names></string-name>, <string-name><surname>Sun</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>N</given-names></string-name>, <string-name><surname>Shen</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Cui</surname> <given-names>X</given-names></string-name></person-group>. <article-title>RGAN-EL: a GAN and ensemble learning-based hybrid approach for imbalanced data classification</article-title>. <source>Inf Process Manag</source>. <year>2023</year>;<volume>60</volume>(<issue>2</issue>):<fpage>1</fpage>&#x2013;<lpage>20</lpage>. doi:<pub-id pub-id-type="doi">10.1016/j.ipm.2022.103235</pub-id>.</mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Pan</surname> <given-names>T</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>J</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>T</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>S</given-names></string-name>, <string-name><surname>He</surname> <given-names>S</given-names></string-name>, <string-name><surname>Lv</surname> <given-names>H</given-names></string-name></person-group>. <article-title>Generative adversarial network in mechanical fault diagnosis under small sample: a systematic review on applications and future perspectives</article-title>. <source>ISA Trans</source>. <year>2022</year>;<volume>128</volume>:<fpage>1</fpage>&#x2013;<lpage>10</lpage>. doi:<pub-id pub-id-type="doi">10.1016/j.isatra.2021.11.040</pub-id>; <pub-id pub-id-type="pmid">34953580</pub-id></mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Patange</surname> <given-names>AD</given-names></string-name>, <string-name><surname>Pardeshi</surname> <given-names>SS</given-names></string-name>, <string-name><surname>Jegadeeshwaran</surname> <given-names>R</given-names></string-name>, <string-name><surname>Zarkar</surname> <given-names>A</given-names></string-name>, <string-name><surname>Verma</surname> <given-names>K</given-names></string-name></person-group>. <article-title>Augmentation of decision tree model through hyper-parameters tuning for monitoring of cutting tool faults based on vibration signatures</article-title>. <source>J Vib Eng Technol</source>. <year>2023</year>;<volume>11</volume>(<issue>8</issue>):<fpage>3759</fpage>&#x2013;<lpage>77</lpage>. doi:<pub-id pub-id-type="doi">10.1007/s42417-022-00781-9</pub-id>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Pancaldi</surname> <given-names>F</given-names></string-name>, <string-name><surname>Dibiase</surname> <given-names>L</given-names></string-name>, <string-name><surname>Cocconcelli</surname> <given-names>M</given-names></string-name></person-group>. <article-title>Impact of noise model on the performance of algorithms for fault diagnosis in rolling bearings</article-title>. <source>Mech Syst Signal Process</source>. <year>2023</year>;<volume>188</volume>:<fpage>109975</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.ymssp.2022.109975</pub-id>.</mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Nasir</surname> <given-names>F</given-names></string-name>, <string-name><surname>Ahmed</surname> <given-names>AA</given-names></string-name>, <string-name><surname>SabirKiraz</surname> <given-names>M</given-names></string-name>, <string-name><surname>Yevseyeva</surname> <given-names>I</given-names></string-name>, <string-name><surname>Saif</surname> <given-names>M</given-names></string-name></person-group>. <article-title>Data-driven decision-making for bank targetmarketing using supervised learning classifiers on imbalanced big data</article-title>. <source>Comput Mater Contin</source>. <year>2024</year>;<volume>81</volume>(<issue>1</issue>):<fpage>1703</fpage>&#x2013;<lpage>28</lpage>. doi:<pub-id pub-id-type="doi">10.32604/cmc.2024.055192</pub-id>.</mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>N</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>ZL</given-names></string-name>, <string-name><surname>Luo</surname> <given-names>XG</given-names></string-name></person-group>. <article-title>Iterative minority oversampling and its ensemble for ordinal imbalanced datasets</article-title>. <source>Eng Appl Artif Intell</source>. <year>2024</year>;<volume>127</volume>(<issue>Pt A</issue>):<fpage>107211</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.engappai.2023.107211</pub-id>.</mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Vairetti</surname> <given-names>C</given-names></string-name>, <string-name><surname>Assadi</surname> <given-names>JL</given-names></string-name>, <string-name><surname>Maldonado</surname> <given-names>S</given-names></string-name></person-group>. <article-title>Efficient hybrid oversampling and intelligent undersampling for imbalanced big data classification</article-title>. <source>Expert Syst Appl</source>. <year>2024</year>;<volume>246</volume>:<fpage>123149</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.eswa.2024.123149</pub-id>.</mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zhang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Mani</surname> <given-names>I</given-names></string-name></person-group>. <article-title>KNN approach to unbalanced data distributions: a case study involving information extraction</article-title>. In: <conf-name>Proceedings of the International Conference on Machine Learning (ICML 2003)</conf-name>; <year>2003 Aug 21&#x2013;24</year>; <publisher-loc>Washington, DC, USA</publisher-loc>. p. <fpage>1</fpage>&#x2013;<lpage>7</lpage>.</mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Kulkarni</surname> <given-names>A</given-names></string-name>, <string-name><surname>Chong</surname> <given-names>D</given-names></string-name>, <string-name><surname>Batarseh</surname> <given-names>FA</given-names></string-name></person-group>. <article-title>Foundations of data imbalance and solutions for a data democracy</article-title>. <source>Data Democr Nexus Artif Intell Softw Dev Knowl Eng</source>. <year>2020</year>;<fpage>83</fpage>&#x2013;<lpage>106</lpage>. doi:<pub-id pub-id-type="doi">10.1016/B978-0-12-818366-3.00005-8</pub-id>.</mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Chawla</surname> <given-names>NV</given-names></string-name>, <string-name><surname>Bowyer</surname> <given-names>KW</given-names></string-name>, <string-name><surname>Hall</surname> <given-names>LO</given-names></string-name>, <string-name><surname>Kegelmeyer</surname> <given-names>WP</given-names></string-name></person-group>. <article-title>SMOTE: synthetic minority over-sampling technique</article-title>. <source>J Artif Intell Res</source>. <year>2002</year>;<volume>16</volume>:<fpage>321</fpage>&#x2013;<lpage>57</lpage>. doi:<pub-id pub-id-type="doi">10.1613/jair.953</pub-id>.</mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Yuan</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Wei</surname> <given-names>J</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Jiao</surname> <given-names>W</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>H</given-names></string-name></person-group>. <article-title>Review of resampling techniques for the treatment of imbalanced industrial data classification in equipment condition monitoring</article-title>. <source>Eng Appl Artif Intell</source>. <year>2023</year>;<volume>126</volume>(<issue>Pt B</issue>):<fpage>106911</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.engappai.2023.106911</pub-id>.</mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>AX</given-names></string-name>, <string-name><surname>Chukova</surname> <given-names>SS</given-names></string-name>, <string-name><surname>Nguyen</surname> <given-names>BP</given-names></string-name></person-group>. <article-title>Synthetic minority oversampling using edited displacement-based k-nearest neighbors</article-title>. <source>Appl Soft Comput</source>. <year>2023</year>;<volume>148</volume>:<fpage>110895</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.asoc.2023.110895</pub-id>.</mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Ahsan</surname> <given-names>MM</given-names></string-name>, <string-name><surname>Ali</surname> <given-names>MS</given-names></string-name>, <string-name><surname>Siddique</surname> <given-names>Z</given-names></string-name></person-group>. <article-title>Enhancing and improving the performance of imbalanced class data using novel GBO and SSG: a comparative analysis</article-title>. <source>Neural Netw</source>. <year>2024</year>;<volume>173</volume>:<fpage>106157</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.neunet.2024.106157</pub-id>; <pub-id pub-id-type="pmid">38335796</pub-id></mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>He</surname> <given-names>H</given-names></string-name>, <string-name><surname>Bai</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Garcia</surname> <given-names>EA</given-names></string-name>, <string-name><surname>Li</surname> <given-names>S</given-names></string-name></person-group>. <article-title>ADASYN: adaptive synthetic sampling approach for imbalanced learning</article-title>. In: <conf-name>Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence)</conf-name>; <year>2008 Jun 1&#x2013;8</year>; <publisher-loc>Hong Kong, China</publisher-loc>. p. <fpage>132</fpage>&#x2013;<lpage>8</lpage>. doi:<pub-id pub-id-type="doi">10.1109/IJCNN.2008.4633969</pub-id>.</mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Islam</surname> <given-names>A</given-names></string-name>, <string-name><surname>Belhaouari</surname> <given-names>SB</given-names></string-name>, <string-name><surname>Rehman</surname> <given-names>AU</given-names></string-name>, <string-name><surname>Bensmail</surname> <given-names>H</given-names></string-name></person-group>. <article-title>KNNOR: an oversampling technique for imbalanced datasets</article-title>. <source>Appl Soft Comput</source>. <year>2022</year>;<volume>115</volume>:<fpage>108288</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.asoc.2021.108288</pub-id>.</mixed-citation></ref>
<ref id="ref-27"><label>[27]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>AX</given-names></string-name>, <string-name><surname>Chukova</surname> <given-names>SS</given-names></string-name>, <string-name><surname>Nguyen</surname> <given-names>BP</given-names></string-name></person-group>. <chapter-title>Implementation and analysis of centroid displacement-based k-nearest neighbors</chapter-title>. In: <person-group person-group-type="editor"><string-name><surname>Chen</surname> <given-names>WT</given-names></string-name>, <string-name><surname>Yao</surname> <given-names>LN</given-names></string-name>, <string-name><surname>Cai</surname> <given-names>TT</given-names></string-name>, <string-name><surname>Pan</surname> <given-names>SR</given-names></string-name>, <string-name><surname>Shen</surname> <given-names>T</given-names></string-name>, <string-name><surname>Li</surname> <given-names>X</given-names></string-name></person-group>, editors. <source>Advanced Data Mining and Applications. International Conference on Advanced Data Mining and Applications; 2022 Nov 28&#x2013;30</source>; <publisher-loc>Brisbane, QLD, Australia. Berlin/Heidelberg, Germany</publisher-loc>: <publisher-name>Springer</publisher-name>; <year>2022</year>. p. <fpage>431</fpage>&#x2013;<lpage>43</lpage>. doi:<pub-id pub-id-type="doi">10.1007/978-3-031-22064-7_31</pub-id>.</mixed-citation></ref>
<ref id="ref-28"><label>[28]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Jia</surname> <given-names>L</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Sun</surname> <given-names>P</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>Z</given-names></string-name></person-group>. <article-title>R-WDLS: an efficient security region oversampling technique based on data distribution</article-title>. <source>Appl Soft Comput</source>. <year>2024</year>;<volume>154</volume>:<fpage>111376</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.asoc.2024.111376</pub-id>.</mixed-citation></ref>
<ref id="ref-29"><label>[29]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Dem&#x0161;ar</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Statistical comparisons of classifiers over multiple data sets</article-title>. <source>J Mach Learn Res</source>. <year>2006</year>;<volume>7</volume>:<fpage>1</fpage>&#x2013;<lpage>30</lpage>. doi:<pub-id pub-id-type="doi">10.5555/1248547.1248548</pub-id>.</mixed-citation></ref>
<ref id="ref-30"><label>[30]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Mujeeb</surname> <given-names>S</given-names></string-name>, <string-name><surname>Javaid</surname> <given-names>N</given-names></string-name>, <string-name><surname>Ahmed</surname> <given-names>A</given-names></string-name>, <string-name><surname>Gulfam</surname> <given-names>SM</given-names></string-name>, <string-name><surname>Qasim</surname> <given-names>U</given-names></string-name>, <string-name><surname>Shafiq</surname> <given-names>M</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Electricity theft detection with automatic labeling and enhanced RUSBoost classification using differential evolution and jaya algorithm</article-title>. <source>IEEE Access</source>. <year>2021</year>;<volume>9</volume>:<fpage>128521</fpage>&#x2013;<lpage>39</lpage>. doi:<pub-id pub-id-type="doi">10.1109/ACCESS.2021.3102643</pub-id>.</mixed-citation></ref>
<ref id="ref-31"><label>[31]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Huang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>B</given-names></string-name>, <string-name><surname>Xue</surname> <given-names>X</given-names></string-name>, <string-name><surname>Cao</surname> <given-names>J</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>X</given-names></string-name></person-group>. <article-title>Imbalanced credit card fraud detection data: a solution based on hybrid neural network and clustering-based undersampling technique</article-title>. <source>Appl Soft Comput</source>. <year>2024</year>;<volume>154</volume>:<fpage>111368</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.asoc.2024.111368</pub-id>.</mixed-citation></ref>
<ref id="ref-32"><label>[32]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Rithani</surname> <given-names>M</given-names></string-name>, <string-name><surname>Kumar</surname> <given-names>RP</given-names></string-name>, <string-name><surname>Ali</surname> <given-names>A</given-names></string-name></person-group>. <article-title>A dynamic ensemble learning based data mining framework for medical imbalanced big data</article-title>. <source>Knowl Based Syst</source>. <year>2025</year>;<volume>310</volume>:<fpage>112947</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.knosys.2024.112947</pub-id>.</mixed-citation></ref>
<ref id="ref-33"><label>[33]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Bagui</surname> <given-names>S</given-names></string-name>, <string-name><surname>Li</surname> <given-names>K</given-names></string-name></person-group>. <article-title>Resampling imbalanced data for network intrusion detection datasets</article-title>. <source>J Big Data</source>. <year>2021</year>;<volume>8</volume>(<issue>1</issue>):<fpage>6</fpage>. doi:<pub-id pub-id-type="doi">10.1186/s40537-020-00390-x</pub-id>.</mixed-citation></ref>
<ref id="ref-34"><label>[34]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Singh</surname> <given-names>T</given-names></string-name>, <string-name><surname>Khanna</surname> <given-names>R</given-names></string-name>, <string-name><surname>Satakshi</surname></string-name>, <string-name><surname>Kumar</surname> <given-names>M</given-names></string-name></person-group>. <article-title>Improved multi-class classification approach for imbalanced big data on spark</article-title>. <source>J Supercomput</source>. <year>2023</year>;<volume>79</volume>(<issue>6</issue>):<fpage>6583</fpage>&#x2013;<lpage>611</lpage>. doi:<pub-id pub-id-type="doi">10.1007/s11227-022-04908-3</pub-id>.</mixed-citation></ref>
<ref id="ref-35"><label>[35]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Xiao</surname> <given-names>M</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>C</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Distributed classification for imbalanced big data in distributed environments</article-title>. <source>Wirel Netw</source>. <year>2024</year>;<volume>30</volume>(<issue>5</issue>):<fpage>3657</fpage>&#x2013;<lpage>68</lpage>. doi:<pub-id pub-id-type="doi">10.1007/s11276-021-02552-y</pub-id>.</mixed-citation></ref>
</ref-list>
</back></article>












