<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xml:lang="en" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMC</journal-id>
<journal-id journal-id-type="nlm-ta">CMC</journal-id>
<journal-id journal-id-type="publisher-id">CMC</journal-id>
<journal-title-group>
<journal-title>Computers, Materials &#x0026; Continua</journal-title>
</journal-title-group>
<issn pub-type="epub">1546-2226</issn>
<issn pub-type="ppub">1546-2218</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">60090</article-id>
<article-id pub-id-type="doi">10.32604/cmc.2024.060090</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Coordinate Descent K-means Algorithm Based on Split-Merge</article-title>
<alt-title alt-title-type="left-running-head">Coordinate Descent K-means Algorithm Based on Split-Merge</alt-title>
<alt-title alt-title-type="right-running-head">Coordinate Descent K-means Algorithm Based on Split-Merge</alt-title>
</title-group>
<contrib-group>
<contrib id="author-1" contrib-type="author">
<name name-style="western"><surname>Qu</surname><given-names>Fuheng</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-2" contrib-type="author">
<name name-style="western"><surname>Shi</surname><given-names>Yuhang</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-3" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Yang</surname><given-names>Yong</given-names></name><xref ref-type="aff" rid="aff-1">1</xref><email>yy@cust.edu.cn</email></contrib>
<contrib id="author-4" contrib-type="author">
<name name-style="western"><surname>Hu</surname><given-names>Yating</given-names></name><xref ref-type="aff" rid="aff-2">2</xref></contrib>
<contrib id="author-5" contrib-type="author">
<name name-style="western"><surname>Liu</surname><given-names>Yuyao</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<aff id="aff-1"><label>1</label><institution>College of Computer Science and Technology, Changchun University of Science and Technology</institution>, <addr-line>Changchun, 130022</addr-line>, <country>China</country></aff>
<aff id="aff-2"><label>2</label><institution>College of Computer Science and Technology, Jilin Agricultural University</institution>, <addr-line>Changchun, 130118</addr-line>, <country>China</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Yong Yang. Email: <email>yy@cust.edu.cn</email></corresp>
</author-notes>
<pub-date date-type="collection" publication-format="electronic">
<year>2024</year></pub-date>
<pub-date date-type="pub" publication-format="electronic">
<day>19</day>
<month>12</month>
<year>2024</year>
</pub-date>
<volume>81</volume>
<issue>3</issue>
<fpage>4875</fpage>
<lpage>4893</lpage>
<history>
<date date-type="received">
<day>23</day>
<month>10</month>
<year>2024</year>
</date>
<date date-type="accepted">
<day>20</day>
<month>11</month>
<year>2024</year>
</date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2024 The Authors.</copyright-statement>
<copyright-year>2024</copyright-year>
<copyright-holder>Published by Tech Science Press.</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMC_60090.pdf"></self-uri>
<abstract>
<p>The Coordinate Descent Method for K-means (CDKM) is an improved algorithm of K-means. It identifies better locally optimal solutions than the original K-means algorithm. That is, it achieves solutions that yield smaller objective function values than the K-means algorithm. However, CDKM is sensitive to initialization, which makes the K-means objective function values not small enough. Since selecting suitable initial centers is not always possible, this paper proposes a novel algorithm by modifying the process of CDKM. The proposed algorithm first obtains the partition matrix by CDKM and then optimizes the partition matrix by designing the split-merge criterion to reduce the objective function value further. The split-merge criterion can minimize the objective function value as much as possible while ensuring that the number of clusters remains unchanged. The algorithm avoids the distance calculation in the traditional K-means algorithm because all the operations are completed only using the partition matrix. Experiments on ten UCI datasets show that the solution accuracy of the proposed algorithm, measured by the <italic>E</italic> value, is improved by 11.29% compared with CDKM and retains its efficiency advantage for the high dimensional datasets. The proposed algorithm can find a better locally optimal solution in comparison to other tested K-means improved algorithms in less run time.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Cluster analysis</kwd>
<kwd>K-means</kwd>
<kwd>coordinate descent K-means</kwd>
<kwd>split-merge</kwd>
</kwd-group>
<funding-group>
<award-group id="awg1">
<funding-source>National Defense Basic Research Program</funding-source>
<award-id>JCKY2019411B001</award-id>
</award-group>
<award-group id="awg2">
<funding-source>National Key Research and Development Program</funding-source>
<award-id>2022YFC3601305</award-id>
</award-group>
<award-group id="awg3">
<funding-source>Key R&#x0026;D Projects of Jilin Provincial Science and Technology Department</funding-source>
<award-id>20210203218SF</award-id>
</award-group>
</funding-group>
</article-meta>
</front>
<body>
<sec id="s1">
<label>1</label>
<title>Introduction</title>
<p>Clustering is an unsupervised machine-learning method [<xref ref-type="bibr" rid="ref-1">1</xref>] that does not depend on data labels. It has a wide range of applications in many fields [<xref ref-type="bibr" rid="ref-2">2</xref>], such as image segmentation [<xref ref-type="bibr" rid="ref-3">3</xref>], target recognition [<xref ref-type="bibr" rid="ref-4">4</xref>] and feature extraction [<xref ref-type="bibr" rid="ref-5">5</xref>]. Among these applications, the K-means algorithm is one of the most commonly used clustering algorithms due to its simplicity and interpretability [<xref ref-type="bibr" rid="ref-6">6</xref>,<xref ref-type="bibr" rid="ref-7">7</xref>].</p>
<p>The K-means clustering problem is a non-deterministic polynomial-time hardness (NP-hard) problem [<xref ref-type="bibr" rid="ref-8">8</xref>]. The traditional K-means clustering algorithm is a greedy algorithm used to optimize the K-means clustering model. However, its sensitivity to the choice of the initial centroids leads to its tendency to fall into local optimal solutions.</p>
<p>To address this problem, researchers have proposed many improved methods [<xref ref-type="bibr" rid="ref-9">9</xref>&#x2212;<xref ref-type="bibr" rid="ref-10">10</xref>] for enhancing <bold><italic>the solution accuracy</italic></bold><xref ref-type="fn" rid="fn1"><sup>1</sup></xref>
<fn id="fn1"><label>1</label>
<p>In this paper, <bold><italic>when we say that the solution accuracy of solution <inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:msup><mml:mi>C</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula></italic></bold> <bold> <italic>is higher than that of solution</italic></bold> <inline-formula id="ieqn-2"><mml:math id="mml-ieqn-2"><mml:msup><mml:mi>C</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula><bold>, <italic>it means that the value of the objective function corresponding to solution</italic></bold> <inline-formula id="ieqn-3"><mml:math id="mml-ieqn-3"><mml:msup><mml:mi>C</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> <bold> <italic>is less than the value of the objective function of solution</italic></bold> <inline-formula id="ieqn-4"><mml:math id="mml-ieqn-4"><mml:msup><mml:mi>C</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>.</p></fn> of the K-means clustering algorithm. Gul et al. proposed the R-K-means algorithm, which uses a two-step process to find the initialization centroids, making the K-means algorithm effective in solution accuracy on large-scale high dimensional datasets [<xref ref-type="bibr" rid="ref-11">11</xref>]. Biswas et al. used the computational geometry for cluster center initialization to make cluster centers uniformly distributed [<xref ref-type="bibr" rid="ref-12">12</xref>]. Layeb et al. proposed two deterministic initialization methods for K-means clustering based on modified crowding distances, which are capable of selecting more uniform initial centers based on the modified crowding distances [<xref ref-type="bibr" rid="ref-13">13</xref>]. Arthur et al. proposed the K-means&#x002B;&#x002B; algorithm, which is based on a random seeding technique to make the initialization centroids as dispersed as possible [<xref ref-type="bibr" rid="ref-14">14</xref>]. Lattanzi et al. improved K-means&#x002B;&#x002B; by adding a new local search strategy to it, which can optimize the location of the center of bad clusters [<xref ref-type="bibr" rid="ref-15">15</xref>]. &#x015E;enol proposed a method to find the optimal initial centers using kernel density estimation so that the initial centroids are distributed in regions with a high density of data points [<xref ref-type="bibr" rid="ref-16">16</xref>]. Reddy et al. selected the initial cluster centers by constructing a Voronoi diagram using the data points, which reduces the problem of the K-means algorithm&#x2019;s overdependence on initial centroids and improves the convergence time of the subsequent K-means algorithm [<xref ref-type="bibr" rid="ref-17">17</xref>].</p>
<p>In addition to the above methods, a large number of improvements for initialization have been proposed in the literature [<xref ref-type="bibr" rid="ref-18">18</xref>]. The final result of K-means clustering is determined by initialization and iteration. Iteration and initialization are similar. They can both make improvements to the solution accuracy. However, there is little research on iterative processes. Recently, Nie et al. improved the iterative process of the K-means algorithm. They rewrote the objective function of K-means and introduced the Coordinate Descent Method [<xref ref-type="bibr" rid="ref-19">19</xref>,<xref ref-type="bibr" rid="ref-20">20</xref>] to optimize the K-means clustering model. This new algorithm is called the Coordinate Descent Method for K-means (CDKM) algorithm. Experimental results show that under the same initialization conditions, the CDKM algorithm has a more significant improvement in solution accuracy than the K-means algorithm and runs more efficiently on high dimensional datasets [<xref ref-type="bibr" rid="ref-21">21</xref>].</p>
<p>Although CDKM is able to find smaller local optimal solutions than K-means, CDKM is still sensitive to initialization. There is a large room for improvement in the solution accuracy of CDKM. Since selecting the appropriate initial center is not always feasible [<xref ref-type="bibr" rid="ref-21">21</xref>], we attempt to improve its iterative process. Specifically, we introduce the split-merge criterion into CDKM. The split-merge criterion was proposed by Kaukoranta et al. in the iterative split-and-merge algorithm in 1998 [<xref ref-type="bibr" rid="ref-22">22</xref>]. The iterative split-and-merge algorithm aims to optimize codebook generation by utilizing the split-merge criterion. The split-merge criterion is also an excellent operation that can significantly enhance the solution accuracy of K-means. Based on the split-and-merge criterion, many improved algorithms have been proposed, such as the iterative split-and-merge algorithm [<xref ref-type="bibr" rid="ref-22">22</xref>], split algorithm [<xref ref-type="bibr" rid="ref-23">23</xref>], random swap algorithm [<xref ref-type="bibr" rid="ref-24">24</xref>] and I-K-means-&#x002B; algorithms [<xref ref-type="bibr" rid="ref-25">25</xref>]. However, the split-merge criterion primarily operates under the original K-means objective function, which may not apply to the CDKM objective function. One of the key strengths of CDKM is its efficiency in high dimensional data. This efficiency arises from its objective function, which eliminates the need for distance calculations during the search process. This feature is not typically available in traditional K-means algorithms and its improved algorithms, such as the algorithms mentioned above [<xref ref-type="bibr" rid="ref-11">11</xref>&#x2212;<xref ref-type="bibr" rid="ref-14">14</xref>] and the split-merge algorithm [<xref ref-type="bibr" rid="ref-22">22</xref>&#x2212;<xref ref-type="bibr" rid="ref-25">25</xref>] analyzed earlier.</p>
<p>The challenge is how to apply the split-merge criterion based on the original K-means objective function to the CDKM algorithm, which is based on the CDKM objective function. We propose a Coordinate Descending method for K-means algorithm based on Split-Merge (CDKMSM) algorithm from the perspective of improving the iterative process of CDKM. First, an existing algorithm based on split-merge criterion, specifically, the I-K-means&#x002B; algorithm, is modified with the aim of obtaining an excellent solution. Then, the proposed split-merge criterion is converted into a partition matrix operation to make it applicable to the CDKM clustering model. In order to retain the efficiency advantage of the original CDKM under high dimensional data, the proposed algorithm avoids distance computation, and its computation process can be accomplished only by using the partition matrix.</p>
</sec>
<sec id="s2">
<label>2</label>
<title>Related Work</title>
<sec id="s2_1">
<label>2.1</label>
<title>CDKM Algorithm</title>
<p>Suppose <inline-formula id="ieqn-5"><mml:math id="mml-ieqn-5"><mml:mi mathvariant="bold-italic">X</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>}</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mi>d</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>n</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> is a data set with <inline-formula id="ieqn-6"><mml:math id="mml-ieqn-6"><mml:mi>N</mml:mi></mml:math></inline-formula> individual elements. The goal of the K-means clustering model is to divide the dataset into disjoint clusters <inline-formula id="ieqn-7"><mml:math id="mml-ieqn-7"><mml:mi>C</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula>. The K-means algorithm uses the traditional error sum of squares (SSE) function within clusters as a measure of clustering effectiveness. For a given solution <inline-formula id="ieqn-8"><mml:math id="mml-ieqn-8"><mml:mi>C</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>, its SSE value is shown as follows:
<disp-formula id="eqn-1"><label>(1)</label><mml:math id="mml-eqn-1" display="block"><mml:mrow><mml:mtext>SSE</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>C</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:munderover><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:munderover><mml:mrow><mml:mtext>SSE</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mspace width="negativethinmathspace" /><mml:mo>,</mml:mo></mml:math></disp-formula>where, <inline-formula id="ieqn-9"><mml:math id="mml-ieqn-9"><mml:mrow><mml:mtext>SSE</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> of the <italic>i</italic>-th cluster is shown as follows:
<disp-formula id="eqn-2"><label>(2)</label><mml:math id="mml-eqn-2" display="block"><mml:mrow><mml:mtext>SSE</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:munder><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:munder><mml:msup><mml:mrow><mml:mo symmetric="true">&#x2016;</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo symmetric="true">&#x2016;</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mspace width="negativethinmathspace" /><mml:mo>,</mml:mo></mml:math></disp-formula>where, <inline-formula id="ieqn-10"><mml:math id="mml-ieqn-10"><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the center of the cluster <inline-formula id="ieqn-11"><mml:math id="mml-ieqn-11"><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. The calculation of <inline-formula id="ieqn-12"><mml:math id="mml-ieqn-12"><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is shown as follows:
<disp-formula id="eqn-3"><label>(3)</label><mml:math id="mml-eqn-3" display="block"><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:mfrac><mml:munder><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:munder><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo></mml:math></disp-formula></p>
<p>The clustering model of K-means can be described as a minimization problem is shown as follows:
<disp-formula id="eqn-4"><label>(4)</label><mml:math id="mml-eqn-4" display="block"><mml:munder><mml:mo movablelimits="true" form="prefix">min</mml:mo><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:munder><mml:mrow><mml:mtext>SSE</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:munder><mml:mo movablelimits="true" form="prefix">min</mml:mo><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:munder><mml:munderover><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:munderover><mml:munder><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:munder><mml:msubsup><mml:mrow><mml:mo symmetric="true">&#x2016;</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo symmetric="true">&#x2016;</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup><mml:mo>.</mml:mo></mml:math></disp-formula></p>
<p>CDKM rewrites the K-means clustering problem as <xref ref-type="disp-formula" rid="eqn-5">Eq. (5)</xref> by using the partition matrix.
<disp-formula id="eqn-5"><label>(5)</label><mml:math id="mml-eqn-5" display="block"><mml:munder><mml:mo movablelimits="true" form="prefix">max</mml:mo><mml:mrow><mml:mi>e</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>k</mml:mi><mml:mo>}</mml:mo></mml:mrow></mml:mrow></mml:munder><mml:mi>o</mml:mi><mml:mi>b</mml:mi><mml:mi>j</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">F</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>e</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:munder><mml:mo movablelimits="true" form="prefix">max</mml:mo><mml:mrow><mml:mi>e</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>k</mml:mi><mml:mo>}</mml:mo></mml:mrow></mml:mrow></mml:munder><mml:munderover><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>l</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:munderover><mml:mfrac><mml:mrow><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>l</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>e</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:msup><mml:mi mathvariant="bold-italic">X</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:mi mathvariant="bold-italic">X</mml:mi><mml:msubsup><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>l</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>e</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msubsup></mml:mrow><mml:mrow><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>l</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>e</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:msubsup><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>l</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>e</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msubsup></mml:mrow></mml:mfrac></mml:math></disp-formula></p>
<p>In <xref ref-type="disp-formula" rid="eqn-5">Eq. (5)</xref>, <inline-formula id="ieqn-13"><mml:math id="mml-ieqn-13"><mml:msup><mml:mi mathvariant="bold-italic">F</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>e</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mi>e</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>k</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> is a membership matrix, <inline-formula id="ieqn-14"><mml:math id="mml-ieqn-14"><mml:mi>e</mml:mi></mml:math></inline-formula> represents the position information of the updated element when the <inline-formula id="ieqn-15"><mml:math id="mml-ieqn-15"><mml:msup><mml:mi mathvariant="bold-italic">F</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>e</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:math></inline-formula> matrix is updated in rows, <inline-formula id="ieqn-16"><mml:math id="mml-ieqn-16"><mml:msubsup><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>l</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>e</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msubsup></mml:math></inline-formula> is the <italic>l</italic>-th column of <inline-formula id="ieqn-17"><mml:math id="mml-ieqn-17"><mml:msup><mml:mi mathvariant="bold-italic">F</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>e</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:math></inline-formula>, <inline-formula id="ieqn-18"><mml:math id="mml-ieqn-18"><mml:mi>o</mml:mi><mml:mi>b</mml:mi><mml:mi>j</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">F</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>e</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> represents the objective function value of the CDKM algorithm. Note: since both the expression form and the objective function value of the rewritten CDKM solution are different from those in the original K-means, different mathematical symbols are used here to distinguish them.</p>
<p>In order to reduce the amount of calculation, CDKM defines <inline-formula id="ieqn-19"><mml:math id="mml-ieqn-19"><mml:mi>&#x03C8;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>e</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> is shown as follows:
<disp-formula id="eqn-6"><label>(6)</label><mml:math id="mml-eqn-6" display="block"><mml:mi>&#x03C8;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>e</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mtable columnalign="left" rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mrow><mml:msubsup><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:msup><mml:mi mathvariant="bold-italic">X</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:mi mathvariant="bold-italic">X</mml:mi><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>e</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msubsup><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>e</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfrac></mml:mstyle><mml:mo>&#x2212;</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mrow><mml:msubsup><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:msup><mml:mi mathvariant="bold-italic">X</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:mi mathvariant="bold-italic">X</mml:mi><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>e</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:mn>2</mml:mn><mml:msubsup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:mi mathvariant="bold-italic">X</mml:mi><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>e</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msubsup><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>e</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:mfrac></mml:mstyle><mml:mi>e</mml:mi><mml:mo>=</mml:mo><mml:mi>p</mml:mi></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mrow><mml:msubsup><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:msup><mml:mi mathvariant="bold-italic">X</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:mi mathvariant="bold-italic">X</mml:mi><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>e</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mn>2</mml:mn><mml:msubsup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:mi mathvariant="bold-italic">X</mml:mi><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>e</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msubsup><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>e</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:mfrac></mml:mstyle><mml:mo>&#x2212;</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mrow><mml:msubsup><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:msup><mml:mi mathvariant="bold-italic">X</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:mi mathvariant="bold-italic">X</mml:mi><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>e</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msubsup><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>e</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfrac></mml:mstyle><mml:mi>e</mml:mi><mml:mo>&#x2260;</mml:mo><mml:mi>p</mml:mi></mml:mtd></mml:mtr></mml:mtable><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:math></disp-formula>where, <inline-formula id="ieqn-20"><mml:math id="mml-ieqn-20"><mml:mi>i</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>N</mml:mi><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula> represents the row number of the currently processed row when updating the partition matrix <inline-formula id="ieqn-21"><mml:math id="mml-ieqn-21"><mml:msup><mml:mi mathvariant="bold-italic">F</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>e</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:math></inline-formula>, <inline-formula id="ieqn-22"><mml:math id="mml-ieqn-22"><mml:mi>p</mml:mi></mml:math></inline-formula> represents the column number of the element with the value of 1 in the row <inline-formula id="ieqn-23"><mml:math id="mml-ieqn-23"><mml:mi>i</mml:mi></mml:math></inline-formula> when the row is updated. The calculation of <inline-formula id="ieqn-24"><mml:math id="mml-ieqn-24"><mml:mi>&#x03C8;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>e</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> is divided into two cases: <inline-formula id="ieqn-25"><mml:math id="mml-ieqn-25"><mml:mi>e</mml:mi><mml:mo>=</mml:mo><mml:mi>p</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-26"><mml:math id="mml-ieqn-26"><mml:mi>e</mml:mi><mml:mo>&#x2260;</mml:mo><mml:mi>p</mml:mi></mml:math></inline-formula>.</p>
<p>According to the coordinate descent method and the property of the partition matrix <inline-formula id="ieqn-27"><mml:math id="mml-ieqn-27"><mml:msup><mml:mi mathvariant="bold-italic">F</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>e</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:math></inline-formula>, the <italic>i</italic>-th row of <inline-formula id="ieqn-28"><mml:math id="mml-ieqn-28"><mml:msup><mml:mi mathvariant="bold-italic">F</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>e</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:math></inline-formula> is updated as follows:
<disp-formula id="eqn-7"><label>(7)</label><mml:math id="mml-eqn-7" display="block"><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>q</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mtable columnalign="left center" rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mn>1</mml:mn></mml:mtd><mml:mtd><mml:msub><mml:mrow><mml:mtext>arg max</mml:mtext></mml:mrow><mml:mrow><mml:mi>e</mml:mi></mml:mrow></mml:msub><mml:mi>&#x03C8;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>e</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mn>0</mml:mn></mml:mtd><mml:mtd><mml:mi>o</mml:mi><mml:mi>t</mml:mi><mml:mi>h</mml:mi><mml:mi>e</mml:mi><mml:mi>r</mml:mi><mml:mi>w</mml:mi><mml:mi>i</mml:mi><mml:mi>s</mml:mi><mml:mi>e</mml:mi></mml:mtd></mml:mtr></mml:mtable><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:math></disp-formula></p>
<p>The variables that need to be updated after the <italic>i</italic>-th row of <inline-formula id="ieqn-29"><mml:math id="mml-ieqn-29"><mml:msup><mml:mi mathvariant="bold-italic">F</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>e</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:math></inline-formula> are updated as follows:
<disp-formula id="eqn-8"><label>(8)</label><mml:math id="mml-eqn-8" display="block"><mml:mtable columnalign="left" rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mi mathvariant="bold-italic">X</mml:mi><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">X</mml:mi><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>;</mml:mo><mml:mi mathvariant="bold-italic">X</mml:mi><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>q</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">X</mml:mi><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>q</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>;</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msubsup><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>;</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>q</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>q</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mn>1.</mml:mn></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="eqn-9"><label>(9)</label><mml:math id="mml-eqn-9" display="block"><mml:mtable columnalign="left" rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:msubsup><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:msup><mml:mi mathvariant="bold-italic">X</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:mi mathvariant="bold-italic">X</mml:mi><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:msup><mml:mi mathvariant="bold-italic">X</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:mi mathvariant="bold-italic">X</mml:mi><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:mn>2</mml:mn><mml:msubsup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:mi mathvariant="bold-italic">X</mml:mi><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>;</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msubsup><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:msup><mml:mi mathvariant="bold-italic">X</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:mi mathvariant="bold-italic">X</mml:mi><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>q</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:msup><mml:mi mathvariant="bold-italic">X</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:mi mathvariant="bold-italic">X</mml:mi><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>q</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mn>2</mml:mn><mml:msubsup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:mi mathvariant="bold-italic">X</mml:mi><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>q</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>Note: The CDKM method presents new formulas to modify the objective function and replace the original K-means iterative process with a coordinate descent method. These formulas clarify the improvement process and detail the iterative method used in the newly proposed algorithm (Algorithm 1). This paper includes the objective function and iterative process of CDKM, which is why the formulas are presented.</p>
<fig id="fig-8">
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_60090-fig-8.tif"/>
</fig>
</sec>
<sec id="s2_2">
<label>2.2</label>
<title>The Problems of CDKM Algorithm</title>
<p>When the initial center position is not ideal (as shown in the left part of <xref ref-type="fig" rid="fig-1">Fig. 1</xref>), the initial centers in different regions may be too few or too many. As shown in the right part of <xref ref-type="fig" rid="fig-1">Fig. 1</xref>, there may be a local optimum problem after CDKM clustering. The datasets in the clusters are too sparse after clustering in regions with too few centers, and the datasets in the clusters are too dense after clustering in regions with too many centers. The division of these clusters is not accurate enough, which leads to the solution accuracy needing to be improved.</p>
<fig id="fig-1">
<label>Figure 1</label>
<caption>
<title>The left part is initial center distribution. The right part is the division of datasets after CDKM clustering</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_60090-fig-1.tif"/>
</fig>
</sec>
</sec>
<sec id="s3">
<label>3</label>
<title>Design of Coordinate Descent K-Means Algorithm Based on Split-Merge</title>
<p>For the problem described in <xref ref-type="sec" rid="s2_2">Section 2.2</xref>, in order to improve the solution accuracy of the CDKM algorithm, we introduce the split-merge criterion and transform it to apply to the clustering model of CDKM. Split-merge criterion has already been introduced in the iterative split-and-merge algorithm in 1998 [<xref ref-type="bibr" rid="ref-22">22</xref>]. The iterative split-and-merge algorithm begins with an initial codebook and enhances it through merge and split operations. Merging small neighboring clusters frees up additional code vectors, which can then be reallocated by splitting larger clusters. We introduce the split-merge criterion for the traditional K-means algorithm in <xref ref-type="sec" rid="s3_1">Section 3.1</xref>. In <xref ref-type="sec" rid="s3_2">Section 3.2</xref>, we analyze the shortcomings of the split-merge criterion for the traditional K-means algorithm and introduce the modified method in this paper. Since the split-merge criterion for the traditional K-means and the modified criterion we proposed in <xref ref-type="sec" rid="s3_2">Section 3.2</xref> are used to optimize the objective function value <inline-formula id="ieqn-44"><mml:math id="mml-ieqn-44"><mml:mrow><mml:mtext>SSE</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>C</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> of the original K-means model, they cannot be directly applied to solve the objective function value <inline-formula id="ieqn-45"><mml:math id="mml-ieqn-45"><mml:mrow><mml:mtext>obj</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">F</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>e</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> in the CDKM clustering model. Moreover, the original split-merge process involves a large number of distance calculations. In <xref ref-type="sec" rid="s3_3">Section 3.3</xref>, we introduce a method that transforms the formulas and operations corresponding to the improved split-merge criterion proposed in <xref ref-type="sec" rid="s3_2">Section 3.2</xref> of this paper so that it can be applied to the optimization of CDKM models. This method relies solely on the partition matrix and avoids distance calculation to improve the time efficiency of the algorithm in high dimensional datasets.</p>
<sec id="s3_1">
<label>3.1</label>
<title>Split-Merge Criterion for the Traditional K-Means Algorithm</title>
<p>The idea of split-merge criterion has been used to improve K-means, resulting in many proposed algorithms [<xref ref-type="bibr" rid="ref-22">22</xref>&#x2212;<xref ref-type="bibr" rid="ref-25">25</xref>]. These algorithms generally share a common approach, utilizing split-merge criterion to enhance the original K-means algorithm. In this paper, we build upon one of the more recent split-merge criterion algorithms, specifically the I-K-means-&#x002B; algorithm proposed in 2018. It improves the K-means&#x2019; solution accuracy by merging two clusters, splitting another cluster, and regrouping in each iteration, which is capable of gradually approaching the global optimal solution. The cluster <inline-formula id="ieqn-46"><mml:math id="mml-ieqn-46"><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> with the smallest cost selected for deletion when merging, and the cluster <inline-formula id="ieqn-47"><mml:math id="mml-ieqn-47"><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> with the largest gain selected for division when splitting. The calculation of cost and gain is shown as follows:
<disp-formula id="eqn-10"><label>(10)</label><mml:math id="mml-eqn-10" display="block"><mml:mrow><mml:mtext>cost</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mtext>SSE</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:munder><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi mathvariant="normal">&#x2200;</mml:mi><mml:mi>p</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:munder><mml:mrow><mml:mtext>dis</mml:mtext></mml:mrow><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mi>p</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi>Z</mml:mi><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>,</mml:mo></mml:math></disp-formula>
<disp-formula id="eqn-11"><label>(11)</label><mml:math id="mml-eqn-11" display="block"><mml:mrow><mml:mtext>gain</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2248;</mml:mo><mml:mfrac><mml:mn>3</mml:mn><mml:mn>4</mml:mn></mml:mfrac><mml:mo>&#x00D7;</mml:mo><mml:mrow><mml:mtext>SSE</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mspace width="negativethinmathspace" /><mml:mo>,</mml:mo></mml:math></disp-formula>where, <inline-formula id="ieqn-48"><mml:math id="mml-ieqn-48"><mml:msub><mml:mi>Z</mml:mi><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> in <xref ref-type="disp-formula" rid="eqn-10">Eq.(10)</xref> is the sub-proximal centroid of the data point <inline-formula id="ieqn-49"><mml:math id="mml-ieqn-49"><mml:mi>p</mml:mi></mml:math></inline-formula>.</p>
</sec>
<sec id="s3_2">
<label>3.2</label>
<title>Split-Merge Criterion in This Paper</title>
<p>Since the calculation of the gain value of the I-K-means-&#x002B; algorithm (<xref ref-type="disp-formula" rid="eqn-11">Eq. (11)</xref>) is an approximate calculation method, it may produce a misclassification when splitting and merging. So in this paper, the split-merge criterion of the I-K-means-&#x002B; algorithm is modified in the expectation of obtaining a solution with higher solution accuracy. We have changed the method of calculating the gain to an exact one, and also improved the method of calculating the cost of the I-K-means-&#x002B; algorithm.</p>
<p>The merging strategy in this paper aims to find two clusters that are close to each other, which are too densely populated with datasets, and merge them to minimize the cost is shown as follows [<xref ref-type="bibr" rid="ref-22">22</xref>]:
<disp-formula id="eqn-12"><label>(12)</label><mml:math id="mml-eqn-12" display="block"><mml:msub><mml:mrow><mml:mtext>cost</mml:mtext></mml:mrow><mml:mrow><mml:mi>A</mml:mi><mml:mo>,</mml:mo><mml:mi>B</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mtext>SSE</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>A</mml:mi><mml:mo>,</mml:mo><mml:mi>B</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mtext>SSE</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>A</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:mtext>SSE</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>B</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mspace width="negativethinmathspace" /><mml:mo>,</mml:mo></mml:math></disp-formula>where, <inline-formula id="ieqn-50"><mml:math id="mml-ieqn-50"><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>A</mml:mi><mml:mo>,</mml:mo><mml:mi>B</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> denotes the new cluster after merging two clusters of <inline-formula id="ieqn-51"><mml:math id="mml-ieqn-51"><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>A</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-52"><mml:math id="mml-ieqn-52"><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>B</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. The merging operation is shown in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>, assuming that there are <inline-formula id="ieqn-53"><mml:math id="mml-ieqn-53"><mml:mi>k</mml:mi></mml:math></inline-formula> clusters, calculate the SSE values of all the clusters, find the <inline-formula id="ieqn-54"><mml:math id="mml-ieqn-54"><mml:mrow><mml:mo>&#x230A;</mml:mo><mml:mfrac><mml:mi>k</mml:mi><mml:mn>2</mml:mn></mml:mfrac><mml:mo>&#x230B;</mml:mo></mml:mrow></mml:math></inline-formula> clusters with the smallest SSE values, calculate the cost of these <inline-formula id="ieqn-55"><mml:math id="mml-ieqn-55"><mml:mrow><mml:mo>&#x230A;</mml:mo><mml:mfrac><mml:mi>k</mml:mi><mml:mn>2</mml:mn></mml:mfrac><mml:mo>&#x230B;</mml:mo></mml:mrow></mml:math></inline-formula> clusters merged with the rest of the clusters two by two, respectively, and selecting the two clusters, <italic>A</italic> and <italic>B</italic> with the smallest cost values to be merged, i.e., assigning the data points in <italic>A</italic> to <italic>B</italic>, and deleting the center <italic>a</italic> of the original cluster <italic>A</italic>.</p>
<fig id="fig-2">
<label>Figure 2</label>
<caption>
<title>The left part is the division results corresponding to the original solution <inline-formula id="ieqn-63"><mml:math id="mml-ieqn-63"><mml:msup><mml:mi>C</mml:mi><mml:mrow><mml:mi>o</mml:mi><mml:mi>l</mml:mi><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> with SSE values. The right part is the result after merging and splitting</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_60090-fig-2.tif"/>
</fig>
<p>The objective of the splitting criterion presented in this paper is to identify clusters characterized by sparse datasets. Upon splitting such a cluster into two distinct clusters, a significant reduction in the SSE value is observed. <xref ref-type="disp-formula" rid="eqn-13">Eq. (13)</xref> is employed in this study to facilitate an accurate calculation of the gain resulting from splitting [<xref ref-type="bibr" rid="ref-22">22</xref>].
<disp-formula id="eqn-13"><label>(13)</label><mml:math id="mml-eqn-13" display="block"><mml:mrow><mml:mtext>gain</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mtext>SSE</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mtext>SSE</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi>C</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:mtext>SSE</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi>C</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>where, <inline-formula id="ieqn-56"><mml:math id="mml-ieqn-56"><mml:mrow><mml:mo>{</mml:mo><mml:msubsup><mml:mi>C</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>C</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula> represents the two clusters produced by each cluster <inline-formula id="ieqn-57"><mml:math id="mml-ieqn-57"><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> division. The splitting operation is shown in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>. In the splitting operation, a random point from the selected cluster is chosen as a new center. Use this new center and the original cluster center to perform clustering on the cluster using the K-means method. Traverse all clusters. Calculate the gain of all clusters after splitting. Then, find the cluster with the largest gain <inline-formula id="ieqn-58"><mml:math id="mml-ieqn-58"><mml:mi>H</mml:mi></mml:math></inline-formula>. Finally, a random data point in <inline-formula id="ieqn-59"><mml:math id="mml-ieqn-59"><mml:mi>H</mml:mi></mml:math></inline-formula> is used as the second center so that <inline-formula id="ieqn-60"><mml:math id="mml-ieqn-60"><mml:mi>H</mml:mi></mml:math></inline-formula> splits into two clusters <inline-formula id="ieqn-61"><mml:math id="mml-ieqn-61"><mml:msup><mml:mi>H</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> and <inline-formula id="ieqn-62"><mml:math id="mml-ieqn-62"><mml:msup><mml:mi>H</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>.</p>
<p>The advantage of this split-merge criterion is that it avoids too dense or sparse division of clusters and improves the solution accuracy.</p>
</sec>
<sec id="s3_3">
<label>3.3</label>
<title>Split-Merge Criterion for CDKM Model Optimization</title>
<p><xref ref-type="sec" rid="s3_2">Section 3.2</xref> introduces the improved split-merge criterion, however, this criterion is based on the original K-means model, which is not directly applicable to the CDKM model since the form of the solution and the objective function value of the CDKM are different from the original K-means. Moreover, this criterion needs to adjust the existing centers and redistribute the data points for the split-merge operation, which involves more distance calculations, and the more the centers change, the more the data points need to be calculated, and the higher the time complexity, especially in the high-dimensional dataset, which is even more prominent. To address the above problems, this paper uses the partition matrix to design a new split-merge criterion to make it applicable to the CDKM model and reduce the time complexity at the same time.
<list list-type="simple">
<list-item><label>1)</label><p>Merging operation</p></list-item>
</list></p>
<p>We perform the merging operation by integrating the two columns of the partition matrix. The split-merge criterion in the previous section used the objective function values of the K-means algorithm for calculating the cost gain, which is modified here to use the CDKM form of the objective function values. The values of the objective function before and after merging are calculated using <xref ref-type="disp-formula" rid="eqn-14">Eqs. (14)</xref> and <xref ref-type="disp-formula" rid="eqn-15">(15)</xref>, respectively, and the cost after merging is calculated as clusters using <xref ref-type="disp-formula" rid="eqn-16">Eq. (16)</xref>. Where, <inline-formula id="ieqn-64"><mml:math id="mml-ieqn-64"><mml:mrow><mml:mtext>obj</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>o</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> represents the sum of the CDKM objective function values of the columns <inline-formula id="ieqn-65"><mml:math id="mml-ieqn-65"><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>o</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-66"><mml:math id="mml-ieqn-66"><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> corresponding to the clusters before merging, <inline-formula id="ieqn-67"><mml:math id="mml-ieqn-67"><mml:mrow><mml:mtext>obj</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>o</mml:mi><mml:mo>,</mml:mo><mml:mi>w</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> represents the sum of the CDKM objective function values of the new column <inline-formula id="ieqn-68"><mml:math id="mml-ieqn-68"><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>o</mml:mi><mml:mo>,</mml:mo><mml:mi>w</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> after the merging of these two clusters, and <inline-formula id="ieqn-69"><mml:math id="mml-ieqn-69"><mml:mrow><mml:mtext>cost</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>o</mml:mi><mml:mo>,</mml:mo><mml:mi>w</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> represents the cost after the merging of these two clusters. Using <xref ref-type="disp-formula" rid="eqn-5">Eq. (5)</xref>, calculate and find the <inline-formula id="ieqn-70"><mml:math id="mml-ieqn-70"><mml:mrow><mml:mo>&#x230A;</mml:mo><mml:mfrac><mml:mi>k</mml:mi><mml:mn>2</mml:mn></mml:mfrac><mml:mo>&#x230B;</mml:mo></mml:mrow></mml:math></inline-formula> clusters with the smallest <inline-formula id="ieqn-71"><mml:math id="mml-ieqn-71"><mml:mrow><mml:mtext>obj</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">F</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>e</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> value, and calculate the cost after merging this <inline-formula id="ieqn-72"><mml:math id="mml-ieqn-72"><mml:mrow><mml:mo>&#x230A;</mml:mo><mml:mfrac><mml:mi>k</mml:mi><mml:mn>2</mml:mn></mml:mfrac><mml:mo>&#x230B;</mml:mo></mml:mrow></mml:math></inline-formula> cluster with the other clusters separately. The two clusters with the smallest cost after merging are merged, assuming that these two clusters correspond to columns <inline-formula id="ieqn-73"><mml:math id="mml-ieqn-73"><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>o</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-74"><mml:math id="mml-ieqn-74"><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> in the partition matrix <inline-formula id="ieqn-75"><mml:math id="mml-ieqn-75"><mml:mi mathvariant="bold-italic">F</mml:mi></mml:math></inline-formula>. This is shown in <xref ref-type="fig" rid="fig-3">Fig. 3</xref>. Find the labels of the rows in <inline-formula id="ieqn-76"><mml:math id="mml-ieqn-76"><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>o</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> with value 1, and set the values of these rows in <inline-formula id="ieqn-77"><mml:math id="mml-ieqn-77"><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> to 1. Delete <inline-formula id="ieqn-78"><mml:math id="mml-ieqn-78"><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>o</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> after the operation, and complete the merging of the two clusters by the above steps.</p>
<fig id="fig-3">
<label>Figure 3</label>
<caption>
<title>Use membership matrix to merge two clusters</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_60090-fig-3.tif"/>
</fig>
<p><disp-formula id="eqn-14"><label>(14)</label><mml:math id="mml-eqn-14" display="block"><mml:mrow><mml:mtext>obj</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>o</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>o</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:msup><mml:mi mathvariant="bold-italic">X</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:mi mathvariant="bold-italic">X</mml:mi><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>o</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>o</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>o</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfrac><mml:mo>+</mml:mo><mml:mfrac><mml:mrow><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:msup><mml:mi mathvariant="bold-italic">X</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:mi mathvariant="bold-italic">X</mml:mi><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfrac><mml:mo>,</mml:mo></mml:math></disp-formula>
<disp-formula id="eqn-15"><label>(15)</label><mml:math id="mml-eqn-15" display="block"><mml:mrow><mml:mtext>obj</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>o</mml:mi><mml:mo>,</mml:mo><mml:mi>w</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>o</mml:mi><mml:mo>,</mml:mo><mml:mi>w</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:msup><mml:mi mathvariant="bold-italic">X</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:mi mathvariant="bold-italic">X</mml:mi><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>o</mml:mi><mml:mo>,</mml:mo><mml:mi>w</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>o</mml:mi><mml:mo>,</mml:mo><mml:mi>w</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>o</mml:mi><mml:mo>,</mml:mo><mml:mi>w</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfrac><mml:mo>,</mml:mo></mml:math></disp-formula>
<disp-formula id="eqn-16"><label>(16)</label><mml:math id="mml-eqn-16" display="block"><mml:mrow><mml:mtext>cost</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>o</mml:mi><mml:mo>,</mml:mo><mml:mi>w</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mtext>obj</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>o</mml:mi><mml:mo>,</mml:mo><mml:mi>w</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mtext>obj</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>o</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:math></disp-formula>
<list list-type="simple">
<list-item><label>2)</label><p>Splitting operation</p></list-item>
</list></p>
<p>First, the gain of each cluster is calculated. Then, the cluster splitting operation is accomplished by splitting the columns of the partition matrix. The specific method is as follows. For each cluster <inline-formula id="ieqn-79"><mml:math id="mml-ieqn-79"><mml:msub><mml:mi mathvariant="bold-italic">C</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> before the splitting operation, find its corresponding column <inline-formula id="ieqn-80"><mml:math id="mml-ieqn-80"><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> in the partition matrix <inline-formula id="ieqn-81"><mml:math id="mml-ieqn-81"><mml:mi mathvariant="bold-italic">F</mml:mi></mml:math></inline-formula>. As shown in <xref ref-type="fig" rid="fig-4">Fig. 4</xref>, we create a matrix <inline-formula id="ieqn-82"><mml:math id="mml-ieqn-82"><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mo mathvariant="bold">=</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>, where <inline-formula id="ieqn-83"><mml:math id="mml-ieqn-83"><mml:msub><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-84"><mml:math id="mml-ieqn-84"><mml:msub><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> is 0 for all elements except row u, which is 1. u is the row label of a non-zero element randomly selected in <inline-formula id="ieqn-85"><mml:math id="mml-ieqn-85"><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. Clustering to convergence using the CDKM algorithm for <inline-formula id="ieqn-86"><mml:math id="mml-ieqn-86"><mml:mi mathvariant="bold-italic">g</mml:mi></mml:math></inline-formula> yields the membership matrix <inline-formula id="ieqn-87"><mml:math id="mml-ieqn-87"><mml:msup><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> corresponding to the two clusters after splitting. The objective function values before and after splitting are calculated using <xref ref-type="disp-formula" rid="eqn-17">Eqs. (17)</xref> and <xref ref-type="disp-formula" rid="eqn-18">(18)</xref>, respectively, and the gain after cluster splitting is calculated using <xref ref-type="disp-formula" rid="eqn-19">Eq. (19)</xref>, where, <inline-formula id="ieqn-88"><mml:math id="mml-ieqn-88"><mml:mrow><mml:mtext>obj</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> represents the CDKM objective function value before splitting of the cluster <inline-formula id="ieqn-89"><mml:math id="mml-ieqn-89"><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-90"><mml:math id="mml-ieqn-90"><mml:mrow><mml:mtext>obj</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> represents the CDKM objective function value after splitting of the corresponding cluster of the column <inline-formula id="ieqn-91"><mml:math id="mml-ieqn-91"><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> into two clusters. <inline-formula id="ieqn-92"><mml:math id="mml-ieqn-92"><mml:mrow><mml:mtext>gain</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> represents the gain value of the column <inline-formula id="ieqn-93"><mml:math id="mml-ieqn-93"><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> after splitting the corresponding cluster.</p>
<fig id="fig-4">
<label>Figure 4</label>
<caption>
<title>Using partition matrix to split clusters</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_60090-fig-4.tif"/>
</fig>
<p><disp-formula id="eqn-17"><label>(17)</label><mml:math id="mml-eqn-17" display="block"><mml:mrow><mml:mtext>obj</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:msup><mml:mi mathvariant="bold-italic">X</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:mi mathvariant="bold-italic">X</mml:mi><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfrac><mml:mo>,</mml:mo></mml:math></disp-formula>
<disp-formula id="eqn-18"><label>(18)</label><mml:math id="mml-eqn-18" display="block"><mml:mrow><mml:mtext>obj</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:msup><mml:mi mathvariant="bold-italic">X</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:mi mathvariant="bold-italic">X</mml:mi><mml:msubsup><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mrow><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:msubsup><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup></mml:mrow></mml:mfrac><mml:mo>+</mml:mo><mml:mfrac><mml:mrow><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:msup><mml:mi mathvariant="bold-italic">X</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:mi mathvariant="bold-italic">X</mml:mi><mml:msubsup><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mrow><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:msubsup><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup></mml:mrow></mml:mfrac><mml:mo>,</mml:mo></mml:math></disp-formula>
<disp-formula id="eqn-19"><label>(19)</label><mml:math id="mml-eqn-19" display="block"><mml:mrow><mml:mtext>gain</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mtext>obj</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">g</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mtext>obj</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:math></disp-formula></p>
<p>The following describes the method of splitting the cluster with the largest gain, i.e., column <inline-formula id="ieqn-94"><mml:math id="mml-ieqn-94"><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> in the <inline-formula id="ieqn-95"><mml:math id="mml-ieqn-95"><mml:mi mathvariant="bold-italic">F</mml:mi></mml:math></inline-formula> matrix. As shown in <xref ref-type="fig" rid="fig-5">Fig. 5</xref>, a new column <inline-formula id="ieqn-96"><mml:math id="mml-ieqn-96"><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>r</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is added at the end of the membership matrix <inline-formula id="ieqn-97"><mml:math id="mml-ieqn-97"><mml:mi mathvariant="bold-italic">F</mml:mi></mml:math></inline-formula>. The initial values of the column are all set to 0, and an element with a value of 1 is randomly selected in <inline-formula id="ieqn-98"><mml:math id="mml-ieqn-98"><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, which is aligned with the element values of the corresponding rows in <inline-formula id="ieqn-99"><mml:math id="mml-ieqn-99"><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi>r</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. Clustering was again performed using CDKM to group the data points to the nearest centers, thus completing the merging operation by the partition matrix.</p>
<fig id="fig-5">
<label>Figure 5</label>
<caption>
<title>Using partition matrix to split clusters</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_60090-fig-5.tif"/>
</fig>
</sec>
<sec id="s3_4">
<label>3.4</label>
<title>Algorithm Description</title>
<p>Algorithm 2 is the proposed algorithm in this paper. The following steps provide a clear representation of the algorithm&#x2019;s workflow, illustrating how it progresses from start to finish. This includes performing the CDKM, followed by the split-merge criterion, and concludes with the final completion of the algorithm.</p>
<fig id="fig-9">
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_60090-fig-9.tif"/>
</fig>
</sec>
<sec id="s3_5">
<label>3.5</label>
<title>Computational Complexity Analysis</title>
<p>Our proposed algorithm is mainly composed of four parts. Let <inline-formula id="ieqn-122"><mml:math id="mml-ieqn-122"><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> denote the number of iterations for the first, third and fourth parts of running CDKM, respectively. The first part is clustering using CDKM, and its time complexity is <inline-formula id="ieqn-123"><mml:math id="mml-ieqn-123"><mml:mi>O</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>n</mml:mi><mml:mi>d</mml:mi><mml:mi>k</mml:mi><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>. The second part is the stage of finding suitable clusters for consolidation, in which the time complexity of finding the smallest <inline-formula id="ieqn-124"><mml:math id="mml-ieqn-124"><mml:mrow><mml:mo>&#x230A;</mml:mo><mml:mfrac><mml:mi>k</mml:mi><mml:mn>2</mml:mn></mml:mfrac><mml:mo>&#x230B;</mml:mo></mml:mrow></mml:math></inline-formula> clusters by sorting each cluster according to the objective function value is <inline-formula id="ieqn-125"><mml:math id="mml-ieqn-125"><mml:mi>O</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mi>k</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>. The time complexity of calculating the cost and merging the two clusters corresponding to the minimum value is <inline-formula id="ieqn-126"><mml:math id="mml-ieqn-126"><mml:mi>O</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mi>k</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mi>d</mml:mi><mml:mo>+</mml:mo><mml:mi>n</mml:mi><mml:mi>k</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>. The total time complexity of this part is <inline-formula id="ieqn-127"><mml:math id="mml-ieqn-127"><mml:mi>O</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mi>k</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mi>d</mml:mi><mml:mo>+</mml:mo><mml:mi>n</mml:mi><mml:mi>k</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>. The third part is to find the cluster with the largest gain for splitting, and the time complexity is <inline-formula id="ieqn-128"><mml:math id="mml-ieqn-128"><mml:mi>O</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>n</mml:mi><mml:mi>d</mml:mi><mml:mi>k</mml:mi><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>. The fourth part is the re-clustering using CDKM with time complexity <inline-formula id="ieqn-129"><mml:math id="mml-ieqn-129"><mml:mi>O</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>n</mml:mi><mml:mi>d</mml:mi><mml:mi>k</mml:mi><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>. So the time complexity of this algorithm is <inline-formula id="ieqn-130"><mml:math id="mml-ieqn-130"><mml:mi>O</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>n</mml:mi><mml:mi>d</mml:mi><mml:mi>k</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:msup><mml:mi>k</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mi>d</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>.</p>
</sec>
</sec>
<sec id="s4">
<label>4</label>
<title>Experiment</title>
<sec id="s4_1">
<label>4.1</label>
<title>Experimental Environment and Experimental Dataset</title>
<p>The hardware experimental platform for all the algorithms used Intel (R) Core (TM) i7-9750H CPU @ 2.60 GHz processor. The results of the split algorithm and random swap algorithm were obtained from the C&#x002B;&#x002B; software available at <bold><ext-link ext-link-type="uri" xlink:href="https://cs.uef.fi/ml/software/(data=May.26.2010)">https://cs.uef.fi/ml/software/(data=May.26.2010)</ext-link></bold> (accessed on 19 November 2024), and they were executed in Linux environment. The other algorithms&#x2019; codes are written in C&#x002B;&#x002B;. They were executed in Windows environment. Ten sets of UCI (The University of California, lrvine) data sets are selected in the experiment, which can be downloaded from <bold><ext-link ext-link-type="uri" xlink:href="https://archive.ics.uci.edu/(data=September.30.1985)">https://archive.ics.uci.edu/(data=September.30.1985)</ext-link></bold> (accessed on 19 November 2024). The specific data set information is shown in <xref ref-type="table" rid="table-1">Table 1</xref>.</p>
<table-wrap id="table-1">
<label>Table 1</label>
<caption>
<title>Information of seven UCI datasets</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th></th>
<th>Data set</th>
<th>Data amount</th>
<th>Feature dimension</th>
</tr>
</thead>
<tbody>
<tr>
<td>High-dimensional data sets</td>
<td>Tox</td>
<td>171</td>
<td>1203</td>
</tr>
<tr>
<td/>
<td>DARWIN</td>
<td>174</td>
<td>450</td>
</tr>
<tr>
<td/>
<td>Driv</td>
<td>606</td>
<td>6400</td>
</tr>
<tr>
<td/>
<td>arcene</td>
<td>900</td>
<td>10,000</td>
</tr>
<tr>
<td/>
<td>USPS</td>
<td>1854</td>
<td>256</td>
</tr>
<tr>
<td/>
<td>TUAN</td>
<td>4464</td>
<td>241</td>
</tr>
<tr>
<td/>
<td>isolet</td>
<td>7797</td>
<td>617</td>
</tr>
<tr>
<td>Low-dimensional data sets</td>
<td>Liver</td>
<td>345</td>
<td>6</td>
</tr>
<tr>
<td/>
<td>Ionosphere</td>
<td>351</td>
<td>34</td>
</tr>
<tr>
<td/>
<td>Page</td>
<td>5473</td>
<td>10</td>
</tr>
</tbody>
</table>
<table-wrap-foot><fn><p>Note: <bold>Abbreviation: </bold>Tox &#x003D; Toxicity; Driv &#x003D; DrivfaceD; USPS &#x003D; USPSdata_20; TUAN &#x003D; TUANDROMD; Liver &#x003D; Liver_Disorders; Page &#x003D; Page_Blocks.</p></fn></table-wrap-foot>
</table-wrap>
<p>In order to objectively evaluate the clustering performance of each algorithm, the evaluation index adopted in this paper is SSE, and the SSE value of each algorithm is calculated by <xref ref-type="disp-formula" rid="eqn-1">Eq. (1)</xref> and compared. For the algorithms that rewrite the objective function, including CDKM and CDKMSM, we run the algorithm to get its solution first, and then calculate the SSE value of the corresponding K-means model objective function based on this solution.</p>
</sec>
<sec id="s4_2">
<label>4.2</label>
<title>Experimental Result</title>
<p>The experiment compares the results of K-means algorithm (KM), CDKM algorithm [<xref ref-type="bibr" rid="ref-21">21</xref>], split algorithm [<xref ref-type="bibr" rid="ref-23">23</xref>], Random Swap algorithm (RS) [<xref ref-type="bibr" rid="ref-24">24</xref>], I-K-means-&#x002B; algorithm (IKM) [<xref ref-type="bibr" rid="ref-25">25</xref>], K-means with new formulation algorithm (KWNF) [<xref ref-type="bibr" rid="ref-26">26</xref>] and CDKMSM algorithm. The K-means algorithm is the original clustering algorithm. The split algorithm is an original algorithm based on splitting. Random swap algorithm is an excellent variant based on the split-merge criterion. The I-K-means-&#x002B; algorithm is a modified K-means algorithm based on split-merge criterion. CDKM is a new algorithm that incorporates the coordinate descent method into the iterative process of K-means clustering. CDKM is the algorithm that is to be improved in this paper. The KWNF algorithm is a recent and effective improvement of the K-means algorithm.</p>
<p>In order to verify the effect of the algorithm with different numbers of clusters <italic>k</italic>, five different values of <italic>k</italic> are chosen in this paper: 4, 6, 8, 10 and 12, and the corresponding clustering results are counted. For each value of <italic>k</italic> for each data set, all algorithms run within 50 random times and the results are averaged for comparison. For each algorithm, we used randomized initialization of the centers.</p>
<p>The SSE value is affected by various aspects such as the dataset scale and the dataset size dimension, resulting in large differences in the SSE value for different datasets and <italic>k</italic> values. By defining a variable to represent the relationship between the SSE values of two algorithms, we can directly compare their clustering performance. It allows us to quantify the difference in performance between the two algorithms. Therefore, in addition to the SSE value (shown in <xref ref-type="disp-formula" rid="eqn-1">Eq. (1)</xref>), we define the following <italic>E</italic> value, which measures the relative degree of improvement in the solution accuracy of a certain algorithm, named here as &#x201C;alg1&#x201D;, with respect to another algorithm &#x201C;alg2&#x201D;.
<disp-formula id="eqn-20"><label>(20)</label><mml:math id="mml-eqn-20" display="block"><mml:mi>E</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mtext>SSE</mml:mtext></mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mi>lg</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mrow><mml:mtext>SSE</mml:mtext></mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mi>lg</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:msub><mml:mrow><mml:mtext>SSE</mml:mtext></mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mi>lg</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mfrac><mml:mo>&#x00D7;</mml:mo><mml:mn>100</mml:mn><mml:mi mathvariant="normal">&#x0025;</mml:mi><mml:mo>.</mml:mo></mml:math></disp-formula>where, <inline-formula id="ieqn-131"><mml:math id="mml-ieqn-131"><mml:msub><mml:mrow><mml:mtext>SSE</mml:mtext></mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mi>lg</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> denotes the SSE value of the algorithm &#x201C;alg1&#x201D; to be measured, and <inline-formula id="ieqn-132"><mml:math id="mml-ieqn-132"><mml:msub><mml:mrow><mml:mtext>SSE</mml:mtext></mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mi>lg</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> denotes the SSE value of the algorithm &#x201C;alg2&#x201D; to be compared.</p>
<p>By defining a variable that represents the relationship between the time values of the two algorithms, we can directly compare their clustering performance. This variable serves as a quantitative measure that encapsulates both the execution time and the effectiveness of the clustering results produced by each algorithm. The index for the time of the algorithm we measured by the following <italic>T</italic> value, which quantifies the percentage speedup of the running time <inline-formula id="ieqn-133"><mml:math id="mml-ieqn-133"><mml:msub><mml:mi>T</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> of our proposed algorithm relative to the running time <inline-formula id="ieqn-134"><mml:math id="mml-ieqn-134"><mml:msub><mml:mi>T</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> of the rest of the algorithms.
<disp-formula id="eqn-21"><label>(21)</label><mml:math id="mml-eqn-21" display="block"><mml:mi>T</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mi>T</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>T</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:msub><mml:mi>T</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mfrac><mml:mo>&#x00D7;</mml:mo><mml:mn>100</mml:mn><mml:mi mathvariant="normal">&#x0025;</mml:mi><mml:mo>.</mml:mo></mml:math></disp-formula></p>
<p>The SSE results are shown in <xref ref-type="table" rid="table-2">Table 2</xref>, with the optimal and sub-optimal results shown bolded as well as skewed, respectively. The <italic>E</italic> value of the remaining algorithms as measured against the solution accuracy of the K-means algorithm is shown in <xref ref-type="table" rid="table-3">Table 3</xref>.</p>
<table-wrap id="table-2">
<label>Table 2</label>
<caption>
<title>Comparison for SSE value</title>
</caption>
<table frame="hsides">
<colgroup>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>k</th>
<th>Dataset</th>
<th align="center" colspan="8">Algorithm SSE</th>
</tr>
<tr>
<th></th>
<th></th>
<th>KM</th>
<th>Split [<xref ref-type="bibr" rid="ref-23">23</xref>]</th>
<th>RS [<xref ref-type="bibr" rid="ref-24">24</xref>]</th>
<th>IKM [<xref ref-type="bibr" rid="ref-25">25</xref>]</th>
<th>CDKM [<xref ref-type="bibr" rid="ref-21">21</xref>]</th>
<th>KWNF [<xref ref-type="bibr" rid="ref-26">26</xref>]</th>
<th>CDKMSM</th>
<th>Scaling value<sup>2</sup></th>
</tr>
</thead>
<tbody>
<tr>
<td>4</td>
<td>Tox</td>
<td>14.87</td>
<td>18.42</td>
<td><bold>6</bold>.<bold>37</bold></td>
<td>13.62</td>
<td><italic>7.88</italic></td>
<td>15.21</td>
<td><bold>6.37</bold></td>
<td><bold>&#x002A;</bold>10^14</td>
</tr>
<tr>
<td/>
<td>DARWIN</td>
<td>5.29</td>
<td>13.24</td>
<td><bold>4.45</bold></td>
<td><italic>5.28</italic></td>
<td>5.32</td>
<td>5.30</td>
<td><bold>4.45</bold></td>
<td><bold>&#x002A;</bold>10^13</td>
</tr>
<tr>
<td/>
<td>Driv</td>
<td>6.13</td>
<td>8.93</td>
<td>6.51</td>
<td>6.11</td>
<td><italic>6.10</italic></td>
<td>6.18</td>
<td><bold>5.82</bold></td>
<td><bold>&#x002A;</bold>10^04</td>
</tr>
<tr>
<td/>
<td>arcene</td>
<td>2.07</td>
<td>3.61</td>
<td>1.97</td>
<td><italic>1.92</italic></td>
<td>2.04</td>
<td>2.08</td>
<td><bold>1.84</bold></td>
<td><bold>&#x002A;</bold>10^10</td>
</tr>
<tr>
<td></td>
<td>USPS</td>
<td><italic>8.74</italic></td>
<td>9.42</td>
<td><bold>8.72</bold></td>
<td><italic>8.74</italic></td>
<td><italic>8.74</italic></td>
<td>8.74</td>
<td><italic>8.73</italic></td>
<td><bold>&#x002A;</bold>10^04</td>
</tr>
<tr>
<td/>
<td>TUAN</td>
<td>2.58</td>
<td>3.17</td>
<td><bold>2.30</bold></td>
<td>2.59</td>
<td>2.67</td>
<td>2.60</td>
<td><italic>2.41</italic></td>
<td><bold>&#x002A;</bold>10^04</td>
</tr>
<tr>
<td/>
<td>isolet</td>
<td>6.46</td>
<td>6.72</td>
<td><bold>6.42</bold></td>
<td><italic>6.43</italic></td>
<td>6.46</td>
<td>6.46</td>
<td><bold>6.42</bold></td>
<td><bold>&#x002A;</bold>10^05</td>
</tr>
<tr>
<td/>
<td>Liver</td>
<td>2.64</td>
<td>2.95</td>
<td><bold>2.61</bold></td>
<td>2.63</td>
<td>2.63</td>
<td>2.66</td>
<td><italic>2.62</italic></td>
<td><bold>&#x002A;</bold>10^05</td>
</tr>
<tr>
<td/>
<td>Ionosphere</td>
<td>2.12</td>
<td>2.33</td>
<td><bold>2.00</bold></td>
<td>2.10</td>
<td><italic>2.05</italic></td>
<td>2.12</td>
<td><bold>2.00</bold></td>
<td><bold>&#x002A;</bold>10^03</td>
</tr>
<tr>
<td/>
<td>Page</td>
<td><bold>2.13</bold></td>
<td>2.68</td>
<td><bold>2.13</bold></td>
<td><bold>2.13</bold></td>
<td><bold>2.13</bold></td>
<td><bold>2.13</bold></td>
<td><bold>2.13</bold></td>
<td><bold>&#x002A;</bold>10^10</td>
</tr>
<tr>
<td>6</td>
<td>Tox</td>
<td>5.14</td>
<td>13.79</td>
<td><bold>2.30</bold></td>
<td>3.52</td>
<td>3.05</td>
<td>4.62</td>
<td><bold>2.30</bold></td>
<td><bold>&#x002A;</bold>10^14</td>
</tr>
<tr>
<td/>
<td>DARWIN</td>
<td>4.57</td>
<td>13.20</td>
<td><bold>3.39</bold></td>
<td>4.55</td>
<td>3.96</td>
<td>4.58</td>
<td><italic>3.53</italic></td>
<td><bold>&#x002A;</bold>10^13</td>
</tr>
<tr>
<td/>
<td>Driv</td>
<td>4.97</td>
<td>7.92</td>
<td>5.44</td>
<td><italic>4.77</italic></td>
<td>4.93</td>
<td>5.02</td>
<td><bold>4.72</bold></td>
<td><bold>&#x002A;</bold>10^04</td>
</tr>
<tr>
<td/>
<td>arcene</td>
<td>1.78</td>
<td>3.52</td>
<td>1.68</td>
<td><bold>1.65</bold></td>
<td>1.77</td>
<td>1.78</td>
<td><italic>1.66</italic></td>
<td>&#x002A;10^10</td>
</tr>
<tr>
<td></td>
<td>USPS</td>
<td>7.92</td>
<td>8.90</td>
<td><bold>7.88</bold></td>
<td>7.92</td>
<td>7.91</td>
<td>7.92</td>
<td><italic>7.90</italic></td>
<td><bold>&#x002A;</bold>10^04</td>
</tr>
<tr>
<td/>
<td>TUAN</td>
<td>2.06</td>
<td>3.17</td>
<td><bold>1.72</bold></td>
<td>1.96</td>
<td>2.24</td>
<td>2.10</td>
<td><italic>1.95</italic></td>
<td><bold>&#x002A;</bold>10^04</td>
</tr>
<tr>
<td/>
<td>isolet</td>
<td>5.94</td>
<td>6.57</td>
<td><bold>5.89</bold></td>
<td><italic>5.91</italic></td>
<td>5.93</td>
<td>5.93</td>
<td><bold>5.89</bold></td>
<td><bold>&#x002A;</bold>10^05</td>
</tr>
<tr>
<td/>
<td>Liver</td>
<td>2.10</td>
<td>2.44</td>
<td><bold>1.87</bold></td>
<td>2.02</td>
<td>2.06</td>
<td>2.10</td>
<td><italic>1.92</italic></td>
<td><bold>&#x002A;</bold>10^05</td>
</tr>
<tr>
<td/>
<td>Ionosphere</td>
<td>1.94</td>
<td>2.26</td>
<td><bold>1.81</bold></td>
<td>1.92</td>
<td>1.85</td>
<td>1.94</td>
<td><italic>1.83</italic></td>
<td><bold>&#x002A;</bold>10^03</td>
</tr>
<tr>
<td/>
<td>Page</td>
<td>1.64</td>
<td>1.36</td>
<td><bold>1.02</bold></td>
<td>1.05</td>
<td>1.63</td>
<td>1.64</td>
<td><italic>1.03</italic></td>
<td><bold>&#x002A;</bold>10^10</td>
</tr>
<tr>
<td>8</td>
<td>Tox</td>
<td>46.68</td>
<td>137.79</td>
<td><bold>3.56</bold></td>
<td>16.77</td>
<td>22.99</td>
<td>46.17</td>
<td><italic>8.18</italic></td>
<td>10^13</td>
</tr>
<tr>
<td/>
<td>DARWIN</td>
<td>4.04</td>
<td>13.19</td>
<td><bold>2.89</bold></td>
<td>3.91</td>
<td>3.51</td>
<td>4.03</td>
<td><italic>3.02</italic></td>
<td>10^13</td>
</tr>
<tr>
<td/>
<td>Driv</td>
<td>4.31</td>
<td>6.96</td>
<td>5.05</td>
<td><italic>4.14</italic></td>
<td>4.28</td>
<td>4.31</td>
<td><bold>4.10</bold></td>
<td>10^04</td>
</tr>
<tr>
<td/>
<td>arcene</td>
<td>1.63</td>
<td>3.51</td>
<td>1.63</td>
<td><bold>1.54</bold></td>
<td>1.62</td>
<td>1.63</td>
<td><italic>1.55</italic></td>
<td>10^10</td>
</tr>
<tr>
<td></td>
<td>USPS</td>
<td>7.29</td>
<td>8.51</td>
<td><bold>7.25</bold></td>
<td>7.28</td>
<td>7.28</td>
<td>7.29</td>
<td><italic>7.27</italic></td>
<td><bold>&#x002A;</bold>10^04</td>
</tr>
<tr>
<td/>
<td>TUAN</td>
<td>1.74</td>
<td>3.15</td>
<td><bold>1.41</bold></td>
<td><italic>1.56</italic></td>
<td>1.85</td>
<td>1.71</td>
<td>1.62</td>
<td>&#x002A;10^04</td>
</tr>
<tr>
<td/>
<td>isolet</td>
<td>5.60</td>
<td>6.38</td>
<td><bold>5.56</bold></td>
<td><italic>5.57</italic></td>
<td>5.59</td>
<td><italic>5.59</italic></td>
<td><italic>5.57</italic></td>
<td>&#x002A;10^05</td>
</tr>
<tr>
<td/>
<td>Liver</td>
<td>1.77</td>
<td>2.35</td>
<td><bold>1.48</bold></td>
<td><italic>1.57</italic></td>
<td>1.69</td>
<td>1.74</td>
<td><bold>1.48</bold></td>
<td><bold>&#x002A;</bold>10^05</td>
</tr>
<tr>
<td/>
<td>Ionosphere</td>
<td>1.76</td>
<td>2.23</td>
<td><bold>1.67</bold></td>
<td>1.75</td>
<td>1.70</td>
<td>1.76</td>
<td><italic>1.69</italic></td>
<td><bold>&#x002A;</bold>10^03</td>
</tr>
<tr>
<td/>
<td>Page</td>
<td>15.44</td>
<td>8.65</td>
<td><italic>6.49</italic></td>
<td>7.40</td>
<td>7.60</td>
<td>15.44</td>
<td><bold>6.47</bold></td>
<td><bold>&#x002A;</bold>10^09</td>
</tr>
<tr>
<td>10</td>
<td>Tox</td>
<td>48.41</td>
<td>109.20</td>
<td><bold>1.23</bold></td>
<td>15.24</td>
<td>22.99</td>
<td>47.88</td>
<td><italic>8.18</italic></td>
<td><bold>&#x002A;</bold>10^13</td>
</tr>
<tr>
<td/>
<td>DARWIN</td>
<td>3.66</td>
<td>13.19</td>
<td><bold>2.60</bold></td>
<td>3.25</td>
<td>3.19</td>
<td>3.67</td>
<td><italic>2.68</italic></td>
<td><bold>&#x002A;</bold>10^13</td>
</tr>
<tr>
<td/>
<td>Driv</td>
<td>3.95</td>
<td>6.72</td>
<td>4.64</td>
<td><italic>3.83</italic></td>
<td>3.94</td>
<td>3.97</td>
<td><bold>3.82</bold></td>
<td>&#x002A;10^04</td>
</tr>
<tr>
<td/>
<td>arcene</td>
<td>1.55</td>
<td>3.50</td>
<td>1.51</td>
<td><bold>1.46</bold></td>
<td>1.55</td>
<td>1.55</td>
<td><italic>1.49</italic></td>
<td>&#x002A;10^10</td>
</tr>
<tr>
<td></td>
<td>USPS</td>
<td>6.79</td>
<td>8.43</td>
<td><bold>6.74</bold></td>
<td><italic>6.77</italic></td>
<td><italic>6.79</italic></td>
<td><italic>6.79</italic></td>
<td><italic>6.77</italic></td>
<td><bold>&#x002A;</bold>10^04</td>
</tr>
<tr>
<td/>
<td>TUAN</td>
<td>1.52</td>
<td>3.15</td>
<td><bold>1.25</bold></td>
<td><italic>1.39</italic></td>
<td>1.65</td>
<td>1.53</td>
<td>1.46</td>
<td>&#x002A;10^04</td>
</tr>
<tr>
<td/>
<td>isolet</td>
<td>5.36</td>
<td>6.36</td>
<td><bold>5.30</bold></td>
<td><italic>5.32</italic></td>
<td>5.35</td>
<td>5.36</td>
<td><italic>5.32</italic></td>
<td><bold>&#x002A;</bold>10^05</td>
</tr>
<tr>
<td/>
<td>Liver</td>
<td>1.47</td>
<td>2.28</td>
<td><bold>1.28</bold></td>
<td>1.34</td>
<td>1.37</td>
<td>1.44</td>
<td><italic>1.30</italic></td>
<td><bold>&#x002A;</bold>10^05</td>
</tr>
<tr>
<td/>
<td>Ionosphere</td>
<td>1.67</td>
<td>2.21</td>
<td><bold>1.55</bold></td>
<td>1.63</td>
<td>1.60</td>
<td>1.68</td>
<td><italic>1.57</italic></td>
<td><bold>&#x002A;</bold>10^03</td>
</tr>
<tr>
<td/>
<td>Page</td>
<td>14.33</td>
<td>5.39</td>
<td><bold>4.59</bold></td>
<td>5.95</td>
<td>6.29</td>
<td>14.33</td>
<td><italic>4.68</italic></td>
<td><bold>&#x002A;</bold>10^09</td>
</tr>
<tr>
<td>12</td>
<td>Tox</td>
<td>4928.50</td>
<td>399.50</td>
<td><bold>1.35</bold></td>
<td>1115.30</td>
<td>2289.20</td>
<td>4826.00</td>
<td><italic>808.68</italic></td>
<td><bold>&#x002A;</bold>10^11</td>
</tr>
<tr>
<td/>
<td>DARWIN</td>
<td>3.39</td>
<td>4.99</td>
<td><bold>2.40</bold></td>
<td>2.86</td>
<td>2.96</td>
<td>3.39</td>
<td><italic>2.42</italic></td>
<td><bold>&#x002A;</bold>10^13</td>
</tr>
<tr>
<td/>
<td>Driv</td>
<td>3.70</td>
<td>6.72</td>
<td>4.28</td>
<td><bold>3.58</bold></td>
<td>3.66</td>
<td>3.70</td>
<td><italic>3.60</italic></td>
<td>&#x002A;10^04</td>
</tr>
<tr>
<td/>
<td>arcene</td>
<td>1.46</td>
<td>3.50</td>
<td>1.45</td>
<td><bold>1.40</bold></td>
<td>1.44</td>
<td>1.46</td>
<td><italic>1.41</italic></td>
<td>&#x002A;10^10</td>
</tr>
<tr>
<td></td>
<td>USPS</td>
<td>6.47</td>
<td>8.42</td>
<td><bold>6.41</bold></td>
<td>6.44</td>
<td>6.46</td>
<td>6.47</td>
<td><italic>6.43</italic></td>
<td><bold>&#x002A;</bold>10^04</td>
</tr>
<tr>
<td/>
<td>TUAN</td>
<td>1.45</td>
<td>3.15</td>
<td><bold>1.12</bold></td>
<td><italic>1.28</italic></td>
<td>1.57</td>
<td>1.43</td>
<td>1.36</td>
<td>&#x002A;10^04</td>
</tr>
<tr>
<td/>
<td>isolet</td>
<td>5.17</td>
<td>6.27</td>
<td><bold>5.09</bold></td>
<td><italic>5.12</italic></td>
<td>5.16</td>
<td>5.16</td>
<td>5.13</td>
<td>&#x002A;10^05</td>
</tr>
<tr>
<td/>
<td>Liver</td>
<td>1.26</td>
<td>2.25</td>
<td><bold>1.13</bold></td>
<td><italic>1.16</italic></td>
<td>1.19</td>
<td>1.26</td>
<td><italic>1.14</italic></td>
<td><bold>&#x002A;</bold>10^05</td>
</tr>
<tr>
<td/>
<td>Ionosphere</td>
<td>1.60</td>
<td>2.11</td>
<td><bold>1.48</bold></td>
<td>1.56</td>
<td><italic>1.51</italic></td>
<td>1.59</td>
<td><bold>1.48</bold></td>
<td><bold>&#x002A;</bold>10^03</td>
</tr>
<tr>
<td/>
<td>Page</td>
<td>13.87</td>
<td>4.04</td>
<td><bold>3.49</bold></td>
<td>4.82</td>
<td>5.86</td>
<td>13.88</td>
<td><italic>4.14</italic></td>
<td><bold>&#x002A;</bold>10^09</td>
</tr>
</tbody>
</table>
<table-wrap-foot><fn><p>Note: <sup>2</sup>The meaning of &#x201C;scaling value&#x201D; is that, for example, 10^14 indicates that the values in this row are multiplied by 10^14.</p>
</fn>
</table-wrap-foot>
</table-wrap><table-wrap id="table-3">
<label>Table 3</label>
<caption>
<title>Comparison of the <italic>E</italic> value between various optimization algorithms and K-means algorithm</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th><italic>k</italic></th>
<th>Split [<xref ref-type="bibr" rid="ref-23">23</xref>]</th>
<th>RS [<xref ref-type="bibr" rid="ref-24">24</xref>]</th>
<th>IKM [<xref ref-type="bibr" rid="ref-25">25</xref>]</th>
<th>CDKM [<xref ref-type="bibr" rid="ref-21">21</xref>]</th>
<th>KWNF [<xref ref-type="bibr" rid="ref-26">26</xref>]</th>
<th>CDKMSM</th>
</tr>
</thead>
<tbody>
<tr>
<td>4</td>
<td>&#x2212;37.70%</td>
<td>9.01%</td>
<td>1.75%</td>
<td>4.88%</td>
<td>&#x2212;0.53%</td>
<td>10.25%</td>
</tr>
<tr>
<td>6</td>
<td>&#x2212;60.79%</td>
<td>15.05%</td>
<td>8.95%</td>
<td>5.35%</td>
<td>0.70%</td>
<td>14.75%</td>
</tr>
<tr>
<td>8</td>
<td>&#x2212;72.56%</td>
<td>20.35%</td>
<td>15.14%</td>
<td>11.72%</td>
<td>0.45%</td>
<td>20.31%</td>
</tr>
<tr>
<td>10</td>
<td>&#x2212;75.73%</td>
<td>21.95%</td>
<td>16.72%</td>
<td>12.42%</td>
<td>0.14%</td>
<td>20.67%</td>
</tr>
<tr>
<td>12</td>
<td>&#x2212;38.66%</td>
<td>23.19</td>
<td>18.88%</td>
<td>12.88%</td>
<td>0.29%</td>
<td>21.27%</td>
</tr>
<tr>
<td>Mean</td>
<td>&#x2212;57.09%</td>
<td>17.91%</td>
<td>12.29%</td>
<td>9.45%</td>
<td>0.21%</td>
<td>17.45%</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>In <xref ref-type="table" rid="table-2">Table 2</xref>, as the number of clusters increases, the sum of squared errors (SSE) typically decreases because more clusters can better capture the data characteristics, resulting in a reduced distance between data points and their cluster centers. <xref ref-type="fig" rid="fig-6">Fig. 6</xref> represents the comparison of the <italic>E</italic> value between each optimization algorithm and the K-means algorithm. Due to the relatively lower solution accuracy of the split algorithm compared to the other algorithms, it was not included in the comparison in <xref ref-type="fig" rid="fig-6">Fig. 6</xref>. The <italic>E</italic> value of the CDKM algorithm is 9.45%, the <italic>E</italic> value of split algorithm, I-K-means-&#x002B; algorithm, and KWNF algorithm are &#x2212;57.09%, 12.29%, and 0.21%, respectively, the <italic>E</italic> value of the proposed algorithm is 17.45%. The <italic>E</italic> value of the proposed algorithm is 11.29% with respect to the CDKM algorithm, 35.61%, 7.87%, and 17.39% with respect to the split algorithm, I-K-means-&#x002B;algorithm, and KWNF algorithm, respectively. As the number of clusters increases, the <italic>E</italic> value of the proposed algorithm compared to the K-means algorithm gradually increases, indicating that the improvement in the SSE relative to K-means and other tested algorithms apart from the random swap algorithm becomes more and more significant. The solution accuracy of the proposed algorithm has achieved an improvement over the K-means algorithm and other tested K-means optimization algorithms apart from the random swap algorithm. Compared to the K-means algorithm, the <italic>E</italic> value of the random swap algorithm is 17.91%, which is slightly higher than the <italic>E</italic> value of the proposed algorithm, which is 17.45%.</p>
<fig id="fig-6">
<label>&#x2002;Figure 6</label>
<caption>
<title>Comparison of the <italic>E</italic> value between each optimization algorithm and K-means algorithm</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_60090-fig-6.tif"/>
</fig>
<p>The runtime results of the algorithm are shown in <xref ref-type="table" rid="table-4">Table 4</xref> with the optimal and sub-optimal results shown bolded as well as skewed, respectively. Since the split and random swap algorithms were run on Linux system, we did not compare their execution time. As the number of clusters increases, the time gradually increases. <xref ref-type="fig" rid="fig-7">Fig. 7</xref> represents the percentage speedup of our proposed algorithm with other algorithms, for the higher dimensional datasets (first seven datasets) at <italic>k</italic> &#x003D; 4, 6, 8, 10, 12. The <italic>T</italic> values of our proposed algorithm compared to another split-merge based K-means improved algorithm I-K-means-&#x002B; algorithm are 9.16%, 32.69%, 44.65%, and 50.96%, respectively, 50.51%. Compared to the I-K-means-&#x002B; algorithm, the algorithm in this paper operates more efficiently as the value of <italic>k</italic> increases. Comparing the <italic>T</italic> value with K-means algorithm and K-means with new formulation algorithm is 11.60% and 89.23%, respectively. As the number of clusters increases, the <italic>T</italic> value of the proposed algorithm compared to other tested algorithms becomes increasingly larger. It can be concluded that the proposed algorithm still maintains the efficiency advantage of CDKM algorithm in high dimensional data.</p>
<table-wrap id="table-4">
<label>Table 4</label>
<caption>
<title>Comparison for time</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th><italic>k</italic></th>
<th>Dataset</th>
<th align="center" colspan="6">Algorithm time</th>
</tr>
<tr>
<th></th>
<th></th>
<th>KM</th>
<th>IKM [<xref ref-type="bibr" rid="ref-25">25</xref>]</th>
<th>CDKM [<xref ref-type="bibr" rid="ref-21">21</xref>]</th>
<th>KWNF [<xref ref-type="bibr" rid="ref-26">26</xref>]</th>
<th>CDKMSM</th>
<th>Scaling value</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="10">4</td>
<td>Tox</td>
<td><italic>6.76</italic></td>
<td>8.38</td>
<td><bold>5.66</bold></td>
<td>49.74</td>
<td>8.14</td>
<td>&#x002A;10^00</td>
</tr>
<tr>
<td>DARWIN</td>
<td><italic>3.00</italic></td>
<td>3.48</td>
<td><bold>2.18</bold></td>
<td>14.10</td>
<td>3.66</td>
<td>&#x002A;10^00</td>
</tr>
<tr>
<td>Driv</td>
<td><italic>1.70</italic></td>
<td>1.93</td>
<td><bold>1.27</bold></td>
<td>16.06</td>
<td>2.02</td>
<td>&#x002A;10^2</td>
</tr>
<tr>
<td>arcene</td>
<td><italic>4.29</italic></td>
<td>5.48</td>
<td><bold>3.16</bold></td>
<td>30.41</td>
<td>4.58</td>
<td>&#x002A;10^02</td>
</tr>
<tr>
<td>USPS</td>
<td>4.44</td>
<td>4.67</td>
<td><bold>2.62</bold></td>
<td>49.04</td>
<td><italic>3.79</italic></td>
<td>&#x002A;10^01</td>
</tr>
<tr>
<td>TUAN</td>
<td><italic>5.70</italic></td>
<td>7.72</td>
<td><bold>4.22</bold></td>
<td>71.85</td>
<td>8.58</td>
<td>&#x002A;10^01</td>
</tr>
<tr>
<td>isolet</td>
<td>6.94</td>
<td>8.86</td>
<td><bold>2.69</bold></td>
<td>75.71</td>
<td><italic>4.72</italic></td>
<td>&#x002A;10^02</td>
</tr>
<tr>
<td>Liver</td>
<td><bold>1.40</bold></td>
<td><italic>2.80</italic></td>
<td>16.40</td>
<td>25.60</td>
<td>27.00</td>
<td>&#x002A;10^&#x2212;01</td>
</tr>
<tr>
<td>Ionosphere</td>
<td><bold>4.80</bold></td>
<td><italic>6.40</italic></td>
<td>12.00</td>
<td>34.80</td>
<td>25.40</td>
<td>&#x002A;10^&#x2212;01</td>
</tr>
<tr>
<td>Page</td>
<td><bold>3.88</bold></td>
<td><italic>4.52</italic></td>
<td>25.02</td>
<td>69.18</td>
<td>50.48</td>
<td>&#x002A;10^00</td>
</tr>
<tr>
<td rowspan="10">6</td>
<td>Tox</td>
<td>15.00</td>
<td>18.36</td>
<td><bold>8.04</bold></td>
<td>127.56</td>
<td><italic>12.14</italic></td>
<td>&#x002A;10^00</td>
</tr>
<tr>
<td>DARWIN</td>
<td><italic>5.82</italic></td>
<td>6.96</td>
<td><bold>3.56</bold></td>
<td>29.32</td>
<td>6.18</td>
<td>&#x002A;10^00</td>
</tr>
<tr>
<td>Driv</td>
<td>2.85</td>
<td>3.78</td>
<td><bold>1.70</bold></td>
<td>25.45</td>
<td><italic>2.75</italic></td>
<td>&#x002A;10^02</td>
</tr>
<tr>
<td>arcene</td>
<td>8.01</td>
<td>14.99</td>
<td><bold>4.34</bold></td>
<td>58.69</td>
<td><italic>7.25</italic></td>
<td>&#x002A;10^02</td>
</tr>
<tr>
<td>USPS</td>
<td>8.22</td>
<td>8.80</td>
<td><bold>3.55</bold></td>
<td>63.08</td>
<td><italic>5.42</italic></td>
<td>&#x002A;10^01</td>
</tr>
<tr>
<td>TUAN</td>
<td><italic>8.18</italic></td>
<td>11.18</td>
<td><bold>5.23</bold></td>
<td>11.95</td>
<td>10.42</td>
<td>&#x002A;10^01</td>
</tr>
<tr>
<td>isolet</td>
<td>9.38</td>
<td>14.73</td>
<td><bold>3.53</bold></td>
<td>90.06</td>
<td><italic>5.97</italic></td>
<td>&#x002A;10^02</td>
</tr>
<tr>
<td>Liver</td>
<td><bold>2.80</bold></td>
<td><italic>3.80</italic></td>
<td>22.20</td>
<td>39.80</td>
<td>43.40</td>
<td>&#x002A;10^&#x2212;01</td>
</tr>
<tr>
<td>Ionosphere</td>
<td><bold>9.20</bold></td>
<td><italic>12.20</italic></td>
<td>19.20</td>
<td>66.00</td>
<td>34.20</td>
<td>&#x002A;10^&#x2212;01</td>
</tr>
<tr>
<td>Page</td>
<td><bold>9.72</bold></td>
<td><italic>16.36</italic></td>
<td>67.66</td>
<td>203.16</td>
<td>136.88</td>
<td>&#x002A;10^00</td>
</tr>
<tr>
<td rowspan="10">8</td>
<td>Tox</td>
<td>3.13</td>
<td>4.47</td>
<td><bold>1.38</bold></td>
<td>27.99</td>
<td><italic>2.05</italic></td>
<td>&#x002A;10^01</td>
</tr>
<tr>
<td>DARWIN</td>
<td>8.18</td>
<td>9.72</td>
<td><bold>4.52</bold></td>
<td>40.48</td>
<td><italic>7.70</italic></td>
<td>&#x002A;10^00</td>
</tr>
<tr>
<td>Driv</td>
<td><italic>3.31</italic></td>
<td>5.38</td>
<td><bold>2.14</bold></td>
<td>28.97</td>
<td>3.50</td>
<td>&#x002A;10^02</td>
</tr>
<tr>
<td>arcene</td>
<td>11.79</td>
<td>23.04</td>
<td><bold>5.43</bold></td>
<td>89.00</td>
<td><italic>9.15</italic></td>
<td>&#x002A;10^02</td>
</tr>
<tr>
<td>USPS</td>
<td>11.44</td>
<td>12.66</td>
<td><bold>5.27</bold></td>
<td>82.45</td>
<td><italic>7.40</italic></td>
<td>&#x002A;10^01</td>
</tr>
<tr>
<td>TUAN</td>
<td><italic>12.06</italic></td>
<td>17.78</td>
<td><bold>6.77</bold></td>
<td>192.50</td>
<td>12.31</td>
<td>&#x002A;10^01</td>
</tr>
<tr>
<td>isolet</td>
<td>13.14</td>
<td>26.22</td>
<td><bold>5.22</bold></td>
<td>132.27</td>
<td><italic>7.81</italic></td>
<td>&#x002A;10^02</td>
</tr>
<tr>
<td>Liver</td>
<td><bold>3.40</bold></td>
<td><italic>7.60</italic></td>
<td>29.40</td>
<td>66.60</td>
<td>59.80</td>
<td>&#x002A;10^&#x2212;01</td>
</tr>
<tr>
<td>Ionosphere</td>
<td><bold>1.18</bold></td>
<td><italic>2.06</italic></td>
<td>2.50</td>
<td>8.66</td>
<td>4.46</td>
<td>&#x002A;10^00</td>
</tr>
<tr>
<td>Page</td>
<td><bold>1.95</bold></td>
<td><italic>2.99</italic></td>
<td>18.24</td>
<td>42.35</td>
<td>30.76</td>
<td>&#x002A;10^01</td>
</tr>
<tr>
<td rowspan="10">10</td>
<td>Tox</td>
<td>2.82</td>
<td>4.95</td>
<td><bold>1.33</bold></td>
<td>26.39</td>
<td><italic>2.22</italic></td>
<td>&#x002A;10^01</td>
</tr>
<tr>
<td>DARWIN</td>
<td>10.30</td>
<td>13.30</td>
<td><bold>5.46</bold></td>
<td>52.26</td>
<td><italic>9.22</italic></td>
<td>&#x002A;10^00</td>
</tr>
<tr>
<td>Driv</td>
<td>5.03</td>
<td>8.72</td>
<td><bold>2.69</bold></td>
<td>48.49</td>
<td><italic>4.23</italic></td>
<td>&#x002A;10^02</td>
</tr>
<tr>
<td>arcene</td>
<td>14.34</td>
<td>32.16</td>
<td><bold>6.63</bold></td>
<td>129.26</td>
<td><italic>11.25</italic></td>
<td>&#x002A;10^02</td>
</tr>
<tr>
<td>USPS</td>
<td>15.56</td>
<td>18.90</td>
<td><bold>7.08</bold></td>
<td>121.81</td>
<td><italic>10.08</italic></td>
<td>&#x002A;10^01</td>
</tr>
<tr>
<td>TUAN</td>
<td>15.49</td>
<td>24.56</td>
<td><bold>8.63</bold></td>
<td>338.22</td>
<td><italic>15.43</italic></td>
<td>&#x002A;10^01</td>
</tr>
<tr>
<td>isolet</td>
<td>14.83</td>
<td>34.47</td>
<td><bold>5.89</bold></td>
<td>161.62</td>
<td><italic>10.16</italic></td>
<td>&#x002A;10^02</td>
</tr>
<tr>
<td>Liver</td>
<td><bold>5.00</bold></td>
<td><italic>9.20</italic></td>
<td>34.80</td>
<td>77.60</td>
<td>64.20</td>
<td>&#x002A;10^&#x2212;01</td>
</tr>
<tr>
<td>Ionosphere</td>
<td><bold>1.30</bold></td>
<td><italic>2.90</italic></td>
<td>3.00</td>
<td>9.78</td>
<td>5.74</td>
<td>&#x002A;10^00</td>
</tr>
<tr>
<td>Page</td>
<td><bold>3.28</bold></td>
<td><italic>4.69</italic></td>
<td>27.60</td>
<td>76.59</td>
<td>44.42</td>
<td>&#x002A;10^01</td>
</tr>
<tr>
<td rowspan="10">12</td>
<td>Tox</td>
<td>3.09</td>
<td>5.59</td>
<td><bold>1.57</bold></td>
<td>25.54</td>
<td><italic>2.54</italic></td>
<td>&#x002A;10^01</td>
</tr>
<tr>
<td>DARWIN</td>
<td><italic>10.66</italic></td>
<td>14.86</td>
<td><bold>6.30</bold></td>
<td>51.76</td>
<td>10.92</td>
<td>&#x002A;10^00</td>
</tr>
<tr>
<td>Driv</td>
<td>5.80</td>
<td>10.67</td>
<td><bold>3.02</bold></td>
<td>45.26</td>
<td><italic>4.72</italic></td>
<td>&#x002A;10^02</td>
</tr>
<tr>
<td>arcene</td>
<td>18.43</td>
<td>35.97</td>
<td><bold>8.44</bold></td>
<td>158.30</td>
<td><italic>13.77</italic></td>
<td>&#x002A;10^02</td>
</tr>
<tr>
<td>USPS</td>
<td>16.51</td>
<td>21.19</td>
<td><bold>7.12</bold></td>
<td>109.16</td>
<td><italic>11.18</italic></td>
<td>&#x002A;10^01</td>
</tr>
<tr>
<td>TUAN</td>
<td>18.67</td>
<td>27.45</td>
<td><bold>9.97</bold></td>
<td>394.83</td>
<td><italic>17.46</italic></td>
<td>&#x002A;10^01</td>
</tr>
<tr>
<td>isolet</td>
<td>20.07</td>
<td>39.58</td>
<td><bold>7.02</bold></td>
<td>192.58</td>
<td><italic>11.28</italic></td>
<td>&#x002A;10^02</td>
</tr>
<tr>
<td>Liver</td>
<td><bold>5.20</bold></td>
<td><italic>12.00</italic></td>
<td>41.60</td>
<td>88.60</td>
<td>81.00</td>
<td>&#x002A;10^&#x2212;01</td>
</tr>
<tr>
<td>Ionosphere</td>
<td><bold>1.78</bold></td>
<td><italic>3.30</italic></td>
<td>3.42</td>
<td>11.26</td>
<td>6.72</td>
<td>&#x002A;10^00</td>
</tr>
<tr>
<td>Page</td>
<td><bold>5.56</bold></td>
<td><italic>7.62</italic></td>
<td>33.50</td>
<td>145.30</td>
<td>63.64</td>
<td>&#x002A;10^01</td>
</tr>
</tbody>
</table>
</table-wrap><fig id="fig-7">
<label>Figure 7</label>
<caption>
<title>Comparison of <italic>T</italic> value between the proposed algorithm and other algorithms under high dimensional data</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_60090-fig-7.tif"/>
</fig>
<p>The experimental results show that the solution accuracy of the proposed method is higher than that of the K-means algorithm and the two algorithms based on split-merge, namely split and I-K-means-&#x002B;. The proposed algorithm improves the solution accuracy of CDKM while retaining its computational efficiency advantage by using partition matrix under high dimensional data. Additionally, it outperforms the recently proposed KWNF algorithm. However, its solution accuracy is slightly lower than that of the random swap algorithm. The random swap algorithm is a variant of the split-merge criterion that incorporates the swap operation into the traditional split-merge criterion. This suggests that the swap operation can effectively improve the solution accuracy, which is a key focus of our future research.</p>
</sec>
</sec>
<sec id="s5">
<label>5</label>
<title>Conclusions</title>
<p>The CDKM algorithm employs a coordinate descent method to optimize the K-means model and improve the solution accuracy of the original K-means algorithm. However, it is sensitive to the initial centers, which makes its solution accuracy not high enough. In this paper, the iterative process of CDKM is modified, and a coordinate descent K-means algorithm based on Split-merge is proposed. The proposed algorithm obtains the partition matrix by CDKM. Then the partition matrix is optimized by the proposed split-merge criterion to improve the solution accuracy. Since the introduced split-merge operations are completed entirely using the partition matrix, which avoids the distance calculation in traditional K-means and related improved algorithms. So, like CDKM, it has an efficiency advantage over other algorithms on high dimensional datasets.</p>
<p>The experimental results show that in terms of solution accuracy. The <italic>E</italic> value of the proposed algorithm relative to K-means is 17.45%, which is higher than that of the CDKM algorithm (9.45%), and also higher than that of the I-K-means-&#x002B; algorithm, split algorithm, and KWNF algorithm. This result indicates that incorporating basic split-merge criterion into the CDKM algorithm can achieve better solution accuracy than traditional split-merge algorithm. We also observed that a variant of the split-merge algorithm, known as the random swap algorithm, has a slight <italic>E</italic> value advantage (17.91%) over the proposed algorithm (17.45%), which is significantly higher than the other two algorithms, the I-K-means-&#x002B; algorithm and the split algorithm, which are based on the basic split-merge criterion. This suggests that the swap operation plays a significant role in further enhancing the solution accuracy of the split-merge criterion. As the number of clusters increases, the proposed algorithm shows increasingly greater improvements in the SSE index compared to both K-means and other tested algorithms apart from the random swap algorithm. In terms of computational efficiency, the proposed algorithm is more efficient on high dimensional data than another split-merge based algorithm, i.e., I-K-means-&#x002B; algorithm, and it is also more efficient than other tested algorithms. The percentage of time speedup for the proposed algorithm relative to the time of I-K-means-&#x002B;, K-means, and the latest KWNF algorithm is 37.59%, 11.60%, and 89.23%, respectively. As the number of clusters increases, the <italic>T</italic> value of the proposed algorithm gradually increases compared to other algorithms, indicating that its time efficiency improves progressively.</p>
<p>The split-merge criterion can significantly improve the solution accuracy of the K-means algorithm. CDKM model is fundamentally different from the K-means model, direct application of split-merge criterion is not feasible. Therefore, this paper explores the integration of the split-merge criterion into the CDKM clustering model to enhance its solution accuracy. We aim to apply the split-merge criterion, traditionally used in the K-means model, to the CDKM model. In traditional K-means, there are several methods for split-merge operations and subsequent improvements, such as random swap clustering [<xref ref-type="bibr" rid="ref-24">24</xref>], which effectively enhance the solution accuracy of the original K-means model. These represent further advancements built upon the foundation of split-merge. Some effective indexes can measure clustering performance. In our future work, we plan to explore the integration of these advanced methods and indexes into our research. Our goal is to demonstrate the successful application of these methods and to explore the potential of incorporating such traditional methods into the CDKM model in future research.</p>
</sec>
</body>
<back>
<ack>
<p>The authors would like to acknowledge the valuable feedback provided by the reviewers.</p>
</ack>
<sec><title>Funding Statement</title>
<p>This research was funded by National Defense Basic Research Program, grant number JCKY2019411B001. This research was funded by National Key Research and Development Program, grant number 2022YFC3601305. This research was funded by Key R&#x0026;D Projects of Jilin Provincial Science and Technology Department, grant number 20210203218SF.</p>
</sec>
<sec><title>Author Contributions</title>
<p>The authors confirm contribution to the paper as follows: study conception and design: Fuheng Qu, Yuhang Shi; data collection: Yuhang Shi; analysis and interpretation of results: Yuhang Shi, Yong Yang, Yating Hu; draft manuscript preparation: Yuhang Shi, Yuyao Liu. All authors reviewed the results and approved the final version of the manuscript.</p>
</sec>
<sec sec-type="data-availability"><title>Availability of Data and Materials</title>
<p>All data used in this study are freely available and accessible. The sources of the data utilized in this research are thoroughly explained in the main manuscript.</p>
</sec>
<sec><title>Ethics Approval</title>
<p>Not applicable.</p>
</sec>
<sec sec-type="COI-statement"><title>Conflicts of Interest</title>
<p>The authors declare no conflicts of interest to report regarding the present study.</p>
</sec>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Jha</surname></string-name>, <string-name><given-names>G. P.</given-names> <surname>Joshi</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Nkenyereya</surname></string-name>, <string-name><given-names>D. W.</given-names> <surname>Kim</surname></string-name>, and <string-name><given-names>F.</given-names> <surname>Smarandache</surname></string-name></person-group>, &#x201C;<article-title>A direct data-cluster analysis method based on neutrosophic set implication</article-title>,&#x201D; <source>Comput. Mater. Contin.</source>, vol. <volume>65</volume>, no. <issue>2</issue>, pp. <fpage>1203</fpage>&#x2013;<lpage>1220</lpage>, <year>2020</year>. doi: <pub-id pub-id-type="doi">10.32604/cmc.2020.011618</pub-id>.</mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>G. J.</given-names> <surname>Oyewole</surname></string-name> and <string-name><given-names>G. A.</given-names> <surname>Thopil</surname></string-name></person-group>, &#x201C;<article-title>Data clustering: Application and trends</article-title>,&#x201D; <source>Artif. Intell. Rev.</source>, vol. <volume>56</volume>, no. <issue>7</issue>, pp. <fpage>6439</fpage>&#x2013;<lpage>6475</lpage>, <year>2023</year>. doi: <pub-id pub-id-type="doi">10.1007/s10462-022-10325-y</pub-id>; <pub-id pub-id-type="pmid">36466764</pub-id></mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S. K.</given-names> <surname>Dubey</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Vijay</surname></string-name>, and <string-name><given-names>A.</given-names> <surname>Pratibha</surname></string-name></person-group>, &#x201C;<article-title>A review of image segmentation using clustering methods</article-title>,&#x201D; <source>Int. J. Appl. Eng. Res.</source>, vol. <volume>13</volume>, no. <issue>5</issue>, pp. <fpage>2484</fpage>&#x2013;<lpage>2489</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Jamjoom</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Elhadad</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Abulkasim</surname></string-name>, and <string-name><given-names>S.</given-names> <surname>Abbas</surname></string-name></person-group>, &#x201C;<article-title>Plant leaf diseases classification using improved K-means clustering and SVM algorithm for segmentation</article-title>,&#x201D; <source>Comput. Mater. Contin.</source>, vol. <volume>76</volume>, no. <issue>1</issue>, pp. <fpage>367</fpage>&#x2013;<lpage>382</lpage>, <year>2023</year>. doi: <pub-id pub-id-type="doi">10.32604/cmc.2023.037310</pub-id>.</mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>R.</given-names> <surname>Aarthi</surname></string-name>, <string-name><given-names>K.</given-names> <surname>Divya</surname></string-name>, <string-name><given-names>N.</given-names> <surname>Komala</surname></string-name>, and <string-name><given-names>S.</given-names> <surname>Kavitha</surname></string-name></person-group>, &#x201C;<article-title>Application of feature extraction and clustering in mammogram classification using support vector machine</article-title>,&#x201D; in <conf-name>2011 Third Int. Conf. Adv. Comput.</conf-name>, <publisher-name>IEEE</publisher-name>, <year>2011</year>, pp. <fpage>62</fpage>&#x2013;<lpage>67</lpage>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Ahmed</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Seraj</surname></string-name>, and <string-name><given-names>S. M. S.</given-names> <surname>Islam</surname></string-name></person-group>, &#x201C;<article-title>The <italic>k-means</italic> algorithm: A comprehensive survey and performance evaluation</article-title>,&#x201D; <source>Electronics</source>, vol. <volume>9</volume>, no. <issue>8</issue>, <year>2020</year>, <comment>Art. no. 1295</comment>. doi: <pub-id pub-id-type="doi">10.3390/electronics9081295</pub-id>.</mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>A. M.</given-names> <surname>Ikotun</surname></string-name>, <string-name><given-names>A. E.</given-names> <surname>Ezugwu</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Abualigah</surname></string-name>, <string-name><given-names>B.</given-names> <surname>Abuhaija</surname></string-name>, and <string-name><given-names>J.</given-names> <surname>Heming</surname></string-name></person-group>, &#x201C;<article-title>K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data</article-title>,&#x201D; <source>Inf. Sci.</source>, vol. <volume>622</volume>, pp. <fpage>178</fpage>&#x2013;<lpage>210</lpage>, <year>2023</year>. doi: <pub-id pub-id-type="doi">10.1016/j.ins.2022.11.139</pub-id>.</mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>A. K.</given-names> <surname>Jain</surname></string-name></person-group>, &#x201C;<article-title>Data clustering: 50 years beyond K-means</article-title>,&#x201D; <source>Pattern Recognit. Lett.</source>, vol. <volume>31</volume>, no. <issue>8</issue>, pp. <fpage>651</fpage>&#x2013;<lpage>666</lpage>, <year>2010</year>. doi: <pub-id pub-id-type="doi">10.1016/j.patrec.2009.09.011</pub-id>.</mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S. S.</given-names> <surname>Khan</surname></string-name> and <string-name><given-names>A.</given-names> <surname>Ahmad</surname></string-name></person-group>, &#x201C;<article-title>Cluster center initialization algorithm for K-means clustering</article-title>,&#x201D; <source>Pattern Recognit. Lett.</source>, vol. <volume>25</volume>, no. <issue>11</issue>, pp. <fpage>1293</fpage>&#x2013;<lpage>1302</lpage>, <year>2004</year>. doi: <pub-id pub-id-type="doi">10.1016/j.patrec.2004.04.007</pub-id>.</mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>M. E.</given-names> <surname>Celebi</surname></string-name>, <string-name><given-names>H. A.</given-names> <surname>Kingravi</surname></string-name>, and <string-name><given-names>P. A.</given-names> <surname>Vela</surname></string-name></person-group>, &#x201C;<article-title>A comparative study of efficient initialization methods for the k-means clustering algorithm</article-title>,&#x201D; <source>Expert Syst. Appl.</source>, vol. <volume>40</volume>, no. <issue>1</issue>, pp. <fpage>200</fpage>&#x2013;<lpage>210</lpage>, <year>2013</year>. doi: <pub-id pub-id-type="doi">10.1016/j.eswa.2012.07.021</pub-id>.</mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Gul</surname></string-name> and <string-name><given-names>M. A.</given-names> <surname>Rehman</surname></string-name></person-group>, &#x201C;<article-title>Big data: An optimized approach for cluster initialization</article-title>,&#x201D; <source>J. Big Data</source>, vol. <volume>10</volume>, no. <issue>1</issue>, pp. <fpage>120</fpage>, <year>2023</year>. doi: <pub-id pub-id-type="doi">10.1186/s40537-023-00798-1</pub-id>.</mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>T. K.</given-names> <surname>Biswas</surname></string-name>, <string-name><given-names>K.</given-names> <surname>Giri</surname></string-name>, and <string-name><given-names>S.</given-names> <surname>Roy</surname></string-name></person-group>, &#x201C;<article-title>ECKM: An improved K-means clustering based on computational geometry</article-title>,&#x201D; <source>Expert Syst. Appl.</source>, vol. <volume>212</volume>, <year>2023</year>, <comment>Art. no. 118862</comment>. doi: <pub-id pub-id-type="doi">10.1016/j.eswa.2022.118862</pub-id>.</mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Layeb</surname></string-name></person-group>, &#x201C;<article-title><italic>ck</italic>-means and <italic>fck</italic>-means: Two deterministic initialization procedures for K-means algorithm using a modified crowding distance</article-title>,&#x201D; <source>Acta Inform. Prag.</source>, vol. <volume>12</volume>, no. <issue>2</issue>, pp. <fpage>379</fpage>&#x2013;<lpage>399</lpage>, <year>2023</year>. doi: <pub-id pub-id-type="doi">10.18267/j.aip.223</pub-id>.</mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>D.</given-names> <surname>Arthur</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Vassilvitskii</surname></string-name></person-group>, &#x201C;<article-title>k-means&#x002B;&#x002B;: The advantages of careful seeding</article-title>,&#x201D; <source>Soda</source>, vol. <volume>7</volume>, pp. <fpage>1027</fpage>&#x2013;<lpage>1035</lpage>, <year>2007</year>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Lattanzi</surname></string-name> and <string-name><given-names>C.</given-names> <surname>Sohler</surname></string-name></person-group>, &#x201C;<article-title>A better k-means&#x002B;&#x002B; algorithm via local search</article-title>,&#x201D; in <conf-name>Int. Conf. Mach. Learn.</conf-name>, <publisher-name>PMLR</publisher-name>, <year>2019</year>, pp. <fpage>3662</fpage>&#x2013;<lpage>3671</lpage>.</mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>&#x015E;enol</surname></string-name></person-group>, &#x201C;<article-title>ImpKmeans: An improved version of the K-means algorithm, by determining optimum initial centroids, based on multivariate kernel density estimation and Kd-tree</article-title>,&#x201D; <source>Acta Polytechnica Hungarica</source>, vol. <volume>21</volume>, no. <issue>2</issue>, pp. <fpage>111</fpage>&#x2013;<lpage>131</lpage>, <year>2024</year>. doi: <pub-id pub-id-type="doi">10.12700/APH.21.2.2024.2.6</pub-id>.</mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>D.</given-names> <surname>Reddy</surname></string-name>, <string-name><given-names>P. K.</given-names> <surname>Jana</surname></string-name>, and <string-name><given-names>I. S.</given-names> <surname>Member</surname></string-name></person-group>, &#x201C;<article-title>Initialization for K-means clustering using Voronoi diagram</article-title>,&#x201D; <source>Procedia Technol.</source>, vol. <volume>4</volume>, pp. <fpage>395</fpage>&#x2013;<lpage>400</lpage>, <year>2012</year>. doi: <pub-id pub-id-type="doi">10.1016/j.protcy.2012.05.061</pub-id>.</mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>E.</given-names> <surname>Baratalipour</surname></string-name>, <string-name><given-names>S. J.</given-names> <surname>Kabudian</surname></string-name>, and <string-name><given-names>Z.</given-names> <surname>Fathi</surname></string-name></person-group>, &#x201C;<article-title>A new initialization method for k-means clustering</article-title>,&#x201D; in <conf-name>2024 20th CSI Int. Symp. Artif. Intell. Signal Process. (AISP)</conf-name>, <publisher-name>IEEE</publisher-name>, <year>2024</year>, pp. <fpage>1</fpage>&#x2013;<lpage>5</lpage>.</mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S. J.</given-names> <surname>Wright</surname></string-name></person-group>, &#x201C;<article-title>Coordinate descent algorithms</article-title>,&#x201D; <source>Math. Program.</source>, vol. <volume>151</volume>, no. <issue>1</issue>, pp. <fpage>3</fpage>&#x2013;<lpage>34</lpage>, <year>2015</year>. doi: <pub-id pub-id-type="doi">10.1007/s10107-015-0892-3</pub-id>.</mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>H. -J. M.</given-names> <surname>Shi</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Tu</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Xu</surname></string-name>, and <string-name><given-names>W.</given-names> <surname>Yin</surname></string-name></person-group>, &#x201C;<article-title>A primer on coordinate descent algorithms</article-title>,&#x201D; <year>2016</year>, <comment><italic>arXiv:1610.00040</italic></comment>.</mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>F.</given-names> <surname>Nie</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Xue</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Wu</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Wang</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Li</surname></string-name> and <string-name><given-names>X.</given-names> <surname>Li</surname></string-name></person-group>, &#x201C;<article-title>Coordinate descent method for k-means</article-title>,&#x201D; <source>IEEE Trans. Pattern Anal. Mach. Intell.</source>, vol. <volume>44</volume>, no. <issue>5</issue>, pp. <fpage>2371</fpage>&#x2013;<lpage>2385</lpage>, <year>2021</year>. doi: <pub-id pub-id-type="doi">10.1109/TPAMI.2021.3085739</pub-id>; <pub-id pub-id-type="pmid">34061737</pub-id></mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>T.</given-names> <surname>Kaukoranta</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Fr&#x00E4;nti</surname></string-name>, and <string-name><given-names>O.</given-names> <surname>Nevalainen</surname></string-name></person-group>, &#x201C;<article-title>Iterative split-and-merge algorithm for VQ codebook generation</article-title>,&#x201D; <source>Opt. Eng.</source>, vol. <volume>37</volume>, no. <issue>10</issue>, pp. <fpage>2726</fpage>&#x2013;<lpage>2732</lpage>, <year>Oct. 1998</year>.</mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>P.</given-names> <surname>Fr&#x00E4;nti</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Kaukoranta</surname></string-name>, and <string-name><given-names>O.</given-names> <surname>Nevalainen</surname></string-name></person-group>, &#x201C;<article-title>On the splitting method for VQ codebook generation</article-title>,&#x201D; <source>Opt. Eng.</source>, vol. <volume>36</volume>, no. <issue>11</issue>, pp. <fpage>3043</fpage>&#x2013;<lpage>3051</lpage>, <year>Nov. 1997</year>.</mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>P.</given-names> <surname>Fr&#x00E4;nti</surname></string-name></person-group>, &#x201C;<article-title>Efficiency of random swap clustering</article-title>,&#x201D; <source>J. Big Data</source>, vol. <volume>5</volume>, no. <issue>13</issue>, pp. <fpage>1</fpage>&#x2013;<lpage>29</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>H.</given-names> <surname>Ismkhan</surname></string-name></person-group>, &#x201C;<article-title>Ik-means&#x2212;&#x002B;: An iterative clustering algorithm based on an enhanced version of the <italic>k</italic>-means</article-title>,&#x201D; <source>Pattern Recognit.</source>, vol. <volume>79</volume>, pp. <fpage>402</fpage>&#x2013;<lpage>413</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>F.</given-names> <surname>Nie</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Wang</surname></string-name>, and <string-name><given-names>X.</given-names> <surname>Li</surname></string-name></person-group>, &#x201C;<article-title>An effective and efficient algorithm for k-means clustering with new formulation</article-title>,&#x201D; <source>IEEE Trans. Knowl. Data Eng.</source>, vol. <volume>35</volume>, no. <issue>4</issue>, pp. <fpage>3433</fpage>&#x2013;<lpage>3443</lpage>, <year>2022</year>. doi: <pub-id pub-id-type="doi">10.1109/TKDE.2022.3155450</pub-id>.</mixed-citation></ref>
</ref-list>
</back></article>