<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMC</journal-id>
<journal-id journal-id-type="nlm-ta">CMC</journal-id>
<journal-id journal-id-type="publisher-id">CMC</journal-id>
<journal-title-group>
<journal-title>Computers, Materials &#x0026; Continua</journal-title>
</journal-title-group>
<issn pub-type="epub">1546-2226</issn>
<issn pub-type="ppub">1546-2218</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">19776</article-id>
<article-id pub-id-type="doi">10.32604/cmc.2022.019776</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Using Link-Based Consensus Clustering for Mixed-Type Data Analysis</article-title>
<alt-title alt-title-type="left-running-head">Using Link-Based Consensus Clustering for Mixed-Type Data Analysis</alt-title>
<alt-title alt-title-type="right-running-head">Using Link-Based Consensus Clustering for Mixed-Type Data Analysis</alt-title>
</title-group>
<contrib-group content-type="authors">
<contrib id="author-1" contrib-type="author"><name name-style="western"><surname>Boongoen</surname><given-names>Tossapon</given-names></name><xref ref-type="aff" rid="aff-1"/>
</contrib>
<contrib id="author-2" contrib-type="author" corresp="yes"><name name-style="western"><surname>Iam-On</surname><given-names>Natthakan</given-names></name><xref ref-type="aff" rid="aff-1"/><email>natthakan@mfu.ac.th</email>
</contrib>
<aff id="aff-1"><institution>Center of Excellence in Artificial Intelligence and Emerging Technologies, School of Information Technology, Mae Fah Luang University</institution>, <addr-line>Chiang Rai, 57100</addr-line>, <country>Thailand</country></aff>
</contrib-group>
<author-notes><corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Natthakan Iam-On. Email: <email>natthakan@mfu.ac.th</email></corresp>
</author-notes>
<pub-date pub-type="epub" date-type="pub" iso-8601-date="2021-08-30"><day>30</day><month>08</month><year>2021</year>
</pub-date>
<volume>70</volume>
<issue>1</issue>
<fpage>1993</fpage>
<lpage>2011</lpage>
<history>
<date date-type="received"><day>25</day><month>4</month><year>2021</year></date>
<date date-type="accepted"><day>06</day><month>6</month><year>2021</year></date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2022 Boongoen and Iam-On</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>Boongoen and Iam-On</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMC_19776.pdf"></self-uri>
<abstract>
<p>A mix between numerical and nominal data types commonly presents many modern-age data collections. Examples of these include banking data, sales history and healthcare records, where both continuous attributes like age and nominal ones like blood type are exploited to characterize account details, business transactions or individuals. However, only a few standard clustering techniques and consensus clustering methods are provided to examine such a data thus far. Given this insight, the paper introduces novel extensions of link-based cluster ensemble, <inline-formula id="ieqn-201"><mml:math id="mml-ieqn-201"><mml:mrow><mml:mtext>LC</mml:mtext></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mtext>E</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>WCT</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-202"><mml:math id="mml-ieqn-202"><mml:mrow><mml:mtext>LC</mml:mtext></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mtext>E</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>WTQ</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> that are accurate for analyzing mixed-type data. They promote diversity within an ensemble through different initializations of the k-prototypes algorithm as base clusterings and then refine the summarized data using a link-based approach. Based on the evaluation metric of NMI (Normalized Mutual Information) that is averaged across different combinations of benchmark datasets and experimental settings, these new models reach the improved level of 0.34, while the best model found in the literature obtains only around the mark of 0.24. Besides, parameter analysis included herein helps to enhance their performance even further, given relations of clustering quality and algorithmic variables specific to the underlying link-based models. Moreover, another significant factor of ensemble size is examined in such a way to justify a tradeoff between complexity and accuracy.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Cluster analysis</kwd>
<kwd>mixed-type data</kwd>
<kwd>consensus clustering</kwd>
<kwd>link analysis</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="s1"><label>1</label><title>Introduction</title>
<p>Cluster analysis has been widely used to explore the structure of a given dataset. This analytical tool is usually employed in the initial stage of data interpretation, especially for a new problem where prior knowledge is limited. The goal of acquiring knowledge from data sources has been a major driving force, which makes cluster analysis one of the highly active research subjects. Over several decades, different clustering techniques are devised and applied to a variety of problem domains, such as biological study [<xref ref-type="bibr" rid="ref-1">1</xref>], customer relationship management [<xref ref-type="bibr" rid="ref-2">2</xref>], information retrieval [<xref ref-type="bibr" rid="ref-3">3</xref>], image processing and machine vision [<xref ref-type="bibr" rid="ref-4">4</xref>], medicine and health care [<xref ref-type="bibr" rid="ref-5">5</xref>], pattern recognition [<xref ref-type="bibr" rid="ref-6">6</xref>], psychology [<xref ref-type="bibr" rid="ref-7">7</xref>] and recommender system [<xref ref-type="bibr" rid="ref-8">8</xref>]. In addition to these, the recent development of clustering approaches for cancer gene expression data has attracted a lot of interests amongst computer scientists, biological and clinical researchers [<xref ref-type="bibr" rid="ref-9">9</xref>,<xref ref-type="bibr" rid="ref-10">10</xref>].</p>
<p>Principally, the objective of cluster analysis is to divide data objects (or instances) into groups (or clusters) such that objects in the same cluster are more similar to each other than to those belonging to different clusters [<xref ref-type="bibr" rid="ref-11">11</xref>]. Objects under examination are normally described in terms of object-specific (e.g., attribute values) or relative measurements (e.g., pairwise dissimilarity). Unlike supervised learning, clustering is &#x2018;unsupervised&#x2019; and does not require class information, which is typically achieved through a manual tagging of category labels on data objects, by domain expert(s). While many supervised models inherently fail to handle the absence of data labels, data clustering has proven effective for this burden. Given its potential, a large number of research studies focus on several aspects of cluster analysis: for instance, dissimilarity (or distance) metric [<xref ref-type="bibr" rid="ref-12">12</xref>], optimal cluster numbers [<xref ref-type="bibr" rid="ref-13">13</xref>], relevance of data attributes per cluster [<xref ref-type="bibr" rid="ref-14">14</xref>], evaluation of clustering results [<xref ref-type="bibr" rid="ref-15">15</xref>], cluster ensemble or consensus clustering [<xref ref-type="bibr" rid="ref-9">9</xref>], clustering algorithms and extensions for particular type of data [<xref ref-type="bibr" rid="ref-16">16</xref>]. Specific to the lattermost to which this research belongs, only a few studies have concentrated on clustering of mixed-type (numerical and nominal) data, as compared to the cases of numeric and nominal only counterparts.</p>
<p>At present, the data mining community has encountered a challenge from large collections of mixed-type data like those collected from banking and health sectors: web/service access records and biological-clinical data. As for the domain of health care, microarray expressions and clinical details are available for cancer diagnosis [<xref ref-type="bibr" rid="ref-17">17</xref>]. In response, a few clustering techniques have been introduced in the literature for this problem. Some simply transform the underlying mixed-type data to either numeric or nominal only format, with which conventional clustering algorithms can be reused. In particular to this view, k-means [<xref ref-type="bibr" rid="ref-18">18</xref>] is a typical alter- native for the numerical domain, while dSqueezer [<xref ref-type="bibr" rid="ref-19">19</xref>] that is an extension of Squeezer [<xref ref-type="bibr" rid="ref-20">20</xref>] has been investigated for the other. Other attempts focus on defining a distance metric that is effective for the evaluation of dissimilarity amongst data objects in a mixed- type dimensional space. These include different extensions of k-means, k-prototypes [<xref ref-type="bibr" rid="ref-21">21</xref>] and k-centers [<xref ref-type="bibr" rid="ref-22">22</xref>], respectively.</p>
<p>Similar to most clustering methods, the aforementioned models are parameterized, thus achieving optimal performance may not be possible across diverse data collections. At large, there are two major challenges inherent to mixed-type clustering algorithms. First, different techniques discover different structures (e.g., cluster size and shape) from the same set of data [<xref ref-type="bibr" rid="ref-23">23</xref>&#x2013;<xref ref-type="bibr" rid="ref-25">25</xref>]. For example, those extensions of k-means are suitable for spherical-shape clusters. This is due to the fact that each individual algorithm is designed to optimize a specific criterion. Second, a single clustering algorithm with different parameter settings can also reveal various structures on the same dataset. A specific setting may be good for a few, but less accurate on other datasets.</p>
<p>A solution to this dilemma is to combine different clusterings into a single consensus clustering. This process, known as consensus clustering or cluster ensemble, has been reported to provide more robust and stable solutions across different problem domains and datasets [<xref ref-type="bibr" rid="ref-9">9</xref>,<xref ref-type="bibr" rid="ref-24">24</xref>]. Among state-of-the-art approaches, link-based cluster ensemble or LCE [<xref ref-type="bibr" rid="ref-26">26</xref>,<xref ref-type="bibr" rid="ref-27">27</xref>] usually deliver accurate clustering results, with respect to both numerical and nominal domains. Given this insight, the paper introduces the extension of LCE to mixed-type data clustering, with contributions being summarized as follows. Firstly, a new extension of LCE that makes use of k-prototypes as base clusterings is proposed. In particular, the resulting models have been assessed on benchmark datasets, and compared to both groups of basic and ensemble clustering techniques. Experimental results point out that the proposed extension usually outperforms those included in this empirical study. Secondly, parameter analysis with respect to algorithmic variables of LCE is conducted and emphasized as a guideline for further studies and applications. The rest of this paper is organized as follows. To set the scene for this work, Section 2 presents existing methods to mixed-type data clustering. Following that, Section 3 introduces the proposed extension of LCE, including ensemble generation and estimation of link-based similarity. To perceive its performance, the empirical evaluation in Section 4 is conducted on benchmark data sets, with a rich collection of compared techniques. The paper is concluded in Section 5 with the direction of future research.</p>
</sec>
<sec id="s2"><label>2</label><title>Mixed-Type Data Clustering Methods</title>
<p>Following the success in numerical and nominal domains, a line of research has emerged with the focus on clustering mixed-type data. One of initial attempts is the model of k-prototypes, which extends the classical k-means to clustering mixed numeric and categorical data [<xref ref-type="bibr" rid="ref-21">21</xref>]. It makes use of a heterogeneous proximity function to assess the dissimilarity between data objects and cluster prototypes (i.e., cluster centroids). While the Euclidean distance is exploited for numerical case, the nominal dissimilarity can be directly derived from the number of mismatches between nominal values. This distance function for mixed-type data requires different weights for the contribution of numerical <italic>vs.</italic> nominal attributes to avoid favoring either type of attribute. Let <inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:mi>X</mml:mi><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mi>N</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula> be a set of <italic>N</italic> data objects and each <inline-formula id="ieqn-2"><mml:math id="mml-ieqn-2"><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mi>X</mml:mi></mml:math></inline-formula> is described by <italic>D</italic> attributes, where <inline-formula id="ieqn-3"><mml:math id="mml-ieqn-3"><mml:mi>D</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:msub><mml:mi>D</mml:mi><mml:mi>n</mml:mi></mml:msub></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:msub><mml:mi>D</mml:mi><mml:mi>c</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula>, i.e., the total number of numerical (<inline-formula id="ieqn-4"><mml:math id="mml-ieqn-4"><mml:mrow><mml:msub><mml:mi>D</mml:mi><mml:mi>n</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula>) and nominal (<inline-formula id="ieqn-5"><mml:math id="mml-ieqn-5"><mml:mrow><mml:msub><mml:mi>D</mml:mi><mml:mi>c</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula>) attributes. The distance between an object <inline-formula id="ieqn-6"><mml:math id="mml-ieqn-6"><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mi>X</mml:mi></mml:math></inline-formula> and a cluster prototype <inline-formula id="ieqn-7"><mml:math id="mml-ieqn-7"><mml:mover><mml:mrow><mml:msub><mml:mi>c</mml:mi><mml:mi>p</mml:mi></mml:msub></mml:mrow><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover></mml:math></inline-formula> is estimated by the following equation.
<disp-formula id="eqn-1"><label>(1)</label><mml:math id="mml-eqn-1" display="block"><mml:mi>d</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mover><mml:mrow><mml:msub><mml:mi>c</mml:mi><mml:mi>p</mml:mi></mml:msub></mml:mrow><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:munderover><mml:mrow><mml:mo movablelimits="false">&#x2211;</mml:mo></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mrow><mml:msub><mml:mi>D</mml:mi><mml:mi>n</mml:mi></mml:msub></mml:mrow></mml:mrow></mml:munderover><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mover><mml:mrow><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>p</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover></mml:mrow><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup></mml:mrow><mml:mo>+</mml:mo><mml:mi>&#x03B3;</mml:mi><mml:munderover><mml:mrow><mml:mo movablelimits="false">&#x2211;</mml:mo></mml:mrow><mml:mrow><mml:mi>g</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mrow><mml:msub><mml:mi>D</mml:mi><mml:mi>c</mml:mi></mml:msub></mml:mrow></mml:mrow></mml:munderover><mml:mo>&#x2061;</mml:mo><mml:mi>&#x03B4;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mover><mml:mrow><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>p</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:math></disp-formula>
where <inline-formula id="ieqn-8"><mml:math id="mml-ieqn-8"><mml:mi>&#x03B4;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>y</mml:mi><mml:mo>,</mml:mo><mml:mi>z</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula> if <inline-formula id="ieqn-9"><mml:math id="mml-ieqn-9"><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:mi>z</mml:mi></mml:math></inline-formula> and 1, otherwise. In addition, <inline-formula id="ieqn-10"><mml:math id="mml-ieqn-10"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula> is a weight for nominal attributes. A large <inline-formula id="ieqn-11"><mml:math id="mml-ieqn-11"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula> suggests that the clustering process favors the nominal attributes, while a small value of <inline-formula id="ieqn-12"><mml:math id="mml-ieqn-12"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula> indicates that numerical attributes are emphasized.</p>
<p>Besides the aforementioned, k-centers [<xref ref-type="bibr" rid="ref-22">22</xref>] is an extension of the k-prototypes algorithm. It focuses on the effect of attribute values with different frequencies on clustering accuracy. Unlike k-prototypes that selects nominal attribute values that appear most frequently as centroids, k-centers also takes into account other attribute values with low frequency on centroids. Based on this idea, a new dissimilarity measure is defined. Specifically, the Euclidean distance is used for numerical attributes, while the nominal dissimilarity is derived from the similarity between corresponding nominal attributes. Let <inline-formula id="ieqn-13"><mml:math id="mml-ieqn-13"><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mi>X</mml:mi></mml:math></inline-formula> be a data object described by <inline-formula id="ieqn-14"><mml:math id="mml-ieqn-14"><mml:mrow><mml:msub><mml:mi>D</mml:mi><mml:mi>n</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> numerical attributes and <inline-formula id="ieqn-15"><mml:math id="mml-ieqn-15"><mml:mrow><mml:msub><mml:mi>D</mml:mi><mml:mi>c</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> nominal attributes. The domain of nominal attribute <inline-formula id="ieqn-16"><mml:math id="mml-ieqn-16"><mml:mrow><mml:msub><mml:mi>A</mml:mi><mml:mi>g</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> is denoted by <inline-formula id="ieqn-17"><mml:math id="mml-ieqn-17"><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi>g</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:mrow><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula>, where <inline-formula id="ieqn-18"><mml:math id="mml-ieqn-18"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi>g</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> is the number of attribute values of <inline-formula id="ieqn-19"><mml:math id="mml-ieqn-19"><mml:mrow><mml:msub><mml:mi>A</mml:mi><mml:mi>g</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula>. The definition of the distance between data object <inline-formula id="ieqn-20"><mml:math id="mml-ieqn-20"><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> and centroid <inline-formula id="ieqn-21"><mml:math id="mml-ieqn-21"><mml:mover><mml:mrow><mml:msub><mml:mi>c</mml:mi><mml:mi>p</mml:mi></mml:msub></mml:mrow><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover></mml:math></inline-formula> is defined as follows.
<disp-formula id="eqn-2"><label>(2)</label><mml:math id="mml-eqn-2" display="block"><mml:mi>d</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mover><mml:mrow><mml:msub><mml:mi>c</mml:mi><mml:mi>p</mml:mi></mml:msub></mml:mrow><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>&#x03B2;</mml:mi><mml:munderover><mml:mrow><mml:mo movablelimits="false">&#x2211;</mml:mo></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mrow><mml:msub><mml:mi>D</mml:mi><mml:mi>n</mml:mi></mml:msub></mml:mrow></mml:mrow></mml:munderover><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mover><mml:mrow><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>p</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover></mml:mrow><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup></mml:mrow><mml:mo>+</mml:mo><mml:mi>&#x03B3;</mml:mi><mml:munderover><mml:mrow><mml:mo movablelimits="false">&#x2211;</mml:mo></mml:mrow><mml:mrow><mml:mi>g</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mrow><mml:msub><mml:mi>D</mml:mi><mml:mi>c</mml:mi></mml:msub></mml:mrow></mml:mrow></mml:munderover><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mover><mml:mrow><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>p</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mn>2</mml:mn></mml:msup></mml:mrow><mml:mo>,</mml:mo></mml:math></disp-formula>
where <inline-formula id="ieqn-22"><mml:math id="mml-ieqn-22"><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mover><mml:mrow><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>p</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mrow><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>p</mml:mi><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>p</mml:mi><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:mrow><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula>. The weight parameters <inline-formula id="ieqn-23"><mml:math id="mml-ieqn-23"><mml:mi>&#x03B2;</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-24"><mml:math id="mml-ieqn-24"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula> are for numerical and nominal attributes, respectively. According to [<xref ref-type="bibr" rid="ref-22">22</xref>], <inline-formula id="ieqn-25"><mml:math id="mml-ieqn-25"><mml:mi>&#x03B2;</mml:mi></mml:math></inline-formula> is set to be 1 while a greater weight is given for <inline-formula id="ieqn-26"><mml:math id="mml-ieqn-26"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula> if nominal valued attributes are emphasised more or a smaller value for <inline-formula id="ieqn-27"><mml:math id="mml-ieqn-27"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula> otherwise. The new definition of centroids is also introduced. For numerical attributes, a centroid is represented by the mean of attribute values. For nominal attribute <inline-formula id="ieqn-28"><mml:math id="mml-ieqn-28"><mml:mrow><mml:msub><mml:mi>A</mml:mi><mml:mi>g</mml:mi></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /><mml:mi>g</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:msub><mml:mi>D</mml:mi><mml:mi>c</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula>, centroid <inline-formula id="ieqn-29"><mml:math id="mml-ieqn-29"><mml:mover><mml:mrow><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>p</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover></mml:math></inline-formula> is an <inline-formula id="ieqn-30"><mml:math id="mml-ieqn-30"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi>g</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> dimensional vector denoted as <inline-formula id="ieqn-31"><mml:math id="mml-ieqn-31"><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>p</mml:mi><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>p</mml:mi><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>p</mml:mi><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, where <inline-formula id="ieqn-32"><mml:math id="mml-ieqn-32"><mml:mrow><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>p</mml:mi><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> can be defined by the next equation.
<disp-formula id="eqn-3"><label>(3)</label><mml:math id="mml-eqn-3" display="block"><mml:mrow><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>p</mml:mi><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mrow><mml:mi>p</mml:mi><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:mfrac><mml:mo>+</mml:mo><mml:munder><mml:mrow><mml:mo movablelimits="false">&#x2211;</mml:mo></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:msub><mml:mi>A</mml:mi><mml:mi>g</mml:mi></mml:msub></mml:mrow></mml:mrow></mml:munder><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mrow><mml:mi>p</mml:mi><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:mfrac><mml:mspace width="thickmathspace" /><mml:mo>&#x2212;</mml:mo><mml:mspace width="thickmathspace" /><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mrow><mml:mi>p</mml:mi><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:mfrac></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:munder><mml:mrow><mml:mo movablelimits="false">&#x2211;</mml:mo></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:msub><mml:mi>A</mml:mi><mml:mi>g</mml:mi></mml:msub></mml:mrow></mml:mrow></mml:munder><mml:mo>&#x2061;</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mrow><mml:mi>p</mml:mi><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:mfrac></mml:mrow></mml:mfrac><mml:mo>,</mml:mo></mml:math></disp-formula>
</p>
<p>where <inline-formula id="ieqn-33"><mml:math id="mml-ieqn-33"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mrow><mml:mi>p</mml:mi><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> denotes the number of data objects in the <italic>p</italic>th cluster with attribute value <inline-formula id="ieqn-34"><mml:math id="mml-ieqn-34"><mml:mrow><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula>. Note that if attribute value <inline-formula id="ieqn-35"><mml:math id="mml-ieqn-35"><mml:mrow><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> does not exist in the <italic>p</italic>th cluster, <inline-formula id="ieqn-36"><mml:math id="mml-ieqn-36"><mml:mrow><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>p</mml:mi><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula>. The problem of selecting an appropriate clustering algorithm or parameter setting of any potential alternative has proven difficult, especially with a new set of data. In such a case where prior knowledge is generally minimal, the performance of any particular method is inherently uncertain. To obtain a more robust and accurate outcome, consensus clustering has been put forward and extensively investigated in the past decade. However, while a large number of cluster ensemble techniques for numerical data have been developed [<xref ref-type="bibr" rid="ref-24">24</xref>,<xref ref-type="bibr" rid="ref-26">26</xref>,<xref ref-type="bibr" rid="ref-28">28</xref>&#x2013;<xref ref-type="bibr" rid="ref-35">35</xref>], there are very few studies that extend such a methodology to mixed-type data clustering. Specific to this subject, the cluster ensemble framework of [<xref ref-type="bibr" rid="ref-36">36</xref>] uses the pairwise similarity concept [<xref ref-type="bibr" rid="ref-24">24</xref>], which is originally designed for continuous data. Though this research area has received a little attention thus far, it is crucial to explore the true potential of cluster ensembles for such a problem. This motivates the present research, with the link-based framework being developed and evaluated herein.</p>
</sec>
<sec id="s3"><label>3</label><title>Link-Based Consensus Clustering for Mixed-Type Data</title>
<p>This section presents the proposed framework of LCE for mixed-type data. It includes details of conceptual model, ensemble generation strategies, link-based similarity measures, and consensus function that is used to create the final clustering result, respectively.</p>
<sec id="s3_1"><label>3.1</label><title>Problem Definition</title>
<p>LCE approach has been initially introduced for gene expression data analysis [<xref ref-type="bibr" rid="ref-9">9</xref>]. Unlike other methods, it explicitly models base clustering results as a link network from which the relations between and within these partitions can be obtained. In the current research, this consensus-clustering model is uniquely extended for the problem of clustering mixed-type data, which can be formulated as follows. Let <inline-formula id="ieqn-37"><mml:math id="mml-ieqn-37"><mml:mo>&#x220F;</mml:mo><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>&#x03C0;</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>&#x03C0;</mml:mi><mml:mi>M</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula> be a cluster ensemble with <italic>M</italic> base clusterings, each of which returns a set of clusters <inline-formula id="ieqn-38"><mml:math id="mml-ieqn-38"><mml:mrow><mml:msub><mml:mi>&#x03C0;</mml:mi><mml:mi>g</mml:mi></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mrow><mml:msubsup><mml:mi>C</mml:mi><mml:mn>1</mml:mn><mml:mi>g</mml:mi></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>C</mml:mi><mml:mn>2</mml:mn><mml:mi>g</mml:mi></mml:msubsup><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msubsup><mml:mi>C</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>k</mml:mi><mml:mi>g</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mi>g</mml:mi></mml:msubsup></mml:mrow><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula>, such that <inline-formula id="ieqn-39"><mml:math id="mml-ieqn-39"><mml:munderover><mml:mrow><mml:mi>U</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mrow><mml:msub><mml:mi>k</mml:mi><mml:mi>g</mml:mi></mml:msub></mml:mrow></mml:mrow></mml:munderover><mml:mo>&#x2061;</mml:mo><mml:msubsup><mml:mi>C</mml:mi><mml:mi>t</mml:mi><mml:mi>g</mml:mi></mml:msubsup></mml:math></inline-formula>, where <inline-formula id="ieqn-40"><mml:math id="mml-ieqn-40"><mml:mrow><mml:msub><mml:mi>k</mml:mi><mml:mi>g</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> is the number of clusters in the <italic>g</italic>th clustering. For each <inline-formula id="ieqn-41"><mml:math id="mml-ieqn-41"><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mi>X</mml:mi><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /><mml:mrow><mml:msup><mml:mi>C</mml:mi><mml:mi>g</mml:mi></mml:msup></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> denotes the cluster label in the <italic>g</italic>th base clustering to which data object <inline-formula id="ieqn-42"><mml:math id="mml-ieqn-42"><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> belongs, i.e., <inline-formula id="ieqn-43"><mml:math id="mml-ieqn-43"><mml:mrow><mml:msup><mml:mi>C</mml:mi><mml:mi>g</mml:mi></mml:msup></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msup><mml:mspace width="thickmathspace" /><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup><mml:msup><mml:mi>t</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> if <inline-formula id="ieqn-44"><mml:math id="mml-ieqn-44"><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msubsup><mml:mi>C</mml:mi><mml:mi>t</mml:mi><mml:mi>g</mml:mi></mml:msubsup></mml:math></inline-formula>. The problem is to find a new partition <inline-formula id="ieqn-45"><mml:math id="mml-ieqn-45"><mml:mrow><mml:msup><mml:mi>&#x03C0;</mml:mi><mml:mo>&#x2217;</mml:mo></mml:msup></mml:mrow><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mrow><mml:msubsup><mml:mi>C</mml:mi><mml:mn>1</mml:mn><mml:mo>&#x2217;</mml:mo></mml:msubsup><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msubsup><mml:mi>C</mml:mi><mml:mi>K</mml:mi><mml:mo>&#x2217;</mml:mo></mml:msubsup></mml:mrow><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula>, where <italic>K</italic> denotes the number of clusters in the final clustering result, of a data set <italic>X</italic> that summarizes the information from the cluster ensemble <inline-formula id="ieqn-46"><mml:math id="mml-ieqn-46"><mml:mrow><mml:mo>&#x220F;</mml:mo></mml:mrow></mml:math></inline-formula>.</p>
</sec>
<sec id="s3_2"><label>3.2</label><title>LCE Framework for Mixed-Type Data Clustering</title>
<p>The extended LCE framework for the clustering of mixed-type data involves three steps: (i) creating a cluster ensemble <inline-formula id="ieqn-47"><mml:math id="mml-ieqn-47"><mml:mo>&#x220F;</mml:mo></mml:math></inline-formula>, (ii) aggregating base clustering results, <inline-formula id="ieqn-48"><mml:math id="mml-ieqn-48"><mml:mrow><mml:msub><mml:mi>&#x03C0;</mml:mi><mml:mi>g</mml:mi></mml:msub></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mo>&#x220F;</mml:mo><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /><mml:mi>g</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2026;</mml:mo><mml:mi>M</mml:mi></mml:math></inline-formula>, into a meta-level data matrix <inline-formula id="ieqn-49"><mml:math id="mml-ieqn-49"><mml:mi>R</mml:mi><mml:mrow><mml:msub><mml:mi>A</mml:mi><mml:mi>l</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> (with <italic>l</italic> being the link-based similarity measure used to deliver the matrix), and (iii) generating the final data partition <inline-formula id="ieqn-50"><mml:math id="mml-ieqn-50"><mml:mrow><mml:msup><mml:mi>&#x03C0;</mml:mi><mml:mo>&#x2217;</mml:mo></mml:msup></mml:mrow></mml:math></inline-formula> using the spectral graph partitioning (SPEC) algorithm. See <xref ref-type="fig" rid="fig-1">Fig. 1</xref> for the illustration of this framework.</p>
<fig id="fig-1"><label>Figure 1</label><caption><title>Framework of LCE extension to mixed-type data clustering</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_19776-fig-1.png"/></fig>
<sec id="s3_2_1"><label>3.2.1</label><title>Generating Cluster Ensemble</title>
<p>The proposed framework is generalized such that it can be coupled with several different ensemble generation methods. As for the present study, the following four types of ensembles are investigated. Unlike the original work in which the classical k-means is used to form base clusterings, the extended LCE obtains an ensemble by applying k-prototypes to mixed-type data (see <xref ref-type="fig" rid="fig-1">Fig. 1</xref> for details). Each base clustering is initialized with a random set of cluster prototypes. Also, the variable <inline-formula id="ieqn-51"><mml:math id="mml-ieqn-51"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula> of k-prototypes is arbitrarily selected from the set of <inline-formula id="ieqn-52"><mml:math id="mml-ieqn-52"><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mrow><mml:mn>0.1</mml:mn><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /><mml:mn>0.2</mml:mn><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /><mml:mn>0.3</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mn>5</mml:mn></mml:mrow><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula>.</p>
<p><bold>Full-space &#x002B; Fixed-k</bold>: Each <inline-formula id="ieqn-53"><mml:math id="mml-ieqn-53"><mml:mrow><mml:msub><mml:mi>&#x03C0;</mml:mi><mml:mi>g</mml:mi></mml:msub></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mo>&#x220F;</mml:mo></mml:math></inline-formula>, is formed using data set <inline-formula id="ieqn-54"><mml:math id="mml-ieqn-54"><mml:mi>X</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi mathvariant="script">R</mml:mi></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>D</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> with all <italic>D</italic> attributes. The number of clusters in each base clustering is fixed to <inline-formula id="ieqn-55"><mml:math id="mml-ieqn-55"><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">&#x2308;</mml:mo><mml:mrow><mml:msqrt><mml:mi>N</mml:mi></mml:msqrt></mml:mrow><mml:mo fence="false" stretchy="false">&#x2309;</mml:mo></mml:math></inline-formula>. Intuitively, to obtain a meaningful partition, k becomes 50 if <inline-formula id="ieqn-56"><mml:math id="mml-ieqn-56"><mml:mo fence="false" stretchy="false">&#x2308;</mml:mo><mml:msqrt><mml:mi>N</mml:mi></mml:msqrt><mml:mo fence="false" stretchy="false">&#x2309;</mml:mo><mml:mo>&#x003E;</mml:mo><mml:mn>50</mml:mn></mml:math></inline-formula>.</p>
<p><bold>Full-space &#x002B; Random-k</bold>: Each <inline-formula id="ieqn-57"><mml:math id="mml-ieqn-57"><mml:mrow><mml:msub><mml:mi>&#x03C0;</mml:mi><mml:mi>g</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> is obtained using the data set with all attributes, and the number of clusters is randomly selected from the set <inline-formula id="ieqn-58"><mml:math id="mml-ieqn-58"><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mo fence="false" stretchy="false">&#x2308;</mml:mo><mml:msqrt><mml:mi>N</mml:mi></mml:msqrt><mml:mo fence="false" stretchy="false">&#x2309;</mml:mo></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula>. Note that both &#x2018;Fixed-k&#x2019; and &#x2018;Random-k&#x2019; generation strategies are initially introduced in the primary work of [<xref ref-type="bibr" rid="ref-30">30</xref>].</p>
<p><bold>Subspace &#x002B; Fixed-k</bold>: Each <inline-formula id="ieqn-59"><mml:math id="mml-ieqn-59"><mml:mrow><mml:msub><mml:mi>&#x03C0;</mml:mi><mml:mi>g</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> is created using the data set with a subset of original attributes, and the number of clusters is fixed to <inline-formula id="ieqn-60"><mml:math id="mml-ieqn-60"><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">&#x2308;</mml:mo><mml:msqrt><mml:mi>N</mml:mi></mml:msqrt><mml:mo fence="false" stretchy="false">&#x2309;</mml:mo></mml:math></inline-formula>. Following the study of [<xref ref-type="bibr" rid="ref-37">37</xref>] and [<xref ref-type="bibr" rid="ref-38">38</xref>], a data subspace <inline-formula id="ieqn-61"><mml:math id="mml-ieqn-61"><mml:msup><mml:mi>X</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="script">R</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:msup><mml:mi>D</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula>is selected from the original data <inline-formula id="ieqn-62"><mml:math id="mml-ieqn-62"><mml:mi>X</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="script">R</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>D</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula>, where <italic>D</italic> is the number of original attributes and <inline-formula id="ieqn-63"><mml:math id="mml-ieqn-63"><mml:msup><mml:mi>D</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup><mml:mo>&#x003C;</mml:mo><mml:mi>D</mml:mi></mml:math></inline-formula>. In particular, <inline-formula id="ieqn-64"><mml:math id="mml-ieqn-64"><mml:msup><mml:mi>D</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> is randomly chosen by the following.
<disp-formula id="eqn-4"><label>(4)</label><mml:math id="mml-eqn-4" display="block"><mml:msup><mml:mi>D</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:msub><mml:mi>D</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>+</mml:mo><mml:mo fence="false" stretchy="false">&#x230A;</mml:mo><mml:mi>&#x03B1;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>D</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:msub><mml:mi>D</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo fence="false" stretchy="false">&#x230B;</mml:mo><mml:mo>,</mml:mo></mml:math></disp-formula>
where <inline-formula id="ieqn-65"><mml:math id="mml-ieqn-65"><mml:mi>&#x03B1;</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula> is a uniform random variable. Besides, <inline-formula id="ieqn-66"><mml:math id="mml-ieqn-66"><mml:mrow><mml:msub><mml:mi>D</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-67"><mml:math id="mml-ieqn-67"><mml:mrow><mml:msub><mml:mi>D</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> are user-specified parameters, which have the default values of <inline-formula id="ieqn-68"><mml:math id="mml-ieqn-68"><mml:mn>0.75</mml:mn></mml:math></inline-formula> and <inline-formula id="ieqn-69"><mml:math id="mml-ieqn-69"><mml:mn>0.85</mml:mn><mml:mspace width="thickmathspace" /><mml:mi>D</mml:mi></mml:math></inline-formula>, respectively.</p>
<p><bold>Subspace &#x002B; Random-k</bold>: Each <inline-formula id="ieqn-70"><mml:math id="mml-ieqn-70"><mml:mrow><mml:msub><mml:mi>&#x03C0;</mml:mi><mml:mi>g</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> is generated using the dataset with a subset of attributes, and the number of clusters is randomly selected from the set <inline-formula id="ieqn-71"><mml:math id="mml-ieqn-71"><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mo fence="false" stretchy="false">&#x2308;</mml:mo><mml:msqrt><mml:mi>N</mml:mi></mml:msqrt><mml:mo fence="false" stretchy="false">&#x2309;</mml:mo></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula>.</p>
</sec>
<sec id="s3_2_2"><label>3.2.2</label><title>Summarizing Multiple Clustering Results</title>
<p>Having obtained the ensemble <inline-formula id="ieqn-72"><mml:math id="mml-ieqn-72"><mml:mo>&#x220F;</mml:mo></mml:math></inline-formula>, the corresponding base clustering results are summarized into an information matrix <inline-formula id="ieqn-73"><mml:math id="mml-ieqn-73"><mml:mi>R</mml:mi><mml:mrow><mml:msub><mml:mi>A</mml:mi><mml:mi>l</mml:mi></mml:msub></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mrow><mml:mi>N</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>P</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula>, from which the final data partition <inline-formula id="ieqn-74"><mml:math id="mml-ieqn-74"><mml:mrow><mml:msup><mml:mi>&#x03C0;</mml:mi><mml:mo>&#x2217;</mml:mo></mml:msup></mml:mrow></mml:math></inline-formula> can be created. Note that <italic>P</italic> denotes the total number clusters in the ensemble under examination. For each clustering <inline-formula id="ieqn-75"><mml:math id="mml-ieqn-75"><mml:mrow><mml:msub><mml:mi>&#x03C0;</mml:mi><mml:mi>g</mml:mi></mml:msub></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mo>&#x220F;</mml:mo></mml:mrow></mml:math></inline-formula> and their corresponding clusters <inline-formula id="ieqn-76"><mml:math id="mml-ieqn-76"><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mrow><mml:msubsup><mml:mi>C</mml:mi><mml:mn>1</mml:mn><mml:mi>g</mml:mi></mml:msubsup><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msubsup><mml:mi>C</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>k</mml:mi><mml:mi>g</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mi>g</mml:mi></mml:msubsup></mml:mrow><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula>, a matrix entry <inline-formula id="ieqn-77"><mml:math id="mml-ieqn-77"><mml:mi>R</mml:mi><mml:mrow><mml:msub><mml:mi>A</mml:mi><mml:mi>l</mml:mi></mml:msub></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mi>c</mml:mi><mml:mi>l</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> represents the association degree that data object <inline-formula id="ieqn-78"><mml:math id="mml-ieqn-78"><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mi>X</mml:mi></mml:math></inline-formula> has with each cluster <inline-formula id="ieqn-79"><mml:math id="mml-ieqn-79"><mml:mi>c</mml:mi><mml:mi>l</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mrow><mml:msubsup><mml:mi>C</mml:mi><mml:mn>1</mml:mn><mml:mi>g</mml:mi></mml:msubsup><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msubsup><mml:mi>C</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>k</mml:mi><mml:mi>g</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mi>g</mml:mi></mml:msubsup></mml:mrow><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula>, which can calculated by the next equation.
<disp-formula id="eqn-5"><label>(5)</label><mml:math id="mml-eqn-5" display="block"><mml:mi>R</mml:mi><mml:mrow><mml:msub><mml:mi>A</mml:mi><mml:mi>l</mml:mi></mml:msub></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mi>c</mml:mi><mml:mi>l</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mtable columnalign="left" rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mn>1</mml:mn></mml:mtd><mml:mtd><mml:mspace width="1em" /><mml:mrow><mml:mi>i</mml:mi><mml:mi>f</mml:mi><mml:mspace width="thickmathspace" /><mml:mi>c</mml:mi><mml:mi>l</mml:mi><mml:mo>=</mml:mo><mml:msubsup><mml:mi>C</mml:mi><mml:mo>&#x2217;</mml:mo><mml:mi>g</mml:mi></mml:msubsup><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:mi>s</mml:mi><mml:mi>i</mml:mi><mml:mi>m</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>c</mml:mi><mml:mi>l</mml:mi><mml:mo>,</mml:mo><mml:msubsup><mml:mi>C</mml:mi><mml:mo>&#x2217;</mml:mo><mml:mi>g</mml:mi></mml:msubsup><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd><mml:mtd><mml:mspace width="1em" /><mml:mrow><mml:mi>o</mml:mi><mml:mi>t</mml:mi><mml:mi>h</mml:mi><mml:mi>e</mml:mi><mml:mi>r</mml:mi><mml:mi>w</mml:mi><mml:mi>i</mml:mi><mml:mi>s</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:math></disp-formula>
where <inline-formula id="ieqn-80"><mml:math id="mml-ieqn-80"><mml:msubsup><mml:mi>C</mml:mi><mml:mo>&#x2217;</mml:mo><mml:mi>g</mml:mi></mml:msubsup><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is a cluster label to which sample <inline-formula id="ieqn-81"><mml:math id="mml-ieqn-81"><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> has been assigned. In addition, <inline-formula id="ieqn-82"><mml:math id="mml-ieqn-82"><mml:mi>s</mml:mi><mml:mi>i</mml:mi><mml:mi>m</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mi>x</mml:mi></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mi>y</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula> denotes the similarity between any two clusters <inline-formula id="ieqn-83"><mml:math id="mml-ieqn-83"><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mi>x</mml:mi></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mi>y</mml:mi></mml:msub></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:msub><mml:mi>&#x03C0;</mml:mi><mml:mi>g</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula>, which can be discovered using the link-based algorithm <italic>l</italic> presented next.</p>
<p><bold>Weighted Connected-Triple (WCT) Algorithm</bold>: has been developed to evaluate the similarity between any pair of clusters <inline-formula id="ieqn-84"><mml:math id="mml-ieqn-84"><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mi>x</mml:mi></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mi>y</mml:mi></mml:msub></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mo>&#x220F;</mml:mo></mml:math></inline-formula>. At the outset, the ensemble <inline-formula id="ieqn-85"><mml:math id="mml-ieqn-85"><mml:mo>&#x220F;</mml:mo></mml:math></inline-formula> is represented as a weighted graph <inline-formula id="ieqn-86"><mml:math id="mml-ieqn-86"><mml:mi>G</mml:mi><mml:mo>=</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>V</mml:mi><mml:mo>,</mml:mo><mml:mi>W</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, where <italic>V</italic> is the set of vertices each representing a cluster in <inline-formula id="ieqn-87"><mml:math id="mml-ieqn-87"><mml:mo>&#x220F;</mml:mo></mml:math></inline-formula> and <italic>W</italic> is a set of weighted edges between clusters. The weight <inline-formula id="ieqn-88"><mml:math id="mml-ieqn-88"><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>x</mml:mi><mml:mi>y</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula> assigned to the edge <inline-formula id="ieqn-89"><mml:math id="mml-ieqn-89"><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>x</mml:mi><mml:mi>y</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mi>W</mml:mi></mml:math></inline-formula> between <inline-formula id="ieqn-90"><mml:math id="mml-ieqn-90"><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mi>x</mml:mi></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mi>y</mml:mi></mml:msub></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mi>V</mml:mi></mml:math></inline-formula>, is estimated by the next equation.
<disp-formula id="eqn-6"><label>(6)</label><mml:math id="mml-eqn-6" display="block"><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>x</mml:mi><mml:mi>y</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mrow><mml:msub><mml:mi>L</mml:mi><mml:mi>x</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:mo>&#x2229;</mml:mo></mml:mrow><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:msub><mml:mi>L</mml:mi><mml:mi>y</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mrow><mml:msub><mml:mi>L</mml:mi><mml:mi>x</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:mo>&#x222A;</mml:mo></mml:mrow><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:msub><mml:mi>L</mml:mi><mml:mi>y</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:mfrac><mml:mo>,</mml:mo></mml:math></disp-formula>
where <inline-formula id="ieqn-91"><mml:math id="mml-ieqn-91"><mml:mrow><mml:msub><mml:mi>L</mml:mi><mml:mi>z</mml:mi></mml:msub></mml:mrow><mml:mo>&#x2282;</mml:mo><mml:mi>X</mml:mi></mml:math></inline-formula> denotes the set of data objects belonging to cluster <inline-formula id="ieqn-92"><mml:math id="mml-ieqn-92"><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mi>z</mml:mi></mml:msub></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mo>&#x220F;</mml:mo></mml:math></inline-formula>. Note that <italic>G</italic> is an undirected graph such that <inline-formula id="ieqn-93"><mml:math id="mml-ieqn-93"><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>x</mml:mi><mml:mi>y</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:math></inline-formula> is equivalent to <inline-formula id="ieqn-94"><mml:math id="mml-ieqn-94"><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>y</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mspace width="thickmathspace" /></mml:mrow><mml:mi mathvariant="normal">&#x2200;</mml:mi><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mi>x</mml:mi></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mi>y</mml:mi></mml:msub></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mtext>V</mml:mtext></mml:mrow></mml:math></inline-formula>. The WCT algorithm is summarized in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>. Following that, the similarity between clusters <inline-formula id="ieqn-95"><mml:math id="mml-ieqn-95"><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mi>x</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-96"><mml:math id="mml-ieqn-96"><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mi>y</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> can be estimated by the next equation.
<disp-formula id="eqn-7"><label>(7)</label><mml:math id="mml-eqn-7" display="block"><mml:mi>s</mml:mi><mml:mi>i</mml:mi><mml:mi>m</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mi>x</mml:mi></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mi>y</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>W</mml:mi><mml:mi>C</mml:mi><mml:mrow><mml:msub><mml:mi>T</mml:mi><mml:mrow><mml:mi>x</mml:mi><mml:mi>y</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow><mml:mrow><mml:mi>W</mml:mi><mml:mi>C</mml:mi><mml:mrow><mml:msub><mml:mi>T</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:mfrac><mml:mo>&#x00D7;</mml:mo><mml:mi>D</mml:mi><mml:mi>C</mml:mi><mml:mo>,</mml:mo></mml:math></disp-formula>
where <inline-formula id="ieqn-97"><mml:math id="mml-ieqn-97"><mml:mi>W</mml:mi><mml:mi>C</mml:mi><mml:mrow><mml:msub><mml:mi>T</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> is the maximum <inline-formula id="ieqn-98"><mml:math id="mml-ieqn-98"><mml:mi>W</mml:mi><mml:mi>C</mml:mi><mml:mrow><mml:msub><mml:mi>T</mml:mi><mml:mrow><mml:msup><mml:mi>x</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup><mml:msup><mml:mi>y</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> value of any two clusters <inline-formula id="ieqn-99"><mml:math id="mml-ieqn-99"><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:msup><mml:mi>x</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:msup><mml:mi>y</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msub></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mi>V</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-100"><mml:math id="mml-ieqn-100"><mml:mi>D</mml:mi><mml:mi>C</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula> is a constant decay factor (i.e., confidence level of accepting two non-identical clusters as being similar). With this link-based similarity metric, <inline-formula id="ieqn-101"><mml:math id="mml-ieqn-101"><mml:mi>s</mml:mi><mml:mi>i</mml:mi><mml:mi>m</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mi>x</mml:mi></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mi>y</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula> with <inline-formula id="ieqn-102"><mml:math id="mml-ieqn-102"><mml:mi>s</mml:mi><mml:mi>i</mml:mi><mml:mi>m</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mi>x</mml:mi></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mi>x</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mi mathvariant="normal">&#x2200;</mml:mi><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mi>x</mml:mi></mml:msub></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mi>V</mml:mi></mml:math></inline-formula>. It is also reflexive such that <inline-formula id="ieqn-103"><mml:math id="mml-ieqn-103"><mml:mi>s</mml:mi><mml:mi>i</mml:mi><mml:mi>m</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mi>x</mml:mi></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mi>y</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>s</mml:mi><mml:mi>i</mml:mi><mml:mi>m</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mi>y</mml:mi></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mi>x</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>.</p>
<fig id="fig-2"><label>Figure 2</label><caption><title>The summarization of WCT algorithm</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_19776-fig-2.png"/></fig>
<p><bold>Weighted Triple-Quality (WTQ) Algorithm</bold>: WTQ is inspired by the initial measure of [<xref ref-type="bibr" rid="ref-39">39</xref>], as it discriminates the quality of shared triples between a pair of vertices in question. Specifically, the quality of each vertex is determined by the rarity of links connecting itself to other vertices in a network. With a weighted graph <inline-formula id="ieqn-104"><mml:math id="mml-ieqn-104"><mml:mi>G</mml:mi><mml:mo>=</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>V</mml:mi><mml:mo>,</mml:mo><mml:mi>W</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, the WTQ measure of vertices <inline-formula id="ieqn-105"><mml:math id="mml-ieqn-105"><mml:mrow><mml:msub><mml:mi>v</mml:mi><mml:mi>x</mml:mi></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>v</mml:mi><mml:mi>y</mml:mi></mml:msub></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mi>V</mml:mi></mml:math></inline-formula> with respect to each centre of a triple <inline-formula id="ieqn-106"><mml:math id="mml-ieqn-106"><mml:mrow><mml:msub><mml:mi>v</mml:mi><mml:mi>z</mml:mi></mml:msub></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mi>V</mml:mi></mml:math></inline-formula>, is estimated by
<disp-formula id="eqn-8"><label>(8)</label><mml:math id="mml-eqn-8" display="block"><mml:mi>W</mml:mi><mml:mi>T</mml:mi><mml:msubsup><mml:mi>Q</mml:mi><mml:mrow><mml:mi>x</mml:mi><mml:mi>y</mml:mi></mml:mrow><mml:mi>z</mml:mi></mml:msubsup><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mrow><mml:msub><mml:mi>W</mml:mi><mml:mi>z</mml:mi></mml:msub></mml:mrow></mml:mrow></mml:mfrac><mml:mspace width="thickmathspace" /><mml:mo>,</mml:mo></mml:math></disp-formula>
provided that
<disp-formula id="eqn-9"><label>(9)</label><mml:math id="mml-eqn-9" display="block"><mml:mrow><mml:msub><mml:mi>W</mml:mi><mml:mi>z</mml:mi></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:munder><mml:mrow><mml:mo movablelimits="false">&#x2211;</mml:mo></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2200;</mml:mi><mml:mrow><mml:msub><mml:mi>v</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mspace width="thickmathspace" /><mml:mrow><mml:msub><mml:mi>N</mml:mi><mml:mi>z</mml:mi></mml:msub></mml:mrow></mml:mrow></mml:munder><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>z</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:math></disp-formula>
here <inline-formula id="ieqn-107"><mml:math id="mml-ieqn-107"><mml:mrow><mml:msub><mml:mi>N</mml:mi><mml:mi>z</mml:mi></mml:msub></mml:mrow><mml:mo>&#x2282;</mml:mo><mml:mi>V</mml:mi></mml:math></inline-formula> denotes the set of vertices that is directly linked to the vertex <inline-formula id="ieqn-108"><mml:math id="mml-ieqn-108"><mml:mrow><mml:msub><mml:mi>v</mml:mi><mml:mi>z</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula>, such that <inline-formula id="ieqn-109"><mml:math id="mml-ieqn-109"><mml:mi mathvariant="normal">&#x2200;</mml:mi><mml:mrow><mml:msub><mml:mi>v</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:msub><mml:mi>N</mml:mi><mml:mi>z</mml:mi></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>z</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mi>W</mml:mi></mml:math></inline-formula>. A pseudocode for the WTQ measure is described in <xref ref-type="fig" rid="fig-3">Fig. 3</xref>. Following that, the similarity between clusters <inline-formula id="ieqn-110"><mml:math id="mml-ieqn-110"><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mi>x</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-111"><mml:math id="mml-ieqn-111"><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mi>y</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> can be estimated by
<disp-formula id="eqn-10"><label>(10)</label><mml:math id="mml-eqn-10" display="block"><mml:mi>s</mml:mi><mml:mi>i</mml:mi><mml:mi>m</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mi>x</mml:mi></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mi>y</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>W</mml:mi><mml:mi>T</mml:mi><mml:mrow><mml:msub><mml:mi>Q</mml:mi><mml:mrow><mml:mi>x</mml:mi><mml:mi>y</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow><mml:mrow><mml:mi>W</mml:mi><mml:mi>T</mml:mi><mml:mrow><mml:msub><mml:mi>Q</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:mfrac><mml:mo>&#x00D7;</mml:mo><mml:mi>D</mml:mi><mml:mi>C</mml:mi><mml:mo>,</mml:mo></mml:math></disp-formula>
where <inline-formula id="ieqn-112"><mml:math id="mml-ieqn-112"><mml:mi>W</mml:mi><mml:mi>T</mml:mi><mml:mrow><mml:msub><mml:mi>Q</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> is the maximum <inline-formula id="ieqn-113"><mml:math id="mml-ieqn-113"><mml:mi>W</mml:mi><mml:mi>T</mml:mi><mml:mrow><mml:msub><mml:mi>Q</mml:mi><mml:mrow><mml:msup><mml:mi>x</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup><mml:msup><mml:mi>y</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> value of any two clusters and <inline-formula id="ieqn-114"><mml:math id="mml-ieqn-114"><mml:mi>D</mml:mi><mml:mi>C</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula> is a decay factor.</p>
<fig id="fig-3"><label>Figure 3</label><caption><title>The summarization of WTQ algorithm</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_19776-fig-3.png"/></fig>
</sec>
<sec id="s3_2_3"><label>3.2.3</label><title>Creating Final Data Partition</title>
<p>Having acquired <inline-formula id="ieqn-115"><mml:math id="mml-ieqn-115"><mml:mi>R</mml:mi><mml:mrow><mml:msub><mml:mi>A</mml:mi><mml:mi>l</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula>, the spectral graph-partitioning (SPEC) algorithm [<xref ref-type="bibr" rid="ref-40">40</xref>] is used to create the final data partition. This technique is first introduced by [<xref ref-type="bibr" rid="ref-28">28</xref>] as part of the Hybrid Bipartite Graph Formation (HBGF) framework. In particular, SPEC is exploited to divide a bipartite graph, which is transformed from the matrix <inline-formula id="ieqn-116"><mml:math id="mml-ieqn-116"><mml:mi>B</mml:mi><mml:mi>A</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:msup><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mrow><mml:mi>N</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>P</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> (a crisp variation of <inline-formula id="ieqn-117"><mml:math id="mml-ieqn-117"><mml:mi>R</mml:mi><mml:mrow><mml:msub><mml:mi>A</mml:mi><mml:mi>l</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula>), into <italic>K</italic> clusters. Given this insight, HBGF can be considered as the baseline model of LCE. The process of generating the final data partition <inline-formula id="ieqn-118"><mml:math id="mml-ieqn-118"><mml:mrow><mml:msup><mml:mi>&#x03C0;</mml:mi><mml:mo>&#x2217;</mml:mo></mml:msup></mml:mrow></mml:math></inline-formula> from this <inline-formula id="ieqn-119"><mml:math id="mml-ieqn-119"><mml:mi>R</mml:mi><mml:mrow><mml:msub><mml:mi>A</mml:mi><mml:mi>l</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> matrix is summarized as follows. At first, a weighted bipartite graph <inline-formula id="ieqn-120"><mml:math id="mml-ieqn-120"><mml:msup><mml:mi>G</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mi>V</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:msup><mml:mi>W</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is constructed from the matrix <inline-formula id="ieqn-121"><mml:math id="mml-ieqn-121"><mml:mi>R</mml:mi><mml:mrow><mml:msub><mml:mi>A</mml:mi><mml:mi>l</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula>, where <inline-formula id="ieqn-122"><mml:math id="mml-ieqn-122"><mml:mrow><mml:msup><mml:mi>V</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:msup><mml:mi>V</mml:mi><mml:mi>X</mml:mi></mml:msup></mml:mrow><mml:mrow><mml:mo>&#x222A;</mml:mo><mml:mspace width="thickmathspace" /></mml:mrow><mml:mrow><mml:msup><mml:mi>V</mml:mi><mml:mi>C</mml:mi></mml:msup></mml:mrow></mml:math></inline-formula> is a set of vertices representing both data objects <inline-formula id="ieqn-123"><mml:math id="mml-ieqn-123"><mml:mrow><mml:msup><mml:mi>V</mml:mi><mml:mi>X</mml:mi></mml:msup></mml:mrow></mml:math></inline-formula> and clusters <inline-formula id="ieqn-124"><mml:math id="mml-ieqn-124"><mml:mrow><mml:msup><mml:mi>V</mml:mi><mml:mi>C</mml:mi></mml:msup></mml:mrow></mml:math></inline-formula>, and <inline-formula id="ieqn-125"><mml:math id="mml-ieqn-125"><mml:msup><mml:mi>W</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> denotes a set of weighted edges. The weight <inline-formula id="ieqn-126"><mml:math id="mml-ieqn-126"><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:math></inline-formula> of edge <inline-formula id="ieqn-127"><mml:math id="mml-ieqn-127"><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> connecting vertices <inline-formula id="ieqn-128"><mml:math id="mml-ieqn-128"><mml:mrow><mml:msub><mml:mi>v</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>v</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>V</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>, can be defined by
<list list-type="bullet">
<list-item><p><inline-formula id="ieqn-129"><mml:math id="mml-ieqn-129"><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula>Z when <inline-formula id="ieqn-130"><mml:math id="mml-ieqn-130"><mml:mrow><mml:msub><mml:mi>v</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>v</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:msup><mml:mi>V</mml:mi><mml:mi>X</mml:mi></mml:msup></mml:mrow></mml:math></inline-formula> or <inline-formula id="ieqn-131"><mml:math id="mml-ieqn-131"><mml:mrow><mml:msub><mml:mi>v</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>v</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:msup><mml:mi>V</mml:mi><mml:mi>C</mml:mi></mml:msup></mml:mrow></mml:math></inline-formula>.</p></list-item>
<list-item><p>Otherwise, <inline-formula id="ieqn-132"><mml:math id="mml-ieqn-132"><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>R</mml:mi><mml:mrow><mml:msub><mml:mi>A</mml:mi><mml:mi>l</mml:mi></mml:msub></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>v</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>v</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> when <inline-formula id="ieqn-133"><mml:math id="mml-ieqn-133"><mml:mrow><mml:msub><mml:mi>v</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:msup><mml:mi>V</mml:mi><mml:mi>X</mml:mi></mml:msup></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-134"><mml:math id="mml-ieqn-134"><mml:mrow><mml:msub><mml:mi>v</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:msup><mml:mi>V</mml:mi><mml:mi>C</mml:mi></mml:msup></mml:mrow></mml:math></inline-formula>. Note that <inline-formula id="ieqn-135"><mml:math id="mml-ieqn-135"><mml:msup><mml:mi>G</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> is bi-directional such that <inline-formula id="ieqn-136"><mml:math id="mml-ieqn-136"><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mi>j</mml:mi><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:math></inline-formula>. In other words, <inline-formula id="ieqn-137"><mml:math id="mml-ieqn-137"><mml:msup><mml:mi>W</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>N</mml:mi><mml:mo>+</mml:mo><mml:mi>P</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x00D7;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>N</mml:mi><mml:mo>+</mml:mo><mml:mi>P</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> can also be specified as
<disp-formula id="eqn-11"><label>(11)</label><mml:math id="mml-eqn-11" display="block"><mml:msup><mml:mi>W</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mtable columnalign="left" rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mn>0</mml:mn></mml:mtd><mml:mtd><mml:mspace width="1em" /><mml:mrow><mml:mi>R</mml:mi><mml:mrow><mml:msub><mml:mi>A</mml:mi><mml:mi>l</mml:mi></mml:msub></mml:mrow></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:mi>R</mml:mi><mml:msubsup><mml:mi>A</mml:mi><mml:mi>l</mml:mi><mml:mi>T</mml:mi></mml:msubsup></mml:mrow></mml:mtd><mml:mtd><mml:mspace width="1em" /><mml:mn>0</mml:mn></mml:mtd></mml:mtr></mml:mtable></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:math></disp-formula>
</p></list-item>
</list>
After that, the <italic>K</italic> largest eigenvectors <inline-formula id="ieqn-138"><mml:math id="mml-ieqn-138"><mml:mrow><mml:msub><mml:mi>u</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>u</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>u</mml:mi><mml:mi>K</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> of <inline-formula id="ieqn-139"><mml:math id="mml-ieqn-139"><mml:msup><mml:mi>W</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> are used to produce the matrix <inline-formula id="ieqn-140"><mml:math id="mml-ieqn-140"><mml:mi>U</mml:mi><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>u</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:mrow><mml:mspace width="thickmathspace" /><mml:mrow><mml:msub><mml:mi>u</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:mrow><mml:mo>&#x2026;</mml:mo><mml:mspace width="thickmathspace" /><mml:mrow><mml:msub><mml:mi>u</mml:mi><mml:mi>K</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>, in which the eigenvectors are stacked in columns. Then, another matrix <inline-formula id="ieqn-141"><mml:math id="mml-ieqn-141"><mml:mrow><mml:msup><mml:mi>U</mml:mi><mml:mo>&#x2217;</mml:mo></mml:msup></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>N</mml:mi><mml:mo>+</mml:mo><mml:mi>P</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x00D7;</mml:mo><mml:mi>K</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> is formed by normalizing each row of <italic>U</italic> to have a unit length. By considering each row of <inline-formula id="ieqn-142"><mml:math id="mml-ieqn-142"><mml:mrow><mml:msup><mml:mi>U</mml:mi><mml:mo>&#x2217;</mml:mo></mml:msup></mml:mrow></mml:math></inline-formula> as <inline-formula id="ieqn-143"><mml:math id="mml-ieqn-143"><mml:mi>K</mml:mi></mml:math></inline-formula>-dimensional embedding of a graph vertex or a sample in <inline-formula id="ieqn-144"><mml:math id="mml-ieqn-144"><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mi>K</mml:mi></mml:msup></mml:mrow></mml:math></inline-formula>, k-means is finally used to generate the final partition <inline-formula id="ieqn-145"><mml:math id="mml-ieqn-145"><mml:mrow><mml:msup><mml:mi>&#x03C0;</mml:mi><mml:mo>&#x2217;</mml:mo></mml:msup></mml:mrow><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mrow><mml:msubsup><mml:mi>C</mml:mi><mml:mn>1</mml:mn><mml:mo>&#x2217;</mml:mo></mml:msubsup><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msubsup><mml:mi>C</mml:mi><mml:mi>K</mml:mi><mml:mo>&#x2217;</mml:mo></mml:msubsup></mml:mrow><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula> of <italic>K</italic> clusters.</p>
</sec>
</sec>
</sec>
<sec id="s4"><label>4</label><title>Performance Evaluation</title>
<p>To obtain a rigorous assessment of LCE for mixed-type data clustering, this section presents the framework that is systematically designed and employed for the performance evaluation.</p>
<sec id="s4_1"><label>4.1</label><title>Investigated Datasets</title>
<p>Five benchmark datasets obtained from the UCI repository
  [<xref ref-type="bibr" rid="ref-41">41</xref>] are included in this investigation, with <xref ref-type="table" rid="table-1">Tab. 1</xref> giving their details. <italic>Abalone</italic> consists of 4,177 instances, where eight physical measurements are used to divide these data into 28 age groups of abalone. There is only one categorical attribute, while the rest are continuous. <italic>Acute Inflammations</italic> was originally created by a medical expert to assess the decision support system, which performs the presumptive diagnosis of two diseases of urinary system: acute inflammations of urinary bladder and acute nephritises [<xref ref-type="bibr" rid="ref-42">42</xref>]. There are 120 instances, each representing a potential patient with six symptom attributes (1 numerical and 5 categorical). <italic>Heart Disease</italic> contains 303 records of patients collected from Cleveland Clinic Foundation. Each data record is described by 13 attributes (5 numerical and 8 nominal) regarding heart disease diagnosis. This dataset is divided into two classes referring to the presence and absence of heart disease in the examined patients. <italic>Horse Colic</italic> has 368 data records of injured horses, each of which is described by 27 attributes (7 numerical and 19 nominal). These collected instances are categorized into two classes: &#x2018;Yes&#x2019; indicating that lesion is surgical and &#x2018;No&#x2019; otherwise. About 30&#x0025; of the original are missing values. For simplicity, missing nominal values in this dataset are equally treated as a new nominal value. In the case of missing numerical values, mean of the corresponding attribute is used. <italic>Mammographic Masses</italic> contains mammogram data of 961 patient records collected at the Institute of Radiology of the University Erlangen-Nuremberg between 2003 and 2006. Five attributes used to describe each record are BI-RADS assessment, age and three BI-RADS attributes. This dataset possesses two class labels referring to the severity of a mammographic mass lesion: benign (516 instances) and malignant (445 instances).</p>
<table-wrap id="table-1"><label>Table 1</label><caption><title>Description of datasets: number of data points (<inline-formula id="ieqn-185"><mml:math id="mml-ieqn-185"><mml:mrow><mml:mi mathvariant="bold-italic">N</mml:mi></mml:mrow></mml:math></inline-formula>), attributes (<inline-formula id="ieqn-186"><mml:math id="mml-ieqn-186"><mml:mrow><mml:mi mathvariant="bold-italic">D</mml:mi></mml:mrow></mml:math></inline-formula>) and number of classes (<inline-formula id="ieqn-187"><mml:math id="mml-ieqn-187"><mml:mrow><mml:mi mathvariant="bold-italic">K</mml:mi></mml:mrow></mml:math></inline-formula>)</title></caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Dataset</th>
<th>Data points (<inline-formula id="ieqn-188"><mml:math id="mml-ieqn-188"><mml:mrow><mml:mi mathvariant="bold-italic">N</mml:mi></mml:mrow></mml:math></inline-formula>)</th>
<th>Attributes (<inline-formula id="ieqn-189"><mml:math id="mml-ieqn-189"><mml:mrow><mml:mi mathvariant="bold-italic">D</mml:mi></mml:mrow></mml:math></inline-formula>)</th>
<th>Classes (<inline-formula id="ieqn-190"><mml:math id="mml-ieqn-190"><mml:mrow><mml:mi mathvariant="bold-italic">K</mml:mi></mml:mrow></mml:math></inline-formula>)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Abalone</td>
<td>4,177</td>
<td>8</td>
<td>28</td>
</tr>
<tr>
<td>Acute inflammations</td>
<td>120</td>
<td>6</td>
<td>2</td>
</tr>
<tr>
<td>Heart disease</td>
<td>303</td>
<td>13</td>
<td>2</td>
</tr>
<tr>
<td>Horse colic</td>
<td>368</td>
<td>27</td>
<td>2</td>
</tr>
<tr>
<td>Mammographic masses</td>
<td>961</td>
<td>5</td>
<td>2</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4_2"><label>4.2</label><title>Experimental Design</title>
<p>This experiment aims to examine the quality of the LCE<sub>WCT</sub> and LCE<sub>WTQ</sub> extensions of LCE for clustering mixed numeric and nominal data. For these extended models where k-prototypes is used for creating a cluster ensemble, the parameter <inline-formula id="ieqn-146"><mml:math id="mml-ieqn-146"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula> of this base clustering algorithm is randomly selected from <inline-formula id="ieqn-147"><mml:math id="mml-ieqn-147"><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mrow><mml:mn>0.1</mml:mn><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /><mml:mn>0.2</mml:mn><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /><mml:mn>5</mml:mn></mml:mrow><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula>. The results with LCE models are compared against a large number of standard clustering techniques and advanced cluster ensemble approaches. At first, this includes three standard clustering algorithms: k-prototypes, k-centers, k-means (KM) and dSqueezer. Particularly, the weight parameter <inline-formula id="ieqn-148"><mml:math id="mml-ieqn-148"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula> is randomly selected from <inline-formula id="ieqn-149"><mml:math id="mml-ieqn-149"><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mrow><mml:mn>0.1</mml:mn><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /><mml:mn>0.2</mml:mn><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /><mml:mn>5</mml:mn></mml:mrow><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula> for each run of k-prototypes and k-centers. In order to exploit k-means, a mixed-type dataset needs to be pre-processed such that each nominal attribute is transformed to <inline-formula id="ieqn-150"><mml:math id="mml-ieqn-150"><mml:mi>&#x03B2;</mml:mi></mml:math></inline-formula> new binary-value features, where <inline-formula id="ieqn-151"><mml:math id="mml-ieqn-151"><mml:mi>&#x03B2;</mml:mi></mml:math></inline-formula> is the corresponding number of nominal values. For the case of dSqueezer, each numerical data attribute has to be mapped to the corresponding categorical domain using the discretisation method explained by [<xref ref-type="bibr" rid="ref-19">19</xref>]. The set of compared methods also contains twelve different cluster ensemble techniques that have been reported in the literature for their effectiveness in combining clustering results: four graph-based methods of HBGF [<xref ref-type="bibr" rid="ref-28">28</xref>], CSPA [<xref ref-type="bibr" rid="ref-32">32</xref>], HGPA [<xref ref-type="bibr" rid="ref-32">32</xref>] and MCLA [<xref ref-type="bibr" rid="ref-32">32</xref>]; two pairwise-similarity based methods [<xref ref-type="bibr" rid="ref-24">24</xref>] of EAC-SL and EAC-AL; and six feature-based methods of IVC [<xref ref-type="bibr" rid="ref-43">43</xref>], MM [<xref ref-type="bibr" rid="ref-33">33</xref>], QMI [<xref ref-type="bibr" rid="ref-33">33</xref>], AGG<sub>F</sub> [<xref ref-type="bibr" rid="ref-29">29</xref>], AGG<sub>LSF</sub> [<xref ref-type="bibr" rid="ref-29">29</xref>] and AGG<sub>LSR</sub> [<xref ref-type="bibr" rid="ref-29">29</xref>]. The experiment setting employed in this evaluation is exhibited below. Note that the performance of standard clustering algorithms is always assessed over the original data, without using any information of cluster ensembles.
<list list-type="bullet">
<list-item><p>Cluster ensemble methods are investigated using four different ensemble types: Full-space &#x002B; Fixed-k, Full-space &#x002B; Random-k, Subspace &#x002B; Fixed-k, and Sub-space &#x002B; Random-k.</p></list-item>
<list-item><p>Ensemble size (<inline-formula id="ieqn-152"><mml:math id="mml-ieqn-152"><mml:mi>M</mml:mi></mml:math></inline-formula>) of 10 base clusterings is experimented.</p></list-item>
<list-item><p>As in [<xref ref-type="bibr" rid="ref-24">24</xref>,<xref ref-type="bibr" rid="ref-28">28</xref>,<xref ref-type="bibr" rid="ref-29">29</xref>], each method divides data points into a partition of <italic>K</italic> (the number of true classes for each dataset) clusters, which is then evaluated against the corresponding true partition. Note that, true classes are known for all datasets <italic>but are not explicitly used by the cluster ensemble process</italic>. They are only used to evaluate the quality of the clustering results.</p></list-item>
<list-item><p>The quality of each cluster ensemble method with respect to a specific ensemble setting is generalized as the average of 50 runs. Based on the central limit theorem (CLT), the observed statistics in a controlled experiment can be justified to the normal distribution [<xref ref-type="bibr" rid="ref-43">43</xref>].</p></list-item>
<list-item><p>The constant decay factor (<inline-formula id="ieqn-153"><mml:math id="mml-ieqn-153"><mml:mi>D</mml:mi><mml:mi>C</mml:mi></mml:math></inline-formula>) of 0.9 is exploited with WCT and WTQ algorithms.</p></list-item>
</list></p>
</sec>
<sec id="s4_3"><label>4.3</label><title>Performance Measurements and Comparison</title>
<p>Provided that the external class labels are available for all experimented datasets, the results of final clustering are evaluated using the validity index of Normalized Mutual Information (<inline-formula id="ieqn-154"><mml:math id="mml-ieqn-154"><mml:mi>N</mml:mi><mml:mi>M</mml:mi><mml:mi>I</mml:mi></mml:math></inline-formula>) introduced by [<xref ref-type="bibr" rid="ref-32">32</xref>]. Other quality measures such as Classification Accuracy (<inline-formula id="ieqn-155"><mml:math id="mml-ieqn-155"><mml:mi>C</mml:mi><mml:mi>A</mml:mi></mml:math></inline-formula>; [<xref ref-type="bibr" rid="ref-44">44</xref>]) and Adjusted Rand Index (<inline-formula id="ieqn-156"><mml:math id="mml-ieqn-156"><mml:mi>A</mml:mi><mml:mi>R</mml:mi></mml:math></inline-formula>; [<xref ref-type="bibr" rid="ref-45">45</xref>]) can be similarly used. However, unlike other criteria, <inline-formula id="ieqn-157"><mml:math id="mml-ieqn-157"><mml:mi>N</mml:mi><mml:mi>M</mml:mi><mml:mi>I</mml:mi></mml:math></inline-formula> is not biased by a large number of clusters, thus providing a reliable conclusion. This also simplifies the magnitude of evaluation results and their comprehension. This quality index measures the average mutual information (i.e., the degree of agreement) between two data partitions. One is obtained from a clustering algorithm (<inline-formula id="ieqn-158"><mml:math id="mml-ieqn-158"><mml:mrow><mml:msup><mml:mi>&#x03C0;</mml:mi><mml:mo>&#x2217;</mml:mo></mml:msup></mml:mrow></mml:math></inline-formula>) while the other is taken from a priori information, i.e., known class labels (<inline-formula id="ieqn-159"><mml:math id="mml-ieqn-159"><mml:mrow><mml:mover><mml:mo>&#x220F;</mml:mo><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:mover></mml:mrow></mml:math></inline-formula>). With <inline-formula id="ieqn-160"><mml:math id="mml-ieqn-160"><mml:mi>N</mml:mi><mml:mi>M</mml:mi><mml:mi>I</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mspace width="thickmathspace" /><mml:mo stretchy="false">[</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>, the maximum value indicates that the clustering result and the original classes completely match. Given the two data partitions of <italic>K</italic> clusters and <inline-formula id="ieqn-161"><mml:math id="mml-ieqn-161"><mml:msup><mml:mi>K</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> classes, <inline-formula id="ieqn-162"><mml:math id="mml-ieqn-162"><mml:mi>N</mml:mi><mml:mi>M</mml:mi><mml:mi>I</mml:mi></mml:math></inline-formula> is computed by the following equation.
<disp-formula id="eqn-12"><label>(12)</label><mml:math id="mml-eqn-12" display="block"><mml:mi>N</mml:mi><mml:mi>M</mml:mi><mml:mi>I</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:msup><mml:mi>&#x03C0;</mml:mi><mml:mo>&#x2217;</mml:mo></mml:msup></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mover><mml:mo>&#x220F;</mml:mo><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:mover></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:munderover><mml:mrow><mml:mo movablelimits="false">&#x2211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>K</mml:mi></mml:munderover><mml:mo>&#x2061;</mml:mo><mml:munderover><mml:mrow><mml:mo movablelimits="false">&#x2211;</mml:mo></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msup><mml:mi>K</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:munderover><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mfrac><mml:mrow><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:msub><mml:mi>m</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:mrow></mml:mfrac></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:msqrt><mml:munderover><mml:mrow><mml:mo movablelimits="false">&#x2211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>K</mml:mi></mml:munderover><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mfrac><mml:mrow><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mi>N</mml:mi></mml:mfrac></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mspace width="thickmathspace" /><mml:munderover><mml:mrow><mml:mo movablelimits="false">&#x2211;</mml:mo></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msup><mml:mi>K</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:munderover><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:msub><mml:mi>m</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mfrac><mml:mrow><mml:mrow><mml:msub><mml:mi>m</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mi>N</mml:mi></mml:mfrac></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:msqrt></mml:mrow></mml:mfrac><mml:mspace width="thickmathspace" /><mml:mo>,</mml:mo></mml:math></disp-formula>
</p>
<p>where <inline-formula id="ieqn-163"><mml:math id="mml-ieqn-163"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> is the number of data objects agreed by cluster <italic>i</italic> and class <italic>j</italic>, <inline-formula id="ieqn-164"><mml:math id="mml-ieqn-164"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> is the number of data objects in cluster <italic>i</italic>, <inline-formula id="ieqn-165"><mml:math id="mml-ieqn-165"><mml:mrow><mml:msub><mml:mi>m</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> is the number of data objects in class <italic>j</italic> and <italic>N</italic> is the total number of data objects. To compare the performance of different cluster ensemble methods, the overall quality measure for a specific experiment setting (i.e., dataset and ensemble type) is obtained as the average of <inline-formula id="ieqn-166"><mml:math id="mml-ieqn-166"><mml:mi>N</mml:mi><mml:mi>M</mml:mi><mml:mi>I</mml:mi></mml:math></inline-formula> values across 50 trials. These method-specific means may be used for the comparison purpose only to a certain extent. To achieve a more reliable assessment, the number of times (or frequencies) that one technique is &#x2018;significantly better&#x2019; and &#x2018;significantly worse&#x2019; (of 95&#x0025; confidence level) than the others are considered here. This comparison method has been successfully exploited by [<xref ref-type="bibr" rid="ref-9">9</xref>] and [<xref ref-type="bibr" rid="ref-46">46</xref>] to discover trustworthy conclusions from the results generated by different cluster ensemble approaches. Based on these, it is useful to compare the frequencies of better (<inline-formula id="ieqn-167"><mml:math id="mml-ieqn-167"><mml:mi>B</mml:mi></mml:math></inline-formula>) and worse (<inline-formula id="ieqn-168"><mml:math id="mml-ieqn-168"><mml:mi>W</mml:mi></mml:math></inline-formula>) performance between methods. The overall measure (<inline-formula id="ieqn-169"><mml:math id="mml-ieqn-169"><mml:mi>B</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>W</mml:mi></mml:math></inline-formula>) is also used as a summarization.</p>
</sec>
<sec id="s4_4"><label>4.4</label><title>Experimental Results</title>
<p><xref ref-type="fig" rid="fig-4">Fig. 4</xref> shows the overall performance of different clustering methods, as the average <inline-formula id="ieqn-170"><mml:math id="mml-ieqn-170"><mml:mi>N</mml:mi><mml:mi>M</mml:mi><mml:mi>I</mml:mi></mml:math></inline-formula> measure across all investigated datasets and ensemble types. Based on this, LCE<sub>WCT</sub> and LCE<sub>WTQ</sub> are similarly more effective than their baseline model (i.e., HBGF), whilst significantly improve the quality of data partitions acquired by base clusterings, i.e., k-prototypes. Their performance levels are also better than other cluster ensemble methods and standard clustering algorithms included in this evaluation. Note that CSPA and k-means are the most accurate amongst the aforementioned two groups of compared methods. In addition, feature-based approaches such as QMI and IVC are unfortunately incapable of enhancing the accuracy of base clustering results. Dataset-specific results are given in Tabs. A to E of <italic>Supplementary</italic> (<uri xlink:href="https://drive.google.com/file/d/1I62X5LTDQ_u6feFx57tW9oqwDLtfu4eH/view?usp=sharing">https://drive.google.com/file/d/1I62X5LTDQ_u6feFx57tW9oqwDLtfu4eH/view?usp&#x003D;sharing</uri>).</p>
<fig id="fig-4"><label>Figure 4</label><caption><title>performance of different clustering methods, averaged across five datasets and four ensemble types. Note that each error bar represents the standard deviation of the corresponding average</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_19776-fig-4.png"/></fig>
<p>To further evaluate the quality of identified techniques, the number of times (or frequency) that one method is significantly better and worse (of 95&#x0025; confidence level) than the others are assessed across all experimented datasets and ensemble types. <xref ref-type="table" rid="table-2">Tabs. 2</xref> and <xref ref-type="table" rid="table-3">3</xref> present for each method the frequencies of significant better (<inline-formula id="ieqn-171"><mml:math id="mml-ieqn-171"><mml:mi>B</mml:mi></mml:math></inline-formula>) and significant worse (<inline-formula id="ieqn-172"><mml:math id="mml-ieqn-172"><mml:mi>W</mml:mi></mml:math></inline-formula>) performance, respectively. According to the frequencies shown in <xref ref-type="table" rid="table-2">Tab. 2</xref>, LCE<sub>WCT</sub> and LCE<sub>WTQ</sub> perform equally well on most of the examined datasets. EAC-AL is exceptionally effective on &#x2018;Abalone&#x2019; data, while the three graph-based approaches of CSPA, HGPA and MCLA are of good quality with &#x2018;Heart Disease&#x2019; and &#x2018;Horse Colic&#x2019;. Note that k-means and k-prototypes are the best amongst basic clustering techniques. It is also interesting to see that the better-performance statistics of feature-based approaches are usually lower than those of standard clusterings considered here. These findings can be similarly observed in <xref ref-type="table" rid="table-3">Tab. 3</xref>, which illustrates the frequencies of worse performance (<inline-formula id="ieqn-173"><mml:math id="mml-ieqn-173"><mml:mi>W</mml:mi></mml:math></inline-formula>). In this specific evaluation context, k-means is notably effective for most datasets and outperforms many graph-based and pairwise-similarity based cluster ensemble methods.</p>
<table-wrap id="table-2"><label>Table 2</label><caption><title>Number of times that one method performs <italic>significantly better</italic> than others, summarized across five datasets and four types of ensemble. The best two per dataset are highlighted in <bold>boldface</bold></title></caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Method</th>
<th>Abalone</th>
<th>Acute inflammations</th>
<th>Heart disease</th>
<th>Horse colic</th>
<th>Mammographic</th>
<th>Total</th>
</tr>
</thead>
<tbody>
<tr>
<td>LCE<sub>WCT</sub></td>
<td><bold>52</bold></td>
<td><bold>47</bold></td>
<td><bold>58</bold></td>
<td><bold>61</bold></td>
<td><bold>57</bold></td>
<td><bold>275</bold></td>
</tr>
<tr><td>LCE<sub>WTQ</sub></td>
<td>45</td>
<td><bold>51</bold></td>
<td><bold>56</bold></td>
<td><bold>53</bold></td>
<td><bold>49</bold></td>
<td><bold>254</bold></td>
</tr>
<tr><td>HBGF</td>
<td>37</td>
<td>21</td>
<td>49</td>
<td>1</td>
<td>40</td>
<td>148</td>
</tr>
<tr><td>CSPA</td>
<td>20</td>
<td>17</td>
<td>32</td>
<td>29</td>
<td>28</td>
<td>126</td>
</tr>
<tr><td>HGPA</td>
<td>10</td>
<td>8</td>
<td>38</td>
<td>41</td>
<td>16</td>
<td>113</td>
</tr>
<tr><td>MCLA</td>
<td>19</td>
<td>14</td>
<td>29</td>
<td>37</td>
<td>27</td>
<td>126</td>
</tr>
<tr><td>EAC-SL</td>
<td>12</td>
<td>31</td>
<td>2</td>
<td>4</td>
<td>0</td>
<td>49</td>
</tr>
<tr><td>EAC-AL</td>
<td><bold>46</bold></td>
<td>28</td>
<td>23</td>
<td>6</td>
<td>32</td>
<td>135</td>
</tr>
<tr><td>QMI</td>
<td>13</td>
<td>6</td>
<td>17</td>
<td>14</td>
<td>9</td>
<td>59</td>
</tr>
<tr><td>AGG<sub>F</sub></td>
<td>35</td>
<td>6</td>
<td>2</td>
<td>0</td>
<td>22</td>
<td>65</td>
</tr>
<tr><td>AGG<sub>LSF</sub></td>
<td>23</td>
<td>3</td>
<td>9</td>
<td>13</td>
<td>22</td>
<td>70</td>
</tr>
<tr><td>AGG<sub>LSR</sub></td>
<td>1</td>
<td>3</td>
<td>9</td>
<td>15</td>
<td>4</td>
<td>32</td>
</tr>
<tr><td>IVC</td>
<td>13</td>
<td>13</td>
<td>11</td>
<td>16</td>
<td>12</td>
<td>65</td>
</tr>
<tr><td>MM</td>
<td>9</td>
<td>4</td>
<td>13</td>
<td>17</td>
<td>4</td>
<td>47</td>
</tr>
<tr><td>k-prototypes</td>
<td>42</td>
<td>5</td>
<td>24</td>
<td>19</td>
<td>45</td>
<td>135</td>
</tr>
<tr><td>k-centers</td>
<td>39</td>
<td>9</td>
<td>7</td>
<td>22</td>
<td>21</td>
<td>98</td>
</tr>
<tr><td>KM</td>
<td><bold>46</bold></td>
<td>7</td>
<td>35</td>
<td>28</td>
<td>35</td>
<td>151</td>
</tr>
<tr><td>dSqueezer</td>
<td>24</td>
<td>0</td>
<td>35</td>
<td>12</td>
<td>10</td>
<td>81</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="table-3"><label>Table 3</label><caption><title>Number of times that one method performs <italic>significantly worse</italic> than others, summarized across five datasets and four types of ensemble. The best two per dataset are highlighted in <bold>boldface</bold></title></caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Method</th>
<th>Abalone</th>
<th>Acute inflammations</th>
<th>Heart disease</th>
<th>Horse colic</th>
<th>Mammographic</th>
<th>Total</th>
</tr>
</thead>
<tbody>
<tr>
<td>LCE<sub>WCT</sub></td>
<td><bold>1</bold></td>
<td><bold>0</bold></td>
<td><bold>0</bold></td>
<td><bold>0</bold></td>
<td><bold>0</bold></td>
<td><bold>1</bold></td>
</tr>
<tr>
<td>LCE<sub>WTQ</sub></td>
<td>4</td>
<td><bold>0</bold></td>
<td><bold>0</bold></td>
<td><bold>0</bold></td>
<td><bold>0</bold></td>
<td><bold>4</bold></td>
</tr>
<tr>
<td>HBGF</td>
<td>15</td>
<td>4</td>
<td>6</td>
<td>54</td>
<td>6</td>
<td>85</td>
</tr>
<tr>
<td>CSPA</td>
<td>34</td>
<td>4</td>
<td>12</td>
<td>7</td>
<td>18</td>
<td>75</td>
</tr>
<tr>
<td>HGPA</td>
<td>50</td>
<td>12</td>
<td>6</td>
<td>6</td>
<td>32</td>
<td>106</td>
</tr>
<tr>
<td>MCLA</td>
<td>39</td>
<td>10</td>
<td>16</td>
<td>5</td>
<td>11</td>
<td>81</td>
</tr>
<tr>
<td>EAC-SL</td>
<td>49</td>
<td>2</td>
<td>66</td>
<td>58</td>
<td>66</td>
<td>241</td>
</tr>
<tr>
<td>EAC-AL</td>
<td>4</td>
<td>3</td>
<td>22</td>
<td>28</td>
<td>14</td>
<td>71</td>
</tr>
<tr>
<td>QMI</td>
<td>41</td>
<td>17</td>
<td>23</td>
<td>12</td>
<td>34</td>
<td>127</td>
</tr>
<tr>
<td>AGG<sub>F</sub></td>
<td>13</td>
<td>20</td>
<td>61</td>
<td>60</td>
<td>26</td>
<td>180</td>
</tr>
<tr>
<td>AGG<sub>LSF</sub></td>
<td>31</td>
<td>21</td>
<td>39</td>
<td>32</td>
<td>21</td>
<td>144</td>
</tr>
<tr>
<td>AGG<sub>LSR</sub></td>
<td>64</td>
<td>39</td>
<td>45</td>
<td>24</td>
<td>44</td>
<td>216</td>
</tr>
<tr>
<td>IVC</td>
<td>41</td>
<td>15</td>
<td>32</td>
<td>15</td>
<td>34</td>
<td>137</td>
</tr>
<tr>
<td>MM</td>
<td>55</td>
<td>17</td>
<td>26</td>
<td>11</td>
<td>38</td>
<td>147</td>
</tr>
<tr>
<td>k-prototypes</td>
<td><bold>3</bold></td>
<td>24</td>
<td>26</td>
<td>17</td>
<td>6</td>
<td>76</td>
</tr>
<tr>
<td>k-centers</td>
<td><bold>3</bold></td>
<td>12</td>
<td>52</td>
<td>11</td>
<td>35</td>
<td>113</td>
</tr>
<tr>
<td>KM</td>
<td><bold>3</bold></td>
<td>8</td>
<td><bold>0</bold></td>
<td>4</td>
<td><bold>0</bold></td>
<td>15</td>
</tr>
<tr>
<td>dSqueezer</td>
<td>36</td>
<td>65</td>
<td>17</td>
<td>44</td>
<td>48</td>
<td>210</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Besides, the relations between performance of experimented cluster ensemble methods with respect to different ensemble types are also examined for this experiment: Full-space &#x002B; Fixed-k, Full-space &#x002B; Random-k, Subspace &#x002B; Fixed-k, and Subspace &#x002B; Random-k. Specifically, <xref ref-type="fig" rid="fig-5">Fig. 5</xref> shows the average <inline-formula id="ieqn-174"><mml:math id="mml-ieqn-174"><mml:mi>N</mml:mi><mml:mi>M</mml:mi><mml:mi>I</mml:mi></mml:math></inline-formula> measures of different approaches across datasets. According to this statistical illustration, LCE<sub>WCT</sub> and LCE<sub>WTQ</sub> are more effective than other techniques across different ensemble types, with their best performance being obtained with &#x2018;Subspace &#x002B; Fixed-k&#x2019;. HBGF and three graph-based approaches (CSPA, HGPA and MCLA) are also more effective on Subspace ensemble types, as compared to the Full-space alternatives. While both &#x2018;Fixed-k&#x2019; and &#x2018;Random-k&#x2019; strategies equally lead to good performance of link-based techniques, feature-based and pair-wise similarity based methods perform better using the latter.</p>
<fig id="fig-5"><label>Figure 5</label><caption><title>Performance of clustering methods, categorized by four ensemble types</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_19776-fig-5.png"/></fig>
<p>The quality of LCE<sub>WCT</sub> and LCE<sub>WTQ</sub> with respect to the perturbation of <inline-formula id="ieqn-175"><mml:math id="mml-ieqn-175"><mml:mi>D</mml:mi><mml:mi>C</mml:mi></mml:math></inline-formula> and <italic>M</italic> parameters is also studied for the clustering of mixed-type data. <xref ref-type="fig" rid="fig-6">Fig. 6</xref> presents the relation between different values of <inline-formula id="ieqn-176"><mml:math id="mml-ieqn-176"><mml:mi>D</mml:mi><mml:mi>C</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mrow><mml:mn>0.1</mml:mn><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /><mml:mn>0.9</mml:mn></mml:mrow><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula> and the quality of data partitions generated by both LCE methods &#x2013; the average <inline-formula id="ieqn-177"><mml:math id="mml-ieqn-177"><mml:mi>N</mml:mi><mml:mi>M</mml:mi><mml:mi>I</mml:mi></mml:math></inline-formula> measure across all ensemble types, where <italic>M</italic> is fixed to 10 for comparison simplicity. In general, the performance of LCE<sub>WCT</sub> and LCE<sub>WTQ</sub> gradually improve as the value of <inline-formula id="ieqn-178"><mml:math id="mml-ieqn-178"><mml:mi>D</mml:mi><mml:mi>C</mml:mi></mml:math></inline-formula> increases. Another parameter to be assessed is the ensemble size (<inline-formula id="ieqn-179"><mml:math id="mml-ieqn-179"><mml:mi>M</mml:mi></mml:math></inline-formula>). <xref ref-type="fig" rid="fig-7">Fig. 7</xref> shows the association between the performance of various techniques and different values of <inline-formula id="ieqn-180"><mml:math id="mml-ieqn-180"><mml:mi>M</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mrow><mml:mn>10</mml:mn><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /><mml:mn>20</mml:mn><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /><mml:mn>100</mml:mn></mml:mrow><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula>. Both LCE methods perform consistently better than their baseline model competitors across different ensemble sizes, where the decay factor (<inline-formula id="ieqn-181"><mml:math id="mml-ieqn-181"><mml:mi>D</mml:mi><mml:mi>C</mml:mi></mml:math></inline-formula>) is fixed to 0.9 for simplicity. Their performance levels also incline with the increasing ensemble size.</p>
<fig id="fig-6"><label>Figure 6</label><caption><title>Relations between <inline-formula id="ieqn-182"><mml:math id="mml-ieqn-182"><mml:mi>D</mml:mi><mml:mi>C</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mrow><mml:mn mathvariant="bold">0</mml:mn></mml:mrow><mml:mo>.</mml:mo><mml:mrow><mml:mn mathvariant="bold">1</mml:mn></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mn mathvariant="bold">0</mml:mn></mml:mrow><mml:mo>.</mml:mo><mml:mrow><mml:mn mathvariant="bold">2</mml:mn></mml:mrow><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:mn mathvariant="bold">0</mml:mn></mml:mrow><mml:mo>.</mml:mo><mml:mrow><mml:mn mathvariant="bold">9</mml:mn></mml:mrow><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula> and performance of LCE methods (averages of <inline-formula id="ieqn-183"><mml:math id="mml-ieqn-183"><mml:mrow><mml:mi mathvariant="bold-italic">N</mml:mi><mml:mi>M</mml:mi><mml:mi>I</mml:mi></mml:mrow></mml:math></inline-formula> over four ensemble types for each dataset). Measure of HBGF is also included for a comparison</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_19776-fig-6.png"/></fig>
<fig id="fig-7"><label>Figure 7</label><caption><title>Relations between <italic>M</italic> &#x2208; {10, 20, &#x2026;, 100} and performance of LCE methods (presented as the averages of <inline-formula id="ieqn-184"><mml:math id="mml-ieqn-184"><mml:mi>N</mml:mi><mml:mi>M</mml:mi><mml:mi>I</mml:mi></mml:math></inline-formula> over four ensemble types for each dataset)</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_19776-fig-7.png"/></fig>
</sec>
</sec>
<sec id="s5"><label>5</label><title>Conclusion</title>
<p>This paper has presented the novel extension of link-based consensus clustering to mixed-type data analysis. The resulting models have been rigorously evaluated on benchmark datasets, using several ensemble types. The comparison results against different standard clustering algorithms and a large set of well-known cluster ensemble methods show that the link-based techniques usually provide solutions of higher quality than those obtained by competitors. Furthermore, the investigation of their behavior with respect to the perturbation of algorithmic parameters also suggests the robust performance. Such a characteristic makes link-based cluster ensembles highly useful for the exploration and analysis of a new set of mixed-type data, where prior knowledge is minimal. Because of its scope, there are many possibilities for extending the current research. Firstly, other link-based similarity measures may be explored. As more information within a link network is exploited, link-based cluster ensembles are likely to be more accurate (see the relevant findings in the initial work [<xref ref-type="bibr" rid="ref-30">30</xref>,<xref ref-type="bibr" rid="ref-31">31</xref>], where the use of SimRank and its variants is examined). However, it is important to note that such modification is more resource intensive and less accurate in a noisy environment than the present setting. Secondly, performance of link-based cluster ensembles may be further improved using an adaptive decay factor (DC), which is determined from the dataset under examination.</p>
<p>The diversity of cluster ensembles has a positive effect on the performance of the link-based approach. It is interesting to observe the behavior of the proposed models to new ensemble generation strategies, e.g., the random forest method for clustering [<xref ref-type="bibr" rid="ref-47">47</xref>], which may impose a higher diversity amongst base clusterings. Another non-trivial topic is related to the determination of ensemble components&#x2019; significance. This discrimination or selection process usually leads to a better outcome. The coupling of such a mechanism with the link-based cluster ensembles is to be further studied. Despite its performance, the consensus function of spectral graph partitioning (SPEC) can be inefficient with a large RA matrix. This can be overcome through the approximation of eigenvectors required by SPEC. As a result, the time complexity becomes linear to the matrix size, but with possible information loss. A better alternative has been introduced by [<xref ref-type="bibr" rid="ref-48">48</xref>] via the notion of Power Iteration Clustering (PIC). It does not actually find eigenvectors but discovers interesting instances of their combinations. As a result, it is very fast and has proven more effective than the conventional SPEC. The application of PIC as a consensus function of link-based cluster ensembles is a crucial step towards making the proposed approach truly effective in terms of run-time and quality. Other possible future works include the use of proposed method to support accurate clusterings for fuzzy reasoning [<xref ref-type="bibr" rid="ref-49">49</xref>], handling of data with missing values [<xref ref-type="bibr" rid="ref-50">50</xref>] and data discretization [<xref ref-type="bibr" rid="ref-51">51</xref>].</p>
</sec>
</body>
<back>
<ack>
<p>This research work is partly supported by Mae Fah Luang University and Newton Institutional Links 2020-21 project (British Council and National Research Council of Thailand).</p>
</ack>
<fn-group>
<fn fn-type="other"><p><bold>Funding Statement:</bold> This work is funded by Newton Institutional Links 2020&#x2013;21 project: 623718881, jointly by British Council and National Research Council of Thailand (<uri xlink:href="https://www.britishcouncil.org">www.britishcouncil.org</uri>). The first author is the project PI with the other participating as a Co-I.</p></fn>
<fn fn-type="conflict"><p><bold>Conflicts of Interest:</bold> There is no conflict of interest to report regarding the present study.</p></fn>
</fn-group>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>D.</given-names> <surname>Jiang</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Tang</surname></string-name> and <string-name><given-names>A.</given-names> <surname>Zhang</surname></string-name></person-group>, &#x201C;<article-title>Cluster analysis for gene expression data: A survey</article-title>,&#x201D; <source>IEEE Transactions on Knowledge and Data Engineering</source>, vol. <volume>16</volume>, no. <issue>11</issue>, pp. <fpage>1370</fpage>&#x2013;<lpage>1386</lpage>, <year>2004</year>.</mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>R.</given-names> <surname>Wu</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Chen</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Chang</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Chen</surname></string-name></person-group>, &#x201C;<article-title>Data mining application in customer relationship management of credit card business</article-title>,&#x201D; in <conf-name>Proc. of Int. Conf. on Computer Software and Applications</conf-name>, <conf-loc>Edinburgh, UK</conf-loc>, pp. <fpage>39</fpage>&#x2013;<lpage>40</lpage>, <year>2005</year>.</mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Mostafa</surname></string-name> and <string-name><given-names>H.</given-names> <surname>Tripathy</surname></string-name></person-group>, &#x201C;<article-title>Information retrieval by semantic analysis and visualization of the concept space of D-lib magazine</article-title>,&#x201D; <source>D-Lib Magazine</source>, vol. <volume>8</volume>, no. <issue>10</issue>, pp. <fpage>1</fpage>&#x2013;<lpage>8</lpage>, <year>2002</year>.</mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Costa</surname></string-name> and <string-name><given-names>M.</given-names> <surname>Netto</surname></string-name></person-group>, &#x201C;<article-title>Cluster analysis using self-organizing maps and image processing techniques</article-title>,&#x201D; in <conf-name>Proc. of IEEE Int. Conf. on systems, Man, and Cybernetics</conf-name>, vol. <volume>5</volume>, pp. <fpage>367</fpage>&#x2013;<lpage>372</lpage>, <year>1999</year>.</mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>Q.</given-names> <surname>He</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Wang</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Tang</surname></string-name> and <string-name><given-names>Y.</given-names> <surname>Zhang</surname></string-name></person-group>, &#x201C;<article-title>Cluster analysis on symptoms and signs of traditional Chinese medicine in 815 patients with unstable angina</article-title>,&#x201D; in <conf-name>Proc. of Int. Conf. on Fuzzy Systems and Knowledge Discovery</conf-name>, <conf-loc>Tianjin, China</conf-loc>, pp. <fpage>435</fpage>&#x2013;<lpage>439</lpage>, <year>2009</year>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>A. K.</given-names> <surname>Jain</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Duin</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Mao</surname></string-name></person-group>, &#x201C;<article-title>Statistical pattern recognition: A review</article-title>,&#x201D; <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>, vol. <volume>22</volume>, no. <issue>1</issue>, pp. <fpage>4</fpage>&#x2013;<lpage>37</lpage>, <year>2000</year>.</mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>D. B.</given-names> <surname>Henry</surname></string-name>, <string-name><given-names>P. H.</given-names> <surname>Tolan</surname></string-name> and <string-name><given-names>D.</given-names> <surname>Gorman-Smith</surname></string-name></person-group>, &#x201C;<article-title>Cluster analysis in family psychology research</article-title>,&#x201D; <source>Journal of Family Psychology</source>, vol. <volume>19</volume>, no. <issue>1</issue>, pp. <fpage>121</fpage>&#x2013;<lpage>132</lpage>, <year>2005</year>.</mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>K.</given-names> <surname>Kim</surname></string-name> and <string-name><given-names>H.</given-names> <surname>Ahn</surname></string-name></person-group>, &#x201C;<article-title>A recommender system using GA K-means clustering in an online shopping market</article-title>,&#x201D; <source>Expert Systems with Applications</source>, vol. <volume>34</volume>, pp. <fpage>1200</fpage>&#x2013;<lpage>1209</lpage>, <year>2008</year>.</mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>N.</given-names> <surname>Iam-On</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Boongoen</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Garrett</surname></string-name></person-group>, &#x201C;<article-title>LCE: A link-based cluster ensemble method for improved gene expression data analysis</article-title>,&#x201D; <source>Bioinformatics</source>, vol. <volume>26</volume>, no. <issue>12</issue>, pp. <fpage>1513</fpage>&#x2013;<lpage>1519</lpage>, <year>2010</year>.</mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>E.</given-names> <surname>Kim</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Kim</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Ashlock</surname></string-name> and <string-name><given-names>D.</given-names> <surname>Nam</surname></string-name></person-group>, &#x201C;<article-title>MULTI-K: Accurate classification of microarray subtypes using ensemble k-means clustering</article-title>,&#x201D; <source>BMC Bioinformatics</source>, vol. <volume>10</volume>, no. <issue>260</issue>, pp. <fpage>1</fpage>&#x2013;<lpage>12</lpage>, <year>2009</year>.</mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>A. K.</given-names> <surname>Jain</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Murty</surname></string-name> and <string-name><given-names>P.</given-names> <surname>Flynn</surname></string-name></person-group>, &#x201C;<article-title>Data clustering: A review</article-title>,&#x201D; <source>ACM Computing Survey</source>, vol. <volume>31</volume>, no. <issue>3</issue>, pp. <fpage>264</fpage>&#x2013;<lpage>323</lpage>, <year>1999</year>.</mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>Z.</given-names> <surname>Huang</surname></string-name></person-group>, &#x201C;<article-title>Clustering large data sets with mixed numeric and categorical values</article-title>,&#x201D; in <conf-name>Proc. of Pacific Asia Conf. on Knowledge Discovery and Data Mining</conf-name>, <conf-loc>Singapore</conf-loc>, pp. <fpage>21</fpage>&#x2013;<lpage>34</lpage>, <year>1997</year>.</mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Dudoit</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Fridyand</surname></string-name></person-group>, &#x201C;<article-title>A prediction-based resampling method for estimating the number of clusters in a dataset</article-title>,&#x201D; <source>Genome Biology</source>, vol. <volume>3</volume>, no. <issue>7</issue>, pp. <fpage>1</fpage>&#x2013;<lpage>21</lpage>, <year>2002</year>.</mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>T.</given-names> <surname>Boongoen</surname></string-name> and <string-name><given-names>Q.</given-names> <surname>Shen</surname></string-name></person-group>, &#x201C;<article-title>Nearest-neighbor guided evaluation of data reliability and its applications</article-title>,&#x201D; <source>IEEE Transactions on Systems, Man and Cybernetics, Part B</source>, vol. <volume>40</volume>, no. <issue>6</issue>, pp. <fpage>1622</fpage>&#x2013;<lpage>1633</lpage>, <year>2010</year>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>W. M.</given-names> <surname>Rand</surname></string-name></person-group>, &#x201C;<article-title>Objective criteria for the evaluation of clustering methods</article-title>,&#x201D; <source>Journal of the American Statistical Association</source>, vol. <volume>66</volume>, pp. <fpage>846</fpage>&#x2013;<lpage>850</lpage>, <year>1971</year>.</mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Ahmad</surname></string-name> and <string-name><given-names>L.</given-names> <surname>Dey</surname></string-name></person-group>, &#x201C;<article-title>A k-mean clustering algorithm for mixed numeric and categorical data</article-title>,&#x201D; <source>Data and Knowledge Engineering</source>, vol. <volume>63</volume>, no. <issue>2</issue>, pp. <fpage>503</fpage>&#x2013;<lpage>527</lpage>, <year>2007</year>.</mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>N.</given-names> <surname>Iam-On</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Boongoen</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Garrett</surname></string-name> and <string-name><given-names>C.</given-names> <surname>Price</surname></string-name></person-group>, &#x201C;<article-title>Link-based cluster ensembles for heterogeneous biological data analysis</article-title>,&#x201D; in <conf-name>Proc. of IEEE Int. Conf. on Bioinformatics and Biomedicine</conf-name>, pp. <fpage>573</fpage>&#x2013;<lpage>578</lpage>, <year>2010</year>.</mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>H. A.</given-names> <surname>Ralambondrainy</surname></string-name></person-group>, &#x201C;<article-title>Conceptual version of the k-means algorithm</article-title>,&#x201D; <source>Pattern Recognition Letters</source>, vol. <volume>16</volume>, pp. <fpage>1147</fpage>&#x2013;<lpage>1157</lpage>, <year>1995</year>.</mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Z.</given-names> <surname>He</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Xu</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Deng</surname></string-name></person-group>, &#x201C;<article-title>Scalable algorithms for clustering large datasets with mixed type attributes</article-title>,&#x201D; <source>International Journal of Intelligent Systems</source>, vol. <volume>20</volume>, pp. <fpage>1077</fpage>&#x2013;<lpage>1089</lpage>, <year>2005</year>.</mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Z.</given-names> <surname>He</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Xu</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Deng</surname></string-name></person-group>, &#x201C;<article-title>Squeezer: An efficient algorithm for clustering categorical data</article-title>,&#x201D; <source>Journal of Computer Science and Technology</source>, vol. <volume>17</volume>, no. <issue>5</issue>, pp. <fpage>611</fpage>&#x2013;<lpage>624</lpage>, <year>2002</year>.</mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Z.</given-names> <surname>Huang</surname></string-name></person-group>, &#x201C;<article-title>Extensions to the k-means algorithm for clustering large data sets with categorical values</article-title>,&#x201D; <source>Data Mining and Knowledge Discovery</source>, vol. <volume>2</volume>, pp. <fpage>283</fpage>&#x2013;<lpage>304</lpage>, <year>1998</year>.</mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>W.</given-names> <surname>Zhao</surname></string-name>, <string-name><given-names>W.</given-names> <surname>Dai</surname></string-name> and <string-name><given-names>C.</given-names> <surname>Tang</surname></string-name></person-group>, &#x201C;<article-title>K-centers algorithm for clustering mixed type data</article-title>,&#x201D; in <conf-name>Proc. of Pacific-Asia Conf. on Advances in Knowledge Discovery and Data Mining</conf-name>, <conf-loc>Nanjing, China,</conf-loc> pp. <fpage>1140</fpage>&#x2013;<lpage>1147</lpage>, <year>2007</year>.</mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><given-names>R.</given-names> <surname>Duda</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Hart</surname></string-name> and <string-name><given-names>D.</given-names> <surname>Stork</surname></string-name></person-group>, &#x201C;<chapter-title>Unsupervised learning and clustering</chapter-title>,&#x201D; <source>Pattern Classification</source>, <edition>2</edition><sup>nd</sup> ed., <publisher-loc>Singapore</publisher-loc>: <publisher-name>Wiley-Interscience</publisher-name>, <year>2000</year>.</mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Fred</surname></string-name> and <string-name><given-names>A. K.</given-names> <surname>Jain</surname></string-name></person-group>, &#x201C;<article-title>Combining multiple clusterings using evidence accumulation</article-title>,&#x201D; <source>IEEE Transaction on Pattern Analysis and Machine Intelligence</source>, vol. <volume>27</volume>, no. <issue>6</issue>, pp. <fpage>835</fpage>&#x2013;<lpage>850</lpage>, <year>2005</year>.</mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>H.</given-names> <surname>Xue</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Chen</surname></string-name> and <string-name><given-names>Q.</given-names> <surname>Yang</surname></string-name></person-group>, &#x201C;<article-title>Discriminatively regularized least-squares classification</article-title>,&#x201D; <source>Pattern Recognition</source>, vol. <volume>42</volume>, no. <issue>1</issue>, pp. <fpage>93</fpage>&#x2013;<lpage>104</lpage>, <year>2009</year>.</mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>N.</given-names> <surname>Iam-On</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Boongoen</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Garrett</surname></string-name> and <string-name><given-names>C.</given-names> <surname>Price</surname></string-name></person-group>, &#x201C;<article-title>A link-based approach to the cluster ensemble problem</article-title>,&#x201D; <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>, vol. <volume>33</volume>, no. <issue>12</issue>, pp. <fpage>2396</fpage>&#x2013;<lpage>2409</lpage>, <year>2011</year>.</mixed-citation></ref>
<ref id="ref-27"><label>[27]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>N.</given-names> <surname>Iam-On</surname></string-name> and <string-name><given-names>T.</given-names> <surname>Boongoen</surname></string-name></person-group>, &#x201C;<article-title>Pairwise similarity for cluster ensemble problem: Link-based and approximate approaches</article-title>,&#x201D; <source>Springer Transactions on Large-Scale Data and Knowledge-Centered Systems</source>, vol. <volume>9</volume>, pp. <fpage>95</fpage>&#x2013;<lpage>122</lpage>, <year>2013</year>.</mixed-citation></ref>
<ref id="ref-28"><label>[28]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>X. Z.</given-names> <surname>Fern</surname></string-name> and <string-name><given-names>C. E.</given-names> <surname>Brodley</surname></string-name></person-group>, &#x201C;<article-title>Solving cluster ensemble problems by bipartite graph partitioning</article-title>,&#x201D; in <conf-name>Proc. of Int. Conf. on Machine Learning</conf-name>, <conf-loc>Louisville, Kentucky, USA</conf-loc>, pp. <fpage>36</fpage>&#x2013;<lpage>43</lpage>, <year>2004</year>.</mixed-citation></ref>
<ref id="ref-29"><label>[29]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Gionis</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Mannila</surname></string-name> and <string-name><given-names>P.</given-names> <surname>Tsaparas</surname></string-name></person-group>, &#x201C;<article-title>Clustering aggregation</article-title>,&#x201D; <source>ACM Transactions on Knowledge Discovery from Data</source>, vol. <volume>1</volume>, no. <issue>1</issue>, pp. <fpage>4</fpage>&#x2013;<lpage>ex</lpage>, <year>2007</year>.</mixed-citation></ref>
<ref id="ref-30"><label>[30]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>N.</given-names> <surname>Iam-On</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Boongoen</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Garrett</surname></string-name></person-group>, &#x201C;<article-title>Refining pairwise similarity matrix for cluster ensemble problem with cluster relations</article-title>,&#x201D; in <conf-name>Proc. of Int. Conf. on Discovery Science</conf-name>, <conf-loc> Budapest, Hungary</conf-loc>, pp. <fpage>222</fpage>&#x2013;<lpage>233</lpage>, <year>2008</year>.</mixed-citation></ref>
<ref id="ref-31"><label>[31]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>N.</given-names> <surname>Iam-On</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Garrett</surname></string-name></person-group>, &#x201C;<article-title>Linkclue: A MATLAB package for link-based cluster ensembles</article-title>,&#x201D; <source>Journal of Statistical Software</source>, vol. <volume>36</volume>, no. <issue>9</issue>, pp. <fpage>1</fpage>&#x2013;<lpage>36</lpage>, <year>2010</year>.</mixed-citation></ref>
<ref id="ref-32"><label>[32]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Strehl</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Ghosh</surname></string-name></person-group>, &#x201C;<article-title>Cluster ensembles: A knowledge reuse framework for combining multiple partitions</article-title>,&#x201D; <source>Journal of Machine Learning Research</source>, vol. <volume>3</volume>, pp. <fpage>583</fpage>&#x2013;<lpage>617</lpage>, <year>2002</year>.</mixed-citation></ref>
<ref id="ref-33"><label>[33]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Topchy</surname></string-name>, <string-name><given-names>A. K.</given-names> <surname>Jain</surname></string-name> and <string-name><given-names>W.</given-names> <surname>Punch</surname></string-name></person-group>, &#x201C;<article-title>Clustering ensembles: Models of consensus and weak partitions</article-title>,&#x201D; <source>IEEE Transaction on Pattern Analysis and Machine Intelligence</source>, vol. <volume>27</volume>, no. <issue>12</issue>, pp. <fpage>1866</fpage>&#x2013;<lpage>1881</lpage>, <year>2005</year>.</mixed-citation></ref>
<ref id="ref-34"><label>[34]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>N.</given-names> <surname>Iam-On</surname></string-name> and <string-name><given-names>T.</given-names> <surname>Boongoen</surname></string-name></person-group>, &#x201C;<article-title>Diversity-driven generation of link-based cluster ensemble and application to data classification</article-title>,&#x201D; <source>Expert Systems with Applications</source>, vol. <volume>42</volume>, no. <issue>21</issue>, pp. <fpage>8259</fpage>&#x2013;<lpage>8273</lpage>, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-35"><label>[35]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>P.</given-names> <surname>Panwong</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Boongoen</surname></string-name> and <string-name><given-names>N.</given-names> <surname>Iam-On</surname></string-name></person-group>, &#x201C;<article-title>Improving consensus clustering with noise-induced ensemble generation</article-title>,&#x201D; <source>Expert Systems with Applications</source>, vol. <volume>146</volume>, pp. <fpage>113</fpage>&#x2013;<lpage>138</lpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-36"><label>[36]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>H.</given-names> <surname>Luo</surname></string-name>, <string-name><given-names>F.</given-names> <surname>Kong</surname></string-name> and <string-name><given-names>Y.</given-names> <surname>Li</surname></string-name></person-group>, &#x201C;<article-title>Clustering mixed data based on evidence accumulation</article-title>,&#x201D; in <conf-name>Proc. of Int. Conf. on Advanced Data Mining and Applications</conf-name>, <conf-loc>Xian, China</conf-loc>, pp. <fpage>348</fpage>&#x2013;<lpage>355</lpage>, <year>2006</year>.</mixed-citation></ref>
<ref id="ref-37"><label>[37]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Smolkin</surname></string-name> and <string-name><given-names>D.</given-names> <surname>Ghosh</surname></string-name></person-group>, &#x201C;<article-title>Cluster stability scores for microarray data in cancer studies</article-title>,&#x201D; <source>BMC Bioinformatics</source>, vol. <volume>21</volume>, no. <issue>9</issue>, pp. <fpage>1927</fpage>&#x2013;<lpage>1934</lpage>, <year>2003</year>.</mixed-citation></ref>
<ref id="ref-38"><label>[38]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Z.</given-names> <surname>Yu</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Wong</surname></string-name> and <string-name><given-names>H.</given-names> <surname>Wang</surname></string-name></person-group>, &#x201C;<article-title>Graph-based consensus clustering for class discovery from gene expression data</article-title>,&#x201D; <source>Bioinformatics</source>, vol. <volume>23</volume>, no. <issue>21</issue>, pp. <fpage>2888</fpage>&#x2013;<lpage>2896</lpage>, <year>2007</year>.</mixed-citation></ref>
<ref id="ref-39"><label>[39]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>L.</given-names> <surname>Adamic</surname></string-name> and <string-name><given-names>E.</given-names> <surname>Adar</surname></string-name></person-group>, &#x201C;<article-title>Friends &#x0026; neighbors on the web</article-title>,&#x201D; <source>Social Networks</source>, vol. <volume>25</volume>, no. <issue>3</issue>, pp. <fpage>211</fpage>&#x2013;<lpage>230</lpage>, <year>2003</year>.</mixed-citation></ref>
<ref id="ref-40"><label>[40]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Ng</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Jordan</surname></string-name> and <string-name><given-names>Y.</given-names> <surname>Weiss</surname></string-name></person-group>, &#x201C;<article-title>On spectral clustering: Analysis and an algorithm</article-title>,&#x201D; <source>Advances in Neural Information Processing Systems</source>, vol. <volume>14</volume>, pp. <fpage>849</fpage>&#x2013;<lpage>856</lpage>, <year>2001</year>.</mixed-citation></ref>
<ref id="ref-41"><label>[41]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Asuncion</surname></string-name> and <string-name><given-names>D.</given-names> <surname>Newman</surname></string-name></person-group>, &#x0022;<article-title>UCI machine learning repository</article-title>,&#x0022; <uri xlink:href="https://archive.ics.uci.edu">https://archive.ics.uci.edu</uri>, <year>2007</year>.</mixed-citation></ref>
<ref id="ref-42"><label>[42]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Czerniak</surname></string-name> and <string-name><given-names>H.</given-names> <surname>Zarzycki</surname></string-name></person-group>, &#x201C;<article-title>Application of rough sets in the presumptive diagnosis of urinary system diseases</article-title>,&#x201D; in <conf-name>Proc. of Int. Conf. on AI and Security in Computing Systems</conf-name>, <conf-loc>Miedzyzdroje, Poland</conf-loc>, pp. <fpage>41</fpage>&#x2013;<lpage>51</lpage>, <year>2003</year>.</mixed-citation></ref>
<ref id="ref-43"><label>[43]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><given-names>H.</given-names> <surname>Tijms</surname></string-name></person-group>, &#x201C;<source>Understanding Probability: Chance Rules in Everyday Life</source>,&#x201D; <publisher-loc>Cambridge, UK</publisher-loc>: <publisher-name>Cambridge University Press</publisher-name>, <year>2004</year>.</mixed-citation></ref>
<ref id="ref-44"><label>[44]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>N.</given-names> <surname>Nguyen</surname></string-name> and <string-name><given-names>R.</given-names> <surname>Caruana</surname></string-name></person-group>, &#x201C;<article-title>Consensus clusterings</article-title>,&#x201D; in <conf-name>Proc. of IEEE Int. Conf. on Data Mining</conf-name>, <conf-loc>Omaha, Nebraska, USA</conf-loc>, pp. <fpage>607</fpage>&#x2013;<lpage>612</lpage>, <year>2007</year>.</mixed-citation></ref>
<ref id="ref-45"><label>[45]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>L.</given-names> <surname>Hubert</surname></string-name> and <string-name><given-names>P.</given-names> <surname>Arabie</surname></string-name></person-group>, &#x201C;<article-title>Comparing partitions</article-title>,&#x201D; <source>Journal of Classification</source>, vol. <volume>2</volume>, no. <issue>1</issue>, pp. <fpage>193</fpage>&#x2013;<lpage>218</lpage>, <year>1985</year>.</mixed-citation></ref>
<ref id="ref-46"><label>[46]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>L. I.</given-names> <surname>Kuncheva</surname></string-name></person-group>, &#x201C;<article-title>Experimental comparison of cluster ensemble methods</article-title>,&#x201D; in <conf-name>Proc. of Int. Conf. on Fusion</conf-name>, <conf-loc>Florence, Italy</conf-loc>, pp. <fpage>105</fpage>&#x2013;<lpage>115</lpage>, <year>2006</year>. </mixed-citation></ref>
<ref id="ref-47"><label>[47]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>T.</given-names> <surname>Shi</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Horvath</surname></string-name></person-group>, &#x201C;<article-title>Unsupervised learning with random forest predictors</article-title>,&#x201D; <source>Journal of Computational and Graphical Statistics</source>, vol. <volume>15</volume>, no. <issue>1</issue>, pp. <fpage>118</fpage>&#x2013;<lpage>138</lpage>, <year>2006</year>.</mixed-citation></ref>
<ref id="ref-48"><label>[48]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>F.</given-names> <surname>Lin</surname></string-name> and <string-name><given-names>W.</given-names> <surname>Cohen</surname></string-name></person-group>, &#x201C;<article-title>Power iteration clustering</article-title>,&#x201D; in <conf-name>Proc. of Int. Conf. on Machine Learning</conf-name>, <conf-loc>Haifa, Israel</conf-loc>, pp. <fpage>655</fpage>&#x2013;<lpage>662</lpage>, <year>2010</year>.</mixed-citation></ref>
<ref id="ref-49"><label>[49]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>X.</given-names> <surname>Fu</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Boongoen</surname></string-name> and <string-name><given-names>Q.</given-names> <surname>Shen</surname></string-name></person-group>, &#x201C;<article-title>Evidence directed generation of plausible crime scenarios with identity resolution</article-title>,&#x201D; <source>Applied Artificial Intelligence</source>, vol. <volume>24</volume>, no. <issue>4</issue>, pp. <fpage>253</fpage>&#x2013;<lpage>276</lpage>, <year>2010</year>.</mixed-citation></ref>
<ref id="ref-50"><label>[50]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Pattanodom</surname></string-name>, <string-name><given-names>N.</given-names> <surname>Iam-On</surname></string-name> and <string-name><given-names>T.</given-names> <surname>Boongoen</surname></string-name></person-group>, &#x201C;<article-title>Clustering data with the presence of missing values by ensemble approach</article-title>,&#x201D; in <conf-name>Proc. of Asian Conf. on Defence Technology</conf-name>, <conf-loc> Chiang Mai, Thailand</conf-loc>, pp. <fpage>151</fpage>&#x2013;<lpage>156</lpage>, <year>2016</year>.</mixed-citation></ref>
<ref id="ref-51"><label>[51]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>K.</given-names> <surname>Sriwanna</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Boongoen</surname></string-name> and <string-name><given-names>N.</given-names> <surname>Iam-On</surname></string-name></person-group>, &#x201C;<article-title>Graph clustering-based discretization of splitting and merging methods</article-title>,&#x201D; <source>Human-centric Computing and Information Sciences</source>, vol. <volume>7</volume>, no. <issue>1</issue>, pp. <fpage>1</fpage>&#x2013;<lpage>39</lpage>, <year>2017</year>.</mixed-citation></ref>
</ref-list>
</back>
</article>