<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">IASC</journal-id>
<journal-id journal-id-type="nlm-ta">IASC</journal-id>
<journal-id journal-id-type="publisher-id">IASC</journal-id>
<journal-title-group>
<journal-title>Intelligent Automation &#x0026; Soft Computing</journal-title>
</journal-title-group>
<issn pub-type="epub">2326-005X</issn>
<issn pub-type="ppub">1079-8587</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">20164</article-id>
<article-id pub-id-type="doi">10.32604/iasc.2022.020164</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Machine Learning Privacy Aware Anonymization Using MapReduce Based Neural Network</article-title><alt-title alt-title-type="left-running-head">Machine Learning Privacy Aware Anonymization Using MapReduce Based Neural Network</alt-title><alt-title alt-title-type="right-running-head">Machine Learning Privacy Aware Anonymization Using MapReduce Based Neural Network</alt-title>
</title-group>
<contrib-group content-type="authors">
<contrib id="author-1" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Selvi</surname><given-names>U.</given-names></name><email>slvunnikrishnan@gmail.com</email>
</contrib>
<contrib id="author-2" contrib-type="author">
<name name-style="western"><surname>Pushpa</surname><given-names>S.</given-names></name>
</contrib><aff><institution>Department of Computer Science and Engineering, St. Peter&#x2019;s Institute of Higher Education and Research</institution>, <addr-line>Chennai</addr-line>, <country>India</country></aff>
</contrib-group><author-notes><corresp id="cor1"><label>&#x002A;</label>Corresponding Author: U. Selvi. Email: <email>slvunnikrishnan@gmail.com</email></corresp></author-notes>
<pub-date pub-type="epub" date-type="pub" iso-8601-date="2021-09-06"><day>06</day><month>9</month><year>2021</year></pub-date>
<volume>31</volume>
<issue>2</issue>
<fpage>1185</fpage>
<lpage>1196</lpage>
<history>
<date date-type="received"><day>12</day><month>5</month><year>2021</year></date>
<date date-type="accepted"><day>22</day><month>6</month><year>2021</year></date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2022 Selvi and Pushpa</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>Selvi and Pushpa</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_IASC_20164.pdf"></self-uri>
<abstract>
<p>Due to the recent advancement in technologies, a huge amount of data is generated where individual private information needs to be preserved. A proper Anonymization algorithm with increased Data utility is required to protect individual privacy. However, preserving privacy of individuals whileprocessing huge amount of data is a challenging task, as the data contains certain sensitive information. Moreover, scalability issue in handling a large dataset is found in using existing framework. Many an Anonymization algorithm for Big Data have been developed and under research. We propose a method of applying Machine Learning techniques to protect and preserve the personal identities of Individuals in BigData framework, which is termed as BigData Privacy Aware Machine Learning. For addressing a large volume of data, MapReduce-based neural networks parallelism is taken into consideration with classification of data volume. Human contextual information as applied through collaborative Machine Learning is proposed. The result of our experiment shows that relating human knowledge to neural network and parallelism by MapReduce framework can yield a better and measurable classification results for large scale Applications.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Privacy aware machine learning</kwd>
<kwd>anonymization</kwd>
<kwd>k-anonymity</kwd>
<kwd>bigdata</kwd>
<kwd>mapreduce</kwd>
<kwd>back-propagation neural network</kwd>
<kwd>machine learning</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="s1">
<label>1</label>
<title>Introduction</title>
<p>BigData as the name implies refers to massive amount of data, generated at high speed with different data types that cannot be processed by tools available for Database Management and are problematic to capture, collect, explore, share, examine and visualize data [<xref ref-type="bibr" rid="ref-1">1</xref>].</p>
<p>BigData and 4v&#x2019;s characteristics:<list list-type="roman-lower"><list-item>
<p>Volume states the large quantity of Data Generation and Data Collection;</p></list-item><list-item>
<p>Velocity states the timeliness of the data received at data pool for analysis;</p></list-item><list-item>
<p>Variety states data forms like unstructured, structured and semi-structured data; and</p></list-item><list-item>
<p>Value states information concealed from data.</p></list-item></list></p>
<p>MapReduce, the customary computation model is used for handling BigData applications [<xref ref-type="bibr" rid="ref-1">1</xref>,<xref ref-type="bibr" rid="ref-2">2</xref>]. It is a framework used for processing larger datasets and is consistent, fault-tolerant, accessible, and self-balancing when the size of dataset increases. MapReduce framework [<xref ref-type="bibr" rid="ref-3">3</xref>] handles the larger sets using Map and Reduce Functions. Basically, a map functions produces &#x3008;key, value&#x3009; pairs as intermediate results by processing data. A reduce function sorts and merges the &#x3008;key, value&#x3009; pairs collected from multiple mappers and applies the results for secondary processing. Finally the reducer generates the results, based on the input which is collected from the outputs of multiple mappers.</p>
<p>Data Anonymization is the technique of protecting sensitive information from disclosure and preserving the privacy of the users of the application. In our paper, the standard k-Anonymity algorithm is chosen to preserve privacy.</p>
<p>Artificial Neural Networks [<xref ref-type="bibr" rid="ref-4">4</xref>] (ANNs) is capable of modelling &#x0026; processing non-linear relationships between input and output in parallel and have been widely used in various research scenario. One of the implementation of ANN is neural network which uses Back-Propagation (BPNN) approach, which is evidenced to be efficient in terms of approximation capability. An &#x2018;n&#x2019; number of hidden inputs and outputs layers are found in BPNN, with each layer containing neurons. BPNN uses an error-back propagation mechanism for training data and employs feed forward network to produce the required output.</p>
<p>This paper tries to focus on MapReduce approach which implements Back-Propagation Neural Network (MRBPNN) by considering Classification of data Volume i.e., we aim to implement Machine Learning Algorithm in MapReduce for parallel processing.</p>
<p>Malle et al. [<xref ref-type="bibr" rid="ref-5">5</xref>] Aims to provide the report on information loss and Quasi-identifier distributions by defining a k-factor proposed by user. But the algorithm is found to be not interactive and doesn&#x2019;t achieve the expected results during the learning phase. The user has to check whether the expected result is achieved or not and decide after completion of Anonymization run upon using Cornell Anonymization Toolkit (Cat) [<xref ref-type="bibr" rid="ref-6">6</xref>]. Our methodology adjusts algorithmic factors upon each (batch of) manual interruptions, to make the algorithm to be adapted in real-time.</p>
<p>Xu et al. [<xref ref-type="bibr" rid="ref-7">7</xref>] Proposes to construct generalization hierarchies by allowing human interventions to set constraints on attributes in the process of anonymization.</p>
<p><bold><italic>Our Contributions</italic></bold></p>
<p>We propose Hadoop MapReduce framework that implements Back-propagation Neural Network to achieve k-anonymity for large scale application.</p>
<p>Our contribution is summarized as follows:<list list-type="roman-lower"><list-item>
<p>Fast Correlation based on Feature Selection Algorithm using Map Reduce is used for pre- processing the dataset to select relevant feature.</p></list-item><list-item>
<p>Then, the pre-processed data is fed to the MapReduce Framework. Each mapper has a Back Propagation Neural Network which maps the data to form equivalent groups which form clusters. The algorithm begins by selecting the next candidate for merging until the cluster reaches the size k. The k-anonymity criterion is satisfied by combining all data points to form clusters for a given dataset. The intermediate result from the mapper is fed as input to reducer function.</p></list-item><list-item>
<p>BPNN uses the error propagation method to tune the network parameter until it satisfies the k-anonymity.</p></list-item></list></p>
<p>The paper organized as follows:</p>
<p>Section 2: Delivers contextual evidence around MapReduce-based Back Propagation Neural network and the architecture of BPNN. Section 3: Explains k-anonymity and Section 4: Deals with BigData MapReduce. Section 5: Describes the Iterative Machine learning to achieve Anonymity. Section 6: Deliberation and Analysis of the empirical studies. Section 7: Summarizes the conclusion of the paper.</p>
</sec>
<sec id="s2">
<label>2</label>
<title>Paralleling Neural Network</title>
<p>This section explains the Paralleling Neural Networks with Back-Propagation.</p>
<sec id="s2_1">
<label>2.1</label>
<title>Back-Propagation Neural Networks (BPNN)</title>
<p>Propagating errors in backward direction for training data is Back-Propagation Neural Network. It is a multi-layered structure and feeds forward network. Input-output mappings using BPNN can be performed with a large volume of data, without having adequate knowledge on mathematical equation involved. BPNN tunes the network parameter to achieve k-anonymity in the process of the error propagation. <xref ref-type="fig" rid="fig-1">Fig. 1</xref> depicts the BPNN which has a number of inputs and outputs with a multi-layered network structure. A BPNN has three explicit layers: (i) the input, (ii) the output and (iii) the hidden layers. It is the commonly accepted network structure to fit a mathematical equation and to map the relationships between inputs and outputs [<xref ref-type="bibr" rid="ref-8">8</xref>].</p>
<fig id="fig-1">
<label>Figure 1</label>
<caption>
<title>Neural network with back-propagation</title></caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="IASC_20164-fig-1.png"/>
</fig>
</sec>
<sec id="s2_2">
<label>2.2</label>
<title>The Design of MapReduce Back-Propagation Neural Network</title>
<p>Consider a testing instance j &#x003D; {<inline-formula id="ieqn-1">
<mml:math id="mml-ieqn-1"><mml:mrow><mml:msub><mml:mi>b</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:mrow></mml:math>
</inline-formula>, <inline-formula id="ieqn-2">
<mml:math id="mml-ieqn-2"><mml:mrow><mml:msub><mml:mi>b</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:mrow></mml:math>
</inline-formula>, <inline-formula id="ieqn-3">
<mml:math id="mml-ieqn-3"><mml:mrow><mml:msub><mml:mi>b</mml:mi><mml:mn>3</mml:mn></mml:msub></mml:mrow></mml:math>
</inline-formula>, . . . , <inline-formula id="ieqn-4">
<mml:math id="mml-ieqn-4"><mml:mrow><mml:msub><mml:mi>b</mml:mi><mml:mrow><mml:mi>j</mml:mi><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math>
</inline-formula>}, <inline-formula id="ieqn-5">
<mml:math id="mml-ieqn-5"><mml:mrow><mml:msub><mml:mi>q</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:math>
</inline-formula> &#x2208; Q, where<list list-type="roman-lower"><list-item>
<p>Data instance is denoted by <inline-formula id="ieqn-6">
<mml:math id="mml-ieqn-6"><mml:mrow><mml:msub><mml:mi>q</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:math>
</inline-formula> ;</p></list-item><list-item>
<p>Dataset is denoted by Q;</p></list-item><list-item>
<p>The dimension of <inline-formula id="ieqn-7">
<mml:math id="mml-ieqn-7"><mml:mrow><mml:msub><mml:mi>q</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:math>
</inline-formula>, the input quantity of neural network, denoted by jk;</p></list-item><list-item>
<p>The inputs are represented as &#x3008;<inline-formula id="ieqn-8">
<mml:math id="mml-ieqn-8"><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>s</mml:mi><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mrow><mml:msub><mml:mi>e</mml:mi><mml:mi>n</mml:mi></mml:msub></mml:mrow></mml:math>
</inline-formula>, <inline-formula id="ieqn-9">
<mml:math id="mml-ieqn-9"><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mi>r</mml:mi><mml:mi>g</mml:mi><mml:mi>e</mml:mi><mml:mrow><mml:msub><mml:mi>t</mml:mi><mml:mi>n</mml:mi></mml:msub></mml:mrow></mml:math>
</inline-formula>, type&#x3009;;</p></list-item><list-item>
<p>Neural Network (NN) input <inline-formula id="ieqn-10">
<mml:math id="mml-ieqn-10"><mml:mspace width="thickmathspace" /><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>s</mml:mi><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mrow><mml:msub><mml:mi>e</mml:mi><mml:mi>n</mml:mi></mml:msub></mml:mrow></mml:math>
</inline-formula> signified by <inline-formula id="ieqn-11">
<mml:math id="mml-ieqn-11"><mml:mrow><mml:msub><mml:mi>q</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:math>
</inline-formula>;</p></list-item><list-item>
<p><inline-formula id="ieqn-12">
<mml:math id="mml-ieqn-12"><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mi>r</mml:mi><mml:mi>g</mml:mi><mml:mi>e</mml:mi><mml:mrow><mml:msub><mml:mi>t</mml:mi><mml:mi>n</mml:mi></mml:msub></mml:mrow><mml:mo>,</mml:mo></mml:math>
</inline-formula> signifies the anticipated yield, if <inline-formula id="ieqn-13">
<mml:math id="mml-ieqn-13"><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>s</mml:mi><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mrow><mml:msub><mml:mi>e</mml:mi><mml:mi>n</mml:mi></mml:msub></mml:mrow></mml:math>
</inline-formula> is a training instance;</p></list-item><list-item>
<p>Two values of field type includes: &#x2018;trainset&#x2019; and &#x2018;testset&#x2019;, which are marked based on the category of <inline-formula id="ieqn-14">
<mml:math id="mml-ieqn-14"><mml:mspace width="thickmathspace" /><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>s</mml:mi><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mrow><mml:msub><mml:mi>e</mml:mi><mml:mi>n</mml:mi></mml:msub></mml:mrow></mml:math>
</inline-formula>; if &#x2018;testset&#x2019; value is fixed, <inline-formula id="ieqn-15">
<mml:math id="mml-ieqn-15"><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mi>r</mml:mi><mml:mi>g</mml:mi><mml:mi>e</mml:mi><mml:mrow><mml:msub><mml:mi>t</mml:mi><mml:mi>n</mml:mi></mml:msub></mml:mrow></mml:math>
</inline-formula> field is shown blank.</p></list-item></list></p>
<p>Initially, records covering instances are kept into HDFS. Each record has training and testing instances. Consequently, the record number &#x03B7; specifies the quantity of mappers used. The data chunk or the training data is fed as input to each mapper. <xref ref-type="fig" rid="fig-2">Fig. 2</xref> shows the architecture of MapReduce-based Back Propagation Neural Network (BPNN).</p>
<p>Initializing a neural network with each mapper function is the first step where the Algorithm begins. As a consequence, in a cluster, there are 𝑛 neural networks having exactly the same structure and parameters. As the training data is fed into the Mapper, each mapper reads data and picks a first cluster randomly or pre-defined from the data row. Then, the process continues in selecting the finest candidates for integration by reducing GIL and proceeds to reach the cluster size of k. When the cluster size reaches size of k, the next cluster with new data point is chosen as initiator; the above said process is repeated to form multiple clusters from the data points, to satisfy k-anonymity for the specified number of dataset. The Error Propagation concept in neural network helps in maintainining a minimum Information Loss using the GIL measure.</p>
<fig id="fig-2">
<label>Figure 2</label>
<caption>
<title>Map Reduce based BPNN</title></caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="IASC_20164-fig-2.png"/>
</fig>
<p>Finally Reducer maintains a final output based on the data collected from Mapper. Thus MapReduce solves the scalability problem of BigData and interactive Machine Learning in Neural network satisfies k-anonymity.</p>
<fig id="fig-8">
<graphic mimetype="image" mime-subtype="png" xlink:href="IASC_20164-fig-8.png"/>
</fig>
</sec>
</sec>
<sec id="s3">
<label>3</label>
<title>Data Anonymization</title>
<p>Data Anonymization [<xref ref-type="bibr" rid="ref-9">9</xref>] is the process of screening identity and sensitive records for the possessors of information records. The main objective of Data Anonymization [<xref ref-type="bibr" rid="ref-10">10</xref>] is to preserve privacy by exposing aggregate information to data scientist for analytics and mining. Data given for data mining and analysis should conserve a proper stability among the utility and privacy. The bench mark algorithm for Anonymization is k-anonymity which uses the concept of Generalization of attributes and Suppression of tuples (i.e., Suppression of attributes).</p>
<p><bold><italic>K-anonymity</italic></bold></p>
<p>The Classification of Dataset includes Personal identifiers, Sensitive information and Quasi-identifiers<list list-type="bullet"><list-item>
<p>A Personal identifier is an attributes which straightaway recognizes an individual without any further analysis or cross-reference. Examples are individual mail-ID or Social Security Number (SSN). This category of data is usually dangerous as it reveals the individual identity and needs to be removed.</p></list-item><list-item>
<p>Sensitive information is the vital information that can be used for research purposes and for mining. Examples include disease classification, medication intaken information or salary information of an individual. Data relating to these are required for analysis and should be conserved as an anonymized dataset. So, such data cannot undergo generalization or suppression.</p></list-item><list-item>
<p>Other types of attributes include Quasi-identifiers (QI&#x2019;s), which don&#x2019;t directly recognize the individual. However, when aggregated information is given individuals can be reconstructed from them.</p></list-item></list></p>
<p>For illustration, report in 2002 said that individual information in a certain region can be revealed via attributes with zip code, gender and birth date. Based on this information, it can be inferred that, Quasi-Identifier comprise dynamic information for analysis of research applications and [<xref ref-type="bibr" rid="ref-11">11</xref>] they can be generalized or suppressed based on conciliation between privacy which inhibits information loss and data utilization.</p>
<p>In k-anonymity [<xref ref-type="bibr" rid="ref-12">12</xref>], the information of an individual cannot be revealed by exposed information and revealed information have a minimum of k &#x2013; 1 people with the same information in the cluster. At least, k-record should have same Quasi-identifier. For instance, in a released table, Birth date of a person and gender attribute are the QID, to achieve k-anonymity, k-people should have the same date of birth and gender in the given datasets. In a <italic>k</italic>-anonymous table, there is no unique record and k-1 record has similar QID values. Generalization and suppression are the key concepts to achieve k-anonymity. To anonymize a data structure, it uses an algorithm given the General Information Loss (GIL) through Anonymization. Generally General Information Loss (GIL) is defined as amount of information loss occurring through generalization of attributes as in <xref ref-type="disp-formula" rid="eqn-1">Eq. (1)</xref>.</p>
<p>General Information loss (GIL) is denoted as:</p>
<p><disp-formula id="eqn-1"><label>(1)</label>
<mml:math id="mml-eqn-1" display="block"><mml:mrow><mml:mi mathvariant="normal">G</mml:mi><mml:mi mathvariant="normal">I</mml:mi><mml:mi mathvariant="normal">L</mml:mi><mml:mspace width="thickmathspace" /></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="normal">c</mml:mi><mml:mi mathvariant="normal">l</mml:mi></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mspace width="thickmathspace" /></mml:mrow><mml:mrow><mml:mo>|</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="normal">c</mml:mi><mml:mi mathvariant="normal">l</mml:mi></mml:mrow></mml:mrow><mml:mo>|</mml:mo></mml:mrow><mml:mo>.</mml:mo><mml:munderover><mml:mrow><mml:mo movablelimits="false">&#x2211;</mml:mo></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>s</mml:mi></mml:munderover><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mrow><mml:mfrac><mml:mrow><mml:mrow><mml:mi mathvariant="normal">s</mml:mi><mml:mi mathvariant="normal">i</mml:mi><mml:mi mathvariant="normal">z</mml:mi><mml:mi mathvariant="normal">e</mml:mi></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="normal">g</mml:mi><mml:mi mathvariant="normal">e</mml:mi><mml:mi mathvariant="normal">n</mml:mi></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="normal">c</mml:mi><mml:mi mathvariant="normal">l</mml:mi></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="normal">N</mml:mi><mml:mi mathvariant="normal">j</mml:mi><mml:mspace width="thickmathspace" /></mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="normal">s</mml:mi><mml:mi mathvariant="normal">i</mml:mi><mml:mi mathvariant="normal">z</mml:mi><mml:mi mathvariant="normal">e</mml:mi></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="normal">m</mml:mi><mml:mi mathvariant="normal">i</mml:mi><mml:mi mathvariant="normal">n</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="normal">x</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">N</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="normal">X</mml:mi></mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="normal">N</mml:mi><mml:mi mathvariant="normal">j</mml:mi><mml:mspace width="thickmathspace" /></mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mi mathvariant="normal">m</mml:mi><mml:mi mathvariant="normal">a</mml:mi><mml:mi mathvariant="normal">x</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="normal">x</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">N</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="normal">X</mml:mi></mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="normal">N</mml:mi><mml:mi mathvariant="normal">j</mml:mi><mml:mspace width="thickmathspace" /></mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:mrow><mml:mspace width="thickmathspace" /><mml:mo>+</mml:mo><mml:mspace width="thickmathspace" /><mml:mspace width="thickmathspace" /><mml:munderover><mml:mrow><mml:mo movablelimits="false">&#x2211;</mml:mo></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>t</mml:mi></mml:munderover><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mrow><mml:mfrac><mml:mrow><mml:mrow><mml:mi mathvariant="normal">h</mml:mi><mml:mi mathvariant="normal">e</mml:mi><mml:mi mathvariant="normal">i</mml:mi><mml:mi mathvariant="normal">g</mml:mi><mml:mi mathvariant="normal">h</mml:mi><mml:mi mathvariant="normal">t</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x2227;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="normal">g</mml:mi><mml:mi mathvariant="normal">e</mml:mi><mml:mi mathvariant="normal">n</mml:mi></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="normal">c</mml:mi><mml:mi mathvariant="normal">l</mml:mi></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="normal">N</mml:mi><mml:mi mathvariant="normal">j</mml:mi><mml:mspace width="thickmathspace" /></mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>h</mml:mi><mml:mi>e</mml:mi><mml:mi>i</mml:mi><mml:mi>g</mml:mi><mml:mi>h</mml:mi><mml:mi>t</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:mrow></mml:msub></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:mrow></mml:mstyle></mml:mstyle></mml:math>
</disp-formula></p>
<p>where:<list list-type="bullet"><list-item>
<p>|cl| signifies the cluster cl&#x2019;s cardinal;</p></list-item><list-item>
<p>size([i1, i2]) signifies the size of the interval (i2 &#x2212; i1);</p></list-item><list-item>
<p><inline-formula id="ieqn-80">
<mml:math id="mml-ieqn-80"><mml:mo>&#x2227;</mml:mo></mml:math>
</inline-formula> (<inline-formula id="ieqn-81">
<mml:math id="mml-ieqn-81"><mml:mi>&#x03C9;</mml:mi></mml:math>
</inline-formula>),<inline-formula id="ieqn-82">
<mml:math id="mml-ieqn-82"><mml:mspace width="thickmathspace" /><mml:mi>&#x03C9;</mml:mi></mml:math>
</inline-formula> &#x2208;<inline-formula id="ieqn-83">
<mml:math id="mml-ieqn-83"><mml:mrow><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:mrow></mml:msub></mml:mrow></mml:math>
</inline-formula> signifies sub-hierarchy of <inline-formula id="ieqn-84">
<mml:math id="mml-ieqn-84"><mml:mrow><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:mrow></mml:msub></mml:mrow></mml:math>
</inline-formula> embedded in <inline-formula id="ieqn-85">
<mml:math id="mml-ieqn-85"><mml:mi>&#x03C9;</mml:mi></mml:math>
</inline-formula>;</p></list-item><list-item>
<p><inline-formula id="ieqn-86">
<mml:math id="mml-ieqn-86"><mml:mi>h</mml:mi><mml:mi>e</mml:mi><mml:mi>i</mml:mi><mml:mi>g</mml:mi><mml:mi>h</mml:mi><mml:mi>t</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:mrow></mml:msub></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math>
</inline-formula> signifies the altitude of the tree Hierarchy <inline-formula id="ieqn-87">
<mml:math id="mml-ieqn-87"><mml:mrow><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:mrow></mml:msub></mml:mrow></mml:math>
</inline-formula> ;</p></list-item></list></p>
<p>The training data is fed into a mapper; each mapper reads the data and picks a first cluster randomly or as pre-defined from the data row. Then, the process continues in selecting the finest attributes for integration by reducing General Information Loss (GIL) and the process repeats itself to retain the cluster size of k. When the cluster size reaches the size of k, the next cluster with fresh data argument is elected as the originator; for the given dataset, this procedure iterates until the entire data arguments are combined to form new clusters to satisfy anonymity algorithm.</p>
<p>k-anonymous [<xref ref-type="bibr" rid="ref-13">13</xref>] datasets with reference to Quasi-identifier has the size of the similarity class to be &#x2018;k&#x2019; or more with reference to the Quasi-identifier. Data Utility for Classification is not considered by Generalization and Suppression after anonymity. Well ahead of the k-anonymity there should be l-diversity [<xref ref-type="bibr" rid="ref-14">14</xref>] (where every cluster should maintain l-diverse sensitive data), t-closeness (with a threshold of t, native dissemination over sensitive data must not deviate from its comprehensive dissemination), m-variance, differential privacy (a noise is injected into the dataset for securely releasing sensitive information) which is proposed.</p>
</sec>
<sec id="s4">
<label>4</label>
<title>BigData: MapReduce, Hadoop</title>
<p>This section explains the two main concepts for BigData processing. The following sub-sections focus on the MapReduce programming model, pre-processing data and FCBF in map-reduce framework.</p>
<sec id="s4_1">
<label>4.1</label>
<title>The MapReduce: Computing Model</title>
<p>In the revolution of BigData, MapReduce [<xref ref-type="bibr" rid="ref-15">15</xref>] is the customary computing model for processing large volumes of dataset using a cluster of commodity computers. Hadoop [<xref ref-type="bibr" rid="ref-16">16</xref>,<xref ref-type="bibr" rid="ref-17">17</xref>] an open source, is the popular implementations of MapReduce model.</p>
<p>HDFS which is used for data Management and MapReduce [<xref ref-type="bibr" rid="ref-18">18</xref>] are the two main concepts in Hadoop framework. For running jobs and to process data in Hadoop cluster, it has a Namenode and Datanode. The namenode is accountable for the cluster&#x2019;s metadata and the Datanode is the actual processing node which has Map and Reduce functions. The data fed as input is split into a number of small chunks of equal size, while submitting job to Hadoop and is stored in the HDFS. To preserve data reliability, every data portion of the data can have one or additional copies as per Hadoop cluster configuration. Mappers duplicate and deliver data based on data vicinity. Finally, HDFS has the concluding response which is organized, combined and produced by reducers.</p>
<p>To process and handle a large amount datasets in parallel, scalable and fault-tolerant MapReduce framework [<xref ref-type="bibr" rid="ref-19">19</xref>] for data processing is developed. MapReduce has a two-fold basic task: Map and Reduce functions. A mapper, based on the input key-value pair, yields an intermediary key-value pair. Each Mapper has a Neural Network based on Back propagation Algorithm. Each intermediary key-value pair is combined based on the key and communicated to the Reducer, which compresses the values to make them reduced.</p>
</sec>
<sec id="s4_2">
<label>4.2</label>
<title>Pre-Processing Data</title>
<p>Based on our previous work [<xref ref-type="bibr" rid="ref-20">20</xref>], we elaborated the measure of goodness of feature for classification. In terms of Correlation Analysis [<xref ref-type="bibr" rid="ref-21">21</xref>], a worthy feature greatly related to the class is nevertheless not associated with any further features.</p>
<p>The Fast Correlation Feature Selection [<xref ref-type="bibr" rid="ref-22">22</xref>] algorithm explores search space by using the best-first search Algorithm. The algorithm begins with a void list of features and upon iteration of search; all probable sole feature expansions are produced. The novel subsets are estimated and added to a precedence queue permitting to Improvement. In the consequent iteration, the optimum subset of the queue is designated for improvement in the same way as the first certain void subgroup. The next finest subset is designated from the queue, if the subsequent finest subset fails to produce enhancement. After five successive failure (since it is the criterion for stopping), the FCFS algorithm stops altogether.</p>
<p>The last CFS [<xref ref-type="bibr" rid="ref-23">23</xref>] element is a discretionary step. The FCFS algorithm selects feature subsets with low redundancy and high correlation within the class. However, in certain cases, additional features that are nearly predictive in a minor area of the instance space may occur that can be leveraged by some classifiers. To take these features in the subgroup subsequent to the search, the FCFS can use an experimental study that permits the presence of all features. Hence the association of features within the class is more sophisticated than the correlation between the features already selected.</p>
</sec>
<sec id="s4_3">
<label>4.3</label>
<title>FCBF in MapReduce Framework</title>
<p>As shown in <xref ref-type="fig" rid="fig-3">Fig. 3</xref> a huge amount of data collected is allocated as chunk and each chuck is fed to the Mapper. At each Mapper, a Fast-Correlation Based Features Selection algorithm is provided to select the optimum subset of features and to remove the duplicate feature subset with increased data utility. Finally, the k-anonymity algorithm in reducer confirms the privacy of an individual. Henceforth, the outcome of Map-Reduce is Anonymized Dataset which satisfies k-anonymity. Upon applying FCFS algorithm to the large Dataset, it is found that execution time decreases with the data volume as shown in <xref ref-type="fig" rid="fig-4">Fig. 4</xref>. Hence for Pre-processing process, FCFS algorithm is selected. This pre-processed data is fed as input to the next stage MapReduce for anonymization process.</p>
</sec>
</sec>
<sec id="s5">
<label>5</label>
<title>Interactive Machine Learning</title>
<p>Interactive Machine learning algorithms [<xref ref-type="bibr" rid="ref-24">24</xref>] infer a positive or negative reinforcement by the way of human interaction of outward oracle with their inner working mechanism [<xref ref-type="bibr" rid="ref-25">25</xref>]. Our methodology alters the algorithmic factors upon each consignment of manual interruptions, permitting them to adjust an approach of including human decisions in real-time applications [<xref ref-type="bibr" rid="ref-26">26</xref>].</p>
<fig id="fig-3">
<label>Figure 3</label>
<caption>
<title>K-anonymity based FCBF in Map-Reduce</title></caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="IASC_20164-fig-3.png"/>
</fig>
<p>This is done into the process of anonymization by permitting them to fix limitations on instance generalization; besides, generalization, hierarchies were constructed comprising domain-specific ontologies.</p>
<fig id="fig-4">
<label>Figure 4</label>
<caption>
<title>Execution Time decreases as data volume increases by the use of FCFS Algorithm</title></caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="IASC_20164-fig-4.png"/>
</fig>
</sec>
<sec id="s6">
<label>6</label>
<title>Performance Evaluation</title>
<p>The parallel BPNNs using Hadoop with MapReduce computing model was implemented to test our work. Multi-Variate Adult Dataset is used to implement our proposed work. It has both Categorical and Integer Attribute (14 attributes), 48842 instances having missing values. The exactness of the algorithms was calculated by changing the value of k starting from 10 and further increasing its value upto 1000. The computation efficiency was estimated by varying the datasets size from 1MB to 1GB.</p>
<fig id="fig-5">
<label>Figure 5</label>
<caption>
<title>Human bias overtaking both equal weights and human interaction parameters when marital status is taken as the target attributes</title></caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="IASC_20164-fig-5.png"/>
</fig>
<fig id="fig-6">
<label>Figure 6</label>
<caption>
<title>Human bias performs marginally well in comparison with equal weights / iML parameters with different values of k, nonetheless not subsequently as already stated, when education is taken as the target attribute</title></caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="IASC_20164-fig-6.png"/>
</fig>
<p>A four different Classification algorithm was applied along with iML [<xref ref-type="bibr" rid="ref-27">27</xref>,<xref ref-type="bibr" rid="ref-28">28</xref>] in Neural Network for large scale applications on three target attributes to generate anonymized datasets.</p>
<p>The following processing pipeline design is applied as follows:<list list-type="roman-lower"><list-item>
<p>The original datasets are pre-processed using Fast Correlational Feature selection algorithm, followed by the application of k-anonymity algorithm with the value of k ranging from 5 to 100 as [5,10,20,50,100,200] and 129 completely dissimilar weight classification (iML, bias, equal) to create anonymized datasets.</p></list-item><list-item>
<p>We attempted to execute four classification algorithms on all of the datasets and relevant F1 score were compared; the rationale behind choosing numerous algorithms was to discover if anonymization would produce completely different performances on multiple mathematical methodologies for classification. The different algorithms used were linear support vector machines, logistic regression, gradient boosting (ensemble, boosting) equally as Random Forest (ensemble, bagging). Considering classification goal as education, fourteen completely diverse education levels exist on adult dataset. Broadly it can be classified into four classes as &#x2019;advanced studies&#x2019;, &#x2019;&#x003C;&#x003D;bachelors&#x2019;, &#x2019;high school&#x2019; and &#x2019;pre high school&#x2019;.</p></list-item><list-item>
<p>For every group of classification goal (education, marital status, income) and weight classification (iML, bias, equal) we averaged the corresponding outcomes. Outcomes stay designed for each goal (target), such as these permit improved evaluation among diverse classifiers. <xref ref-type="fig" rid="fig-5">Figs. 5</xref>&#x2013;<xref ref-type="fig" rid="fig-7">7</xref> show the result on applying different classifier based on attributes selected for anonymization.</p></list-item></list></p>
<fig id="fig-7">
<label>Figure 7</label>
<caption>
<title>iML-based results generally outperform bias excluding linear SVC; However, they are incompetent of overtaking the firmly equal setting when income is taken as the target attribute</title></caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="IASC_20164-fig-7.png"/>
</fig>
</sec>
<sec id="s7">
<label>7</label>
<title>Conclusion</title>
<p>We have outlined parallel neural networks using manual interruptions to tolerate on the mission of Anonymization through iML. This is built on Hadoop MapReduce programming structure in associations with classification datasets. We devised an experiment concerning clustering of data arguments probably with manual inclination for conservation of attributes and the resultant constraints on classification of anonymized data into classes of income, education and marital status are verified. The outcomes demonstrate that human bias with MapReduce in Neural network can positively contribute to ordinary presentation areas, whereas supplementary difficult applications require trained professionals or better data preparation. Further research is required for privacy preservation when the data needs to be analyzed, shared and mined. This work can be extended as a Deep Learning based neural network to achieve k-anonymity for Large Data Applications in our subsequent work.</p>
</sec>
</body>
<back><fn-group>
<fn fn-type="other">
<p><bold>Funding Statement:</bold> The authors received no specific funding for this study.</p>
</fn>
<fn fn-type="conflict">
<p><bold>Conflicts of Interest:</bold> The authors declare that they have no conflicts of interest to report regarding the present study.</p>
</fn>
</fn-group>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Garcia</surname></string-name>, <string-name><given-names>S. R.</given-names> <surname>Gallego</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Luengo</surname></string-name>, <string-name><given-names>J. M.</given-names> <surname>Benitez</surname></string-name> and <string-name><given-names>F.</given-names> <surname>Herrera</surname></string-name></person-group>, &#x201C;<article-title>Big data pre-processing: Methods and prospects&#x201D;</article-title>,&#x201D; <source>Big Data Analytics</source>, vol. <volume>42</volume>, no. <issue>6</issue>, pp. <fpage>1911</fpage>&#x2013;<lpage>1920</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>chen</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Mao</surname></string-name> and <string-name><given-names>Y.</given-names> <surname>Lin</surname></string-name></person-group>, &#x201C;<article-title>Big Data: A survey</article-title>,&#x201D; <source>Mobile Networks and Applications</source>, vol. <volume>19</volume>, no. <issue>2</issue>, pp. <fpage>171</fpage>&#x2013;<lpage>209</lpage>, <year>2014</year>.</mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>B.</given-names> <surname>He</surname></string-name>, <string-name><given-names>W.</given-names> <surname>Fang</surname></string-name>, <string-name><given-names>Q.</given-names> <surname>Luo</surname></string-name>, <string-name><given-names>N. K.</given-names> <surname>Govindaraju</surname></string-name> and <string-name><given-names>T.</given-names> <surname>Wang</surname></string-name></person-group>, &#x201C;<article-title>Mars: A map-reduce framework on graphics processors</article-title>,&#x201D; in <conf-name>Proc. of the 17th Int. Conf. on Parallel Architectures Computational Intelligence and Neuroscience and Compilation Techniques (PACT &#x2019;08)</conf-name>, <conf-loc>Toronto Ontario Canada</conf-loc>, pp. <fpage>260</fpage>&#x2013;<lpage>269</lpage>, <year>2008</year>. </mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Yang</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Huang</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Xu</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Lie</surname></string-name></person-group>, &#x201C;<article-title>Map-reduce based parallel neural network enabling large scale machine learning</article-title>,&#x201D; <source>Hindawi</source>, vol. <volume>2015</volume>, pp. <fpage>1</fpage>&#x2013;<lpage>13</lpage>, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>B.</given-names> <surname>Malle</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Kieseberg</surname></string-name> and <string-name><given-names>A.</given-names> <surname>Holzinger</surname></string-name></person-group>, &#x201C;<article-title>Interactive anonymization for privacy aware machine learning</article-title>,&#x201D; in <conf-name>European Conf. on Machine Learning and Knowledge Discovery ECML-PKDD</conf-name>, <publisher-loc>Skopje, Macedonia</publisher-loc>, <publisher-name>The Former Yugoslav Republic of Macedonia</publisher-name>, pp. <fpage>15</fpage>&#x2013;<lpage>26</lpage>, <year>2017</year>. </mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>X.</given-names> <surname>Xiao</surname></string-name>, <string-name><given-names>G.</given-names> <surname>Wang</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Gehrke</surname></string-name></person-group>, &#x201C;<article-title>Interactive anonymization of sensitive data</article-title>,&#x201D; in <conf-name>Proc. of the 35th SIGMOD Int. Conf. on Management of Data - SIGMOD &#x2019;09</conf-name>, Association for Computing Machinery, <conf-loc>New York</conf-loc>, pp. <fpage>1051</fpage>&#x2013;<lpage>1054</lpage>, <year>2009</year>. </mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>K.</given-names> <surname>Xu</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Yue</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Guo</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Guo</surname></string-name> and <string-name><given-names>Y.</given-names> <surname>Fang</surname></string-name></person-group>, &#x201C;<article-title>Privacy-preserving machine learning algorithms for big data Systems</article-title>,&#x201D; in <conf-name>Int. Conf. on Distributed Computing Systems</conf-name>, <publisher-name>IEEE</publisher-name>, <comment>Columbus</comment>, <conf-loc>USA</conf-loc>, <year>2015</year>. </mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Campan</surname></string-name> and <string-name><given-names>T. M.</given-names> <surname>Truta</surname></string-name></person-group>, &#x201C;<chapter-title>Data and structural k-anonymity in social networks</chapter-title>,&#x201D; in <source>Privacy, Security, and Trust in KDD</source>. vol. <volume>5456</volume>, <publisher-loc>United States</publisher-loc>: <publisher-name>Springer</publisher-name>, pp. <fpage>33</fpage>&#x2013;<lpage>54</lpage>, <year>2009</year>.</mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>L.</given-names> <surname>Zheng</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Yue</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Pan</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Wu</surname></string-name> and <string-name><given-names>F.</given-names> <surname>Yang</surname></string-name></person-group>, &#x201C;<article-title>K-anonymity location privacy algorithm based on clustering</article-title>,&#x201D; <source>IEEE Access</source>, vol. <volume>6</volume>, pp. <fpage>28328</fpage>&#x2013;<lpage>28338</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Miche</surname></string-name>, <string-name><given-names>W.</given-names> <surname>Ren</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Oliver</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Holtmanns</surname></string-name></person-group>, &#x201C;<article-title>A framework for privacy quantification: measuring the impact of privacy techniques through mutual information, distance mapping, and machine learning</article-title>,&#x201D; <source>Springer Cognitive Computation</source>, vol. <volume>11</volume>, no. <issue>2</issue>, pp. <fpage>241</fpage>&#x2013;<lpage>261</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>L.</given-names> <surname>Sweeney</surname></string-name></person-group>, &#x201C;<article-title>K-anonymity: A model for protecting privacy</article-title>,&#x201D; <source>International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems</source>, vol. <volume>10</volume>, no. <issue>5</issue>, pp. <fpage>557</fpage>&#x2013;<lpage>570</lpage>, <year>2012</year>.</mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>K. L.</given-names> <surname>Du</surname></string-name></person-group>, &#x201C;<article-title>Clustering: A neural network approach</article-title>,&#x201D; <source>Neural Networks Elsevier</source>, vol. <volume>23</volume>, no. <issue>1</issue>, pp. <fpage>89</fpage>&#x2013;<lpage>107</lpage>, <year>2010</year>.</mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>B.</given-names> <surname>Zhou</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Pei</surname></string-name></person-group>, &#x201C;<article-title>The k-anonymity and l-diversity approaches for privacy preservation in social networks against neighborhood attacks</article-title>,&#x201D; <source>Knowledge and Information Systems</source>, vol. <volume>28</volume>, no. <issue>1</issue>, pp. <fpage>47</fpage>&#x2013;<lpage>77</lpage>, <year>2011</year>.</mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>L.</given-names> <surname>Sweeney</surname></string-name></person-group>, &#x201C;<article-title>k-anonymity: A model for protecting privacy</article-title>,&#x201D; <source>International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems</source>, vol. <volume>10</volume>, no. <issue>5</issue>, pp. <fpage>557</fpage>&#x2013;<lpage>570</lpage>, <year>2012</year>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>C. L. P.</given-names> <surname>Chen</surname></string-name> and <string-name><given-names>C. Y.</given-names> <surname>Zhang</surname></string-name></person-group>, &#x201C;<article-title>Data-intensive applications, challenges, techniques and technologies: A survey on Big Data</article-title>,&#x201D; <source>Information Sciences</source>, vol. <volume>275</volume>, no. <issue>4</issue>, pp. <fpage>314</fpage>&#x2013;<lpage>347</lpage>, <year>2014</year>.</mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>Apache</given-names> <surname>Hadoop</surname></string-name></person-group> <year>2015</year>. [Online]. Available: <uri>http://hadoop.apache.org</uri>.</mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Venner</surname></string-name></person-group>, <source>Pro Hadoop</source>. <publisher-loc>New York, NY, USA</publisher-loc>: <publisher-name>Springer</publisher-name>, <year>2009</year>.</mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><given-names>X.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Yang</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Nepal</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>W.</given-names> <surname>Dou</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> <source>A map-reduce based approach of scalable multidimentional anonymization for big data privacy preservation on cloud</source>. <publisher-name>IEEE Third International Conference on Cloud and Green Computing</publisher-name>, <conf-loc>USA</conf-loc>, <year>2013</year>.</mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>U.</given-names> <surname>Selvi</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Pushpa</surname></string-name></person-group>, &#x201C;<article-title>A review of big data and anonymization algorithms</article-title>,&#x201D; <source>International Journal of Applied Engineering Research</source>, vol. <volume>10</volume>, no. <issue>17</issue>, pp. <fpage>13125</fpage>&#x2013;<lpage>13130</lpage>, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>U.</given-names> <surname>Selvi</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Pushpa</surname></string-name></person-group>, &#x201C;<article-title>Big data feature selection to achieve anonymization, Invention Communication and computational technologies</article-title>,&#x201D; <source>Spinger</source>, vol. <volume>637</volume>, pp. <fpage>59</fpage>&#x2013;<lpage>67</lpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>B.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>N.</given-names> <surname>Mohammed</surname></string-name>, <string-name><given-names>V.</given-names> <surname>Dave</surname></string-name> and <string-name><given-names>M. A.</given-names> <surname>Hasan</surname></string-name></person-group>, &#x201C;<article-title>Feature selection for classification under anonymity constraint</article-title>,&#x201D; <source>ACM</source>, vol. <volume>10</volume>, no. <issue>1</issue>, pp. <fpage>61</fpage>&#x2013;<lpage>81</lpage>, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>L.</given-names> <surname>Yu</surname></string-name> and <string-name><given-names>H.</given-names> <surname>Liu</surname></string-name></person-group>, &#x201C;<article-title>Feature selection for high-dimensional data: A fast correlation-based filter solution</article-title>,&#x201D; in <conf-name>Proc. of the Twentieth Int. Conf. on Machine Learning</conf-name>, <conf-loc>Washington, DC</conf-loc>, <year>2003</year>. </mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>D.</given-names> <surname>Peralta</surname></string-name>, <string-name><given-names>S. Del</given-names> <surname>Rio</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Ramirez-Gallego</surname></string-name></person-group>, &#x201C;<article-title>Evolutionary feature selection for big data classification: A map reduce approach</article-title>,&#x201D; <source>Hindawi</source>, vol. <volume>2015</volume>, pp. <fpage>1</fpage>&#x2013;<lpage>12</lpage>, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><given-names>B.</given-names> <surname>Malle</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Kieseberg</surname></string-name> and <string-name><given-names>A.</given-names> <surname>Holzinger</surname></string-name></person-group>, &#x201C;<chapter-title>Do not disturb? classifier behavior on perturbed datasets</chapter-title>,&#x201D; in <source>Machine Learning and Knowledge Extraction, IFIP CD-MAKE</source>, <series>Lecture Notes in Computer Science LNCS</series>, vol. <volume>10410</volume>. <publisher-loc>Cham</publisher-loc>: <publisher-name>Springer</publisher-name>, pp. <fpage>155</fpage>&#x2013;<lpage>173</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Holzinger</surname></string-name></person-group>, &#x201C;<article-title>Interactive machine learning for health informatics: When do we need the human-in-the-loop?</article-title>,&#x201D; <source>Springer Brain Informatics (BRIN)</source>, vol. <volume>3</volume>, no. <issue>2</issue>, pp. <fpage>119</fpage>&#x2013;<lpage>131</lpage>, <year>2016</year>.</mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>C.</given-names> <surname>Moque</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Pomares</surname></string-name> and <string-name><given-names>R.</given-names> <surname>Gonzalez</surname></string-name></person-group>, &#x201C;<article-title>A proposal for interactive anonymization of electronic medical records</article-title>,&#x201D; <source>Procedia Technology</source>, vol. <volume>5</volume>, pp. <fpage>743</fpage>&#x2013;<lpage>752</lpage>, <year>2012</year>.</mixed-citation></ref>
<ref id="ref-27"><label>[27]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>B.</given-names> <surname>Malle</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Kieseberg</surname></string-name>, <string-name><given-names>E.</given-names> <surname>Weippl</surname></string-name> and <string-name><given-names>A.</given-names> <surname>Holzinger</surname></string-name></person-group>, &#x201C;<article-title>The right to be forgotten: towards machine learning on perturbed knowledge bases</article-title>,&#x201D; in <conf-name>Int. Conf. on Availability, Reliability, and Security</conf-name>, <publisher-name>Springer</publisher-name>, <publisher-loc>Salzburg, Austria</publisher-loc>, pp. <fpage>251</fpage>&#x2013;<lpage>266</lpage>, <year>2016</year>. </mixed-citation></ref>
<ref id="ref-28"><label>[28]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Holzinger</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Plass</surname></string-name>, <string-name><given-names>K.</given-names> <surname>Holzinger</surname></string-name>, <string-name><given-names>G. C.</given-names> <surname>Crisan</surname></string-name>, <string-name><given-names>C. M.</given-names> <surname>Pintea</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Towards interactive machine learning (iml): Applying ant colony algorithms to solve the traveling salesman problem with the human-in-the-loop approach</article-title>,&#x201D; in <conf-name>IFIP Int. Cross Domain Conf. and Workshop (CD-ARES)</conf-name>, <publisher-loc>Heidelberg, Berlin, New York</publisher-loc>, <publisher-name>Springer</publisher-name>, pp. <fpage>81</fpage>&#x2013;<lpage>95</lpage>, <year>2016</year>. </mixed-citation></ref>
</ref-list>
</back>
</article>