<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMC</journal-id>
<journal-id journal-id-type="nlm-ta">CMC</journal-id>
<journal-id journal-id-type="publisher-id">CMC</journal-id>
<journal-title-group>
<journal-title>Computers, Materials &#x0026; Continua</journal-title>
</journal-title-group>
<issn pub-type="epub">1546-2226</issn>
<issn pub-type="ppub">1546-2218</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">29420</article-id>
<article-id pub-id-type="doi">10.32604/cmc.2023.029420</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Data Augmentation and Random Multi-Model Deep Learning for Data Classification</article-title>
<alt-title alt-title-type="left-running-head">Data Augmentation and Random Multi-Model Deep Learning for Data Classification</alt-title>
<alt-title alt-title-type="right-running-head">Data Augmentation and Random Multi-Model Deep Learning for Data Classification</alt-title>
</title-group>
<contrib-group content-type="authors">
<contrib id="author-1" contrib-type="author">
<name name-style="western"><surname>Harby</surname><given-names>Fatma</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-2" contrib-type="author">
<name name-style="western"><surname>Thaljaoui</surname><given-names>Adel</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-3" contrib-type="author">
<name name-style="western"><surname>Nayab</surname><given-names>Durre</given-names></name><xref ref-type="aff" rid="aff-2">2</xref></contrib>
<contrib id="author-4" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Aladhadh</surname><given-names>Suliman</given-names></name><xref ref-type="aff" rid="aff-3">3</xref><email>s.aladhadh@qu.edu.sa</email></contrib>
<contrib id="author-5" contrib-type="author">
<name name-style="western"><surname>Khediri</surname><given-names>Salim EL</given-names></name><xref ref-type="aff" rid="aff-3">3</xref>
<xref ref-type="aff" rid="aff-4">4</xref></contrib>
<contrib id="author-6" contrib-type="author">
<name name-style="western"><surname>Khan</surname><given-names>Rehan Ullah</given-names></name><xref ref-type="aff" rid="aff-3">3</xref></contrib>
<aff id="aff-1"><label>1</label><institution>Computer Science Department, Future Academy-Higher Future Institute for Specialized Technological Studies</institution>, <country>Egypt</country></aff>
<aff id="aff-2"><label>2</label><institution>Department of Computer Systems Engineering, Faculty of Electrical and Computer Engineering, University of Engineering and Technology</institution>, <addr-line>Peshawar, 25120</addr-line>, <country>Pakistan</country></aff>
<aff id="aff-3"><label>3</label><institution>Department of Information Technology, College of Computer, Qassim University</institution>, <addr-line>Buraydah</addr-line>, <country>Saudi Arabia</country></aff>
<aff id="aff-4"><label>4</label><institution>Department of Computer Sciences, Faculty of Sciences of Gafsa, University of Gafsa</institution>, <addr-line>Gafsa</addr-line>, <country>Tunisia</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Suliman Aladhadh. Email: <email>s.aladhadh@qu.edu.sa</email></corresp>
</author-notes>
<pub-date publication-format="print" date-type="pub" iso-8601-date="2022-12-15"><day>15</day>
<month>12</month>
<year>2022</year></pub-date>
<volume>74</volume>
<issue>3</issue>
<fpage>5191</fpage>
<lpage>5207</lpage>
<history>
<date date-type="received"><day>03</day><month>3</month><year>2022</year></date>
<date date-type="accepted"><day>06</day><month>6</month><year>2022</year></date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2023 Harby et al.</copyright-statement>
<copyright-year>2023</copyright-year>
<copyright-holder>Harby et al.</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMC_29420.pdf"></self-uri>
<abstract>
<p>In the machine learning (ML) paradigm, data augmentation serves as a regularization approach for creating ML models. The increase in the diversification of training samples increases the generalization capabilities, which enhances the prediction performance of classifiers when tested on unseen examples. Deep learning (DL) models have a lot of parameters, and they frequently overfit. Effectively, to avoid overfitting, data plays a major role to augment the latest improvements in DL. Nevertheless, reliable data collection is a major limiting factor. Frequently, this problem is undertaken by combining augmentation of data, transfer learning, dropout, and methods of normalization in batches. In this paper, we introduce the application of data augmentation in the field of image classification using Random Multi-model Deep Learning (RMDL) which uses the association approaches of multi-DL to yield random models for classification. We present a methodology for using Generative Adversarial Networks (GANs) to generate images for data augmenting. Through experiments, we discover that samples generated by GANs when fed into RMDL improve both accuracy and model efficiency. Experimenting across both MNIST and CIAFAR-10 datasets show that, error rate with proposed approach has been decreased with different random models.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Data augmentation</kwd>
<kwd>generative adversarial networks</kwd>
<kwd>classification</kwd>
<kwd>machine learning</kwd>
<kwd>random multi-model deep learning</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="s1"><label>1</label><title>Introduction</title>
<p>In the data science community, classification and categorization using complex data of images, video and documents are crucial challenges. In recent years, there has been a growing interest in applying DL structures and architectures to such problems.</p>
<p>However, popular deep architectures are intended specifically for data type or domain. Therefore, there is an essential need to improve further information handling approaches for categorization and classification through an extensive variety of data types.</p>
<p>Even though DL has been successfully utilized to solve classification problems [<xref ref-type="bibr" rid="ref-1">1</xref>] the key issue is deciding which DL architecture such as Deep Neural Network (DNN), Convolutional Neural Network (CNN) or Recurrent Neural Network (RNN<bold>)</bold> to use for a given task. Several nodes (units) and hidden layers are also very useful for varied data kinds and application structures. Hence, the best way to solve this issue is by trial and error method for the application or dataset.</p>
<p>The difficulty of deploying ensembles of deep architectures is addressed in this paper with a method known as RMDL [<xref ref-type="bibr" rid="ref-2">2</xref>]. In short, RMDL is a method that incorporates three DL architectures: CNN, DNN and RNN. Experiments with numerous sorts of data have shown that this method is accurate, dependable and efficient.</p>
<p>For their input layers, the three basic DL designs use various feature space techniques. DNN, for example, extracts features from the text using frequency-inverse document frequency (TF-IDF) [<xref ref-type="bibr" rid="ref-3">3</xref>]. RMDL uses randomly generated hyper-parameters to find the number of both hidden layers and nodes (density) in each DNN hidden layer. By using random feature maps and random numbers of hidden layers, RMDL selects hyper-parameters in CNN.</p>
<p>The CNN structures used by RMDL are 1-dimensional CNN (Conv1D) which only moves along a single axis so it used for text, a 2-dimensional CNN (Conv2D) which works by applying kernels that strides along 2-dimensional space so it used for picture. In 3-dimensions CNNs (Conv3D) the kernels move in three dimensions so it is suitable for video processing [<xref ref-type="bibr" rid="ref-1">1</xref>]. Text classification is primarily accomplished using RNN architectures. The RMDL model employs two RNN structures: Gated Recurrent Units (GRUs) and Long Short-Term Memory (LSTM). As a result, the RMDL&#x2019;s number of GRU or LSTM and the hidden layers are determined through a search of randomly generated hyper-parameters [<xref ref-type="bibr" rid="ref-1">1</xref>].</p>
<p>The rest of the paper is organized as follows; Section 2 gives the related works. The proposed approach for data augmentation using GANS is described in Section 3. Section 4 discusses an approach for text augmentation. Section 5 presents an overview for RMDL that used for classification. The experimental results are elaborated in Section 6, Finally, Section 7 is the Conclusion of the paper.</p>
</sec>
<sec id="s2"><label>2</label><title>Related Works</title>
<p>Generally, for the validation and test sets, the model does not generalize well if trained using a small set of samples. Therefore, such models may suffer from the overfitting problem. Numerous approaches have been proposed to reduce overfitting [<xref ref-type="bibr" rid="ref-4">4</xref>]. The simplest method is to add a regularization term on the weight&#x2019;s norm. An additional common technique employed is a dropout.</p>
<p>Probabilistically, dropout is a workaround in case the lesser computation is the target as such the neurons (units) are dropped randomly in the training process. It enables neurons to be independent as units are randomly dropped. At test time, this has the averaging effect on the predictions over several networks [<xref ref-type="bibr" rid="ref-5">5</xref>,<xref ref-type="bibr" rid="ref-6">6</xref>]. In [<xref ref-type="bibr" rid="ref-7">7</xref>], the authors show that dropout attains superior results on various test datasets.</p>
<p>Batch normalization is another common technique to avoid the model&#x2019;s overfitting; it normalizes layers and permits training of the normalization weights. Within the network, batch normalization can be applied on any layer and therefore it works effectively. Particularly when it is utilized in GANs [<xref ref-type="bibr" rid="ref-8">8</xref>] such as CycleGAN in [<xref ref-type="bibr" rid="ref-9">9</xref>]. Furthermore, transfer learning is a method of efficiently solving an issue by training neural net pre-trained weights on some relevant or more broad data and suitable parameters.</p>
<p>Previous research to systematically understand the benefits and limitations of data augmentation demonstrate that data augmentation can act as a regularizer in preventing overfitting in neural networks [<xref ref-type="bibr" rid="ref-10">10</xref>].</p>
<p>Data augmentation is not a novel field and several data augmentation methods have been applied to explicit problems. Data augmentation is the procedure of increasing the training dataset by creating more samples taking advantage of the training data.</p>
<p>There are several approaches to augment data. Geometric transformations and color augmentation are famous approaches when using images to increase dataset size. Smart Augmentation (SA) [<xref ref-type="bibr" rid="ref-11">11</xref>] and Neural Augmentation (NA) [<xref ref-type="bibr" rid="ref-12">12</xref>] have been proposed in recent years for similar tasks. During the training of the target network, smart augmentation creates a network that is learned to generate augmented data such that the loss of the target networks is minimized.</p>
<p>NA behaves like SA such that they separate networks to perform data augmentation which improves the classifier. However, NA uses pair of images randomly selected from the same class as an input. Manual data augmentation and data augmentation through training neural networks are the two main types of data augmentation approaches. Different neural network topologies, on the other hand, can be employed in data augmentation by training neural networks. The focus of this paper is on GANs.</p>
<p>Artificially, augmentation in image classification methods generates training images by altering available images [<xref ref-type="bibr" rid="ref-13">13</xref>]. Classification tasks take benefit of the augmentation of many images. The structure of images is changed to enhance the number of samples available to the ML algorithm, while flexibility is included in the final model [<xref ref-type="bibr" rid="ref-14">14</xref>].</p>
<p>As such, data augmentation techniques belong to the category of data warping, a method that searches directly to augment the input data to the data space. The idea of augmentation was accomplished on the MNIST dataset in [<xref ref-type="bibr" rid="ref-15">15</xref>]. A very common and conventional method to augment image data is to achieve geometry and augmentations of color, such as image reflecting, translating, cropping of the image, and changing the palette of image color as in [<xref ref-type="bibr" rid="ref-13">13</xref>,<xref ref-type="bibr" rid="ref-16">16</xref>]. The performance of classifiers on the MNIST database has been improved over elastic deformation&#x2019;s introduction in [<xref ref-type="bibr" rid="ref-17">17</xref>], additionally using the existing affine transformations.</p>
<p>In [<xref ref-type="bibr" rid="ref-18">18</xref>], a method called neural augmentation is proposed to allow a neural net to learn augmentations which improve the ability to correctly classify images. They proposed two different approaches to data augmentation. The first approach is generating augmented data before training the classifier by applying GANs and basic transformations to create a larger dataset. The second approach attempts to learn augmentation through a prepended neural net. AutoAugment [<xref ref-type="bibr" rid="ref-19">19</xref>], developed by Cubuk et al., is a much different approach to meta-learning than Neural Augmentation or Smart Augmentation. AutoAugment is a Reinforcement Learning algorithm that searches for an optimal augmentation policy amongst a constrained set of geometric transformations with miscellaneous levels of distortions.</p>
<p>Data augmentation techniques generally fall under two categories: Firstly, data augmentation by manual approaches, and secondly, using neural networks. In this paper, we particularly emphasize using GANs. In this approach, new data samples are generated from the distribution, which is learned from already available data. Therefore, we believe that it is a good alternative compared to manual augmentation.</p>
</sec>
<sec id="s3"><label>3</label><title>Generative Adversarial Networks (GANs)</title>
<p>In fields like Computer Vision (CV), the GANs have been excessively employed to generate new images for training. Moreover, even with relatively insignificant sets of data, GANs have been effective by learning techniques as in [<xref ref-type="bibr" rid="ref-20">20</xref>]. Single neural net produces better counterfeit examples from the original distribution of the data using a min-max strategy to trick the other net. Subsequently, the other net is trained to differentiate the counterfeits better. For style transfer in CycleGAN, GANs are used as image transferring from one setting to another. Furthermore, at augmenting datasets, GANs have shown to be very effective such as in increasing the input images resolution [<xref ref-type="bibr" rid="ref-21">21</xref>]. Infrequent instances, GANs have accomplished a lot of success.</p>
<p>The framework in [<xref ref-type="bibr" rid="ref-22">22</xref>] attains good overall performance on the datasets MNIST [<xref ref-type="bibr" rid="ref-23">23</xref>], CIFAR-10 [<xref ref-type="bibr" rid="ref-24">24</xref>] GANs have also proven effective in instance paragraph generation [<xref ref-type="bibr" rid="ref-25">25</xref>]. The proposed system in [<xref ref-type="bibr" rid="ref-26">26</xref>] has achieved good results using CNN. CNNs can effectively extract and learn a huge number of features. The superior performance of the CNNs is related to the sparse connections between neurons with several variables in these networks being low.</p>
<p>A generator G model and a discriminator D model are common in GANs. The generator learns the dataset G distribution gap. The discriminator D determines if the data comes from a true distribution or a gap. The discriminator D distinguishes between actual and synthetic images when it comes to visuals. The generator G aids in the creation of natural-looking photographs. While the generator is attempting to deceive the discriminator, the discriminator is attempting to avoid being deceived by the generator. GANs have been found to be particularly unstable for training, resulting in generators that produce insufficient outputs.</p>
<p>To overcome this problem, in this paper, we introduce the advantage of using Deep Convolutional Generative Adversarial Networks (DCGAN). Based on these generative models, a hierarchy of abstract features can be learned from parts of objects in the scene. As such, the learned features can characterize the distribution of underlying data and samples, which have been generated from these features, are common in real-world scenarios. Significantly, DCGAN achieves good results mostly because of the stability of its architecture in training which affects the quality of the samples. Both the generator and discriminator models in DCGAN are CNNs rather than multilayer perceptron.</p>
<p>In this paper, we describe a method for augmenting data using DCGAN. Through a range of datasets, the proposed framework trains GANs which leads to stable training. Mainly, our proposed framework relies on three modifications in the architecture of CNN. Traditionally in every layer of CNN, a pooling layer follows convolutional layers. Practically, applying pooling is to down-sample the previous input. Nonetheless, later, it is validated that getting all convolutional layers and getting the network to train its spatial downsampling improves the performance of CNNs. Thus, this is the first interesting contribution to the proposed architecture.</p>
<p>Generally, neural network architecture is achieved by creating CNN layers block, followed by stacked fully connected layers. With each node in the layer, the latest layer is fully connected which has a Softmax activation function specifying the image probability of a specific class. As such, the fully connected layers have abundant parameters resulting in overfitting. Thus, dropout is used to avoid overfitting.</p>
<p>Recently, to reduce overfitting global average pooling is proposed in which parameters number in the network is reduced. Global average pooling reduces spatial dimensions in a similar manner as the max-pooling technique, but the global average-pooling technique reduces the dimensions very efficiently. Simply, by taking an average of WH values, <inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:mi>W</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>H</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>d</mml:mi></mml:math></inline-formula> tensor is converted to <inline-formula id="ieqn-2"><mml:math id="mml-ieqn-2"><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mi>d</mml:mi></mml:math></inline-formula>. To calculate the probability on this layer, the soft-max activation function is applied. This will be the second contribution to our architecture.</p>
<p>Nevertheless, the stability of the model is increased by global average-pooling which hits the speed of convergence. In [<xref ref-type="bibr" rid="ref-22">22</xref>], the authors proposed to connect the maximum features to the generator and the discriminators&#x2019; input and output respectively. Commonly while GANs training, the generator collapses to generate samples from the same point. Hence, to avoid this problem we suggested using batch normalization. To achieve unit-variance and zero-mean, the batch normalization approach normalizes the input for each unit. Thus, this allows treating with poor initialization problems and similarly reliefs vanishing addressing, explosion and further gradient descent problems. This is the third contribution to our architecture.</p>
<p>As mentioned already, the generation of high-resolution images by GANs modeling is an unstable procedure. Consequently, selecting the same architecture for generator and discriminator for diverse datasets by varying images resolutions is illogical. Therefore, our architecture is modeled in such a manner that CNN layers&#x2019; number in generator and discriminator are contingent on images resolution. Several convolutional layers in the generator are calculated as:
<disp-formula id="eqn-1"><label>(1)</label><mml:math id="mml-eqn-1" display="block"><mml:mi>n</mml:mi><mml:mi>u</mml:mi><mml:mi>m</mml:mi><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mi>o</mml:mi><mml:mi>f</mml:mi><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mrow><mml:mtext mathvariant="italic">layers</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mtext mathvariant="italic">height</mml:mtext></mml:mrow><mml:mi>o</mml:mi><mml:mi>f</mml:mi><mml:mrow><mml:mtext mathvariant="italic">image</mml:mtext></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>2</mml:mn></mml:math></disp-formula></p>
<p>Contrariwise, the number of the convolutional layers in the discriminator is calculated as <inline-formula id="ieqn-3"><mml:math id="mml-ieqn-3"><mml:mrow><mml:mo>(</mml:mo><mml:mi>n</mml:mi><mml:mi>u</mml:mi><mml:mi>m</mml:mi><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mi>o</mml:mi><mml:mi>f</mml:mi><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mrow><mml:mtext mathvariant="italic">layers</mml:mtext></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula>. A vector drawn from the normal or uniform distribution is used as the generator input. As a result, the initial layer in a generator architecture can be a completely connected layer with a matrix multiplication. As a result, the layer&#x2019;s output is reformed into a four-dimensional tensor. Batch-normalization was previously used on the 4-dimensional tensor. The tensor obtained works as a surprise of the convolution stack having size <inline-formula id="ieqn-4"><mml:math id="mml-ieqn-4"><mml:mi>m</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>4</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>4</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mi>c</mml:mi></mml:math></inline-formula> where <inline-formula id="ieqn-5"><mml:math id="mml-ieqn-5"><mml:mi>m</mml:mi></mml:math></inline-formula> represents the number of images in batch size heuristically, <italic>c</italic> is the features of the convolutional number which is computed as:
<disp-formula id="eqn-2"><label>(2)</label><mml:math id="mml-eqn-2" display="block"><mml:mi>c</mml:mi><mml:mo>=</mml:mo><mml:mn>128</mml:mn><mml:mo>&#x2217;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mi>n</mml:mi><mml:mi>u</mml:mi><mml:mi>m</mml:mi><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mi>o</mml:mi><mml:mi>f</mml:mi><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mrow><mml:mtext mathvariant="italic">layers</mml:mtext></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>It is expected for each convolutional layer to take a tensor of the following form for the last layer in the stack:
<disp-formula id="eqn-3"><label>(3)</label><mml:math id="mml-eqn-3" display="block"><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>4</mml:mn><mml:mo>&#x2217;</mml:mo><mml:mrow><mml:mtext mathvariant="italic">layer</mml:mtext></mml:mrow><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mrow><mml:mtext mathvariant="italic">number</mml:mtext></mml:mrow><mml:mo stretchy="false">]</mml:mo><mml:mo>&#x00D7;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>4</mml:mn><mml:mo>&#x2217;</mml:mo><mml:mrow><mml:mtext mathvariant="italic">layer</mml:mtext></mml:mrow><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mrow><mml:mtext mathvariant="italic">number</mml:mtext></mml:mrow><mml:mo stretchy="false">]</mml:mo><mml:mo>&#x00D7;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mi>c</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>2</mml:mn><mml:mrow><mml:mover><mml:mrow /><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext mathvariant="italic">layer</mml:mtext></mml:mrow><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mrow><mml:mtext mathvariant="italic">number</mml:mtext></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">]</mml:mo></mml:math></disp-formula>where the tensors&#x2019; input and outputs are of the form:
<disp-formula id="eqn-4"><label>(4)</label><mml:math id="mml-eqn-4" display="block"><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>4</mml:mn><mml:mo>&#x2217;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext mathvariant="italic">layer</mml:mtext></mml:mrow><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mrow><mml:mtext mathvariant="italic">number</mml:mtext></mml:mrow><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">]</mml:mo><mml:mo>&#x00D7;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>4</mml:mn><mml:mo>&#x2217;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext mathvariant="italic">layer</mml:mtext></mml:mrow><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mrow><mml:mtext mathvariant="italic">number</mml:mtext></mml:mrow><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">]</mml:mo><mml:mo>&#x00D7;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mi>f</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>2</mml:mn><mml:mrow><mml:mover><mml:mrow /><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext mathvariant="italic">layer</mml:mtext></mml:mrow><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mrow><mml:mtext mathvariant="italic">number</mml:mtext></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">]</mml:mo></mml:math></disp-formula>where: <inline-formula id="ieqn-6"><mml:math id="mml-ieqn-6"><mml:mrow><mml:mtext mathvariant="italic">layer</mml:mtext></mml:mrow><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mrow><mml:mtext mathvariant="italic">number</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2026;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mi>n</mml:mi><mml:mi>u</mml:mi><mml:mi>m</mml:mi><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mi>o</mml:mi><mml:mi>f</mml:mi><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mrow><mml:mtext mathvariant="italic">layers</mml:mtext></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>. Noticeably, in each layer, the height and width are up-sampled by factor two such that the features number is also down-sampled by factor two. The transpose of a convolution layer is upscaled for each convolutional NN, with a kernel of size 5-by-5 by two strides for vertical and horizontal directions.</p>
<p>Subsequently, it is required to carry out batch-normalization for the output of each convolutional layer before taking it as the following CNN layer input. Thus, the last layer in the convolutional stack yields an output of size <inline-formula id="ieqn-7"><mml:math id="mml-ieqn-7"><mml:mi>m</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>s</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>s</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>f</mml:mi></mml:math></inline-formula> where <italic>f</italic> is the channels number inside the image and <italic>s</italic> is the image size. Hence, batch-normalization is not carried out on the final layer so that oscillation of the sample can be avoided, and stability of the model is maintained. Lastly, tanh activation-function is applied to the output which is handled as generated models.</p>
<p>An image is taken as input to discriminator which has size <inline-formula id="ieqn-8"><mml:math id="mml-ieqn-8"><mml:mi>m</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>s</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>s</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:math></inline-formula>. Additionally, the discriminator only has one more convolutional layer rather than the layers&#x2019; number in the generator. In the proposed architecture, the first layer is a convolutional layer. So, instead of decreasing dimensions in this layer of input, further feature maps are produced. More accurately, this layer output is represented of size <inline-formula id="ieqn-9"><mml:math id="mml-ieqn-9"><mml:mi>m</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>s</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>s</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>64</mml:mn></mml:math></inline-formula>.</p>
<p>To avoid overfitting, dropout is applied to this layer output. Finally, the remaining convolutional layers are stacked taking input of the form: <inline-formula id="ieqn-10"><mml:math id="mml-ieqn-10"><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mi>s</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:mo>&#x2217;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext mathvariant="italic">layer</mml:mtext></mml:mrow><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mrow><mml:mtext mathvariant="italic">number</mml:mtext></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">]</mml:mo><mml:mo>&#x00D7;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mi>s</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>2</mml:mn><mml:mo>&#x2217;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext mathvariant="italic">layer</mml:mtext></mml:mrow><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mrow><mml:mtext mathvariant="italic">number</mml:mtext></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">]</mml:mo><mml:mo>&#x00D7;</mml:mo><mml:mn>64</mml:mn><mml:mo>&#x2217;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext mathvariant="italic">layer</mml:mtext></mml:mrow><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mrow><mml:mtext mathvariant="italic">number</mml:mtext></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula></p>
<p>where input and outputs yield a tensor of the form:
<disp-formula id="eqn-5"><label>(5)</label><mml:math id="mml-eqn-5" display="block"><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mi>s</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext mathvariant="italic">layer</mml:mtext></mml:mrow><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mrow><mml:mtext mathvariant="italic">number</mml:mtext></mml:mrow><mml:mo>&#x2217;</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">]</mml:mo><mml:mo>&#x00D7;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mi>s</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:mo>&#x2217;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext mathvariant="italic">layer</mml:mtext></mml:mrow><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mrow><mml:mtext mathvariant="italic">number</mml:mtext></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">]</mml:mo><mml:mo>&#x00D7;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mrow><mml:mtext mathvariant="italic">layer</mml:mtext></mml:mrow><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mrow><mml:mtext mathvariant="italic">number</mml:mtext></mml:mrow><mml:mo>&#x2217;</mml:mo><mml:mn>64</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:math></disp-formula></p>
<p><italic>where</italic> <inline-formula id="ieqn-11"><mml:math id="mml-ieqn-11"><mml:mrow><mml:mtext mathvariant="italic">layer</mml:mtext></mml:mrow><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mrow><mml:mtext mathvariant="italic">number</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mn>2</mml:mn><mml:mo>&#x2026;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>n</mml:mi><mml:mi>u</mml:mi><mml:mi>m</mml:mi><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mi>o</mml:mi><mml:mi>f</mml:mi><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mrow><mml:mtext mathvariant="italic">layers</mml:mtext></mml:mrow><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>. For each layer, the generator height and width are up-sampled with factor two and the features number is down-sampled with a factor two. Furthermore, in each layer, the height and width of the discriminator are down-scaled by two and the features number is up-scaled with factor two.</p>
<p>Considerably, the proposed architecture is exhibited such that the convolutional layers&#x2019; highest number are connected respectively to generators&#x2019; and discriminators&#x2019; input and output. Convolution is applied in the vertical and horizontal directions from second convolutional layer to downscale using a 3 &#x00D7; 3 kernel with a step of two. In case of generator batch-normalization is carried out for each convolutional layer output. Moreover, dropout is applied in the last convolutional layer for all layers before been passed as input to CNN layer. In the convolutional stack, the last layer yields an output which has the following size representation <inline-formula id="ieqn-12"><mml:math id="mml-ieqn-12"><mml:mn>64</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mi>m</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>4</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>4</mml:mn><mml:mo>&#x2217;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mi>n</mml:mi><mml:mi>u</mml:mi><mml:mi>m</mml:mi><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mi>o</mml:mi><mml:mi>f</mml:mi><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mrow><mml:mtext mathvariant="italic">layers</mml:mtext></mml:mrow><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>.</p>
<p>From last convolutional layer, for the resulting output, global average-pooling is applied in of size <inline-formula id="ieqn-13"><mml:math id="mml-ieqn-13"><mml:mi>m</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>64</mml:mn><mml:mo>&#x2217;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mi>n</mml:mi><mml:mi>u</mml:mi><mml:mi>m</mml:mi><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mi>o</mml:mi><mml:mi>f</mml:mi><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mrow><mml:mtext mathvariant="italic">layers</mml:mtext></mml:mrow><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>. In the proposed architecture, the final layer is a fully connected layer and this layer output is a matrix representation that has size <inline-formula id="ieqn-14"><mml:math id="mml-ieqn-14"><mml:mi>m</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>. Finally, on output from this layer, soft-max activation function is applied which estimates an image probability to belong <inline-formula id="ieqn-15"><mml:math id="mml-ieqn-15"><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> different classes. Algorithm 1 describes the proposed data augmentation using GANs.</p>
<statement id="st1" content-type="algorithm">
<label>Algorithm 1:</label>
<title>Data Augmentation using GANs</title>
<p><bold>Input:</bold> <inline-formula id="ieqn-16"><mml:math id="mml-ieqn-16"><mml:mi>Z</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:msub><mml:mi>z</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>z</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>&#x2026;</mml:mo><mml:msub><mml:mi>z</mml:mi><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula> with <inline-formula id="ieqn-17"><mml:math id="mml-ieqn-17"><mml:msub><mml:mi>z</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula></p>
<p>&#x2002;&#x2002;1. Initialization of the weights in both generator and discriminator</p>
<p>&#x2002;&#x2002;2. <bold>for</bold> <inline-formula id="ieqn-18"><mml:math id="mml-ieqn-18"><mml:mi>j</mml:mi><mml:mo stretchy="false">&#x2190;</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula> <bold>to</bold> <inline-formula id="ieqn-19"><mml:math id="mml-ieqn-19"><mml:mi>L</mml:mi></mml:math></inline-formula> <bold>do</bold></p>
<p>&#x2002;&#x2002;&#x2002;&#x2002;&#x2002;&#x2002;Update the weights of discriminator using back propagation technique, in order to minimize loss of the discriminator</p>
<p>&#x2002;&#x2002;&#x2002;&#x2002;&#x2002;&#x2002;Update generators&#x2019; weights using back propagation technique, in order to reduce the loss of generator</p>
<p>&#x2002;&#x2002;&#x2002;&#x2002;<bold>end for</bold></p>
<p>&#x2002;&#x2002;3. <bold>for</bold> <inline-formula id="ieqn-20"><mml:math id="mml-ieqn-20"><mml:mi>j</mml:mi><mml:mo stretchy="false">&#x2190;</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula> <bold>to</bold> <italic>C</italic> <bold>do</bold></p>
<p>&#x2002;&#x2002;4. &#x2002;&#x2002;<bold>foreach</bold> image <italic>y</italic> in <inline-formula id="ieqn-21"><mml:math id="mml-ieqn-21"><mml:msup><mml:mi>j</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> class <bold>do</bold></p>
<p>&#x2002;&#x2002;&#x2002;&#x2002;&#x2002;&#x2002;Learn inverse mapping z of image <inline-formula id="ieqn-22"><mml:math id="mml-ieqn-22"><mml:mi>y</mml:mi><mml:mo>,</mml:mo><mml:mrow><mml:mtext mathvariant="italic">using</mml:mtext></mml:mrow><mml:mrow><mml:mtext mathvariant="italic">gradient</mml:mtext></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mtext mathvariant="italic">based</mml:mtext></mml:mrow><mml:mrow><mml:mtext mathvariant="italic">model</mml:mtext></mml:mrow></mml:math></inline-formula></p>
<p>&#x2002;&#x2002;5. &#x2002;&#x2002;&#x2002;&#x2002;<bold>for</bold> <inline-formula id="ieqn-23"><mml:math id="mml-ieqn-23"><mml:mi>j</mml:mi><mml:mo stretchy="false">&#x2190;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> <bold>to</bold> <italic>A</italic> <bold>do</bold></p>
<p>&#x2002;&#x2002;6. &#x2002;&#x2002;&#x2002;&#x2002;Add random noise <inline-formula id="ieqn-24"><mml:math id="mml-ieqn-24"><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>z</mml:mi></mml:math></inline-formula> to <italic>z</italic> for producing <inline-formula id="ieqn-25"><mml:math id="mml-ieqn-25"><mml:msup><mml:mi>z</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula></p>
<p>&#x2002;&#x2002;7.&#x2002;&#x2002;&#x2002;&#x2002;create image <inline-formula id="ieqn-26"><mml:math id="mml-ieqn-26"><mml:msup><mml:mi>z</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> as <inline-formula id="ieqn-27"><mml:math id="mml-ieqn-27"><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>q</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mi>z</mml:mi><mml:mo>+</mml:mo><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>z</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula></p>
<p>&#x2002;&#x2002;8.&#x2002;&#x2002;&#x2002;&#x2002;add <inline-formula id="ieqn-28"><mml:math id="mml-ieqn-28"><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mi>z</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> to <inline-formula id="ieqn-29"><mml:math id="mml-ieqn-29"><mml:msup><mml:mi>Z</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula></p>
<p>&#x2002;&#x2002;&#x2002;&#x2002;&#x2002;&#x2002;<bold>end for</bold></p>
<p>&#x2002;&#x2002;&#x2002;&#x2002;<bold>end for</bold></p>
<p>&#x2002;&#x2002;<bold>end for</bold></p>
<p><bold>Output:</bold> <inline-formula id="ieqn-30"><mml:math id="mml-ieqn-30"><mml:msup><mml:mi>Z</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo>{</mml:mo><mml:msubsup><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x2026;</mml:mo><mml:mo>&#x2026;</mml:mo><mml:msubsup><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula> where <inline-formula id="ieqn-31"><mml:math id="mml-ieqn-31"><mml:mrow><mml:mo>(</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x003E;</mml:mo><mml:mi>m</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula></p>
</statement>
<p>At training time, training batch is generated from a batch of images then this image is fed into RMDL to do classification. Subsequently, randomly from the same class, two images are sampled and served as an input to augmentation network to generate the successive augmented image.</p>
</sec>
<sec id="s4"><label>4</label><title>Text Augmentation</title>
<p>Text augmentation approaches have different factors associated with them for creating robust models. Some of the text augmentation techniques are more dependent on the details of the language [<xref ref-type="bibr" rid="ref-27">27</xref>] while other techniques do not require much detail if a model based on language is provided [<xref ref-type="bibr" rid="ref-28">28</xref>].</p>
<p>The models of semantic data have improved over the past few years with the aid of distributed word embedded representations [<xref ref-type="bibr" rid="ref-29">29</xref>]. New approaches for text understandability have been created such as text classification approaches as deployed in this research. The text augmentation techniques can be deployed for improving classification accuracy and creating robust models. This way, the classification models can be trained with a lesser requirement for labeled data.</p>
<p>The process of data labeling is very time and efficiency costly. Scenarios such as identification of fake news [<xref ref-type="bibr" rid="ref-30">30</xref>], getting the context of political news, public response on administration services [<xref ref-type="bibr" rid="ref-31">31</xref>], or crucial scenarios such as emergencies where better coordination is required, aptly labeled data is very difficult to attain. ML techniques require a huge amount of data to perform well and give optimal results. Such big data are available and accessible to big organizations along with the resources for labeling such data, but small organizations do not have such resources or access to data.</p>
<p>The changes in input data affect the performance of the classifier but for a robust classifier, it is required to have an ability to handle the data changes and have a learned model that gives a good response to input distribution changes. The reason for data changes in the evolution of language and changes in geographical aspects. Another use is semi-supervised learning, in which we use the few labels we must generate a noisy classifier that labels more unlabeled data before feeding it back to train another classifier.</p>
<p>This work introduces a text augmentation system that replaces terms that are used similarly from a global perspective rather than a context-specific perspective. So, rather than focusing on what the ideal word to replace in this sentence in this paper is, consider how similar words are utilized throughout texts. We further assess the proposed technique on various classification data.</p>
<p>Limited data is an issue when it comes to acquiring well-labeled data for supervised learning tasks [<xref ref-type="bibr" rid="ref-32">32</xref>], and it is even more of a problem for low-resource languages [<xref ref-type="bibr" rid="ref-33">33</xref>]. The augmentation effect on learning for DNN models is demonstrated here.</p>
<p>We employ pre-trained word vector representations from Glove [<xref ref-type="bibr" rid="ref-34">34</xref>] for the DNN. For the AG News dataset, the pre-trained Wikipedia model is utilized, and for the social media datasets, the pre-trained Twitter model is used. We augment the data five times across all the datasets in this experiment. We also incorporate the original dataset, resulting in a six-fold enlarged dataset. When no augmentation is used, we simply rerun the original dataset six times to make the findings similar.</p>
<p>Word2vec-based augmentation (learned semantic similarity) Word2vec is a strong augmentation method that uses a word embedding model [<xref ref-type="bibr" rid="ref-29">29</xref>] trained on the public dataset to locate words that are mostly similar to a particular input word. We use a pre-trained Wikipedia Word2Vec model for formal text. To convert a Glove model pre-trained on Twitter data to Word2vec format for social media data, we employ Gen-sim [<xref ref-type="bibr" rid="ref-35">35</xref>]. The updated models are used in the proposed strategy to augment data by randomly assigning a word in a sentence and utilizing cosine similarity to determine the distributed representations of words and phrases as well as their compositionality. When selecting a comparable phrase, we use the cosine similarity as a relative weight to find a term that replaces the input word.</p>
<p>We are provided with a string and an integer in Algorithm 2, where the string represents the input data and the integer represents the number of repeats to supplement that data. Word2vec has the advantage of producing more contextually connected vectors, which means that words with similar meanings can be expressed accordingly.</p>
<statement id="st2" content-type="algorithm">
<label>Algorithm 2:</label>
<title>Text Augmentation</title>
<p>1.&#x2002;&#x2002;Input: Input: s (sentence), <italic>a</italic> (number)</p>
<p>2.&#x2002;&#x2002;Let <inline-formula id="ieqn-32"><mml:math id="mml-ieqn-32"><mml:mi>V</mml:mi></mml:math></inline-formula> be a vocabulary</p>
<p>3.&#x2002;&#x2002;for <italic>i</italic> in range <inline-formula id="ieqn-33"><mml:math id="mml-ieqn-33"><mml:mrow><mml:mo>(</mml:mo><mml:mi>a</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula></p>
<p>&#x2002;&#x2002;&#x2002;&#x2002;&#x2002;&#x2002;&#x2002;&#x02022; randomly select a word from <inline-formula id="ieqn-34"><mml:math id="mml-ieqn-34"><mml:mi>s</mml:mi></mml:math></inline-formula></p>
<p>&#x2002;&#x2002;&#x2002;&#x2002;&#x2002;&#x2002;&#x2002;&#x02022; find similar words to randomly selected word</p>
<p>&#x2002;&#x2002;&#x2002;&#x2002;&#x2002;&#x2002;&#x2002;&#x02022; randomly select a word given weights as distance</p>
<p>&#x2002;&#x2002;&#x2002;&#x2002;&#x2002;&#x2002;&#x2002;&#x02022; replace with similar word randomly selected before</p>
<p>&#x2002;&#x2002;&#x2002;&#x2002;end for</p>
<p>Output: a sentence with words replaced</p>
</statement>
<p>We send the updated data (picture or text) into RMDL, which does categorization.</p>
</sec>
<sec id="s5"><label>5</label><title>Random Multi Model Deep Learning</title>
<p>Multi random DL models [<xref ref-type="bibr" rid="ref-2">2</xref>] including DNN, RNN, and CNN techniques (or GRU) are used for text and image categorization. We&#x2019;ll go over RMDL first, followed by the three DL architecture techniques (DNN, RNN, and CNN). The usage of multi optimizer methods in various random models will next be examined.</p>
<p>Multi-model random DL is a one-of-a-kind technique that involves training a large number of DNN, Deep CNN, and Deep RNN at the same time. The number of layers and nodes are produced for all of these deep learning multi-models (for instance, nine random models in RMDL consist of three CNNs, three RNNs, and three DNNs, such that all of them are unique owing to the random creation).</p>
<disp-formula id="eqn-6"><label>(6)</label><mml:math id="mml-eqn-6" display="block"><mml:mi>M</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac></mml:mstyle><mml:mo>+</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mrow><mml:mo>&#x2211;</mml:mo></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msubsup><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac></mml:mstyle></mml:mrow><mml:mi>n</mml:mi></mml:mfrac></mml:mstyle><mml:mo>]</mml:mo></mml:mrow></mml:math></disp-formula>
<p>where <italic>n</italic> is random models&#x2019; number, and <inline-formula id="ieqn-35"><mml:math id="mml-ieqn-35"><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the prediction of model output for data point <italic>i</italic> in model <italic>j</italic> Output space employs majority vote for final <inline-formula id="ieqn-55"><mml:math id="mml-ieqn-55"><mml:msub><mml:mrow><mml:mover><mml:mi>y</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. Hence, <inline-formula id="ieqn-57"><mml:math id="mml-ieqn-57"><mml:msub><mml:mrow><mml:mover><mml:mi>y</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> can be expressed as follows:</p>
<disp-formula id="eqn-7"><label>(7)</label><mml:math id="mml-eqn-7" display="block"><mml:msub><mml:mrow><mml:mover><mml:mi>y</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>y</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mn>1</mml:mn><mml:mo>.</mml:mo></mml:mrow></mml:msub><mml:mo>&#x2026;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>y</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2026;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>y</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>]</mml:mo></mml:mrow></mml:math></disp-formula>
<p>where <italic>n</italic> is random models&#x2019; number, and <inline-formula id="ieqn-36"><mml:math id="mml-ieqn-36"><mml:msub><mml:mrow><mml:mover><mml:mi>y</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> illustrates the document prediction of label either for document or data point of <inline-formula id="ieqn-37"><mml:math id="mml-ieqn-37"><mml:msub><mml:mi>D</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mi>&#x03F5;</mml:mi><mml:mrow><mml:mo>{</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula> for model <italic>j</italic> and <inline-formula id="ieqn-38"><mml:math id="mml-ieqn-38"><mml:msub><mml:mrow><mml:mover><mml:mi>y</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is defined as follows:</p>
<disp-formula id="eqn-8"><label>(8)</label><mml:math id="mml-eqn-8" display="block"><mml:msub><mml:mrow><mml:mover><mml:mi>y</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mtext mathvariant="italic">softmax</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:math></disp-formula>
<p>After all of RDL&#x2019;s models (RMDL) have been trained, the final forecast is made using the majority vote of these models.</p>
<sec id="s5_1"><label>5.1</label><title>Deep Learning in RMDL</title>
<p>The RMDL model structure contains three basic techniques of DL in parallel. In what follows, every individual model will be described separately. Consequently, the final model consists of d random DNNs, RNNs, and CNNs models.</p>
</sec>
<sec id="s5_2"><label>5.2</label><title>Deep Neural Network</title>
<p>Layer multiconnection is used to learn the structure of DNNs, where each layer receives only connections from the preceding layer and only grants connections to the subsequent layer in hidden layers. This layer input is a link between feature space and all random models&#x2019; initial hidden layer. As a result, the output layer for multi-class classification is the number of classes, whereas for binary classification it is individual output.</p>
<p>In this paper, DNNs are trained several times for different purposes. In our proposed technique, multi-classes DNNs learned models are generated randomly. For instance, nodes number in each layer and furthermore layers&#x2019; number are entirely random assigned.</p>
<p>In our approach, DNNs are discriminative trained models that use sigmoid <xref ref-type="disp-formula" rid="eqn-9">Eq. (9)</xref> and ReLU [<xref ref-type="bibr" rid="ref-36">36</xref>] as activation functions in a typical back-propagation algorithm. Finally, Softmax <xref ref-type="disp-formula" rid="eqn-11">(11)</xref> should be used in output layer for classification of multi-class.</p>
<disp-formula id="eqn-9"><label>(9)</label><mml:math id="mml-eqn-9" display="block"><mml:mi>f</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:msup><mml:mi>e</mml:mi><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mi>x</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:mfrac></mml:mstyle><mml:mi>&#x03F5;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>
<disp-formula id="eqn-10"><label>(10)</label><mml:math id="mml-eqn-10" display="block"><mml:mi>f</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mi>x</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>
<disp-formula id="eqn-11"><label>(11)</label><mml:math id="mml-eqn-11" display="block"><mml:mi>&#x03C3;</mml:mi><mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mi>z</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:msup><mml:mi>e</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>z</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msup><mml:mrow><mml:msubsup><mml:mrow><mml:mo>&#x2211;</mml:mo></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:msubsup><mml:msup><mml:mi>e</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>z</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msup></mml:mrow></mml:mfrac><mml:mi>j</mml:mi><mml:mi>&#x03F5;</mml:mi><mml:mrow><mml:mo>{</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>K</mml:mi><mml:mo>}</mml:mo></mml:mrow></mml:math></disp-formula>
<p>Mainly, the goal is to learn from a set of example pairs <italic>(x, y), x&#x03F5;X, y&#x03F5;Y,</italic> and target spaces using hidden layers. On the other hand, in classification of text, the input is string that is created by text vectorization.</p>
</sec>
<sec id="s5_3"><label>5.3</label><title>Recurrent Neural Networks</title>
<p>Additional neural network architecture which contributes in RMDL model is RNNs. Practically, RNNs assign extra weights to the earlier sequence data points. Consequently, this procedure is a worthy method for classification of text, string and sequential data. Furthermore, in this work this procedure could be used for classification of images. RNN the neural net cogitates the previous nodes information which allows improved analysis of semantic for dataset structures. At step <italic>t</italic>, the formulation of this notion is given in <xref ref-type="disp-formula" rid="eqn-12">Eq. (12)</xref> such that <inline-formula id="ieqn-39"><mml:math id="mml-ieqn-39"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the state on time <italic>t</italic> and <inline-formula id="ieqn-40"><mml:math id="mml-ieqn-40"><mml:msub><mml:mi>u</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> denotes the input.</p>
<disp-formula id="eqn-12"><label>(12)</label><mml:math id="mml-eqn-12" display="block"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>F</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>u</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>&#x03B8;</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>
<p>More explicitly, weights can be used to formulate <xref ref-type="disp-formula" rid="eqn-12">Eq. (12)</xref> with specified parameters in <xref ref-type="disp-formula" rid="eqn-13">Eq. (13)</xref>:</p>
<disp-formula id="eqn-13"><label>(13)</label><mml:math id="mml-eqn-13" display="block"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mi>&#x03C3;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi>u</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mi>b</mml:mi></mml:math></disp-formula>
<p>where <inline-formula id="ieqn-41"><mml:math id="mml-ieqn-41"><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> indicates recurrent matrix weight, <inline-formula id="ieqn-42"><mml:math id="mml-ieqn-42"><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> refers to weights of the input, <italic>b</italic> is the bias and <inline-formula id="ieqn-43"><mml:math id="mml-ieqn-43"><mml:mi>&#x03C3;</mml:mi></mml:math></inline-formula> denotes an element-wise function. Once more, the basic architecture for RMDL model is modified. Even though benefits of RNN, it has two major problems, first, when the gradient descent algorithm error is back propagated over the network and the second problem is vanishing gradient and exploding gradient [<xref ref-type="bibr" rid="ref-37">37</xref>].</p>
<p>In this paper, we have used the Gated hidden unit proposed by Choet al. (2014a), for RNN activation function <inline-formula id="ieqn-44"><mml:math id="mml-ieqn-44"><mml:msub><mml:mi>&#x03C3;</mml:mi><mml:mrow><mml:mi>g</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. Simply, as an elementwise tanh, the gated concealed unit is an alternative to ordinary units. The gated unit is like Hochreiter and Schmidhuber&#x2019;s (1997) long short-term memory (LSTM) unit, which can learn and model with long-term dependencies more effectively. Probably, this can be done by taking paths of computation within the unfolded RNN such that the derivatives product is close to one. By the way, these paths permit gradients to easily flow backward deprived of excessively suffering from the vanishing effect (Pascanu et al., 2013a). Consequently, it is potential to use LSTM units as an alternative to the gated hidden unit described but with some modification to the new state. The RNN new state <inline-formula id="ieqn-45"><mml:math id="mml-ieqn-45"><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> taking on <italic>n</italic> gated hidden units is calculated as follows:</p>
<disp-formula id="eqn-14"><label>(14)</label><mml:math id="mml-eqn-14" display="block"><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03C3;</mml:mi><mml:mrow><mml:mi>g</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>z</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2218;</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>z</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2218;</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></disp-formula>
<p>where <inline-formula id="ieqn-46"><mml:math id="mml-ieqn-46"><mml:msub><mml:mi>&#x03C3;</mml:mi><mml:mrow><mml:mi>g</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is an element-wise multiplication, and <inline-formula id="ieqn-47"><mml:math id="mml-ieqn-47"><mml:msub><mml:mi>z</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the update gates output. We have used the following updated state <inline-formula id="ieqn-48"><mml:math id="mml-ieqn-48"><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> which is computed by:</p>
<disp-formula id="eqn-15"><label>(15)</label><mml:math id="mml-eqn-15" display="block"><mml:msub><mml:mrow><mml:mover><mml:mi>s</mml:mi><mml:mo>&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mi>h</mml:mi><mml:mi>&#x00A0;</mml:mi><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mi>h</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>W</mml:mi><mml:mi>e</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:mi>U</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2218;</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>]</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:mi>C</mml:mi><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>
<p>where <inline-formula id="ieqn-49"><mml:math id="mml-ieqn-49"><mml:mi>e</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mi>&#x03F5;</mml:mi><mml:mi>n</mml:mi></mml:math></inline-formula> is an <italic>n</italic>-dimensional word <inline-formula id="ieqn-50"><mml:math id="mml-ieqn-50"><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> embedding, and <inline-formula id="ieqn-51"><mml:math id="mml-ieqn-51"><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the retuned gates output. Such that <inline-formula id="ieqn-52"><mml:math id="mml-ieqn-52"><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is represented as a <inline-formula id="ieqn-53"><mml:math id="mml-ieqn-53"><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>o</mml:mi><mml:mi>f</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>M</mml:mi></mml:math></inline-formula> vector, simply <inline-formula id="ieqn-54"><mml:math id="mml-ieqn-54"><mml:mi>e</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> is a column of an embedding matrix <inline-formula id="ieqn-55a"><mml:math id="mml-ieqn-55a"><mml:mi>E</mml:mi><mml:mi>&#x00A0;</mml:mi><mml:mi>&#x03F5;</mml:mi><mml:mi>&#x00A0;</mml:mi><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>M</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>. Occasionally, bias terms are omitted to get equations which are less muddled. The update gate <inline-formula id="ieqn-56"><mml:math id="mml-ieqn-56"><mml:msub><mml:mi>z</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> allows each hidden unit to keep its preceding activation, such that the reset gate <inline-formula id="ieqn-57a"><mml:math id="mml-ieqn-57a"><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> which controls the information quantity which should be rest from the previous state. This is computed as follows:</p>
<disp-formula id="eqn-16"><label>(16)</label><mml:math id="mml-eqn-16" display="block"><mml:msub><mml:mi>z</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>z</mml:mi></mml:mrow></mml:msub><mml:mi>e</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:msub><mml:mi>U</mml:mi><mml:mrow><mml:mi>z</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>z</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>
<disp-formula id="eqn-17"><label>(17)</label><mml:math id="mml-eqn-17" display="block"><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>r</mml:mi></mml:mrow></mml:msub><mml:mi>e</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:msub><mml:mi>U</mml:mi><mml:mrow><mml:mi>r</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>r</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>
<p>where <inline-formula id="ieqn-58"><mml:math id="mml-ieqn-58"><mml:mi>W</mml:mi><mml:mo>,</mml:mo><mml:mi>U</mml:mi></mml:math></inline-formula> represent parameter matrices and <inline-formula id="ieqn-59"><mml:math id="mml-ieqn-59"><mml:mi>&#x03C3;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>.</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is a logistic sigmoid function.</p>
</sec>
<sec id="s5_4"><label>5.4</label><title>Convolutional Neural Networks (CNN)</title>
<p>The final DL technique that contributes to RMDL is CNNs developed for hierarchical document or picture classification. The basic CNN convolves a tensor of the image with a set of kernels of size dd for image processing. The feature maps are convolutional layers that can be stacked to offer different input filters. Pooling is a technique used in CNN to decrease computational complexity by lowering output size from one layer to the next. We employ global average pooling to reduce overfitting by lowering the number of parameters in the network. The maps are converted to columns so that the output from the stacked featured maps can be fed into the next layer.</p>
<p>The final layers of a CNN are usually fully connected. During the backpropagation step of a CNN, not only the weights but also the feature detector filters are modified. The channels number, which reflects the feature area size, is, on the other hand, a major challenge for CNN when it comes to text. Practically, the CNN dimensionality for text is high.</p>
<sec id="s5_4_1"><label>5.4.1</label><title>Optimization</title>
<p>In this paper, we used two stochastic gradient optimizers in implementation of NNs that are Adagrad and Adadelta. Adagrad is addressed in [<xref ref-type="bibr" rid="ref-38">38</xref>] which are sub gradient methods, dynamically it absorbs geometry data knowledge to achieve more helpful the gradient based on learning. In iteration <italic>k,</italic> define:
<disp-formula id="eqn-18"><label>(18)</label><mml:math id="mml-eqn-18" display="block"><mml:msup><mml:mi>G</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>k</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mi>d</mml:mi><mml:mi>i</mml:mi><mml:mi>a</mml:mi><mml:mi>g</mml:mi><mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:msubsup><mml:mrow><mml:mo>&#x2211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup><mml:msup><mml:mi>g</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mi>g</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:mo>]</mml:mo></mml:mrow><mml:mrow><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac></mml:mrow></mml:msup></mml:math></disp-formula></p>
<p>Diagonal matrix:
<disp-formula id="eqn-19"><label>(19)</label><mml:math id="mml-eqn-19" display="block"><mml:msubsup><mml:mrow><mml:mi>G</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>k</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msqrt><mml:msubsup><mml:mrow><mml:mo>&#x2211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mrow><mml:msub><mml:mi>g</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:msqrt></mml:math></disp-formula></p>
<p>Update rule:
<disp-formula id="eqn-20"><label>(20)</label><mml:math id="mml-eqn-20" display="block"><mml:msup><mml:mi>x</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mi>a</mml:mi><mml:mi>r</mml:mi><mml:mi>g</mml:mi><mml:mi>a</mml:mi><mml:mi>r</mml:mi><mml:mi>g</mml:mi><mml:mi>m</mml:mi><mml:mi>i</mml:mi><mml:msub><mml:mi>n</mml:mi><mml:mrow><mml:mi>x</mml:mi><mml:mi>&#x03F5;</mml:mi><mml:mi>X</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>{</mml:mo><mml:mo>&lt;</mml:mo><mml:mi mathvariant="normal">&#x2207;</mml:mi><mml:mi>f</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>k</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mi>x</mml:mi><mml:mo>&gt;</mml:mo><mml:mo>+</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mn>2</mml:mn><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfrac></mml:mstyle><mml:mo fence="false" stretchy="false">&#x2016;</mml:mo><mml:mi>x</mml:mi><mml:mo>&#x2212;</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>k</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:msubsup><mml:mrow><mml:mo fence="false" stretchy="false">&#x2016;</mml:mo></mml:mrow><mml:mrow><mml:msup><mml:mi>G</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>k</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup><mml:mo>}</mml:mo></mml:mrow></mml:mstyle></mml:math></disp-formula>
<disp-formula id="eqn-21"><label>(21)</label><mml:math id="mml-eqn-21" display="block"><mml:mo>=</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>k</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03B1;</mml:mi><mml:msup><mml:mi>B</mml:mi><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mi mathvariant="normal">&#x2207;</mml:mi><mml:mi>f</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>k</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mi>f</mml:mi><mml:mi>X</mml:mi><mml:mo>=</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>AdaDelta is introduced by MD. Zeiler [<xref ref-type="bibr" rid="ref-39">39</xref>] as an updated version of Adagrad. This method uses decaying average of <italic>gt</italic> exponentially as 2<sup>nd</sup> gradient moment. In fact, this method depends on only first order. For Adadelta, the update rule is as follows:
<disp-formula id="eqn-22"><label>(22)</label><mml:math id="mml-eqn-22" display="block"><mml:msub><mml:mi>g</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x03B3;</mml:mi><mml:msub><mml:mi>g</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03B3;</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mi mathvariant="normal">&#x2207;</mml:mi><mml:mi>L</mml:mi><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mi>&#x03B8;</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:math></disp-formula>
<disp-formula id="eqn-23"><label>(23)</label><mml:math id="mml-eqn-23" display="block"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x03B3;</mml:mi><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03B3;</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:msubsup><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup></mml:math></disp-formula>
<disp-formula id="eqn-24"><label>(24)</label><mml:math id="mml-eqn-24" display="block"><mml:msub><mml:mi>v</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:msqrt><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mi>&#x03F5;</mml:mi><mml:mi>&#x03B4;</mml:mi><mml:mi>L</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:msqrt><mml:mrow><mml:msqrt><mml:msub><mml:mi>g</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:msqrt><mml:mo>+</mml:mo><mml:mi>&#x03F5;</mml:mi></mml:mrow></mml:mfrac></mml:mstyle></mml:math></disp-formula></p>
</sec>
<sec id="s5_4_2"><label>5.4.2</label><title>Multi Optimization Rule</title>
<p>Practically, if one optimizer does not offer a good fitting for an explicit dataset, the RMDL using multi model has n random models. Such that some of them might utilize various optimizers which could disregard <italic>k</italic> models that are not effective if and only if <italic>n&#x2009;&#x003E;&#x2009;k.</italic> In this paper, we only utilized two optimizers (Adagrad, and Adadelta) for the model evaluation, however the RMDL model can utilize slightly other optimizer.</p>
</sec>
</sec>
</sec>
<sec id="s6"><label>6</label><title>Experimental Results</title>
<p>This section provides comprehensive experiments on the efficacy of classification using data augmentation techniques. The results indicate that using neural augmentation is more precisely using GANs. At training time, the MNIST [<xref ref-type="bibr" rid="ref-23">23</xref>] and CIFAR-10 [<xref ref-type="bibr" rid="ref-24">24</xref>] datasets are used experimentally with GANs. On CIFAR-10 and MNIST, GANs were trained for 200 and 140&#x2005;k iterations, respectively. Alternatively, at each iteration, the generator and discriminator losses are fixed. For training, Adagrad and Adadelta optimizers are employed, which incorporate shifting parameter averages, allowing for larger effective steps and hence faster convergence.</p>
<p>In order to improve performance and reduce training time, the learning rate should be reduced when training grows. As a result, the learning rate will be based on an exponential decay. 
The exponential decay is calculated as:</p>
<disp-formula id="eqn-25"><label>(25)</label><mml:math id="mml-eqn-25" display="block"><mml:mrow><mml:mtext mathvariant="italic">decaye</mml:mtext></mml:mrow><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mrow><mml:mtext mathvariant="italic">learnin</mml:mtext></mml:mrow><mml:mrow><mml:msub><mml:mi>g</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mtext mathvariant="italic">learnin</mml:mtext></mml:mrow><mml:msub><mml:mi>g</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2217;</mml:mo><mml:mi>d</mml:mi><mml:mi>e</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2217;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mrow><mml:mrow><mml:mtext mathvariant="italic">globa</mml:mtext></mml:mrow><mml:msub><mml:mi>l</mml:mi><mml:mrow><mml:mi>s</mml:mi><mml:mi>t</mml:mi><mml:mi>e</mml:mi><mml:mi>p</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mi>e</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mrow><mml:mtext mathvariant="italic">steps</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:mrow></mml:mfrac></mml:mstyle><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>
<p>where initial learning rate, rate of decay, iteration number, and decay number steps are starting learning rate, rate of decay, global step, and decay steps, respectively.</p>
<p>Initial learning is set to 0.01, number of decay steps to 20000, and decay rate to 0.1 in our model. In both generator and discriminator, a used batch size is 128. To avoid overfitting, dropout is used in discriminator architecture. In our experiments, a dropout probability employed is 0.05. Input to the generator is a latent space variable of 100.</p>
<sec id="s6_1"><label>6.1</label><title>Evaluation</title>
<p>In this paper, the accuracy and Micro F1-Score are given as follows:</p>
<disp-formula id="eqn-26"><label>(26)</label><mml:math id="mml-eqn-26" display="block"><mml:mrow><mml:mtext mathvariant="italic">Precisio</mml:mtext></mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mrow><mml:mrow><mml:mtext mathvariant="italic">micro</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mrow><mml:msubsup><mml:mrow><mml:mo>&#x2211;</mml:mo></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>L</mml:mi></mml:mrow></mml:msubsup><mml:mi>T</mml:mi><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msubsup><mml:mrow><mml:mo>&#x2211;</mml:mo></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>L</mml:mi></mml:mrow></mml:msubsup><mml:mi>T</mml:mi><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mi>F</mml:mi><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfrac></mml:mstyle></mml:math></disp-formula>
<disp-formula id="eqn-27"><label>(27)</label><mml:math id="mml-eqn-27" display="block"><mml:mrow><mml:mtext mathvariant="italic">Recal</mml:mtext></mml:mrow><mml:msub><mml:mi>l</mml:mi><mml:mrow><mml:mrow><mml:mtext mathvariant="italic">micro</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mrow><mml:msubsup><mml:mrow><mml:mo>&#x2211;</mml:mo></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>L</mml:mi></mml:mrow></mml:msubsup><mml:mi>T</mml:mi><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msubsup><mml:mrow><mml:mo>&#x2211;</mml:mo></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>L</mml:mi></mml:mrow></mml:msubsup><mml:mi>T</mml:mi><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mi>F</mml:mi><mml:msub><mml:mi>N</mml:mi><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfrac></mml:mstyle></mml:math></disp-formula>
<disp-formula id="eqn-28"><label>(28)</label><mml:math id="mml-eqn-28" display="block"><mml:mi>F</mml:mi><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>S</mml:mi><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>r</mml:mi><mml:msub><mml:mi>e</mml:mi><mml:mrow><mml:mrow><mml:mtext mathvariant="italic">micro</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msubsup><mml:mrow><mml:mo>&#x2211;</mml:mo></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>L</mml:mi></mml:mrow></mml:msubsup><mml:mn>2</mml:mn><mml:mi>T</mml:mi><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msubsup><mml:mrow><mml:mo>&#x2211;</mml:mo></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>L</mml:mi></mml:mrow></mml:msubsup><mml:mi>T</mml:mi><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mi>F</mml:mi><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfrac></mml:math></disp-formula>
<p>The performance of the proposed model is assessed in terms of F1-score and error rate for accuracy evaluation. Formally, given <inline-formula id="ieqn-60"><mml:math id="mml-ieqn-60"><mml:mi>I</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>k</mml:mi><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula> a set of indices, we define the <inline-formula id="ieqn-61"><mml:math id="mml-ieqn-61"><mml:msup><mml:mi>i</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> class as <inline-formula id="ieqn-62"><mml:math id="mml-ieqn-62"><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. We denote <inline-formula id="ieqn-63"><mml:math id="mml-ieqn-63"><mml:mi>T</mml:mi><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>_true positive of <inline-formula id="ieqn-64"><mml:math id="mml-ieqn-64"><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>T</mml:mi><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>_false positive, <inline-formula id="ieqn-65"><mml:math id="mml-ieqn-65"><mml:mi>T</mml:mi><mml:msub><mml:mi>N</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>_false negative, <inline-formula id="ieqn-66"><mml:math id="mml-ieqn-66"><mml:mi>T</mml:mi><mml:msub><mml:mi>N</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>_true negative counts respectively.</p>
</sec>
<sec id="s6_2"><label>6.2</label><title>Experimental Setup</title>
<p>For testing and evaluating the performance of our approach, two datasets type one for text and the other for image have been used. Nevertheless, the model has ability to solve problems of classification using different data such as video, images, and text.</p>
<sec id="s6_2_1"><label>6.2.1</label><title>Image Datasets</title>
<p>For all these experiments, we used MNIST [<xref ref-type="bibr" rid="ref-23">23</xref>] and CIFAR-10 [<xref ref-type="bibr" rid="ref-24">24</xref>] datasets. We briefly describe characteristics of the datasets before proceeding to experiments. The CIFAR-10: The tagged photos are a subset of the 80 million Tiny labeled images dataset. The data collection contains images from ten different classes. Automobile, Airplane, Bird, Cat, Dog, Deer, Frog, Truck, Horse and Ship are the categories in which all the images fall. Interestingly, the groups are mutually exclusive.</p>
<p>Each class has 6000 images in the CIFAR-10 dataset, which has 60000 <italic>32 &#x00D7; 32</italic> color images. In practice, this dataset is split into two parts: training data with 50000 images and testing data with 10000 images. Contrariwise, MNIST dataset comprises handwritten digits&#x2019; images. In the MNIST database, there are 70000 handwritten digits. The photos are <italic>28 &#x00D7; 28</italic> in size and are divided into ten categories. There are 50000 images in the training data and 10,000 images in the testing data in this dataset. Because all the photos in both datasets are the same size, data pre-processing was minimal, except for normalizing the images.</p>
</sec>
<sec id="s6_2_2"><label>6.2.2</label><title>Text Datasets</title>
<p>Practically, four different datasets have been used for text classification, namely; WOS, Reuters, IMDB and 20 newsgroups. A collection of academic articles&#x2019; abstracts named Web of Science (WOS) [<xref ref-type="bibr" rid="ref-40">40</xref>] dataset consists of three corpora (5736, 11967, and 46985 documents) for (11, 34, and 134 topics). The Reuters-21578 news dataset includes 10,788 documents distributed into 7,769 documents used for training and 3,019 for validation through 90 classes totally.</p>
<p>IMDB dataset comprises 50,000 reviews divided into 25,000 highly widely set held reviews of movie used for training, and a set 25,000 for validation. 20 NewsGroup dataset contains 19,997 documents which its maximum length is 1,000 words. For this dataset, there are 15,997 used for training and 4,000 for testing.</p>
</sec>
<sec id="s6_2_3"><label>6.2.3</label><title>Framework</title>
<p>The suggested framework is implemented in Python using a parallel computing platform and Application Programming Interface (API) paradigm established by Nvidia known as Compute Unified Device Architecture (CUDA). Finally, <italic>TensorFlow</italic> and <italic>Keras</italic> library have been used for the neural network creation [<xref ref-type="bibr" rid="ref-41">41</xref>].</p>
</sec>
</sec>
<sec id="s6_3"><label>6.3</label><title>Empirical Results</title>
<p>In this paper, execution of all shown results are performed on Central Process Units (CPU) and Graphical Process Units (GPU). Correspondingly, RMDL can run on either GPU, CPU, or composed. The processing unit used in these experiments is intel on <italic>Xeon E5-2640</italic> (<italic>2.6&#x2005;GHz</italic>) with <italic>12 cores</italic> and 64 GB memory. Moreover, three graphical cards have been used which are two <italic>NVidia GeForce GTX 1080 Ti</italic> and <italic>NVidia Tesla K20c</italic>.</p>
<p>Before proceeding to experiment results, it is required to generate training batches. We only used a subset of the datasets as training data in practice. More specifically, the training data for CIFAR-10 and MNIST contains 5000 images, 500 of which belong to each of the 10 classes, while the testing data, which contains 10,000 images, remains unchanged. The experimental results of RMDL are shown in text categorization and image classification.</p>
<sec id="s6_3_1"><label>6.3.1</label><title>Image Classification</title>
<p>The baselines used for image classification are: Deep L2-SVM [<xref ref-type="bibr" rid="ref-42">42</xref>], PCANet-1 [<xref ref-type="bibr" rid="ref-43">43</xref>], gcForest [<xref ref-type="bibr" rid="ref-44">44</xref>] and RMDL with no augmentation. <xref ref-type="table" rid="table-1">Table 1</xref> shows the RMDL error rate for images classification generated by neural augmentation. The comparison between the RMDL with augmentation and RMDL with baselines, explicit that the RMDL error rate with augmentation for MNIST and CIAFAR-10 datasets has been decreased with different random models.</p>
<table-wrap id="table-1"><label>Table 1</label><caption><title>Comparison of error rate for image classification (MNIST and CIAFAR-10 datasets)</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="center" colspan="2">Methods</th>
<th align="left">MNIST</th>
<th align="left">CIFAR-10</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Baseline</td>
<td align="left">DeepL2-SVM [<xref ref-type="bibr" rid="ref-43">43</xref>]</td>
<td align="left">0.86</td>
<td align="left">11.8</td>
</tr>
<tr><td/>
<td align="left">PCANet-1 [<xref ref-type="bibr" rid="ref-44">44</xref>]</td>
<td align="left">0.67</td>
<td align="left">21.31</td>
</tr>
<tr><td/>
<td align="left">geForest [<xref ref-type="bibr" rid="ref-45">45</xref>]</td>
<td align="left">0.73</td>
<td align="left">30.00</td>
</tr>
<tr>
<td align="left">RMDL [<xref ref-type="bibr" rid="ref-2">2</xref>]</td>
<td align="left">3 RDLS</td>
<td align="left">0.56</td>
<td align="left">9.88</td>
</tr>
<tr><td/>
<td align="left">9 RDLS</td>
<td align="left">0.44</td>
<td align="left">9.0</td>
</tr>
<tr><td/>
<td align="left">15 RDLS</td>
<td align="left">0.21</td>
<td align="left">8.74</td>
</tr>
<tr><td/>
<td align="left">30 RDLS</td>
<td align="left">0.18</td>
<td align="left">8.79</td>
</tr>
<tr>
<td align="left">RMDL with augmentation</td>
<td align="left">3 RDLS</td>
<td align="left"><bold>0.45</bold></td>
<td align="left"><bold>8.21</bold></td>
</tr>
<tr><td/>
<td align="left">9 RDLS</td>
<td align="left"><bold>0.38</bold></td>
<td align="left"><bold>7.85</bold></td>
</tr>
<tr><td/>
<td align="left">15 RDLS</td>
<td align="left"><bold>0.18</bold></td>
<td align="left"><bold>7.23</bold></td>
</tr>
<tr><td/>
<td align="left">30 RDLS</td>
<td align="left"><bold>0.10</bold></td>
<td align="left"><bold>7.51</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s6_3_2"><label>6.3.2</label><title>Text Categorization</title>
<p>As <xref ref-type="table" rid="table-2">Table 2</xref> shows, RMDL with augmentation the accuracy has been enhanced comparing with the baselines. The empirical results in <xref ref-type="table" rid="table-2">Table 2</xref> are evaluated using four RMDL models (using 3, 9, 15, and 30 RDLs). The accuracy for Web of Science (WOS-5,736) is increased to 92.43, 93.00, 94.56 and 94.98 respectively. The accuracy for Web of Science (WOS-11,967), is improved to 93.85, 94.88, 93.18 and 94.45 respectively, and finally the accuracy for Web of Science (WOS-46,985) has improved to 82.87, 85.62, 85.35 and 86.56 correspondingly. The accuracy of Reuters-21578 is 91.25, 93.56, 93.95 and 92.77 respectively. As it is mentioned, the accuracy has been improved with all datasets.</p>
<table-wrap id="table-2"><label>Table 2</label><caption><title>Text classification accuracy comparison for W.1 (WOS-5736), W.2, W.3, AND R STANDS REUTERS-21578</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="center" colspan="2">Methods</th>
<th align="left">W.1</th>
<th align="left">W.2</th>
<th align="left">W.3</th>
<th align="left">R</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Baseline</td>
<td align="left">DNN</td>
<td align="left">86.14</td>
<td align="left">80.00</td>
<td align="left">66.89</td>
<td align="left">85.1</td>
</tr>
<tr><td/>
<td align="left">CNN</td>
<td align="left">88.66</td>
<td align="left">83.20</td>
<td align="left">70.45</td>
<td align="left">86.1</td>
</tr>
<tr><td/>
<td align="left">RNN</td>
<td align="left">88.98</td>
<td align="left">83.88</td>
<td align="left">72.11</td>
<td align="left">88.23</td>
</tr>
<tr>
<td align="left">RMDL [<xref ref-type="bibr" rid="ref-2">2</xref>]</td>
<td align="left">3 RDLS</td>
<td align="left">90.85</td>
<td align="left">87.32</td>
<td align="left">78.35</td>
<td align="left">88.98</td>
</tr>
<tr><td/>
<td align="left">9 RDLS</td>
<td align="left">92.59</td>
<td align="left">90.54</td>
<td align="left">81.91</td>
<td align="left">90.35</td>
</tr>
<tr><td/>
<td align="left">15 RDLS</td>
<td align="left">92.64</td>
<td align="left">90.59</td>
<td align="left">81.78</td>
<td align="left">89.88</td>
</tr>
<tr><td/>
<td align="left">30 RDLS</td>
<td align="left">92.55</td>
<td align="left">90.52</td>
<td align="left">82.68</td>
<td align="left">90.65</td>
</tr>
<tr>
<td align="left">RMDL with augmentation</td>
<td align="left">3 RDLS</td>
<td align="left"><bold>92.43</bold></td>
<td align="left"><bold>93.85</bold></td>
<td align="left"><bold>82.87</bold></td>
<td align="left"><bold>91.25</bold></td>
</tr>
<tr><td/>
<td align="left">9 RDLS</td>
<td align="left"><bold>93.00</bold></td>
<td align="left"><bold>94.88</bold></td>
<td align="left"><bold>85.62</bold></td>
<td align="left"><bold>93.56</bold></td>
</tr>
<tr><td/>
<td align="left">15 RDLS</td>
<td align="left"><bold>94.56</bold></td>
<td align="left"><bold>93.18</bold></td>
<td align="left"><bold>85.35</bold></td>
<td align="left"><bold>93.95</bold></td>
</tr>
<tr><td/>
<td align="left">30 RDLS</td>
<td align="left"><bold>94.98</bold></td>
<td align="left"><bold>94.45</bold></td>
<td align="left"><bold>86.56</bold></td>
<td align="left"><bold>92.77</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Other datasets, such as the Large Movie Review Dataset (IMDB) and 20 NewsGroups, also have results. <xref ref-type="table" rid="table-3">Table 3</xref> shows how RMDL with augmentation improves accuracy for two ground truth datasets.</p>
<table-wrap id="table-3"><label>Table 3</label><caption><title>Comparison of accuracy for text classification on IMDB and 20 newsgroup datasets</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left" colspan="2">Methods</th>
<th align="left">IMDB</th>
<th align="left">20 NewsGroup</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Baseline</td>
<td align="left">DNN</td>
<td align="left">87.59</td>
<td align="left">86.42</td>
</tr>
<tr><td/>
<td align="left">CNN [<xref ref-type="bibr" rid="ref-45">45</xref>]</td>
<td align="left">87.32</td>
<td align="left">82.89</td>
</tr>
<tr><td/>
<td align="left">RNN [<xref ref-type="bibr" rid="ref-45">45</xref>]</td>
<td align="left">88.52</td>
<td align="left">83.72</td>
</tr>
<tr>
<td align="left">RMDL [<xref ref-type="bibr" rid="ref-2">2</xref>]</td>
<td align="left">3 RDLS</td>
<td align="left">88.90</td>
<td align="left">86.72</td>
</tr>
<tr><td/>
<td align="left">9 RDLS</td>
<td align="left">91.11</td>
<td align="left">87.61</td>
</tr>
<tr><td/>
<td align="left">15 RDLS</td>
<td align="left">90.89</td>
<td align="left">87.90</td>
</tr>
<tr>
<td align="left">RMDL with augmentation</td>
<td align="left">3 RDLS</td>
<td align="left"><bold>92.35</bold></td>
<td align="left"><bold>90.89</bold></td>
</tr>
<tr><td/>
<td align="left">9 RDLS</td>
<td align="left"><bold>94.21</bold></td>
<td align="left"><bold>92.75</bold></td>
</tr>
<tr><td/>
<td align="left">15 RDLS</td>
<td align="left"><bold>93.87</bold></td>
<td align="left"><bold>92.45</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The accuracy of IMDB dataset is 92.35&#x0025;, 94.21&#x0025; and 93.87&#x0025; respectively for 3, 9, and 15 RDLs, where the accuracy of DNN is 87.59&#x0025;, CNN [<xref ref-type="bibr" rid="ref-45">45</xref>] is 87.32&#x0025;, and RNN [<xref ref-type="bibr" rid="ref-45">45</xref>] is 88.52&#x0025;. The accuracy of 20 NewsGroup dataset is 90.89&#x0025;, 92.75&#x0025; and 92.45&#x0025; for 3, 9, and 15 random models respectively, where the accuracy of DNN is 86.42&#x0025;, CNN [<xref ref-type="bibr" rid="ref-45">45</xref>] is 82.89&#x0025;, and RNN [<xref ref-type="bibr" rid="ref-45">45</xref>] is 83.72&#x0025;.</p>
</sec>
</sec>
</sec>
<sec id="s7"><label>7</label><title>Conclusion and Discussion</title>
<p>In ML, to generalize findings of a classification task, we normally need to increase the training data amount, however, that causes the model to be overfitted and hence poorly performs on the testing set. In this work, an alternative mean of data augmentation problem using GANs has been introduced to overcome such a challenge. In particular, the proposed framework forces the convolutional architecture&#x2019;s ability to learn low features for data capturing. By using these architectures, the data which is more meaningful and accurate can be collected. For classification, we used the RMDL model that solved the problem of choosing best technique in DL. The results show that combining RMDL with neural augmentation network provides improvements for classification for both images and text. The proposed approach has capability of improving model efficiency and accuracy and also can be used through a widespread variety of types of data.</p>
</sec>
</body>
<back>
<fn-group>
<fn fn-type="other"><p><bold>Funding Statement:</bold> The researchers would like to thank the Deanship of Scientific Research, Qassim University for funding the publication of this project.</p></fn>
<fn fn-type="conflict"><p><bold>Conflicts of Interest:</bold> The authors declare that they have no conflicts of interest to report regarding the present study.</p></fn>
</fn-group>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Chung</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Gulcehre</surname></string-name>, <string-name><given-names>K.</given-names> <surname>Cho</surname></string-name> and <string-name><given-names>Y.</given-names> <surname>Bengio</surname></string-name></person-group>, &#x201C;<article-title>Empirical evaluation of gated recurrent neural networks on sequence modeling</article-title>,&#x201D; arXiv preprint arXiv:1412.3555, <year>2014</year>.</mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>K.</given-names> <surname>Kowsari</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Heidarysafa</surname></string-name>, <string-name><given-names>D. E.</given-names> <surname>Brown</surname></string-name>, <string-name><given-names>K. J.</given-names> <surname>Meimandi</surname></string-name> and <string-name><given-names>L. E.</given-names> <surname>Barnes</surname></string-name></person-group>, &#x201C;<article-title>Rmdl: Random multimodel deep learning for classification</article-title>,&#x201D; in <conf-name>Proc. of the 2nd Int. Conf. on Information System and Data Mining</conf-name>, <conf-loc>NY, United States</conf-loc>, pp. <fpage>19</fpage>&#x2013;<lpage>28</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Robertson</surname></string-name></person-group>, &#x201C;<article-title>Understanding inverse document frequency: On theoretical arguments for IDF</article-title>,&#x201D; <source>Journal of Documentation</source>, vol. <volume>60</volume>, no. <issue>5</issue>, pp. <fpage>503</fpage>&#x2013;<lpage>520</lpage>, <year>2004</year>.</mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>B.</given-names> <surname>Wang</surname></string-name> and <string-name><given-names>D.</given-names> <surname>Klabjan</surname></string-name></person-group>, &#x201C;<article-title>Regularization for unsupervised deep neural nets</article-title>,&#x201D; in <conf-name>Thirty-First AAAI Conf. on Artificial Intelligence</conf-name>, <conf-loc>San Francisco, California, USA</conf-loc>, pp. <fpage>2681</fpage>&#x2013;<lpage>2687</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Kubo</surname></string-name>, <string-name><given-names>G.</given-names> <surname>Tucker</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Wiesler</surname></string-name></person-group>, &#x201C;<article-title>Compacting neural network classifiers via dropout training</article-title>,&#x201D; arXiv preprint arXiv:1611.06148, <year>2016</year>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Gal</surname></string-name> and <string-name><given-names>Z.</given-names> <surname>Ghahramani</surname></string-name></person-group>, &#x201C;<chapter-title>A theoretically grounded application of dropout in recurrent neural networks</chapter-title>,&#x201D; in <source>Proc. of the Advances in Neural Information Processing Systems</source>, <publisher-loc>California, USA</publisher-loc>, pp. <fpage>1019</fpage>&#x2013;<lpage>1027</lpage>, <year>2016</year>.</mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>N.</given-names> <surname>Srivastava</surname></string-name>, <string-name><given-names>G.</given-names> <surname>Hinton</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Krizhevsky</surname></string-name>, <string-name><given-names>I.</given-names> <surname>Sutskever</surname></string-name> and <string-name><given-names>R.</given-names> <surname>Salakhutdinov</surname></string-name></person-group>, &#x201C;<article-title>Dropout: A simple way to prevent neural networks from overfitting</article-title>,&#x201D; <source>The Journal of Machine Learning Research</source>, vol. <volume>15</volume>, no. <issue>1</issue>, pp. <fpage>1929</fpage>&#x2013;<lpage>1958</lpage>, <year>2014</year>.</mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Ioffe</surname></string-name> and <string-name><given-names>C.</given-names> <surname>Szegedy</surname></string-name></person-group>, &#x201C;<article-title>Batch normalization: Accelerating deep network training by reducing internal covariate shift</article-title>,&#x201D; in <conf-name>Int. Conf. on Machine Learning, PMLR</conf-name>, <conf-loc>Lillie, France</conf-loc>, pp. <fpage>448</fpage>&#x2013;<lpage>456</lpage>, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Xiang</surname></string-name> and <string-name><given-names>H.</given-names> <surname>Li</surname></string-name></person-group>, &#x201C;<article-title>On the effects of batch and weight normalization in generative adversarial networks</article-title>,&#x201D; arXiv preprint arXiv:1704.03971, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>P. Y.</given-names> <surname>Simard</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Steinkras</surname></string-name> and <string-name><given-names>J. C.</given-names> <surname>Platt</surname></string-name></person-group>, &#x201C;<article-title>Best practices for convolutional neural networks applied to visual document analysis</article-title>,&#x201D; in <conf-name>Proc. of the Seventh Int. Conf. on Document Analysis and Recognition</conf-name>, Edinburgh, UK, pp. <fpage>958</fpage>&#x2013;<lpage>962</lpage>, <year>2003</year>.</mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Lemley</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Bazrafkan</surname></string-name> and <string-name><given-names>P.</given-names> <surname>Corcoran</surname></string-name></person-group>, &#x201C;<article-title>Smart augmentation learning an optimal data augmentation strategy</article-title>,&#x201D; <source>IEEE Access</source>, vol. <volume>5</volume>, no.<issue>1</issue>, pp. <fpage>5858</fpage>&#x2013;<lpage>5869</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Wang</surname></string-name> and <string-name><given-names>L.</given-names> <surname>Perez</surname></string-name></person-group>, &#x201C;<article-title>The effectiveness of data augmentation in image classification using deep learning</article-title>,&#x201D; <source>Convolutional Neural Networks Vis. Recognit.</source>, vol. <volume>11</volume>, no. <issue>1</issue>, pp. <fpage>1</fpage>&#x2013;<lpage>8</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>S. C.</given-names> <surname>Wong</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Gatt</surname></string-name>, <string-name><given-names>V.</given-names> <surname>Stamatescu</surname></string-name> and <string-name><given-names>M. D.</given-names> <surname>McDonnell</surname></string-name></person-group>, &#x201C;<article-title>Understanding data augmentation for classification: When to warp?</article-title>,&#x201D; in <conf-name>2016 Int. Conf. on Digital Image Computing: Techniques and Applications (DICTA)</conf-name>, <conf-loc>Gold Coast, Australia</conf-loc>, <publisher-name>IEEE</publisher-name>, pp. <fpage>1</fpage>&#x2013;<lpage>6</lpage>, <year>2016</year>.</mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>C.</given-names> <surname>Shorten</surname></string-name> and <string-name><given-names>T. M.</given-names> <surname>Khoshgoftaar</surname></string-name></person-group>, &#x201C;<article-title>A survey on Image Data Augmentation for Deep Learning</article-title>,&#x201D; <source>Journal of Big Data</source>, vol. <volume>6</volume>, pp. <fpage>1</fpage>&#x2013;<lpage>48</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>G.</given-names> <surname>Nagy</surname></string-name></person-group>, &#x201C;<article-title>Twenty years of document image analysis in pami</article-title>,&#x201D; <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>, vol. <volume>22</volume>, no. <issue>1</issue>, pp. <fpage>38</fpage>&#x2013;<lpage>62</lpage>, <year>2000</year>.</mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>D. C.</given-names> <surname>Cire&#x015F;an</surname></string-name>, <string-name><given-names>U.</given-names> <surname>Meier</surname></string-name>, <string-name><given-names>L. M.</given-names> <surname>Gambardella</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Schmidhuber</surname></string-name></person-group>, &#x201C;<article-title>Deep, big, simple neural nets for handwritten digit recognition</article-title>,&#x201D; <source>Neural Computation</source>, vol. <volume>22</volume>, no. <issue>12</issue>, pp. <fpage>3207</fpage>&#x2013;<lpage>3220</lpage>, <year>2010</year>.</mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>G.</given-names> <surname>Loosli</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Canu</surname></string-name> and <string-name><given-names>L.</given-names> <surname>Bottou</surname></string-name></person-group>, &#x201C;<article-title>Training invariant support vector machines using selective sampling</article-title>,&#x201D; <source>Large Scale Kernel Machines</source>, vol. <volume>2</volume>, no. <issue>1</issue>, pp. <fpage>301.320</fpage>, <year>2007</year>.</mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>L.</given-names> <surname>Perez</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Wang</surname></string-name></person-group>, &#x201C;<article-title>The effectiveness of data augmentation in image classification using deep learning</article-title>,&#x201D; arXiv preprint arXiv:1712.04621, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>E. D.</given-names> <surname>Cubuk</surname></string-name>, <string-name><given-names>B.</given-names> <surname>Zoph</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Mane</surname></string-name>, <string-name><given-names>V.</given-names> <surname>Vasudevan</surname></string-name> and <string-name><given-names>Q. V.</given-names> <surname>Le</surname></string-name></person-group>, &#x201C;<article-title>Autoaugment: Learning augmentation policies from data</article-title>,&#x201D; arXiv preprint arXiv:1805.09501, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Gurumurthy</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Kiran Sarvadevabhatla</surname></string-name> and <string-name><given-names>R.</given-names> <surname>Venkatesh Babu</surname></string-name></person-group>, &#x201C;<article-title>Deligan: Generative adversarial networks for diverse and limited data</article-title>,&#x201D; in <conf-name>Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition</conf-name>, <conf-loc>Honolulu, HI, USA</conf-loc>, pp. <fpage>166</fpage>&#x2013;<lpage>174</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Marchesi</surname></string-name></person-group>, &#x201C;<article-title>Megapixel size image creation using generative adversarial networks</article-title>,&#x201D; arXiv preprint arXiv:1706.00082, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>T.</given-names> <surname>Salimans</surname></string-name>, <string-name><given-names>I.</given-names> <surname>Goodfellow</surname></string-name>, <string-name><given-names>W.</given-names> <surname>Zaremba</surname></string-name>, <string-name><given-names>V.</given-names> <surname>Cheung</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Radford</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>Improved techniques for training GANs</article-title>,&#x201D; <source>Advances in Neural Information Processing Systems</source>, vol. <volume>29</volume>, no. <issue>1</issue>, pp. <fpage>1</fpage>&#x2013;<lpage>10</lpage>, <year>2016</year>.</mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>L.</given-names> <surname>Deng</surname></string-name></person-group>, &#x201C;<article-title>The mnist database of handwritten digit images for machine learning research [best of the web]</article-title>,&#x201D; <source>IEEE Signal Processing Magazine</source>, vol. <volume>29</volume>, no. <issue>6</issue>, pp. <fpage>141</fpage>&#x2013;<lpage>142</lpage>, <year>2012</year>.</mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Krizhevsky</surname></string-name>, <string-name><given-names>V.</given-names> <surname>Nair</surname></string-name> and <string-name><given-names>G.</given-names> <surname>Hinton</surname></string-name></person-group>, &#x201C;<article-title>The CIFAR-10 dataset</article-title>,&#x201D; <year>2014</year>. [Online]. Available: <uri xlink:href="http://www.cs.toronto.edu/kriz/cifar.Html">http://www.cs.toronto.edu/kriz/cifar.Html</uri>.</mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>X.</given-names> <surname>Liang</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Hu</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Gan</surname></string-name> and <string-name><given-names>E. P.</given-names> <surname>Xing</surname></string-name></person-group>, &#x201C;<article-title>Recurrent topic-transition gan for visual paragraph generation</article-title>,&#x201D; in <conf-name>Proc. of the IEEE Int. Conf. on Computer Vision</conf-name>, <conf-loc>Venice, Italy</conf-loc>, pp. <fpage>3362</fpage>&#x2013;<lpage>3371</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Yu</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Gong</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Zhong</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Shan</surname></string-name></person-group>, &#x201C;<article-title>Unsupervised representation learning with deep convolutional neural network for remote sensing images</article-title>,&#x201D; in <conf-name>Int. Conf. on Image and Graphics</conf-name>, <conf-loc>Shanghai, China</conf-loc>, <publisher-name>Springer</publisher-name>, pp. <fpage>97</fpage>&#x2013;<lpage>108</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-27"><label>[27]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Cohn</surname></string-name> and <string-name><given-names>T.</given-names> <surname>Baldwin</surname></string-name></person-group>, &#x201C;<article-title>Robust training under linguistic adversity</article-title>,&#x201D; in <conf-name>Proc. of the 15th Conf. of the European Chapter of the Association for Computational Linguistics, Short Papers</conf-name>, <conf-loc>Valencia, Spain</conf-loc>, vol. <volume>2</volume>, pp. <fpage>21</fpage>&#x2013;<lpage>27</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-28"><label>[28]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Kobayashi</surname></string-name></person-group>, &#x201C;<article-title>Contextual augmentation: Data augmentation by words with paradigmatic relations</article-title>,&#x201D; arXiv preprint arXiv:1805.06201, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-29"><label>[29]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>T.</given-names> <surname>Mikolov</surname></string-name>, <string-name><given-names>I.</given-names> <surname>Sutskever</surname></string-name>, <string-name><given-names>K.</given-names> <surname>Chen</surname></string-name>, <string-name><given-names>G. S.</given-names> <surname>Corrado</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Dean</surname></string-name></person-group>, &#x201C;<article-title>Distributed representations of words and phrases and their compositionality</article-title>,&#x201D; <source>Advances in Neural Information Processing Systems</source>, vol. <volume>26</volume>, no. <issue>1</issue>, pp. <fpage>3111</fpage>&#x2013;<lpage>3119</lpage>, <year>2013</year>.</mixed-citation></ref>
<ref id="ref-30"><label>[30]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Krishnan</surname></string-name> and <string-name><given-names>M.</given-names> <surname>Chen</surname></string-name></person-group>, &#x201C;<article-title>Identifying tweets with fake news</article-title>,&#x201D; in <conf-name>2018 IEEE Int. Conf. on Information Reuse and Integration (IRI)</conf-name>, <conf-loc>Utah, USA</conf-loc>, pp. <fpage>460</fpage>&#x2013;<lpage>464</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-31"><label>[31]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Sano</surname></string-name>, <string-name><given-names>K.</given-names> <surname>Yamaguchi</surname></string-name> and <string-name><given-names>T.</given-names> <surname>Mine</surname></string-name></person-group>, &#x201C;<article-title>Automatic classification of complaint reports about city park</article-title>,&#x201D; <source>Information Engineering Express</source>, vol. <volume>1</volume>, no. <issue>4</issue>, pp. <fpage>119</fpage>&#x2013;<lpage>130</lpage>, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-32"><label>[32]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Dundar</surname></string-name>, <string-name><given-names>Q.</given-names> <surname>Kou</surname></string-name>, <string-name><given-names>B.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>He</surname></string-name> and <string-name><given-names>B.</given-names> <surname>Rajwa</surname></string-name></person-group>, &#x201C;<article-title>Simplicity of kmeans versus deepness of deep learning: A case of unsupervised feature learning with limited data</article-title>,&#x201D; in <conf-name>2015 IEEE 14th Int. Conf. on Machine Learning and Applications (ICMLA)</conf-name>, <conf-loc>Miami, Florida, USA</conf-loc>, pp. <fpage>883</fpage>&#x2013;<lpage>888</lpage>, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-33"><label>[33]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Das</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Ganguly</surname></string-name> and <string-name><given-names>U.</given-names> <surname>Garain</surname></string-name></person-group>, &#x201C;<article-title>Named entity recognition with word embeddings and wikipedia categories for a low-resource language</article-title>,&#x201D; <source>ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP)</source>, vol. <volume>16</volume>, no. <issue>3</issue>, pp. <fpage>1</fpage>&#x2013;<lpage>19</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-34"><label>[34]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Pennington</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Socher</surname></string-name> and <string-name><given-names>C. D.</given-names> <surname>Manning</surname></string-name></person-group>, &#x201C;<article-title>Glove: Global vectors for word representation</article-title>,&#x201D; in <conf-name>Proc. of the 2014 Conf. on Empirical Methods in Natural Language Processing (EMNLP)</conf-name>, <conf-loc>Doha, Qatar</conf-loc>, pp. <fpage>1532</fpage>&#x2013;<lpage>1543</lpage>, <year>2014</year>.</mixed-citation></ref>
<ref id="ref-35"><label>[35]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>R.</given-names> <surname>Rehurek</surname></string-name> and <string-name><given-names>P.</given-names> <surname>Sojka</surname></string-name></person-group>, &#x201C;<article-title>Software framework for topic modelling with large corpora</article-title>,&#x201D; in <conf-name>Proc. of the LREC 2010 Workshop on New Challenges for NLP Frameworks, Citeseer</conf-name>, <conf-loc>Valetta, Malta</conf-loc>, pp. <fpage>45</fpage>&#x2013;<lpage>50</lpage>, <year>2010</year>.</mixed-citation></ref>
<ref id="ref-36"><label>[36]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>V.</given-names> <surname>Nair</surname></string-name> and <string-name><given-names>G. E.</given-names> <surname>Hinton</surname></string-name></person-group>, &#x201C;<article-title>Rectified linear units improve restricted boltzmann machines</article-title>,&#x201D; in <conf-name>Int. Conf. on Machine Learning</conf-name>, <conf-loc>Haifa, Israel</conf-loc>, pp. <fpage>807</fpage>&#x2013;<lpage>814</lpage>, <year>2010</year>.</mixed-citation></ref>
<ref id="ref-37"><label>[37]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Bengio</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Simard</surname></string-name> and <string-name><given-names>P.</given-names> <surname>Frasconi</surname></string-name></person-group>, &#x201C;<article-title>Learning long-term dependencies with gradient descent is difficult</article-title>,&#x201D; <source>IEEE Transactions on Neural Networks</source>, vol. <volume>5</volume>, no. <issue>2</issue>, pp. <fpage>157</fpage>&#x2013;<lpage>166</lpage>, <year>1994</year>.</mixed-citation></ref>
<ref id="ref-38"><label>[38]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Duchi</surname></string-name>, <string-name><given-names>E.</given-names> <surname>Hazan</surname></string-name> and <string-name><given-names>Y.</given-names> <surname>Singer</surname></string-name></person-group>, &#x201C;<article-title>Adaptive subgradient methods for online learning and stochastic optimization</article-title>,&#x201D; <source>Journal of Machine Learning Research</source>, vol. <volume>12</volume>, no. <issue>7</issue>, pp. <fpage>2121</fpage>&#x2013;<lpage>2159</lpage>, <year>2011</year>.</mixed-citation></ref>
<ref id="ref-39"><label>[39]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><given-names>M. D.</given-names> <surname>Zeiler</surname></string-name></person-group>, &#x201C;<chapter-title>Adadelta: An adaptive learning rate method</chapter-title>,&#x201D; arXiv preprint arXiv:1212.5701, <year>2012</year>.</mixed-citation></ref>
<ref id="ref-40"><label>[40]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>K.</given-names> <surname>Kowsari</surname></string-name>, <string-name><given-names>D. E.</given-names> <surname>Brown</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Hei-darysafa</surname></string-name>, <string-name><given-names>K. J.</given-names> <surname>Meimandi</surname></string-name>, <string-name><given-names>M. S.</given-names> <surname>Gerber</surname></string-name> <etal>et al.,</etal></person-group> <source>Web of Science Dataset</source>, <year>2018</year>. <uri xlink:href="https://doi.org/10.17632/9rw3vkcfy4.6">https://doi.org/10.17632/9rw3vkcfy4.6</uri>.</mixed-citation></ref>
<ref id="ref-41"><label>[41]</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><given-names>F.</given-names> <surname>Chollet</surname></string-name></person-group>, &#x201C;<article-title>Keras: Deep learning library for theano and tensorflow</article-title>,&#x201D; vol. <volume>7</volume>, no. <issue>8</issue>, pp. <fpage>T1</fpage>, <year>2015</year>, URL: <uri xlink:href="https://keras.io/k">https://keras.io/k</uri>.</mixed-citation></ref>
<ref id="ref-42"><label>[42]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Tang</surname></string-name></person-group>, &#x201C;<article-title>Deep learning using linear support vector machines</article-title>,&#x201D; arXiv preprint arXiv:1306.0239, <year>2013</year>.</mixed-citation></ref>
<ref id="ref-43"><label>[43]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>T. H.</given-names> <surname>Chan</surname></string-name>, <string-name><given-names>K.</given-names> <surname>Jia</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Gao</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Lu</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Zeng</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>Pcanet: A simple deep learning baseline for image classification?</article-title>,&#x201D; <source>IEEE Transactions on Image Processing</source>, vol. <volume>24</volume>, no. <issue>12</issue>, pp. <fpage>5017</fpage>&#x2013;<lpage>5032</lpage>, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-44"><label>[44]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><given-names>Z. H.</given-names> <surname>Zhou</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Feng</surname></string-name></person-group>, &#x201C;<chapter-title>Deep forest: Towards an alternative to deep neural networks</chapter-title>,&#x201D; in <source>Proc. of the Twenty-Sixth Int. Joint Conf. on Artificial Intelligence</source>, <publisher-loc>Melbourne, Australia</publisher-loc>, pp. <fpage>3553</fpage>&#x2013;<lpage>3559</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-45"><label>[45]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>Z.</given-names> <surname>Yang</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Yang</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Dyer</surname></string-name>, <string-name><given-names>X.</given-names> <surname>He</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Smola</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>Hierarchical attention networks for document classification</article-title>,&#x201D; in <conf-name>Proc. of the 2016 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</conf-name>, <conf-loc>San Diego, California, USA</conf-loc>, pp. <fpage>1480</fpage>&#x2013;<lpage>1489</lpage>, <year>2016</year>.</mixed-citation></ref>
</ref-list>
</back>
</article>


