Multi-Tier Sentiment Analysis of Social Media Text Using Supervised Machine Learning

,


Introduction
In the last few decades, the number of users on social media has grown exponentially. Facebook has 1.28 billion active users and Twitter has 241 million active users [1]. These platforms serve as the hub for people to express their opinions. Thus, a massive amount of data is provided by Social Media platforms, which indicate the behavior and sentiment of the users. Sentiment Analysis (SA) of this data can be beneficial in many domains. SA has been a focus of many researchers owing to its vast applications in the industry [2]. It brings forth an automated way to mine the opinions and sentiments of the target audience. Organizations and companies can gain financial benefits by analyzing the sentiments of consumers. The government can understand the public's perceptions and sentiments regarding their policies [3].
While the common supervised ML models namely Decision Tree, Support Vector Machine (SVM), and Naïve Bayes ML algorithms have been implemented widely for different tasks in multiple domains, such as data privacy, and healthcare security assessment [4,5].
The most naïve approach to SA is to classify the text into binary classes i.e., positive and negative. Binary ML classifiers yield high accuracy [6]. However, multi-class classification proves to be more beneficial as it provides an in-depth insight into the sentiments. Thus, the decision-makers are better equipped to make accurate decisions. Aspect-based SA is another tool of paramount importance that gives the awareness of the users' sentiment regarding a specific aspect of a particular entity. The problem arises when baseline ML algorithms fail to give an optimal accuracy for multi-class classification problems. Naïve Bayes and SVM are the most commonly used algorithms for SA [7,8]. These algorithms cease to perform optimally in a social media text. Degradation in performance arises because social media text frequently contains erroneous spellings, domain-centric slang, spam reviews, and negation in sentences. Due to these challenges, the SA of social media becomes a challenging task. Many researchers have put in efforts to enhance the performance of the multi-class classifier. To overcome this challenge, various architectures have been proposed. However, on account of the complexity of the social media text, the data embodies high-level features and is not linearly separable. Thus, the baseline ML models are incapable of completely capturing and learning these features. However, more complex deep learning (DL) models such as Recurrent Neural Network (RNN), Convolutional Neural Network (CNN) and Bidirectional Encoder Representation of Transformer (BERT) can perform more optimally [9,10].
Besides the model selection and designing an optimal architecture, pre-processing of the textual data plays a key role in augmenting the overall performance [11]. Thus, effective preprocessing of the social media text has been a focus of many researchers. The baseline pre-processing of textual data involves a number of crucial steps: 1) stop words elimination, 2) tokenization 3) word normalization, and 4) vectorization [8]. Which language model i.e., unigram, bigram, or trigram fits best the particular NLP task at hand and comes under the umbrella of pre-processing techniques.
In this study, a multi-tier model has been proposed to augment the performance of multiclass sentiment classification. The proposed model has been implemented on the Sentiment Analysis on Movie Reviews data set [12]. Movie Reviews are classified into 5 classes i.e., positive, strongly positive, negative, strong negative, and neutral. Before the training and evaluation of the model, pre-processing techniques have been implemented for the SA task. For the implementation pre-processing NLP library, NLTK (natural language toolkit) has been utilized. TreeBank word tokenizer tokenizes the text data. WordNet Lemmatizer fulfills the need for text normalization. For the removal of stop words, the built-in list of stop words has been used. Lastly, a term frequency-Inverse document frequency (TF-IDF) vectorization is performed to represent the textual data into numerical vector space that can be used to train the ML model. The multi-tier model decomposes the multiclass classification into binary and ternary classification sub-tasks. The ML models perform more optimally in case of binary and ternary classification. However, the challenges such as negation, spam reviews, unbalanced data over classes, and spelling errors still contribute to the degradation of the accuracy. The performance metrics of the multi-tier model are compared with those of the single-tier model. Also, the performance metrics of the binary and ternary classifiers that constitute the multi-tier model are analyzed. It can be noted that as the number of classes is added the performance is significantly degraded. The aim of this study is to identify whether a multi-tier architecture is useful for sentiment classification, if yes, then we can use it for other sentiment classification task. Otherwise, we will identify the short coming and possible solutions. This will address the following research questions: 1. Will the multi-tier model gain more useful information than single-tier model? 2. What are the short comings of multi-tier model? 3. What are the possible solutions to improve the performance of multi-tier model?
The rest of the paper constitutes the following sections. Section 3 presents the related work done in the area of SA. In Section 4 proposed methodology is discussed. Results are displayed in Section 5. Section 6 results are discussed and finally, Section 7 concludes the paper.

Related Work
The computational study of people's opinions, sentiments, emotions, and attitudes towards entities such as products, services, issues, events, topics, and their attributes is known as SA [2,3]. SA is divided into five sub-tasks: subjectivity classification, sentiment classification, opinion spam detection, implicit language detection, and aspect extraction [8]. All these tasks rely on sentence or document level, which is later applied to machine learning, deep learning, lexicon-based, or a hybrid approach for classification. The sentiment classification model needs to use certain features. Therefore, Chan et al. [6] proposed six categories of features, i.e., Terms and Frequency, Parts of Speech, Rules and Opinions, Sentiment Shifters, and Syntactic Dependency.

Feature Representation of Text
The most common representation for features in sentiment classification is Bag of Words (BOW), which represents a document in terms of the presence or absence of words [7]. However, the presence or absence cannot represent the value/quality of the words. For this sake, the TF-IDF scheme is used to give more weight to the important words in the corpus. The drawback of this approach is that size of the sparse feature vector increases with the increase in the document size. To overcome this limitation, one approach is to optimize the feature vector with the use of lexicon and machine learning methods [13]. Moreover, a genetic algorithm (GA) was implemented for feature reduction. The proposed pipeline is comprised of data cleaning, preprocessing, and an analysis engine. The fitness function used is based on the lexical database (SentiWordNet), which calculates the distance in terms of polarity score.
The lexicon and TF-IDF may not accurately take the semantics of the text. To incorporate the semantics researchers have used word embedding. Word embedding converts the words of similar context into similar dense vectors in multi-dimensional space, each dimension represents some aspect of the word. The authors in [14] used the pre-trained word embedding e.g., Word2Vec and GloVe to identify the sentiment. However, using pre-trained word embedding has a few drawbacks: firstly, a large corpus is required for an optimum representation of word vectors. Another problem arises because of the inability to consider the context of the document e.g., "beetle" can be the name of a car or insect but both have the same representation [14]. To address these issues, researchers have concatenated Word2Vec, Pos2Vec, Word-position2Vec (Wp2V), and Lexicon2Vec (L2V) for sentiment classification [15]. Pos2Vec is a Part of Speech (POS) tag vector providing syntactic information. Wp2V gives the location of a word concerning the ordering of the sentence. L2V is the sentiment score provided by the Lexicon. Experimental results showed performance improvement.
Apart from this, [15] suggested the domain-specific word vector performs better than the pretrained vector. The authors researched Ebola and Zika virus outbreaks. Initially, when a certain event occurs, the Twitter corpus is scanty to train a vector. The solution to this problem was to use prior scholarly text to train the unstructured twitter corpus.
Although the ML models with complex word embedding have shown good performance, they are limited in retaining contextual information of the long text. Target Dependent BERT [10] for sentiment classification while considering the target.

Multiclass Classification
Generally, sentiment classification relies on three classes, i.e., negative, neutral, and positive. However, these classes may not depict the real-world scenario. Therefore, Multi-Tier model [6,16] architecture base models were proposed. The authors in [6] proposed classification was an efficient way, for this sake, they used spark. In their approach, they trained three models to identify the sentiment of the tweets. The result showed that Linear Support Vector Machine (SVM) generated better accuracy, but it took more time to classify unseen data. Similarly, [17] used a multi-class classification of tweets. However, the authors proposed to consider the sentiment classification as an ordinal regression problem where labels are considered as ordinal data. In this approach, a scoring and balancing methodology were applied for the classification of tweets. This approach presented considerably good results. However, some data were misclassified due to the removal of emoticons that depicted relevant sentiment. In general, replacing emoticons with sentiment can mislead the classifier.
The ML-based approach requires hand-crafted feature engineering. Hand-crated features may induce a bias in the model. Therefore, researchers are more driven toward deep learning methods. Reference [3] conducted a comparative analysis of Recurrent Neural Network (RNN) and Convolutional Neural Network (CNN) on pre-trained word embedding. Based on their findings, the authors concluded that the combination of CNN and RNN results in better performance. It is important to note that CNN cannot hold long sequence semantics in the text. For this sake, Dilated CNN was proposed, which has three parallel dilated CNN layers. The idea of the parallel layer is to reduce dimension and increase accuracy without losing information. Similarly, CNN and RNN together were used by [11] to make use of long-term dependencies in classification. In their approach, CNN is applied to extract high-level features, which are then applied to the attention layer to calculate the attention score. Finally, these features and attention scores were applied on the RNN to learn longterm dependencies in the classification. Experimental results on standard datasets show performance improvement.
The sentiment classification is highly domain-specific i.e., the polarity can alter in different domains. Therefore, proposed Domain-specific sentiment analysis (DSSA-H) [18] was proposed by. Domain-specific tweets were retrieved using Random Forest. Domain Adaptive Neural Network (DANN) was applied for sentiment classification. DANN embeds a domain-adversarial component, which relies less on the training data. The proposed method can be applied to other domains, making it more scalable.

Negation Detection
In sentiment classification, negation plays a key role as it may alter the ongoing sentiments represented by the text. Thus, negation detection needs to be focused upon. Otherwise, the classification may result in incorrect sentiment. Saeed et al. [19] enhanced the performance of sentiment classification by negation handling, emoticons detection, and removal of spam reviews. The authors used the apriori algorithm to extract the features from the reviews. Scores were assigned to the emoticon using the emoticon dictionary. Sentiments of the opinion words and emoticons were calculated. However, the scores were re-evaluated based on the presence of the word in the negation list. Then data were classified into spam or truthful with the help of processing of content properties, part of speech (POS) tags, sentiment score, and enhanced bag of words (BOW). Spam data was eliminated and SA was applied only to truthful reviews. The results show a significant improvement as compared to the stateof-the-art techniques.
Negation detection is crucial for accurately classifying the sentiment [13]. Negation detection tasks can be performed using two approaches i.e., Rule-based and ML-based. In real-life scenarios, both methods may suffer to some extent. To address this issue, an optimal solution was proposed by the authors based on Reinforcement Learning (RL). The RL is based on a text processing function that evaluates the correlation between the document text and the gold standard assigned to it. Then the tone of the document is calculated to negate or inverse negate a word. Evaluation of the model shows promising results. The proposed methodology removed the bias caused by human error.

Aspect Based SA
With the advent of deep learning, most approaches have used deep learning methods for sentiment classification. Forty deep learning models were analyzed by [20], comparing performance metrics of aspect term detection, aspect category detection, and sentiment polarity. The author analyzed that CNN, Long-Short Term Memory (LSTM), Gated Recurrent Unit (GRU), and their variants are most frequently used.
The deep-learning models learn the parameters and features automatically. The learned features may give more importance to certain aspects than others. However, these aspects in some cases may not relate to the real world. To model practical scenarios, it is integral to explicitly give importance to certain aspects. This can be achieved by utilizing an attention mechanism. Coattention-MemNet and Coattention-LSTM network based on a co-attention mechanism was proposed by [21]. The proposed methodology focused on target and context level attention rather than assigning an average score. The context level attention to target words has shown promising results.
Deep learning models like RNN, LSTM, and BERT have proven efficient in capturing the semantic relation between target and context words. The deriving of contextual and semantic relationships may require more computational power. Therefore, [22] proposed a lightweight Graph Convolutional Neural Network (GCNN) based DNet (Distillation Network). This model used fewer resources with a minute reduction in performance metrics. In comparison with BERT, DNet decreases the model size fifty times and increases the responsiveness by twenty-four times.
A lexicon-based approach was incorporated by [23] that was proven to be more context-oriented. For the development of Aspect Based Sentiment Analysis ABSA, the existing Static lexicon was combined to improve accuracy. The proposed Aspect-Based Frequency Based Sentiment Analysis (ABFBSA) was proposed in which the token distance was used to evaluate the label of the nearest aspect to a given word. A list of seven negation words was used to deal with the negation. ABSA is a three-step process i.e., identification, classification, and aggregation [24]. There are Frequencybased and Syntax-based methods for aspect detection. Considering each method had its drawbacks, hybrid models were graded highly. For SA, one approach was to break text into parts, evaluate their sentiments then syntactically merge. However, aspect detection and SA can be jointly performed [25][26][27][28][29][30][31]. The future of ABSA is inclined toward more semantically rich and concept-centric rather than word-based.
While performing ABSA, the biggest challenge is of annotating the data with sentiment and aspect tags. It is a time-consuming and expensive task since erroneous annotation can induce bias. W2VLDA was implemented by [26] to classify aspects into a category, separate aspect-term, opinion-word, and SA for any given domain and language. The proposed methodology required a single seed word for each domain aspect and a single positive and negative word independent of the domain [32][33][34][35][36][37][38][39]. This made it minimally supervised. The tasks of aspect detection and SA were jointly performed. Varying the seed words of each aspect does not affect the accuracy of the sentiment [40][41][42][43][44][45][46].

Multi-Tier Architecture
Multi-tier architecture has been proposed for the multi-class SA. Their Multi-tier model hierarchically implements ternary and binary classifiers. Hence Machine Learning models are trained. Fig. 1 shows the flow diagram of the proposed methodology. The multi-tier model is comprised of three models. The training of the models takes simultaneously. Training data for M1 consists of positive, negative, and neutral text. Thus, M1 is trained to classify text into ternary classes. M2 and M3 are trained for binary classification. M2 is trained for positive and strong positive data. M3 is trained for negative and string negative text. Each classifier was trained using Naïve Bayes, Decision Tree, and SVM [47][48][49][50][51][52][53].

Preprocessing Module
Effective preprocessing is the foremost part of any ML algorithm. Since the ML algorithm does not work for textual data, the preprocessing needs to be applied to convert textual data into numerical vectors. The preprocessing is carried out with the help of the NLTK library in python [54][55][56][57][58][59][60][61][62]. The preprocessing module as shown in Fig. 2 works according to the following steps 1. A list of English stop words present in the NLTK corpus was used to remove stop words. 2. Punctuations and special characters were removed. 3. After the data was cleaned, Lemmatization was applied. This normalizes the words into their root words according to the dictionary. 4. TF-IDF vectorizer was used to convert the textual data into vectors. TF-IDF is the product of Term Frequency and Inverse document frequency.
Eqs. (1) and (2) show how TF-IDF is calculated, n is the total number of documents, df(t) is the document frequency of t; the document frequency is the number of documents in the document set that contain the term t [11]. The idea behind TF-IDF is that the frequently occurring words do not hold information regarding the sentiments. Thus, their contribution is scaled down and more weight is given to words that occur less frequently. Preprocessing was applied to the entire dataset and thus the feature vector caters to the words that are present in both test and training data. Thus, the preprocessed data is split into dev, train, and test sets according to the 10:70:20 ratio, respectively.  Steps for pre-processing of textual data

Experimental Setup 3.3.1 Dataset
The performance of the ML model is highly data-driven. A labelled dataset is required to perform supervised ML. Assigning 5 levels of sentiment to social media is a complex task. Labelling can either be done through human annotation or ML. Human annotation, although good in accuracy, is laborious. Thus, annotated Sentiment Analysis on Movie Reviews data set [12] tends to provide an optimal solution to this problem. In this study, data implementation has been performed on a publicly available movie review dataset. Table 1 shows the distribution of movie reviews among the classes. Most of the reviews belong to the neutral class. The extreme classes i.e., strong positive and string neutral contain minimum classes. The data is not equally distributed among the labels. However, it is expected to have majority reviews to have neutral sentiments. The issue arises while training the Model with imbalanced data. The ML model tends to be biased towards the majority class. The preprocessed data is split into dev, train, and test sets according to the 10:70:20 ratio, respectively. Fig. 3 shows the histogram of the data distribution is created using the labels given in the dataset. As seen in Fig. 3, the data has more neutral sentiments, i.e., "2", which almost three times more than positive or negative category. Therefore, it brings imbalance in the dataset, which results in model predicting more neutral categories than others. Fig. 4 displays the distribution of the number of words. Most of the phrases comprise 10-20 words. Fig. 5 shows how the sentiments are distributed among reviews comprised of long phrases. Fig. 6 shows the 15 most frequent words. These frequent words are known as stop words and are part of the structure of the sentence.   Word cloud, pictorially represents the important words in the corpus. The size of a word depicts how frequently it appears in the corpus. Fig. 7 shows the words cloud of the dataset.

Implementation of Multi-Tier Model
Once the models are trained, sentiments of text data can be referenced. The prediction phase involves predicting the label of unseen movie reviews. The prediction takes place in two steps. First, the ternary model classifies the data into positive, negative, and neutral. In the second step, the positive predicted data are fed into M2 that further categorized into strong positive and positive. The negative predicted data is fed into the M3 that further categorized into strings negative and negative. The workflow of the prediction phase is shown in Fig. 8. Model performance was quantified by evaluating precision, recall, F1-score, and accuracy. The implementation was performed with the help of the Scikit-Learn library [63][64][65][66].

Results
Three classifiers have been implemented. Performance metrics of multi-tier, single-tier and individual classifiers have been evaluated. Performance of individual Model is improved. Table 2 shows the comparison of results of the multi-tier and single-tier models. Table 3 depicts the results of ternary classifier M1. Tables 4 and 5 tabulate the results of binary classifiers M2 and M3, respectively.    The multi-tier architecture does not ameliorate the performance significantly. The possible reasons for the ineffectualness of the proposed model are discussed in the next section.

Discussion
The multi-tier model results are unremunerative. There is some improvement in accuracy with the base models. A few reasons for the ineffectiveness of the proposed model are put forth. First, the main drawback that leads to deterioration of the performance is that as the data increases the feature vector size increases manifold. Thus, it is quite likely that the ML model may not perform well on sparse matrix. One possible solution is to apply the Principal Component Analysis (PCA) algorithm to reduce the number of features. Second, the data is highly imbalanced. It is expected that humans who write reviews are mainly of neutral polarity. However, the imbalance of classes can cause a bias in the ML model, which alters accuracy. The individual models do not have a huge class imbalance in the training data. Third, the ML models are unable to learn the high-level features of the textual data. For example, in this approach the ML model uses TF-IDF score, which scores the importance of the word not the semantics and relationship of the word with subject and object. Therefore, the ML model may not identify the semantics or if user writes a comment in a different way. So, it is important to learn the structure/relationship among words, thus the model does not only rely on wording rather than other high level features. For this sake, word embedding along with Deep Learning (DL) can be useful. Fourth, the textual data might contain sarcasm and negation. This alters the polarity of the text. Thus, negation handling and spam detection are desiderata. Finally, Sentiments are written by humans so whenever a human writes a review he/she links the relationship between subject and object.
For this sake, a graph based structure to model the reviews of the user can be useful to identify the perspective or context of a user.
As shown in Table 2, the recall of the multi-tier model is mostly better than the recall of single-tier architecture. This shows that the proposed model is able to learn more overall context. The proposed model does not extract the semantic and syntactic properties of the text. To incorporate the syntax, word embedding can be applied. Various DL methods can learn the long-term properties of the text data. To keep the semantics of the text, Glove can be applied.

Conclusion
Fine-grained SA has been an area of interest for researchers. However, more efforts are being put into making SA more context-oriented and aspect-based. In this study, a multi-tier architecture has been proposed for the multi-class classification of sentiments of the text. The classifiers were trained using three supervised ML algorithms i.e., Naïve Bayes, Support Vector Machine and Decision Tree. A comparison of the three algorithms showed that Naïve Bayes presented the best accuracy. In future work, we aim to improve the performance by incorporating Negation handling and spam review. Feature optimization increases the efficiency of the model. Emoticons and POS tagging can be concatenated with the feature vector to deliver true insight into the sentiment. Hence, these features can be made part of the preprocessing module to yield promising results in the future.