A Novel Auto-Annotation Technique for Aspect Level Sentiment Analysis

: In machine learning, sentiment analysis is a technique to find and analyze the sentiments hidden in the text. For sentiment analysis, annotated data is a basic requirement. Generally, this data is manually annotated. Manual annotation is time consuming, costly and laborious process. To overcome these resource constraints this research has proposed a fully automated annotation technique for aspect level sentiment analysis. Dataset is created from the reviews of ten most popular songs on YouTube. Reviews of five aspects—voice, video, music, lyrics and song, are extracted. An N-Gram based technique is proposed. Complete dataset consists of 369436 reviews that took 173.53 s to annotate using the proposed technique while this dataset might have taken approximately 2.07 million seconds (575 h) if it was annotated manually. For the validation of the proposed technique, a sub-dataset—Voice, is annotated manually as well as with the proposed technique. Cohen’s Kappa statistics is used to evaluate the degree of agreement between the two annotations. The high Kappa value (i.e., 0.9571%) shows the high level of agreement between the two. This validates that the quality of annotation of the proposed technique is as good as manual annotation even with far less computational cost. This research also contributes in consolidating the guidelines for the manual annotation process.


Introduction
In recent years, the internet has gained popularity and it has become an eminent platform for socializing among users [1]. It has transformed the real world into a cyber-world [2]. Now almost everyone has easy access to hand-held devices with a reliable internet connection. Over the last few years, it is witnessed that due to the popularity of these handheld gadgets, massive data is being generated on the daily basis [3]. This bulk data is being generated from diverse sources like social media platforms, e-commerce, games, etc [4]. The generated data is both in a structured and unstructured form [3]. Most of the unstructured data is being produced by e-users (i.e., people Availability of different social networks, online blogs and other forums that enable people to discuss various aspects of products or services. Using new technologies, most social networking or e-commerce websites allow users to express their experience regarding the products, services and features [13]. These reviews can help in analyzing any product, service and company [14]. The exponential growth of reviews/comments can be witnessed due to the drastic increase in the number of e-users [15]. The internet has re-modelled the communication world as the backbone of a digital era [16]. Now the showbiz industry is also using this paradigm to progress [17]. Due to its accessibility and innovativeness, people can, now, easily access the entertainment contents, watch them and give their feedback in the form of likes and reviews. Further, this content is judged by these likes, rating and reviews on it [18]. A simple formula to check the quality of the content is: CQ ≤ 0 ?("good":"Bad") (1) where, CQ = TL − TD where CQ quantifies the content quality using the metrics of Total likes (TL) and Total dislikes (TD).
This provides limited insight into the quality of the content. A better way to evaluate the quality of content is through the analysis of comments/reviews. If the count of these reviews is not so high, this goal can easily be achieved by reading and analyzing these reviews manually. It becomes humanely impossible to analyze these reviews if they are in huge amount. That creates a need to analyze these reviews through a proper automated channel. In this context Sentiment analysis (SA), as an important paradigm, is used to know the general opinion towards that content [19]. SA is a way to categorize the people's opinions towards the entity into positive, neutral or negative [20]. It can also be said that it is a way to classify the sentiments according to the class assigned by the reviewer [21].
As discussed above, a huge amount of unstructured data is being generated by e-users on daily basis [22] in the form of comments and reviews. To analyze this data and mine the hidden patterns, the data is required to be in a structured form. There exist different preprocessing [23] techniques to overcome different data anomalies and prepare them for analysis. In Data Annotation every entity of the dataset is assigned a label according to their subjectivity. There exist different semi-automated and automated tools to annotate different types of data contents like the video [12], audio [24], image [11] and text [25].
The rest of the paper is organized into six sections. Section 2 focuses on the previous related research on annotation. Section 3 discusses the entire corpus generation process including the steps involved. Section 4 focuses on the N-gram based proposed technique of auto-annotation for the English text data. Section 5 discusses the experimental results obtained from the proposed technique. Finally, Section 6 concludes the paper.

Literature Review
State-of-the-art studies have presented different tools for annotation. These tools can be classified into two categories that are annotation for image data and annotation for text data.

Annotation for Image Data
Bio-notate is a web-based annotation tool for biomedical annotation to annotate gene-disease and to annotate the binary association between the proteins [26]. AlvisAE, reported in [27], is a semi-automated annotator to annotate tasks and assign different rules, based on the expertise and generate automatic annotation which can also be modified by the users. It is mostly used in biology and crop-sciences. GATE teamware [28] is a web-based open-source semi-automatic annotator which performs pre-annotation of fungal enzymes with the facility of manual correction.

Annotation for Text Data
Catma [29], is a web-based annotator which allows the users to import the text data using document browsing as well as by using Hypertext markup language (HTML) document by entering Uniform resource locater (URL). It allows the corpus creation. It also has the capability of automated document annotation as well as assigning manual tag sets. FLAT [30] folia is a web-based annotator which provides linguistic and semantic-based annotation using Folia format to annotate the biomedical document. MAT [31] is an active learning tool to annotate the text by importing a file in Extensible markup language (XML) and exports the annotations in either XML or Javascript object notation (JSON) formats. It is an offline application to annotate the text. BRAT, reported in [32], is a web-based text annotation tool to support Natural language processing (NLP) such as named entity recognition and part of speech tagging. BioQRator [33] is another web-based tool to annotate biomedical literature.
TeamTat [34] presents an open-source web-based document annotation tool that annotates the plain text inputs as well as document input (BioC XML or XML) and the output document is in the form of BioC XML inline annotated document. Djangology [35] is a collaborative based document annotator to annotate the documents using web services. To annotate, the document is imported as plain text and after annotation, the annotated document is exported in plain text format. In [36] geo-annotator is presented. It is a collaborative semi-automated platform for constructing geo-annotated text corpora. The annotator is a semi-automatic web-based tool with collaborative visual analytics to solve place references in Natural language.
There exist some articles annotator, Loomp [37] was a web-based tool or the annotation of articles that annotate articles based on the article's annotator. RDFa [38] based on a generalpurpose annotation framework, to annotate the news articles automatically. MyMiner [39] a web-based annotation tool can retrieve the abstracts and create a corpus for the annotation.
A plain document is imported to find a binary relationship or for tagging the entity and output is also exported in the plain text. WebAnno [40] provides full functionality for syntax and semanticbased annotations. It allows a variety of formats to import the document for the annotation and as well to export the annotated document. PDFAnno [41] a PDF document annotator available open-source. The document is imported to annotate in PDF file format the PDFAnno performs annotation and to find the relationship between entities. It can also provide the facility to annotate the figures and tables as well. The tagtog [42] provides annotation at the entity level and as well as document level. It uses an active learning approach to annotate the retrieved abstracts or full test retrieve for the annotation purpose. LightTag [43] is a commercial tool to annotate the text and its supports different languages. It can learn from active annotators using machine learning and annotate the unseen text.
There exist different automated tools to annotate the image data and text data as well. BRAT [32] is a tool that performs intuitive annotation, named entity annotation and dependency annotation. Where ezTag [44] tool is used to annotate the medical-based text data using lexicon base tagging concepts. CAT [45] is a tool that annotates the Ribonucleic acid (RNA) sequences and annotates the clades and to identify the relationship in orthology. According to the best of our knowledge, there hardly exists any tool to annotate the English text (comments/reviews) for SA.
Recent studies like [46][47][48], have witnessed that researchers are annotating the text manually and some of them by using TextBlob [49][50][51][52]. There exist tools like PDFAnno [41], MyMiner [39], BtableRAT [32] for text annotation, but no literature has witnessed text annotation for sentiment analysis at the aspect level. Manual annotation of reviews is a very hectic and time taking task [12] e.g., this research has figured out that on average 5.6 s are required to annotate one review.
This study presents a corpus of 369,436 reviews, annotated through N-gram based proposed technique. If the manual annotation were performed, it might have taken 2.07 million seconds (574.68 h) i.e., approximately 24 days to annotate. Manual text annotation is the bottleneck in NLP because it is very time consuming [12]. To overcome this bottleneck, this study presents an automated annotation technique for the English text using N-Gram based technique at the aspect level. The technique is also validated with Cohen's Kappa Coefficient value. After the validation of the technique, the entire corpus is annotated at the aspect level using the N-gram based proposed technique.

Corpus Generation
A quality corpus needs systematic collection and thorough preprocessing which can further be divided into three sub-tasks named data collection, preprocessing and data annotation. Details can be seen below:

Dataset Collection
Data is a vital part of any analysis. No analysis can be performed without data. To collect data and to build a gold-standard dataset, the top ten songs are selected [50]. Details can be seen in Tab. 1.

Preprocessing
Data quality directly affects data analysis [18]. To separate reviews carrying targeted aspects and to get processed data, different preprocessing techniques are applied like aspect filtration, data integration, lowercasing, emojis' removal and string size standardization. Details are as below.

Aspect Filtration
In this study, five aspects/features (lyrics, music, song, video and voice) are targeted for autoannotation of reviews. Reviews that contain these aspects are separated by applying filters and saved in CSV file format. Total 4,886,406 reviews are scraped and after aspect level filtration, the obtained number of record is 369,436.
The pre-processing extracted 7916 reviews for lyrics, 49238 for music, 199248 for the song, 106127 for video and 6907 for voice. Dataset is now in fifty data files, details can be viewed in Fig. 1

Data Integration
The data from ten different files of one aspect are gathered in one file. As this study covers the five different aspects which resulted overall five files, one for each aspect and we call them sub-dataset. Each sub-dataset is named after one aspect e.g., sub-dataset-Voice.

Lowercasing
The case though has no special impact on the analysis of the data but when the same data is presented in different cases then it has adverse effects [48] e.g., algorithms will consider "Yes" and "yes" two different values. Therefore, to overcome this effect, whole data is converted into lowercase.

Noise Removal
It is reported, time and again, that noise directly affects the classification results [51]. It was noted that collected data contained a lot of noise like white spaces, special characters, punctuation signs, etc. that has nothing to do with the analysis. To improve the quality of data all these characters were removed.

Remove Number
The dataset contained English text and numbers as well. This study is to analyze English text. The extra data increase computational power and also diverse the results [49]. To address the said problems, all numbers are removed.

Remove the Emoji's
Emoji's is a popular way to show one's feelings. It is widely used by e-users to show their feelings towards the entity. They leave their sentiments using emoji's [52]. This study only focused on the text, therefore, emoji's are removed from the dataset.

Trim String Size
In the dataset, there were several reviews of extraordinary length. For example in subdataset-Lyrics, a review had 11,487 number of tokens. In the same way in sub-dataset-Music there was a review that had 9,914 tokens. Such lengthy reviews are outliers and have a bad impact on classification [53]. To overcome the issue length standardization is applied. To improve the quality of data, to have the least impact of data loss maximum length size is defined as 150 for all sub-datasets except for lyrics (due to the very small number of reviews).
To resolve this issue the string size of sub-dataset-Lyrics is trimmed to 300 tokens that cover the 77.68% data and for the rest of the sub-datasets the max string size is defined as 150 characters. The rest of the details can be seen in Tab. 2.

Data Annotation
Being the static part, data annotation is a process of categorizing the text (e.g., instance, review or comment) into positive, neutral or negative based upon its subjectivity. Previous studies showed different ways to annotate the data i.e., auto-annotation, semi-annotation and as well as manual annotation. Many automated tools are witnessed to annotate the image and video data.
Very few tools exist to annotate text data, especially, there is no automated tool exists to annotate the English text for sentiment analysis. Generally, manual annotation is used to label text data. Details can be seen in the subsequent section.

Manual Annotation
In manual annotation, each review is labelled according to its subjectivity, manually. Each review is labelled by reading it one by one and assigned class according to its behaviour as positive, neutral or negative. Following the process explained in [54] manual annotation process was divided into four steps.
In step-I, to annotate the reviews manually, the guidelines are prepared. In step-II, three volunteers were contacted to annotate the data. To start with they were given basic training based upon the guidelines. In step-III, the dataset was given to the annotators for annotation. The conflicts were resolved using an inter-annotator agreement. Finally, in step-IV, the computational value of the inter-annotator agreement was calculated using Kappa-statistics. The details of all steps are as below:

Annotation Guidelines Preparation
In the light of guidelines for each class-positive, negative and neutral, presented in different research works [55][56][57] are mapped on the current problem. Details are as below:

Guidelines for Positive Class
A review will be assigned "Positive" • If it shows positive sentiments [53].
• If its behaviour is both neutral and positive [53,58].
• If there exists some positive word(s) in the sentence [59] e.g., good, beautiful, etc.
• If there exist illocutionary speech act like wow, congrats and smash classified as positive [60].
Examples: In "best music yet" the word "best" clearly shows the positive polarity of the aspect-music. In another review "his voice it's so soft and cool," the behaviour is positive towards aspect-voice.

Guidelines for Negative Class
A review will be assigned "Negative" • If it shows negative sentiments [55].
• If the use of language is abusive [1].
• If there exist negation, in a review e.g., not good.
Examples: In "music is trash," the word trash expresses the polarity i.e., negative of the review for the aspect-music. In, "this is the stupid voice," the word "stupid" shows negative sentiments of the reviewer on aspect-voice.

Guidelines for Neutral Class
A review will be assigned "neutral" • If it is not showing any positive or negative sentiments [56].
• If a review has a piece of realistic information [57].
• If a review has both positive and negative sentiments [61].
Examples: In "that music going too far away" the subjectivity of the review isn't clear so it will be annotated as neutral. In, "the video is neither good nor bad," in the review both sentiments are present so it will be annotated as neutral.

Training of Annotators
For the manual annotation the help of three volunteers was pursued, let's call them A, B and C. The volunteers were graduates, well familiar with reviews and concepts of annotation and had a good grip on the English language. Three hours of the hands-on training session was conducted to explain the guidelines and discuss possible issues with them.

Conflict Resolution
For conflict resolutions, a short sample dataset of 100 reviews was created let's call it SSD100. SSD100 was given to the first two annotators-annotator A and annotator B. Once they completed the annotation, a short meeting was arranged to resolve the conflicts by involving the third annotator too. After the conflict resolution, SSD100 was given to the annotator C for annotation.

Problems Faced During Manual Annotation
Though volunteers were very cooperative but still the process faced few problems as listed below: (i) Training (ii) Individual's perception (iii) Clash removals (iv) Confidence level Even after 3 h of hands-on practice, still, annotators were consulting the trainers for the resolution of the issues. There was a big issue of an individual's perception. 5.33% of manual annotations, done by three annotators, was still updated by the trainer. During the annotation process, they were also asked to mention their confidence level (1-10) regarding their annotated label. The average confidence was 90.50%. It took almost 6 h to annotate 3700 reviews with an average of 5.6 s per review. This showed that manual annotation even with qualified and trained annotators is not perfect and unseen and unreported lags always remain there. A sample of annotated data is shown in Tab. 3.

Proposed Technique for Auto-Annotation
To overcome the hectic and time-consuming process of manual annotation, this study has is presented a new fully automated technique for text annotation (at aspect level) based upon the language modelling technique of N-gram.

N-Gram
Models that assign probabilities to sequences of words are called language models (LMs). N-gram is one of the simplest models that assign probabilities to sentences and sequences of words. An N-gram is a sequence of N words. For example in a sentence "Best song ever justin. . ." a 2-gram (or bigram) is a two-word sequence like "Best song," "song ever," or "ever justin," and a 3-gram (or trigram) is a three-word sequence of words like "Best song ever," or "song ever justin." N-gram model estimates the probability of the last word of an n-gram given the previous words, and also assign probabilities to entire sequences, thus the term N-gram is used to mean either the word sequence itself or the predictive model that assigns it a probability.
For the joint probability of each word in a sequence having a particular value P(W = w 1 , X = w 2 , Y = w 3 , . . . ; Z = w n ) we'll use P(w 1 , w 2 , w 3 , . . . , w n ).
Applying the chain rule to the words where w n 1 is w 1 , w 2 , w 3 , . . . , w n and P(w x | w y 1 ) is conditional probability of occurrence of w x given the occurrence of w 1 , w 2 , w 3 , . . . , w y .

Re-Definition of Ps
This research is redefining P's to solve the problem in hand. We need to find the polarity of the text having n-words i.e., w 1 , w 2 , w 3 , . . . , w n , we define it as P(w n 1 ). The polarity will be counted if the behavior of the word is with respect to the aspect. So the polarity is being checked in the form of pair of words, out of which one is supposed to be aspect. P(w x | w y ) defines the two words polarity where w x is aspect and w y can be anything. If w y is positive then the polarity of these two words' combination would be positive and if w y is negative then the polarity of these two words combination would be negative else neutral.
The polarity of two-word combination is required to be checked for all occurrences of w y where 1 ≤ y ≤ n and y = x. To find the aspect in all words of the text w x is varied from 1 to n. Hence we have Eq. (2) to express all this where: where w n 1 is w 1 , w 2 , w 3 , . . . , w n where w k is the aspect P(w x ) is the occurance of aspect P(w x | w y ) is the polarity of occurrence of aspect, w x , given the occurrence of w y We define the value of P as below: ∈ Bag p and Bag n And p, if w a = Aspect and w a−1 ∈ Bag p n, if w a = Aspect and w a−1 ∈ Bag n p, if w a = Aspect and if w a and w a−1 ∈ Bag p n, if w a = Aspect and if w a ∈ Bag p and w a−1 ∈ Bag n n, if w a = Aspect and if w a or w a−1 ∈ Bag n 1, otherwise where Bag p and Bag n , are bag of positive words and bag of negative words respectively (the list of these words can easily be found online (e.g., GitHub) that can then be updated according to the tokens).
For a complete sentence, this will result in an expression like Irrespective of the powers, p x is assigned 1 and n y is assigned −1 P(w n | w n−1 ) = p x n y = 1, label = positive −1, label = negative (10) For example, if the value of P(w n | w n−1 ) = p 3 n 2 we will assign p 3 = 1 and n 2 = −1.
In this way, nearest words are associated with the targeted aspect and after computing its value, a label is assigned. To understand, Eq. (8) is explained, condition by condition, through examples as below:

Example: p, if w a = Aspect and w a−1 ∈ Bag p
The value p will be assigned to P(w a | w a−1 ), if w a is an aspect and w a−1 is a word that belongs to the bag of positive words. E.g., in "justin bieber you have amazing voice" if we check the value of P(w a | w a−1 ) at highlighted words then it will be p as w a = Voice, which is an aspect and w a−1 = amazing which belongs to bag of positive words.

Example; n, if w a = Aspect and w a−1 ∈ Bag n
The value n will be assigned to P(w a | w a−1 ), if w a is an aspect and w a−1 is a word that belongs to the bag of negative words. E.g., in "justin bieber you have annoying voice" if we check the value of P(w a | w a−1 ) at highlighted words then it will be n as w a = Voice, which is an aspect and w a−1 = annoying which belongs to bag of negative words.

Example: p, if w a = Aspect and if w a and w a−1 ∈ Bag p
The value p will be assigned to to P(w a | w a−1 ), if w a is not an aspect and w a and w a−1 is a word that belongs to the bag of positive words. E.g., in "love justin bieber voice look is so amazing" if we check the value of P(w a | w a−1 ) at highlighted words then it will be p as w a = Voice, which is not an aspect and w a & w a−1 belongs to bag of positive words.
Example: n, if w a = Aspect and if w a ∈ Bag p and w a−1 ∈ Bag n The value n will be assigned to P(w a | w a−1 ), if w a is not an aspect and w a is a word that belongs to the bag of negative words and w a−1 is a word that belongs to the bag of negative words. E.g., in "ad to say but justin biebers voice is lighter not good as baby old justin bieber" if we check the w a that belongs to the bag of positive words where w a−1 is belongs to bag of negative words.

Example: n, if w a = Aspect and if w a or w a−1 ∈ Bag n
The value n will be assigned to P(w a | w a−1 ), if w a is not an aspect and w a and w a−1 is a word that belongs to the bag of negative words. E.g., in "three billion viewers of justin hates i want to hear his voice" if we check the value of P(w a | w a−1 ) at highlighted words 'hates' then it will be n as w a = Voice , which is not an aspect and w a & w a−1 belongs to bag of negative words.

Example: 1, &otherwise
If the review not having any word that belongs to the bag of positive nor from the bag of negative words, the value 1 is assigned to that type of reviews which is labelled as neutral. E.g., in "the october from bangladesh who are with me rise your voice" not any single word that belongs to the bag of positive words or belongs to the bag of negative words, the polarity of all these types of reviews is declared as neutral.

Validation of Proposed Technique
The technique is validated using Cohen's Kappa statistics. It is a statistic that is used to indicate the inter-rater reliability between the annotators [62]. According to Cohen's Kappa, the value of Kappa >90% indicates almost perfect agreement between the annotators [63]. Tab. 4 presents the details of the interpretation of different levels of values of Cohen's Kappa. To prove the efficacy of the proposed technique, the inter-annotator agreement is calculated using Cohen's Kappa statistics. For this purpose, three experiments were conducted. Two on SSD100 and one on sub-dataset-Voice. The details of the experiments are shown in Tab. 5.

Experiment 1
In this experiment, SSD100 was used for the manual annotation. This experiment was conducted during the training of Annotators-A, B and annotator C. Kappa value i.e., interannotator agreement between the three annotators was calculated (the value appeared 85.28%).

Experiment 2
In this experiment, the annotation results of SSD100 using manual annotation and using the based proposed technique are compared and the Kappa Statistics value is calculated. The estimated value of Kappa is 90.96%. The level of agreement using the value of Kappa are shown in Tab. 4.

Experiment 3
In this experiment the sub-dataset-Voice is annotated once with manual annotation technique and then using the proposed technique. Kappa Statistics value is calculated to validate the reliability of the results of the proposed technique. The value appeared to be 95.71%. This proves that the proposed technique is giving results as good as manual annotation even with far less computational cost. The details of the Kappa value of different sizes of datasets can be seen in Tab. 5.
The remaining sub-datasets of four aspects are also annotated using N-gram based proposed technique. Details of all sub-datasets can be seen in Tab. 6. It took almost 2 min 53.45 s in annotating complete dataset of size 369,436 reviews of 24534205 tokens with an average of 0.46963 milliseconds per review and 7.07 microseconds per token If it was attempted manually then it had taken approximately 14 weeks, 1 day and 7 h while working 40 h a week with the average of 5.6 s per review and 23.44 microseconds per token. Fig. 3 explains the difference between the two. The Proposed technique completes the task in few seconds the manual might had taken days and weeks. In terms of the expected time of manual annotation, the proposed technique is efficient with a ratio of 1:11934.28.

Proposed
Manual (  This study also established that manual annotation is too subjective. It is not as good as it is supposed to be. Unknowingly inaccuracies do exist. This research has presented a new technique to annotate large datasets as good as manual annotation and even with far less computational cost in comparison to manual annotation. The dataset of English text reviews is scraped, preprocessed and annotated manually as well as using the proposed technique. This technique may benefit in multiple ways. It does not need additional resources-financial as well as human. It is very efficient and requires very less time without any additional cost. The performance ratio of manual to proposed i.e., 1:11934.28 shows its efficiency. If the complete dataset were annotated manually then it might had have taken approximately 2.07 million seconds, 14 weeks, 1 day and 7 h while working 40 h a week (i.e., 575 h) but this technique has done the same with 173.53 s (2 min and 53.45 s). The high value of Kappa statistics i.e., 95.71% shows and validates the reliability of results generated by proposed technique. Machine learning and deep learning algorithms can be applied to the datasets, one with manual annotation and other with this proposed technique, to extend this analysis and study the variation in the two techniques.
Funding Statement: The authors received no specific funding for this study.

Conflicts of Interest:
The authors declare that they have no conflicts of interest to report regarding the present study.