• No results found

Sentiment analysis models methodology

3.3 Research methodology

3.3.1 Sentiment analysis models methodology

There are different approaches to create sentiment analysis models: lexicon based-approach, machine learning approach and hybrid approach.

Lexicon based-approach is a simple approach where the words in a new sentence are searched for in a lexicon. The lexicon contains words with their polarity label or a number reflecting how much the word expresses each polarity/emotion. The polarity/emotion label can be assigned to the sentence according to the majority score. For example, if the majority of the words were positive or the positive total was higher than the negative total, the sentence will be labelled as positive. This approach is not that accurate and can be misleading due to its dependency on the lexicon. If the words were not found in the lexicon, the sentence will remain unlabelled.

The machine learning approach is a more efficient method to detect polarity/emotion. This approach uses machine learning techniques which are classifiers to predict the polar- ity/emotion. This approach is more accurate than the lexical-based approach [123].

The hybrid-approach or combined-approach uses the lexicon-based and machine learning approaches together. This is usually implemented by including the values/results found from the lexicon as inputs (i.e. features) to the machine learning classifier. For this research, the machine learning and hybrid-approaches are utilised. Machine learning is a method of Data Mining (DM).

Data Mining aims to explore and analyse datasets to extract patterns and knowledge and find relations between attributes [53]. DM is also known as knowledge extraction, informa- tion discovery, information harvesting, data archaeology and data pattern processing [93].

Knowledge Discovery (KD) is a process of finding new knowledge about an application domain. DM is one of knowledge discovery’s steps which applies a certain discovery task in an application domain.

Knowledge Discovery and Data Mining (KDDM) is the process of applying KD to any data source. KDDM includes the entire knowledge extraction process, which includes developing efficient algorithms to analyse the data, interpreting and visualising the results, and modelling the interaction between human and machine [93].

We can apply KDDM models steps to construct sentiment analysis models. There are many existing KDDM models such as Fayyad et al. [53], Cabena et al. [22], Anand and Buchner [5], CRISP-DM [152], Cios et al. [31], and the Generic model [93]. The steps of these models are found in Table 3.1. The common steps through the five models are: Domain Understanding, Data Preparation, DM and evaluation of the Data Knowledge. Fayyad’s nine- step model is different from the other models as it performs activities related to the choice of the DM step late in the process. All other models perform this step before preprocessing the data. This approach is better as when the data is correctly prepared for the DM step, there is no need to repeat any other process steps [93]. One of the benefits of Fayyad’s model is that in case prepared data is not suitable for the tool of choice, it includes looping back to the second, third or fourth steps [93]. This is also beneficial for data that requires extensive preprocessing.

Cabena’s, Cios and the CRISP-DM model are very similar, however Cabena’s omits the Data Understanding step, which is a necessary step according to Kurgan and Musilek [93]. Cabena’s model is suitable for applications where data is virtually ready for mining before the project starts.

Anand and Buchner’s model is very detailed and includes steps of the early phases of the KDDM process. However, it does not include necessary activities for putting the discovered knowledge to work.

The generic model was introduced by Kurgan and Musilek [93]. It differed from all the other models as they all involved complex and time-consuming data preparation tasks. The processes in other models include iterations and loops between most of the steps. The generic model consists of six steps and is similar to Cios et al. [31] and CRISP-DM models. It was built based on older models and its six steps can fit all other models with a minor modification that it combines several original steps into a major step. Thus, the reason for choosing it in this research. As shown from Figure 3.1, when utilising the generic model process to the identification of sentiment analysis models, the following steps are applied:

1. Understanding the educational domain and the system (application) as a whole: this step is important and determines the other steps

Table 3.1 KDDM models

Fayyad et al. [53]

1. Developing and Understanding of the Application Domain 2. Creating a Target Data Set

3. Data Cleaning and Preprocessing 4. Data Reduction and Projection 5. Choosing the DM Task 6. Choosing the DM Algorithm 7. DM

8. Interpreting Mined Patterns

9. Consolidating Discovered Knowledge Cabena et

al. [22]

1. Data Preparation 2. DM

3. Domain Knowledge Elicitation 4. Assimilation of Knowledge Anand and

Buchner [5]

1. Human Resource Identification

2. Problem Specification Data Prospecting 3. Domain Knowledge Elicitation

4. Methodology Identification 5. Data Preprocessing 6. Pattern Discovery 7. Knowledge Post-processing CRISP- DM [152]

1. Human Resource Identification 2. Problem Specification

3. Data Prospecting

4. Domain Knowledge Elicitation 5. Methodology Identification 6. Data Preprocessing 7. Pattern Discovery 8. Knowledge Post-processing Cios et al. [31]

1. Understanding the Problem Domain 2. Understanding the Data

3. Preparation of the Data 4. DM

5. Evaluation of the Discovered Knowledge 6. Using the Discovered Knowledge Generic

model [93]

1. Application Domain Understanding 2. Data Understanding

3. Data Preparation and Identification of DM Technology 4. DM

5. Evaluation

Fig. 3.1 Generic model for sentiment analysis

2. Understanding the students feedback: this step allows us to examine the data and plan for preprocessing

3. Data preprocessing, feature selection, and determining the machine learning techniques: this step is to explore which of preprocessing, feature selection, and determining the best performing machine learning techniques for the application (i.e. leading to the highest accuracy)

4. Sentiment analysis step (DM step): this step includes testing sentiment analysis models on the data

5. Evaluation: evaluating the models using the data mining evaluation metrics, which are accuracy, precision, recall and F-score

6. Knowledge Consolidation and Deployment: this step allows us to extract knowledge from the proposed system such as labelling the students’ feedback.

In other words to construct sentiment analysis models using a machine learning approach, the domain and data need to be understood then four main steps are applied: preprocessing the data, selecting the features, applying the machine learning techniques and evaluating the results. These steps involved in model development are illustrated in Figure 3.2. An overview of previous sentiment analysis research related to these steps is given in the following subsections.

Fig. 3.2 Model development Preprocessing

Preprocessing is the process of cleaning the data from unwanted elements. It increases the accuracy of the results by reducing errors in the data [3]. Not using preprocessing, such as spelling corrections, may lead the system to ignore important words. On the other hand, overusing preprocessing techniques may sometimes cause loss of important data. One example illustrating this is removing punctuation in “what does this example mean???”; the question mark (and its repetition) may indicate confusion.

There are many general preprocessing techniques, of which the most common are: tokenization, covert text to lower or upper case, remove punctuation, remove numbers, remove repeated letters, remove stop words, stemming and negation [27, 32, 115, 126, 136]. Some of these preprocessing techniques are found in previous sentiment analysis research, e.g. Mouthami et al. [115], Pak and Paroubek [126, 127], Chamlertwat et al. [27], Claster et al. [32] and Prasad [136].

Preprocessing data from social media is different due to the special symbols existing in social media data such as emoticons, hashtags and chat language. Social media is a popular tool to collect peoples’ opinions as it provides the users a simple interface, enabling and encouraging them to express their sentiments. Twitter is one of the largest and most popular social media platforms for capturing opinions [127]. Analysing Twitter data is different from analysing basic text.

Some of the most common preprocessing techniques that are used for Twitter data are removing hashtags [92], removing URLs [92, 126, 136, 190], removing retweets [190], iden- tifying emoticons [1, 92, 136, 190], removing user mentions in tweets [126, 190], removing Twitter special characters [92] and handling slang/chat language [1]. These techniques are described in more details in chapters 4 and 5.

Features

Features give a more accurate analysis of the sentiments and detailed summarisation of the results [186]. The most common features are n-grams [1, 63, 180] and POS (part-of-speech) tagging [131, 181].

Other features that are less common can include: lexicon-based features, Twitter-related features (i.e. number of hashtags and emoticons) and punctuation-related features (i.e. num- ber of punctuations and particularly question and exclamation marks). More details about these features are presented in the experiments in the chapters 4 and 5.

Machine Learning Techniques

The most popular machine learning techniques for sentiment analysis are Naive Bayes (NB) [17, 57, 63, 111, 129, 160], Multinomial Naive Bayes (MNB) [63, 170], Support Vector Machines (SVM) [17, 44, 63, 111] and Maximum Entropy (ME) [63, 111].

Evaluation

The most common evaluation metrics used in sentiment analysis are accuracy, precision, recall and F-score, as outlined below. All the definitions relate to the following concepts, explained below by taking as an example a classification into two classes, i.e. positive and negative.

1. True positives (TP) or true positive rate represents the number of correctly classified instances for the positive class divided by the total number of instances, i.e. the instances classified as positive that are in reality positive;

2. False positives (FP) represent the number of incorrectly classified instances for the positive class divided by the total number of instances, i.e. the instances classified as positive that are in reality negative;

3. True negatives (TN) represent the number of correctly classified instances for the negative class divided by the total number of instances, i.e. the instances classified as negative that are in reality negative;

4. False negatives (FN) represent the number of incorrectly classified instances for the negative class divided by the total number of instances, i.e. the instances classified as negative that are in reality positive;

The accuracy metric indicates how many instances were correctly classified across all classes: Accuracy = T P+T N+FP+FNT P+T N . Precision represents the fraction of retrieved instances that are relevant, e.g. how many of the instances classified as positive are actually positive: Precision=T P+FPT P . Recall is the fraction of relevant instances that are retrieved, e.g. how many of the instances that are actually positive were classified as positive: Recall = T P+FNT P . The F score is a metric of accuracy that combines precision and recall: F = 2 ×Precision×RecallPrecision+Recall. Precision, recall and F scores for a classifier are obtained as the average of these values for all the classes.