2.4 Approaches to Sentiment Analysis
2.4.2 Machine Learning Approach
2.4.2.1 Supervised Learning
Supervised machine learning techniques are used to classify documents or sentences into a finite set of classes. The machine learning algorithm is required to generalize from the training data to previously unseen data in a way which is considered reasonable [78]. For these approaches to work, they require the availability of labelled data. Labelled data is data whose classification has been predetermined by humans.
Supervised machine learning techniques require a large corpus of training data, and their performance depends on the existence of a good match between the training and the test data, with respect to domain, topic, as well as time-period [113].
There are a number of supervised learning algorithms that have been commonly used in the literature. We give a list of these here, and go further to explain the four which have been broadly utilized in the literature.
Naive bayes
Maximum Entropy
Decision Trees
Support Vector Machines (SVM)
Hidden Markov Models (HMM)
Random Forests
Supervised Machine Learning Algorithms in Sentiment Analysis
Naive Bayes The Naive Bayes classifier is a probabilistic classifier which makes use of Bayes Theorem [74] with strong independence assumptions. These assumptions are seen as simplistic, as they assume that all attributes of the classified objects are independent [37].
Despite the simplifying assumptions of this algorithm, it has been shown to perform well. It has the advantage of being easy to train, where even a small training set is sufficient to
train the classifier [116]. Another advantage of this algorithm is that its training time is significantly smaller [37].
The conditional independence of Naive Bayes does not hold in real world situations, but it still surprisingly performs quite well, but not quite as well as other sophisticated algorithms like SVM and Maximum Entropy [74].
Maximum Entropy The Maximum Entropy (Max Ent) algorithm has been proven to be a good approach in many applications dealing with automatic text processing. Being more effective than the Naive Bayes classifier, it is not simplified by the assumption of statistical independence, as the former is [37].
Max Ent makes no assumptions about the relationships that exists between features, and hence, might perform better when conditional independence assumptions are not met [74].
Decision Trees The principle by which the decision tree works is to split the dataset recursively into subsets, in such a way as to ensure that each subset contains more or less homogeneous states of a predictable nature [21]. The decision tree can be defined as a tree where each non-leaf node of the tree is one test about the variable value, and the branches represent the possible results [37].
In decision tree classifiers, the training data spaces are hierarchically decomposed, and in this decomposition process, the conditions on the attribute are utilized to split the data [81].The splitting of the data space or this decomposition of the data is performed in a recursive manner to the stage when the leaf nodes may contain at least some number of the records which then are utilized for classification [51].
The decision tree is a technique applied in data mining and in machine learning. There are two major types of these trees; the classification tree, and the regression tree. The classification tree analysis is applied when the prediction output is discrete classes, while the regression tree is used when the predicition outcome is a continous value [110]. The J48 and C45 decision tree classifiers are inbuilt in Weka, which is an open source Java tool and can
be used for text classification1.
Support Vector Machines The goal of the Support Vector Machines (SVM) is to find a hypothesis which can guarantee the lowest generalization error. This generalization error hypothesis is the probability that the hypothesis is false for a randomly selected object from the test set [37]. SVMs are large margin classifiers,and not probabilistic classifiers like Naive bayes and Maximum Entropy. Though they require a large dataset to enable them build a super quality model for classification [17][116], Support Vector Machines have been shown to be highly effective in traditional text classification and outperform Naive Bayes classifiers [74].
Support Vector Machines are one of the most popular supervised learning algorithms, not just in the area of Sentiment Analysis, but also in other areas of classification. Support Vector Machines are fast algorithms and perform with state-of-the-art accuracy. They are used for creating feature-vector-based classifiers. Each instance which is to be classified is basically represented by a vector of real-numbered features.
The text to be classified is converted into word vectors, and a hyperplane is drawn using these word vectors to seperate the data instances of one class from the other. SVM uses training instances also known as support vectors to find this hyper-plane [3]. The training data is used to generate a high-dimensional space which can be divided by this hyperplane, between the positive and negative instances, as the case may be. New instances are subsequently classified by finding their position in space with respect to the hyperplane. Symbolically, this search corresponds to a constrained optimisation problem which lets
cj ∈ {1,−1} which correspond to the positive and negative class, be the correct class of a
document dj. The solution will be written as shown:
~
v :=X
j
αjcjd~j, α≥0, (2.1)
Where αj’s are obtained from solving a dual optimization problem. Those d~j where
αj is greater than zero are referred to as ’support vectors’, as they happen to be the only document vectors which contribute to ~v. Classification of test instances is therefore about determining which side of~v’s hyperplane they fall on [74].
Support Vector Machines are good at combining diverse information sources. They do not assume that features are independent, and deal well with overlaps in information sources. They are however sensitive to sparse and insufficient data. They also have the capability to handle high dimensional spaces through their use of kernels.
For linearly inseperable problems, the kernel function of SVM can be used to concert low dimensional space nonlinear problems to high dimension space linear problems. The mapping of the kernel function can act as a good control of the computational complexity of nonlinear expansion, and avoid the curse of dimensionality [114].
Application of Supervised Learning Algorithms in Sentiment Analysis The Bayesian classifier was used by [20], as one of the algorithms implemented in their work for extract- ing small investor sentiment from stock message boards. The Bayesian classifier relies on a multivariate application of Bayes theorem, and makes use of word-based probabilities. As such, it is indifferent to the language structure. This language independence gives it a wide applicability, which enables investigation of message boards in other financial markets, where the underlying language may not be English.
Pang et al [74] consider the problem of classifying documents not by topic, but by overall sentiment, through the use of three different machine learning techniques; Nave Bayes, Max- imum Entropy and Support Vector Machines (SVMs). They implemented these algorithms on a movie reviews domain, the Internet Movie Database (IMDb). The aim was to examine if it was enough to consider sentiment classification as just another form of topic classifica- tion. They found Nave Bayes performed the poorest, while SVM performed the best, though the difference margin was not so large. They also reported that these algorithms had given better results on previous studies in topic-based classification.
lexical counterparts, [4] carried out a comparative analysis of the effectiveness of machine learning techniques on sentiment polarization in movie blogs, in comaprison to that of lexical approaches. From the results obtained, they concluded that machine learning techniques were superior, as even the worst machine learning result outperformed the best lexical results. They suggested that this may be due to the heavy reliance of lexical approaches on semantic information.
Others, such as [73], [72] and [4] first carried out the conversion of a document into vectors and then used classifiers, which included SVM, Maximum Entropy, Nave Bayes, and ADTree to categorize the documents.