Recursive Neural Conditional Random Fields
for Aspect-based Sentiment Analysis
Wenya Wang†‡ Sinno Jialin Pan† Daniel Dahlmeier‡ Xiaokui Xiao† †Nanyang Technological University, Singapore
‡SAP Innovation Center Singapore
†{wa0001ya, sinnopan, xkxiao}@ntu.edu.sg,‡{d.dahlmeier}@sap.com
Abstract
In aspect-based sentiment analysis, extract-ing aspect terms along with the opinions be-ing expressed from user-generated content is one of the most important subtasks. Previ-ous studies have shown that exploiting con-nections between aspect and opinion terms is promising for this task. In this paper, we pro-pose a novel joint model that integrates recur-sive neural networks and conditional random fields into a unified framework for explicit as-pect and opinion terms co-extraction. The proposed model learns high-level discrimina-tive features and double propagates informa-tion between aspect and opinion terms, simul-taneously. Moreover, it is flexible to incor-porate hand-crafted features into the proposed model to further boost its information extrac-tion performance. Experimental results on the dataset from SemEval Challenge 2014 task 4 show the superiority of our proposed model over several baseline methods as well as the winning systems of the challenge.
1 Introduction
Aspect-based sentiment analysis (Pang and Lee, 2008; Liu, 2011) aims to extract important infor-mation, e.g., opinion targets, opinion expressions, target categories, and opinion polarities, from user-generated content, such as microblogs, reviews, etc. This task was first studied by Hu and Liu (2004a; 2004b), followed by Popescu and Etzioni (2005), Zhuang et al. (2006), Zhang et al. (2010), Qiu et al. (2011), Li et al. (2010). In aspect-based senti-ment analysis, one of the goals is to extract explicit aspects of an entity from text, along with the opin-ions being expressed. For example, in a restaurant review“I have to say they have one of the fastest
de-livery times in the city.”, the aspect term isdelivery times, and the opinion term isfastest.
Among previous work, one of the approaches is to accumulate aspect and opinion terms from a seed collection without label information, by utiliz-ing syntactic rules or modification relations between them (Qiu et al., 2011; Liu et al., 2013b). In the above example, if we know fastest is an opinion word, thendelivery timesis probably inferred to be an aspect because fastestis its modifier. However, this approach largely relies on hand-coded rules and is restricted to certain Part-of-Speech (POS) tags, e.g., opinion words are restricted to be adjectives. Another approach focuses on feature engineering based on predefined lexicons, syntactic analysis, etc. (Jin and Ho, 2009; Li et al., 2010). A sequence labeling classifier is then built to extract aspect and opinion terms. This approach requires extensive ef-forts for designing hand-crafted features and only combines features linearly for classification which ignores higher order interactions.
To overcome the limitations of existing methods, we propose a novel model, named Recursive Neu-ral Conditional Random Fields (RNCRF). Specif-ically, RNCRF consists of two main components. The first component is a recursive neural network (RNN)1 (Socher et al., 2010) based on a
depen-dency tree of each sentence. The goal is to learn a high-level feature representation for each word in the context of each sentence and make the represen-tation learning for aspect and opinion terms inter-active through the underlying dependency structure among them. The output of the RNN is then fed into a Conditional Random Field (CRF) (Lafferty et al., 2001) to learn a discriminative mapping from
high-1Note that in this paper, RNN stands for recursive neural network instead of recurrent neural network.
level features to labels, i.e., aspects, opinions, or
others, so that context information can be well cap-tured. Our main contributions are to use RNN for encoding aspect-opinion relations in high-level rep-resentation learning and to present a joint optimiza-tion approach based on maximum likelihood and backpropagation to learn the RNN and CRF com-ponents simultaneously. In this way, the label in-formation of aspect and opinion terms can be dually propagated from parameter learning in CRF to rep-resentation learning in RNN. We conduct expensive experiments on the dataset from SemEval challenge 2014 task 4 (subtask 1) (Pontiki et al., 2014) to ver-ify the superiority of RNCRF over several baseline methods as well as the winning systems of the chal-lenge.
2 Related Work
2.1 Aspects and Opinions Co-Extraction Hu et al. (2004a) proposed to extract product aspects through association mining, and opinion terms by augmenting a seed opinion set using synonyms and antonyms in WordNet. In follow-up work, syntactic relations are further exploited for aspect/opinion ex-traction (Popescu and Etzioni, 2005; Wu et al., 2009; Qiu et al., 2011). For example, Qiu et al. (2011) used syntactic relations to double propagate and augment the sets of aspects and opinions. Although the above models are unsupervised, they heavily depend on predefined rules for extraction and are restricted to specific types of POS tags for product aspects and opinions. Jin et al. (2009), Li et al. (2010), Jakob et al. (2010) and Ma et al. (2010) modeled the ex-traction problem as a sequence tagging problem and proposed to use HMMs or CRFs to solve it. These methods rely on rich hand-crafted features and do not consider interactions between aspect and opin-ion terms explicitly. Another directopin-ion is to use word alignment model to capture opinion relations among a sentence (Liu et al., 2012; Liu et al., 2013a). This method requires sufficient data for modeling desired relations.
Besides explicit aspects and opinions extraction, there are also other lines of research related to aspect-based sentiment analysis, including aspect classification (Lakkaraju et al., 2014; McAuley et al., 2012), aspect rating (Titov and McDonald, 2008;
Wang et al., 2011; Wang and Ester, 2014), domain-specific and target-dependent sentiment classifica-tion (Lu et al., 2011; Ofek et al., 2016; Dong et al., 2014; Tang et al., 2015).
2.2 Deep Learning for Sentiment Analysis Recent studies have shown that deep learning mod-els can automatically learn the inherent semantic and syntactic information from data and thus achieve better performance for sentiment analysis (Socher et al., 2011b; Socher et al., 2012; Socher et al., 2013; Glorot et al., 2011; Kalchbrenner et al., 2014; Kim, 2014; Le and Mikolov, 2014). These methods gen-erally belong to sentence-level or phrase/word-level sentiment polarity predictions. Regarding aspect-based sentiment analysis, Irsoy et al. (2014) ap-plied deep recurrent neural networks for opinion ex-pression extraction. Dong et al. (2014) proposed an adaptive recurrent neural network for target-dependent sentiment classification, where targets or aspects are given as input. Tang et al. (2015) used Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) for the same task. Neverthe-less, there is little work in aspects and opinions co-extraction using deep learning models.
To the best of our knowledge, the work of Liu et al. (2015) and Yin et al. (2016) are the most re-lated to ours. Liu et al. (2015) proposed to com-bine recurrent neural network and word embeddings to extract explicit aspects. However, the proposed model simply uses recurrent neural network on top of word embeddings, and thus its performance heav-ily depends on the quality of word embeddings. In addition, it fails to explicitly model dependency re-lations or compositionalities within certain syntactic structure in a sentence. Recently, Yin et al. (2016) proposed an unsupervised learning method to im-prove word embeddings using dependency path em-beddings. A CRF is then trained with the embed-dings independently in the pipeline.
embeddings and a CRF separately as in (Yin et al., 2016). Note that Weiss et al. (2015) proposed to combine deep learning and structured learning for language parsing which can be learned by structured perceptron. However, they also separate neural net-work training with structured prediction.
2.3 Recursive Neural Networks
Among deep learning methods, RNN has shown promising results on various NLP tasks, such as learning phrase representations (Socher et al., 2010), sentence-level sentiment analysis (Socher et al., 2013), language parsing (Socher et al., 2011a), and question answering (Iyyer et al., 2014). The tree structures used for RNNs include constituency tree and dependency tree. In a constituency tree, all the words lie at leaf nodes, each internal node repre-sents a phrase or a constituent of a sentence, and the root node represents the entire sentence (Socher et al., 2010; Socher et al., 2012; Socher et al., 2013). In a dependency tree, each node including terminal and nonterminal nodes, represents a word, with de-pendency connections to other nodes (Socher et al., 2014; Iyyer et al., 2014). The resultant model is known as dependency-tree RNN (DT-RNN). An ad-vantage of using dependency tree over the other is the ability to extract word-level representations con-sidering syntactic relations and semantic robustness. Therefore, we adopt DT-RNN in this work.
3 Problem Statement
Suppose that we are given a training set of cus-tomer reviews in a specific domain, denoted byS=
{s1, ..., sN}, whereN is the number of review sen-tences. For anysi∈S, there may exist a set of aspect terms Ai ={ai1, ..., ail}, where eachaij ∈Ai can be a single word or a sequence of words expressing explicitly some aspect of an entity, and a set of opin-ion terms Oi ={oi1, ..., oim}, where each oir can be a single word or a sequence of words expressing the subjective sentiment of the comment holder. The task is to learn a classifier to extract the set of aspect termsAi and the set of opinion termsOi from each review sentencesi∈S.
This task can be formulated as a sequence tag-ging problem by using the BIO encoding scheme. Specifically, each review sentence si is composed
of a sequence of wordssi={wi1, ..., wini}. Each wordwip∈si is labeled as one out of the following 5 classes: BA (beginning of aspect), IA (inside of as-pect), BO (beginning of opinion), IO (inside of opin-ion), and O (others). LetL={BA,IA,BO,IO,O}. We are also given a test set of review sentences de-noted byS0={s01, ..., s0N0}, whereN0is the number
of test reviews. For each test reviews0i∈S0, our ob-jective is to predict the class labely0
iq∈L for each wordw0iq. Note that a sequence of predictions with BA at the beginning followed by IA are indication of one aspect, which is similar for opinion terms.2
4 Recursive Neural CRFs
As described in Section 1, RNCRF consists of two main components: 1) a DT-RNN to learn a high-level representation for each word in a sentence, and 2) a CRF to take the learned representation as input to capture context around each word for explicit as-pect and opinion terms extraction. Next, we present these two components in details.
4.1 Dependency-Tree RNNs
We begin by associating each word w in our vo-cabulary with a feature vector x∈Rd, which cor-responds to a column of a word embedding matrix We∈Rd×v, wherev is the size of the vocabulary. For each sentence, we build a DT-RNN based on the corresponding dependency parse tree with word em-beddings as initialization. An example of the depen-dency parse tree is shown in Figure 1(a), where each edge starts from the parent and points to its depen-dent with a syntactic relation.
In a DT-RNN, each noden, including leaf nodes, internal nodes and the root node, in a specific sen-tence is associated with a wordw, an input feature vectorxw, and a hidden vectorhn∈Rdof the same dimension asxw. Each dependency relationr is as-sociated with a separate matrixWr∈Rd×d. In addi-tion, a common transformation matrixWv∈Rd×dis introduced to map the word embeddingxw at node nto its corresponding hidden vectorhn.
(a) Example of a dependency tree. (b) Example of a DT-RNN tree structure. (c) Example of a RNCRF structure.
Figure 1: Examples of dependency tree, DT-RNN structure and RNCRF structure for a review sentence.
Along with a particular dependency tree, a hidden vectorhnis computed from its own word embedding xwat nodenwith the transformation matrixWvand its children’s hidden vectorshchild(n) with the cor-responding relation matrices{Wr}’s. For instance, given the parse tree shown in Figure 1(a), we first compute the leaf nodes associated with the wordsI
andtheusingWvas follows,
hI = f(Wv·xI+b), hthe = f(Wv·xthe+b),
wheref is a nonlinear activation function andbis a bias term. In this paper, we adopttanh(·)as the
ac-tivation function. Once the hidden vectors of all the leaf nodes are generated, we can recursively gener-ate hidden vectors for interior nodes using the corre-sponding relation matrixWrand the common trans-formation matrixWvas follows,
hfood = f(Wv·xfood+WDET·hthe+b), hlike = f(Wv·xlike+WDOBJ·hfood
+WNSUBJ·hI+b).
The resulting DT-RNN is shown in Figure 1(b). In general, a hidden vector for any node nassociated with a word vectorxwcan be computed as follows,
hn=f
Wv·xw+b+ X
k∈Kn
Wrnk·hk , (1)
whereKndenotes the set of children of noden,rnk denotes the dependency relation between nodenand its child nodek, and hk is the hidden vector of the child nodek. The parameters of DT-RNN,ΘRNN=
{Wv, Wr, We, b}, are learned during training.
4.2 Integration with CRFs
CRFs are a discriminative graphical model for struc-tured prediction. In RNCRF, we feed the output of DT-RNN, i.e., the hidden representation of each word in a sentence, to a CRF. Updates of parameters for RNCRF are carried out successively from the top to bottom, by propagating errors through CRF to the hidden layers of RNN (including word em-beddings) using backpropagation through structure (BPTS) (Goller and K¨uchler, 1996).
Formally, for each sentencesi, we denote the in-put for CRF byhi, which is generated by DT-RNN. Here hi is a matrix with columns of hidden vec-tors{hi1, ..., hini}to represent a sequence of words
{wi1, ..., wini} in a sentence si. The model com-putes a structured output yi = {yi1, ..., yini} ∈Y, where Y is a set of possible combinations of la-bels in label set L. The entire structure can be represented by an undirected graph G = (V, E)
with cliques c ∈ C. In this paper, we employed linear-chain CRF, which has two different cliques: unary cliques (U) representing input-output connec-tion, and pairwise cliques (P) representing adjacent output connections, as shown in Figure 1(c). Dur-ing inference, the model aims to outputˆy with the maximum conditional probabilityp(y|h). (We drop the subscriptihere for simplicity.) The probability is computed from potential outputs of the cliques:
p(y|h) = 1
Z(h)
Y
c∈C
ψc(h,yc), (2)
where Z(h) is the normalization term, and
Figure 2: An example for computing input-ouput potential for the second positionlike.
also incorporate a context window of size 2T+ 1
when computing unary potentials (T is a hyper-parameter). Thus, the potential of unary clique at nodekcan be written as
ψU(h, yk) = exp (W0)yk·hk+ T X
t=1
(W−t)yk·hk−t
+
T X
t=1
(W+t)yk ·hk+t !
, (3)
whereW0,W+tandW−tare weight matrices of the CRF for the current position, thet-th position to the right, and thet-th position to the left within context window, respectively. The subscriptykindicates the corresponding row in the weight matrix.
For instance, Figure 2 shows an example of win-dow size 3. At the second position, the input features forlikeare composed of the hidden vectors at posi-tion 1 (hI), position 2 (hlike) and position 3 (hthe). Therefore, the conditional distribution for the entire sequenceyin Figure 1(c) can be calculated as
p(y|h)= 1 Z(h)exp
4 X
k=1
(W0)yk·hk+
4 X
k=2
(W−1)yk·hk−1
+
3 X
k=1
(W+1)yk·hk+1+
3 X
k=1
Vyk,yk+1 !
,
where the first three terms in the exponential of the RHS consider unary clique while the last term con-siders the pairwise clique with matrixV represent-ing pairwise state transition score. For simplicity in description on parameter updates, we denote the log-potential for clique c∈ {U, P} by gc(h,yc) =
hWc, F(h,yc)i.
4.3 Joint Training for RNCRF
Through the objective of maximum likelihood, up-dates for parameters of RNCRF are first conducted on the parameters of the CRF (unary weight matri-ces ΘU = {W0, W+t, W−t} and pairwise weight matrix V) by applying chain rule to log-potential updates. Below is the gradient forΘU (updates for
V are similar through the log-potential of pairwise cliquegP(y0k, y0k+1)):
4ΘU =
∂−logp(y|h)
∂gU(h, yk0) ·
∂gU(h, yk0) ∂ΘU
, (4)
where
∂−logp(y|h)
∂gU(h, yk0)
=−(1yk=y0k−p(y
0
k|h)), (5)
and yk0 represents possible label configuration of node k. The hidden representations of each word and the parameters of DT-RNN are updated sub-sequently by applying chain rule with (5) through BPTS as follows,
4hroot = ∂−logp(y|h)
∂gU(h, yroot0 ) ·
∂gU(h, yroot0 ) ∂hroot ,(6)
4hk6=root = ∂−logp(y|h)
∂gU(h, y0k)
·∂gU(h, yk0) ∂hk
+4hpar(k)·∂hpar(k)
∂hk
, (7)
4ΘRNN =
K X
k=1
∂−logp(y|h)
∂hk ·
∂hk
∂ΘRNN, (8)
wherehroot represents the hidden vector of the word pointed by ROOT in the corresponding DT-RNN. Since this word is the topmost node in the tree, it only inherits error from the CRF output. In (7), hpar(k)denotes the hidden vector of the parent node of node k in DT-RNN. Hence the lower nodes re-ceive error from both the CRF output and error prop-agation from parent node. The parameters within DT-RNN, ΘRNN, are updated by applying chain
rule with respect to updates of hidden vectors, and aggregating among all associated nodes, as shown in (8). The overall procedure of RNCRF is summa-rized in Algorithm 1.
5 Discussion
Algorithm 1Recursive Neural CRFs
Input: A set of customer review sequences: S =
{s1, ..., sN}, and feature vectors ofddimensions for each
word{xw}’s, window sizeT for CRFs
Output:Parameters:Θ =ΘRNN,ΘU, V
Initialization: InitializeWeusingword2vec. InitializeWv
and {Wr}’s randomly with uniform distribution between h
−√√6 2d+1,
√
6
√
2d+1
i
. InitializeW0, {W+t}’s,{W−t}’s,V,
andbwith all 0’s foreach sentencesido
1: Use DT-RNN (1) to generatehi
2: Computep(yi|hi)using (2)
3: Use the backpropagation algorithm to update parame-tersΘthrough (4)-(8)
end for
the difficulty of incorporating dependency structure explicitly as input features, which motivates the de-sign of our model to use DT-RNN to encode depen-dency between words for feature learning. The most important advantage of RNCRF is the ability to learn the underlying dual propagation between aspect and opinion terms from the tree structure itself. Specif-ically as shown in Figure 1(c), where the aspect is
food and the opinion expression islike. In the de-pendency tree,fooddepends onlikewith the relation
DOBJ. During training, RNCRF computes the hid-den vectorhlike forlike, which is also obtained from hfood. As a result, the prediction forlikeis affected by hfood. This is one-way propagation fromfood tolike. During backpropagation, the error forlikeis propagated through a top-down manner to revise the representationhfood. This is the other-way propa-gation fromliketofood. Therefore, the dependency structure together with the learning approach help to enforce the dual propagation of aspect-opinion pairs as long as the dependency relation exists, either di-rectly or indidi-rectly.
5.1 Adding Linguistic/Lexicon Features
RNCRF is an end-to-end model, where feature engi-neering is not necessary. However, it is flexible to in-corporate light hand-crafted features into RNCRF to further boost its performance, such as features from POS tags, name-list, or sentiment lexicon. These features could be appended to the hidden vector of each word, but keep fixed during training, unlike learnable neural inputs and the CRF weights as de-scribed in Section 4.3. As will be shown in
[image:6.612.344.508.58.105.2]exper-Domain Training Test Total Restaurant 3,041 800 3,841 Laptop 3,045 800 3,845 Total 6,086 1,600 7,686
Table 1: SemEval Challenge 2014 task 4 dataset
iments, RNCRF without any hand-crafted features slightly outperforms the best performing systems that involve heavy feature engineering efforts, and RNCRF with light feature engineering can achieve even better performance.
6 Experiment
6.1 Dataset and Experimental Setup
We evaluate our model on the dataset from SemEval Challenge 2014 task 4 (subtask 1), which includes reviews from two domains: restaurant and laptop3.
The detailed description of the dataset is given in Table 1. As the original dataset only includes manu-ally annotate labels for aspect terms but not for opin-ion terms, we manually annotated opinopin-ion terms for each sentence by ourselves to facilitate our experi-ments.
For word vector initialization, we train word em-beddings with word2vec (Mikolov et al., 2013) on the Yelp Challenge dataset4 for the restaurant
do-main and on the Amazon reviews dataset5(McAuley
et al., 2015) for the laptop domain. The Yelp dataset contains 2.2M restaurant reviews with 54K vocab-ulary size. For the Amazon reviews, we only ex-tracted the electronic domain that contains 1M re-views with 590K vocabulary size. We vary differ-ent dimensions for word embeddings and chose 300 for both domains. Empirical sensitivity studies on different dimensions of word embeddings are also conducted. Dependency trees are generated using Stanford Dependency Parser (Klein and Manning, 2003). Regarding CRFs, we implement a linear-chain CRF using CRFSuite (Okazaki, 2007). Be-cause of the relatively small size of training data and a large number of parameters, we perform pre-training on the parameters of DT-RNN with
cross-3Experiments with more publicly available datasets, e.g. restaurant review dataset from SemEval Challenge 2015 task 12 will be conducted in our future work.
entropy error, which is a common strategy for deep learning (Erhan et al., 2009). We implement mini-batch stochastic gradient descent (SGD) with a mini-batch size of 25, and an adaptive learning rate (AdaGrad) initialized at 0.02 for pretraining of DT-RNN, which runs 4 epochs for the restaurant domain and 5 epochs for the laptop domain. For parameter learning of the joint model RNCRF, we implement SGD with a de-caying learning rate initialized at 0.02. We also try with varying context window size, and use 3 for the laptop domain and 5 for the restaurant domain, re-spectively. All parameters are chosen by cross vali-dation.
As discussed in Section 5.1, hand-crafted features can be easily incorporated into RNCRF. We gen-erate three types of simple features based on POS tags, name-list and sentiment lexicon to show fur-ther improvement by incorporating these features. Following (Toh and Wang, 2014), we extract two sets of name list from the training data for each domain, where one includes high-frequency aspect terms, and the other includes high-probability as-pect words. These two sets are used to construct two lexicon features, i.e. we build a 2D binary vector: if a word is in a set, the corresponding value is 1, otherwise 0. For POS tags, we use Stanford POS tagger (Toutanova et al., 2003), and convert them to universal POS tags that have 15 different cate-gories. We then generate 15 one-hot POS tag fea-tures. For sentiment lexicon, we use the collection of commonly used opinion words (around 6,800) (Hu and Liu, 2004a). Similar to name list, we create a bi-nary feature to indicate whether the word belongs to opinion lexicon. We denote by RNCRF+F the pro-posed model with the three types of features.
Compared to the winning systems of SemEval Challenge 2014 task 4 (subtask 1), RNCRF or RN-CRF+F uses additional labels of opinion terms for training. Therefore, to conduct fair comparison ex-periments with the winning systems, we implement RNCRF-O by omitting opinion labels to train our model (i.e., labels become “BA”, “IA”, “O”). Ac-cordingly, we denote by O+F the RNCRF-O model with the three additional types of hand-crafted features.
6.2 Experimental Results
We compare our model with several baselines:
• CRF-1: a linear-chain CRF with standard lin-guistic features including word string, stylis-tics, POS tag, context string, and context POS tags.
• CRF-2: a linear-chain CRF with both
stan-dard linguistic features and dependency infor-mation including head word, dependency rela-tions with parent token and child tokens.
• LSTM: an LSTM network built on top of word
embeddings proposed by (Liu et al., 2015). We keep original settings in (Liu et al., 2015) but replace their word embeddings with ours (300 dimension). We try different hidden layer di-mensions (50, 100, 150, 200) and reported the best result with size 50.
• LSTM+F: the above LSTM model with the
three additional types of hand-crafted features as with RNCRF.
• SemEval-1, SemEval-2: the top two winning
systems for SemEval challenge 2014 task 4 (subtask 1).
• WDEmb+B+CRF6: the model proposed by (Yin et al., 2016) using word and de-pendency path embeddings combined with linear context embedding features, dependency context embedding features and hand-crafted features (i.e., feature engineering) as CRF input.
The comparison results are shown in Table 2 for both the restaurant domain and the laptop domain. Note that we provide the same annotated dataset (both as-pect labels and opinion labels are included for train-ing) for CRF-1, CRF-2 and LSTM for fair compar-ison. It is clear that our proposed model RNCRF achieves superior performance compared with most of the baseline models. The performance is even bet-ter by adding simple hand-crafted features, i.e., RN-CRF+F, with 0.92% and 3.87% absolute improve-ment over the best system in the challenge for aspect extraction for the restaurant domain and the laptop domain, respectively. This shows the advantage of
Restaurant Laptop Models Aspect Opinion Aspect Opinion
SemEval-1 84.01 - 74.55
-SemEval-2 83.98 - 73.78
-WDEmb+B+CRF 84.97 - 75.16
-CRF-1 77.00 78.95 66.21 71.78
CRF-2 78.37 78.65 68.35 70.05
LSTM 81.15 80.22 72.73 74.98
LSTM+F 82.99 82.90 73.23 77.67
RNCRF-O 82.73 - 74.52
-RNCRF-O+F 84.25 - 77.26
-RNCRF 84.05 80.93 76.83 76.76
[image:8.612.310.542.59.165.2]RNCRF+F 84.93 84.11 78.42 79.44
Table 2: Comparison results in terms of F1 scores.
combining high-level continuous features and dis-crete hand-crafted features. Though CRFs usually show promising results in sequence tagging prob-lems, they fail to achieve comparable performance when lacking of extensive features (e.g., CRF-1). By adding dependency information explicitly in CRF-2, the result only improves slightly for aspect ex-traction. Alternatively, by incorporating dependency information into a deep learning model (e.g., RN-CRF), the result shows more than 7% improvement for aspect extraction and 2% for opinion extraction. By removing the labels for opinion terms, RNCRF-O produces inferior results than RNCRF because the effect of dual propagation of aspect and opinion pairs disappears with the absence of opinion labels. This verifies our previous assumption that DT-RNN could learn the interactive effects within aspects and opinions. However, the performance of RNCRF-O is still comparable to the top systems and even better with the addition of simple linguistic fea-tures: 0.24% and 2.71% superior than the best sys-tem in the challenge for the restaurant domain and the laptop domain, respectively. This shows the ro-bustness of our model even without additional opin-ion labels.
LSTM has shown comparable results for aspect extraction (Liu et al., 2015). However, in their work, they used well-pretrained word embeddings by training with large corpus or extensive external resources, e.g., chunking, and NER. To compare their model with RNCRF, we re-implement LSTM with the same word embedding strategy and label-ing resources as ours. The results show that our
Restaurant Laptop Models Aspect Opinion Aspect Opinion DT-RNN+SoftMax 72.45 69.76 66.11 64.66 CRF+word2vec 82.57 78.83 63.62 56.96
RNCRF 84.05 80.93 76.83 76.76
[image:8.612.72.306.60.211.2]RNCRF+POS 84.08 81.48 77.04 77.45 RNCRF+NL 84.24 81.22 78.12 77.20 RNCRF+Lex 84.21 84.14 77.15 78.56 RNCRF+F 84.93 84.11 78.42 79.44
Table 3: Impact of different components.
model outperforms LSTM in aspect extraction by 2.90% and 4.10% for the restaurant domain and the laptop domain, respectively. We conclude that a standard LSTM model fails to extract the rela-tions between aspect and opinion terms. Even with the addition of same linguistic features, LSTM is still inferior than RNCRF itself in terms of as-pect extraction. Moreover, our result is compara-ble with WDEmb+B+CRF in the restaurant domain and better in the laptop domain (+3.26%). Note that WDEmb+B+CRF appended dependency context in-formation into CRF while our model encode such information into high-level representation learning.
To test the impact of each component of RNCRF and the three types of hand-crafted features, we con-duct experiments on different model settings:
• DT-RNN+SoftMax: rather than using a CRF, a softmax classifier is used on top of DT-RNN.
• CRF+word2vec: a linear-chain CRF with word embeddings only without using DT-RNN.
• RNCRF+POS/NL/Lex: the RNCRF model with POS tag or name list or sentiment lexicon feature(s).
25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400 dimension
0.72 0.74 0.76 0.78 0.80 0.82 0.84 0.86 0.88
f1-sc
ore
aspect opinion
(a) On the restaurant domain.
25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400 dimension
0.55 0.60 0.65 0.70 0.75 0.80 0.85
f1-sc
ore
aspect opinion
[image:9.612.105.257.63.352.2](b) On the laptop domain.
Figure 3: Sensitivity studies on word embeddings.
DT-RNN for modeling interactions between aspects and opinions. Hence, the combination of DT-RNN and CRF inherits the advantages from both mod-els. Moreover, by separately adding hand-crafted features, we can observe that name-list-based fea-tures and the sentiment lexicon feature are most ef-fective for aspect extraction and opinion extraction, respectively. This may be explained by the fact that name-list-based features usually contain informative evident for aspect terms and sentiment lexicon pro-vides explicit indication about opinions.
Besides the comparison experiments, we also conduct sensitivity test for our proposed model in terms of word vector dimensions. We tested a set of different dimensions ranging from 25 to 400, with 25 increment. The sensitivity plot is shown in Fig-ure 3. The performance for aspect extraction is smooth with different vector lengths for both do-mains. For restaurant domain, the result is stable when dimension is larger than or equal to 100, with the highest at 325. For the laptop domain, the best result is at dimension 300, but with relatively small
variations. For opinion extraction, the performance reaches a good level when the dimension is larger than or equal to 75 for the restaurant domain and 125 for the laptop domain. This proves the stability and robustness of our model.
7 Conclusion
We have presented a joint model, RNCRF, that achieves the state-of-the-art performance for explicit aspect and opinion term extraction on a benchmark dataset. With the help of DT-RNN, high-level fea-tures can be learned by encoding the underlying dual propagation of aspect-opinion pairs. RNCRF com-bines the advantages of DT-RNNs and CRFs, and thus outperforms the traditional rule-based meth-ods in terms of flexibility, because aspect terms and opinion terms are not only restricted to certain ob-served relations and POS tags. Compared to fea-ture engineering methods with CRFs, the proposed model saves much effort in composing features, and it is able to extract higher-level features obtained from non-linear transformations.
Acknowledgements
This research is partially funded by the Economic Development Board and the National Research Foundation of Singapore. Sinno J. Pan thanks the support from Fuji Xerox Corporation through joint research onMultilingual Semantic Analysisand the NTU Singapore Nanyang Assistant Professorship (NAP) grant M4081532.020.
References
Li Dong, Furu Wei, Chuanqi Tan, Duyu Tang, Ming Zhou, and Ke Xu. 2014. Adaptive recursive neural network for target-dependent twitter sentiment
classi-fication. InACL, pages 49–54.
Dumitru Erhan, Pierre-Antoine Manzagol, Yoshua Ben-gio, Samy BenBen-gio, and Pascal Vincent. 2009. The difficulty of training deep architectures and the effect
of unsupervised pre-training. InAISTATS, pages 153–
160.
Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Domain adaptation for large-scale sentiment
classification: A deep learning approach. In ICML,
Christoph Goller and Andreas K¨uchler. 1996. Learning task-dependent distributed representations by
back-propagation through structure. In ICNN, pages 347–
352.
Sepp Hochreiter and J¨urgen Schmidhuber. 1997. Long
short-term memory. Neural Computation, 9(8):1735–
1780.
Minqing Hu and Bing Liu. 2004a. Mining and
summa-rizing customer reviews. InKDD, pages 168–177.
Minqing Hu and Bing Liu. 2004b. Mining opinion
fea-tures in customer reviews. InAAAI, pages 755–760.
Ozan ˙Irsoy and Claire Cardie. 2014. Opinion
min-ing with deep recurrent neural networks. InEMNLP,
pages 720–728.
Mohit Iyyer, Jordan L. Boyd-Graber, Leonardo
Max Batista Claudino, Richard Socher, and
Hal Daum´e III. 2014. A neural network for
factoid question answering over paragraphs. In
EMNLP, pages 633–644.
Niklas Jakob and Iryna Gurevych. 2010. Extracting opinion targets in a single- and cross-domain setting
with conditional random fields. In EMNLP, pages
1035–1045.
Wei Jin and Hung Hay Ho. 2009. A novel lexicalized hmm-based learning framework for web opinion
min-ing. InICML, pages 465–472.
Nal Kalchbrenner, Edward Grefenstette, and Phil Blun-som. 2014. A convolutional neural network for
mod-elling sentences. InACL, pages 655–665.
Yoon Kim. 2014. Convolutional neural networks for
sen-tence classification. InEMNLP, pages 1746–1751.
Dan Klein and Christopher D. Manning. 2003. Accurate
unlexicalized parsing. InACL, pages 423–430.
John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilis-tic models for segmenting and labeling sequence data.
InICML, pages 282–289.
Himabindu Lakkaraju, Richard Socher, and
Christo-pher D. Manning. 2014. Aspect specific
senti-ment analysis using hierarchical deep learning. In NIPS Workshop on Deep Learning and Representation Learning.
Quoc V. Le and Tomas Mikolov. 2014. Distributed
rep-resentations of sentences and documents. In ICML,
pages 1188–1196.
Fangtao Li, Chao Han, Minlie Huang, Xiaoyan Zhu,
Ying-Ju Xia, Shu Zhang, and Hao Yu. 2010.
Structure-aware review mining and summarization. In COLING, pages 653–661.
Kang Liu, Liheng Xu, and Jun Zhao. 2012. Opinion target extraction using word-based translation model. InEMNLP-CoNLL, pages 1346–1356.
Kang Liu, Liheng Xu, Yang Liu, and Jun Zhao. 2013a. Opinion target extraction using partially-supervised
word alignment model. InIJCAI, pages 2134–2140.
Qian Liu, Zhiqiang Gao, Bing Liu, and Yuanlin Zhang. 2013b. A logic programming approach to aspect
ex-traction in opinion mining. InWI, pages 276–283.
Pengfei Liu, Shafiq Joty, and Helen Meng. 2015. Fine-grained opinion mining with recurrent neural networks
and word embeddings. InEMNLP, pages 1433–1443.
Bing Liu. 2011. Web Data Mining: Exploring
Hy-perlinks, Contents, and Usage Data. Second Edition. Data-Centric Systems and Applications. Springer. Yue Lu, Malu Castellanos, Umeshwar Dayal, and
ChengXiang Zhai. 2011. Automatic construction of a context-aware sentiment lexicon: An optimization
ap-proach. InWWW, pages 347–356.
Tengfei Ma and Xiaojun Wan. 2010. Opinion target
extraction in Chinese news comments. InCOLING,
pages 782–790.
Julian McAuley, Jure Leskovec, and Dan Jurafsky. 2012. Learning attitudes and attributes from multi-aspect
re-views. InICDM, pages 1020–1025.
Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton van den Hengel. 2015. Image-based
recom-mendations on styles and substitutes. InSIGIR, pages
43–52.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word
represen-tations in vector space.CoRR, abs/1301.3781.
Nir Ofek, Soujanya Poria, Lior Rokach, Erik Cam-bria, Amir Hussain, and Asaf Shabtai. 2016. Un-supervised commonsense knowledge enrichment for
domain-specific sentiment analysis. Cognitive
Com-putation, 8(3):467–477.
Naoaki Okazaki. 2007. CRFsuite: a fast
im-plementation of conditional random fields (CRFs). http://www.chokkan.org/software/crfsuite/.
Bo Pang and Lillian Lee. 2008. Opinion mining and
sentiment analysis. Foundations and Trends in
Infor-mation Retrieval, 2(1-2).
Maria Pontiki, Dimitris Galanis, John Pavlopoulos, Har-ris Papageorgiou, Ion Androutsopoulos, and Suresh Manandhar. 2014. Semeval-2014 task 4: Aspect
based sentiment analysis. InSemEval, pages 27–35.
Ana-Maria Popescu and Oren Etzioni. 2005. Extract-ing product features and opinions from reviews. In EMNLP, pages 339–346.
Guang Qiu, Bing Liu, Jiajun Bu, and Chun Chen. 2011. Opinion word expansion and target extraction
through double propagation. Computational
Richard Socher, Christopher D. Manning, and Andrew Y. Ng. 2010. Learning Continuous Phrase Representa-tions and Syntactic Parsing with Recursive Neural
Net-works. InNIPS, pages 1–9.
Richard Socher, Cliff C. Lin, Andrew Y. Ng, and Christo-pher D. Manning. 2011a. Parsing natural scenes and natural language with recursive neural networks. In
ICML, pages 129–136.
Richard Socher, Jeffrey Pennington, Eric H. Huang, An-drew Y. Ng, and Christopher D. Manning. 2011b. Semi-Supervised Recursive Autoencoders for
Predict-ing Sentiment Distributions. InEMNLP, pages 151–
161.
Richard Socher, Brody Huval, Christopher D. Manning, and Andrew Y. Ng. 2012. Semantic Compositionality
Through Recursive Matrix-Vector Spaces. InEMNLP,
pages 1201–1211.
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng, and Christo-pher Potts. 2013. Recursive deep models for se-mantic compositionality over a sentiment treebank. In EMNLP, pages 1631–1642.
Richard Socher, Andrej Karpathy, Quoc V. Le,
Christo-pher D. Manning, and Andrew Y. Ng. 2014.
Grounded compositional semantics for finding and
de-scribing images with sentences. TACL, 2:207–218.
Duyu Tang, Bing Qin, Xiaocheng Feng, and Ting Liu. 2015. Target-dependent sentiment classification with
long short term memory.CoRR, abs/1512.01100.
Ivan Titov and Ryan T. McDonald. 2008. A joint model of text and aspect ratings for sentiment summarization.
InACL, pages 308–316.
Zhiqiang Toh and Wenting Wang. 2014. DLIREC: As-pect term extraction and term polarity classification
system. InSemEval, pages 235–240.
Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003. Feature-rich part-of-speech
tagging with a cyclic dependency network. InNAACL,
pages 173–180.
Hao Wang and Martin Ester. 2014. A sentiment-aligned topic model for product aspect rating prediction. In EMNLP, pages 1192–1202.
Hongning Wang, Yue Lu, and ChengXiang Zhai. 2011. Latent aspect rating analysis without aspect keyword
supervision. InKDD, pages 618–626.
David Weiss, Chris Alberti, Michael Collins, and Slav Petrov. 2015. Structured training for neural network
transition-based parsing. InACL-IJCNLP, pages 323–
333.
Yuanbin Wu, Qi Zhang, Xuanjing Huang, and Lide Wu. 2009. Phrase dependency parsing for opinion mining. InEMNLP, pages 1533–1541.
Yichun Yin, Furu Wei, Li Dong, Kaimeng Xu, Ming Zhang, and Ming Zhou. 2016. Unsupervised word and dependency path embeddings for aspect term
ex-traction. InIJCAI, pages 2979–2985.
Lei Zhang, Bing Liu, Suk Hwan Lim, and Eamonn O’Brien-Strain. 2010. Extracting and ranking
prod-uct features in opinion documents. InCOLING, pages
1462–1470.
Li Zhuang, Feng Jing, and Xiao-Yan Zhu. 2006. Movie
review mining and summarization. In CIKM, pages