Recursive Neural Conditional Random Fields for Aspect based Sentiment Analysis

(1)

Recursive Neural Conditional Random Fields

for Aspect-based Sentiment Analysis

Wenya Wang†‡ _{Sinno Jialin Pan}† _{Daniel Dahlmeier}‡ _{Xiaokui Xiao}† †_{Nanyang Technological University, Singapore}

‡_{SAP Innovation Center Singapore}

†_{_{wa0001ya, sinnopan, xkxiao}_}_@ntu.edu.sg,‡_{_d.dahlmeier_}_@sap.com

Abstract

In aspect-based sentiment analysis, extract-ing aspect terms along with the opinions be-ing expressed from user-generated content is one of the most important subtasks. Previ-ous studies have shown that exploiting con-nections between aspect and opinion terms is promising for this task. In this paper, we pro-pose a novel joint model that integrates recur-sive neural networks and conditional random fields into a unified framework for explicit as-pect and opinion terms co-extraction. The proposed model learns high-level discrimina-tive features and double propagates informa-tion between aspect and opinion terms, simul-taneously. Moreover, it is flexible to incor-porate hand-crafted features into the proposed model to further boost its information extrac-tion performance. Experimental results on the dataset from SemEval Challenge 2014 task 4 show the superiority of our proposed model over several baseline methods as well as the winning systems of the challenge.

1 Introduction

Aspect-based sentiment analysis (Pang and Lee, 2008; Liu, 2011) aims to extract important infor-mation, e.g., opinion targets, opinion expressions, target categories, and opinion polarities, from user-generated content, such as microblogs, reviews, etc. This task was first studied by Hu and Liu (2004a; 2004b), followed by Popescu and Etzioni (2005), Zhuang et al. (2006), Zhang et al. (2010), Qiu et al. (2011), Li et al. (2010). In aspect-based senti-ment analysis, one of the goals is to extract explicit aspects of an entity from text, along with the opin-ions being expressed. For example, in a restaurant review“I have to say they have one of the fastest

de-livery times in the city.”, the aspect term isdelivery times, and the opinion term isfastest.

Among previous work, one of the approaches is to accumulate aspect and opinion terms from a seed collection without label information, by utiliz-ing syntactic rules or modification relations between them (Qiu et al., 2011; Liu et al., 2013b). In the above example, if we know fastest is an opinion word, thendelivery timesis probably inferred to be an aspect because fastestis its modifier. However, this approach largely relies on hand-coded rules and is restricted to certain Part-of-Speech (POS) tags, e.g., opinion words are restricted to be adjectives. Another approach focuses on feature engineering based on predefined lexicons, syntactic analysis, etc. (Jin and Ho, 2009; Li et al., 2010). A sequence labeling classifier is then built to extract aspect and opinion terms. This approach requires extensive ef-forts for designing hand-crafted features and only combines features linearly for classification which ignores higher order interactions.

To overcome the limitations of existing methods, we propose a novel model, named Recursive Neu-ral Conditional Random Fields (RNCRF). Specif-ically, RNCRF consists of two main components. The first component is a recursive neural network (RNN)1 _{(Socher et al., 2010) based on a}

depen-dency tree of each sentence. The goal is to learn a high-level feature representation for each word in the context of each sentence and make the represen-tation learning for aspect and opinion terms inter-active through the underlying dependency structure among them. The output of the RNN is then fed into a Conditional Random Field (CRF) (Lafferty et al., 2001) to learn a discriminative mapping from

high-1_{Note that in this paper, RNN stands for recursive neural} network instead of recurrent neural network.

(2)

level features to labels, i.e., aspects, opinions, or

others, so that context information can be well cap-tured. Our main contributions are to use RNN for encoding aspect-opinion relations in high-level rep-resentation learning and to present a joint optimiza-tion approach based on maximum likelihood and backpropagation to learn the RNN and CRF com-ponents simultaneously. In this way, the label in-formation of aspect and opinion terms can be dually propagated from parameter learning in CRF to rep-resentation learning in RNN. We conduct expensive experiments on the dataset from SemEval challenge 2014 task 4 (subtask 1) (Pontiki et al., 2014) to ver-ify the superiority of RNCRF over several baseline methods as well as the winning systems of the chal-lenge.

2 Related Work

2.1 Aspects and Opinions Co-Extraction Hu et al. (2004a) proposed to extract product aspects through association mining, and opinion terms by augmenting a seed opinion set using synonyms and antonyms in WordNet. In follow-up work, syntactic relations are further exploited for aspect/opinion ex-traction (Popescu and Etzioni, 2005; Wu et al., 2009; Qiu et al., 2011). For example, Qiu et al. (2011) used syntactic relations to double propagate and augment the sets of aspects and opinions. Although the above models are unsupervised, they heavily depend on predefined rules for extraction and are restricted to specific types of POS tags for product aspects and opinions. Jin et al. (2009), Li et al. (2010), Jakob et al. (2010) and Ma et al. (2010) modeled the ex-traction problem as a sequence tagging problem and proposed to use HMMs or CRFs to solve it. These methods rely on rich hand-crafted features and do not consider interactions between aspect and opin-ion terms explicitly. Another directopin-ion is to use word alignment model to capture opinion relations among a sentence (Liu et al., 2012; Liu et al., 2013a). This method requires sufficient data for modeling desired relations.

Besides explicit aspects and opinions extraction, there are also other lines of research related to aspect-based sentiment analysis, including aspect classification (Lakkaraju et al., 2014; McAuley et al., 2012), aspect rating (Titov and McDonald, 2008;

Wang et al., 2011; Wang and Ester, 2014), domain-specific and target-dependent sentiment classifica-tion (Lu et al., 2011; Ofek et al., 2016; Dong et al., 2014; Tang et al., 2015).

2.2 Deep Learning for Sentiment Analysis Recent studies have shown that deep learning mod-els can automatically learn the inherent semantic and syntactic information from data and thus achieve better performance for sentiment analysis (Socher et al., 2011b; Socher et al., 2012; Socher et al., 2013; Glorot et al., 2011; Kalchbrenner et al., 2014; Kim, 2014; Le and Mikolov, 2014). These methods gen-erally belong to sentence-level or phrase/word-level sentiment polarity predictions. Regarding aspect-based sentiment analysis, Irsoy et al. (2014) ap-plied deep recurrent neural networks for opinion ex-pression extraction. Dong et al. (2014) proposed an adaptive recurrent neural network for target-dependent sentiment classification, where targets or aspects are given as input. Tang et al. (2015) used Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) for the same task. Neverthe-less, there is little work in aspects and opinions co-extraction using deep learning models.

To the best of our knowledge, the work of Liu et al. (2015) and Yin et al. (2016) are the most re-lated to ours. Liu et al. (2015) proposed to com-bine recurrent neural network and word embeddings to extract explicit aspects. However, the proposed model simply uses recurrent neural network on top of word embeddings, and thus its performance heav-ily depends on the quality of word embeddings. In addition, it fails to explicitly model dependency re-lations or compositionalities within certain syntactic structure in a sentence. Recently, Yin et al. (2016) proposed an unsupervised learning method to im-prove word embeddings using dependency path em-beddings. A CRF is then trained with the embed-dings independently in the pipeline.

(3)

embeddings and a CRF separately as in (Yin et al., 2016). Note that Weiss et al. (2015) proposed to combine deep learning and structured learning for language parsing which can be learned by structured perceptron. However, they also separate neural net-work training with structured prediction.

2.3 Recursive Neural Networks

Among deep learning methods, RNN has shown promising results on various NLP tasks, such as learning phrase representations (Socher et al., 2010), sentence-level sentiment analysis (Socher et al., 2013), language parsing (Socher et al., 2011a), and question answering (Iyyer et al., 2014). The tree structures used for RNNs include constituency tree and dependency tree. In a constituency tree, all the words lie at leaf nodes, each internal node repre-sents a phrase or a constituent of a sentence, and the root node represents the entire sentence (Socher et al., 2010; Socher et al., 2012; Socher et al., 2013). In a dependency tree, each node including terminal and nonterminal nodes, represents a word, with de-pendency connections to other nodes (Socher et al., 2014; Iyyer et al., 2014). The resultant model is known as dependency-tree RNN (DT-RNN). An ad-vantage of using dependency tree over the other is the ability to extract word-level representations con-sidering syntactic relations and semantic robustness. Therefore, we adopt DT-RNN in this work.

3 Problem Statement

Suppose that we are given a training set of cus-tomer reviews in a specific domain, denoted byS=

{s1, ..., sN}, whereN is the number of review sen-tences. For anysi∈S, there may exist a set of aspect terms Ai ={ai1, ..., ail}, where eachaij ∈Ai can be a single word or a sequence of words expressing explicitly some aspect of an entity, and a set of opin-ion terms Oi ={oi1, ..., oim}, where each oir can be a single word or a sequence of words expressing the subjective sentiment of the comment holder. The task is to learn a classifier to extract the set of aspect termsAi and the set of opinion termsOi from each review sentencesi∈S.

This task can be formulated as a sequence tag-ging problem by using the BIO encoding scheme. Specifically, each review sentence si is composed

of a sequence of wordssi={wi1, ..., wini}. Each wordwip∈si is labeled as one out of the following 5 classes: BA (beginning of aspect), IA (inside of as-pect), BO (beginning of opinion), IO (inside of opin-ion), and O (others). LetL=_{BA,IA,BO,IO,O_}. We are also given a test set of review sentences de-noted byS0={s0₁, ..., s0_N0}, whereN0is the number

of test reviews. For each test reviews0_i_∈S0, our ob-jective is to predict the class labely0

iq∈L for each wordw0_iq. Note that a sequence of predictions with BA at the beginning followed by IA are indication of one aspect, which is similar for opinion terms.2

4 Recursive Neural CRFs

As described in Section 1, RNCRF consists of two main components: 1) a DT-RNN to learn a high-level representation for each word in a sentence, and 2) a CRF to take the learned representation as input to capture context around each word for explicit as-pect and opinion terms extraction. Next, we present these two components in details.

4.1 Dependency-Tree RNNs

We begin by associating each word w in our vo-cabulary with a feature vector x_∈Rd_{, which} cor-responds to a column of a word embedding matrix We∈Rd×v, wherev is the size of the vocabulary. For each sentence, we build a DT-RNN based on the corresponding dependency parse tree with word em-beddings as initialization. An example of the depen-dency parse tree is shown in Figure 1(a), where each edge starts from the parent and points to its depen-dent with a syntactic relation.

In a DT-RNN, each noden, including leaf nodes, internal nodes and the root node, in a specific sen-tence is associated with a wordw, an input feature vectorxw, and a hidden vectorhn∈Rdof the same dimension asxw. Each dependency relationr is as-sociated with a separate matrixWr∈Rd×d. In addi-tion, a common transformation matrixWv∈Rd×dis introduced to map the word embeddingxw at node nto its corresponding hidden vectorhn.

(4)

[image:4.612.78.541.59.161.2]

(a) Example of a dependency tree. (b) Example of a DT-RNN tree structure. (c) Example of a RNCRF structure.

Figure 1: Examples of dependency tree, DT-RNN structure and RNCRF structure for a review sentence.

Along with a particular dependency tree, a hidden vectorhnis computed from its own word embedding xwat nodenwith the transformation matrixWvand its children’s hidden vectorsh_{child(n) with the} cor-responding relation matrices{Wr}’s. For instance, given the parse tree shown in Figure 1(a), we first compute the leaf nodes associated with the wordsI

andtheusingWvas follows,

h_I = f(Wv·xI+b), h_the = f(Wv·xthe+b),

wheref is a nonlinear activation function andbis a bias term. In this paper, we adopttanh(_·)as the

ac-tivation function. Once the hidden vectors of all the leaf nodes are generated, we can recursively gener-ate hidden vectors for interior nodes using the corre-sponding relation matrixWrand the common trans-formation matrixWvas follows,

h_food = f(Wv·xfood+WDET·hthe+b), h_like = f(Wv·xlike+WDOBJ·hfood

+W_NSUBJ_·h_I+b).

The resulting DT-RNN is shown in Figure 1(b). In general, a hidden vector for any node nassociated with a word vectorxwcan be computed as follows,

hn=f 

Wv·xw+b+ X

k∈Kn

Wrnk·hk  , (1)

whereKndenotes the set of children of noden,rnk denotes the dependency relation between nodenand its child nodek, and hk is the hidden vector of the child nodek. The parameters of DT-RNN,Θ_RNN=

{Wv, Wr, We, b}, are learned during training.

4.2 Integration with CRFs

CRFs are a discriminative graphical model for struc-tured prediction. In RNCRF, we feed the output of DT-RNN, i.e., the hidden representation of each word in a sentence, to a CRF. Updates of parameters for RNCRF are carried out successively from the top to bottom, by propagating errors through CRF to the hidden layers of RNN (including word em-beddings) using backpropagation through structure (BPTS) (Goller and K¨uchler, 1996).

Formally, for each sentencesi, we denote the in-put for CRF byhi, which is generated by DT-RNN. Here hi is a matrix with columns of hidden vec-tors{hi1, ..., hini}to represent a sequence of words

{wi1, ..., wini} in a sentence si. The model com-putes a structured output yi = {yi1, ..., yini} ∈Y, where Y is a set of possible combinations of la-bels in label set L. The entire structure can be represented by an undirected graph G = (V, E)

with cliques c ∈ C. In this paper, we employed linear-chain CRF, which has two different cliques: unary cliques (U) representing input-output connec-tion, and pairwise cliques (P) representing adjacent output connections, as shown in Figure 1(c). Dur-ing inference, the model aims to outputˆy with the maximum conditional probabilityp(y_|h). (We drop the subscriptihere for simplicity.) The probability is computed from potential outputs of the cliques:

p(y_|h) = 1

Z(h)

Y

c∈C

ψc(h,yc), (2)

where Z(h) is the normalization term, and

(5)

[image:5.612.90.283.59.99.2]

Figure 2: An example for computing input-ouput potential for the second positionlike.

also incorporate a context window of size 2T+ 1

when computing unary potentials (T is a hyper-parameter). Thus, the potential of unary clique at nodekcan be written as

ψU(h, yk) = exp (W0)yk·hk+ T X

t=1

(W₋t)yk·hk−t

+

T X

t=1

(W+t)yk ·hk+t !

, (3)

whereW0,W+tandW−tare weight matrices of the CRF for the current position, thet-th position to the right, and thet-th position to the left within context window, respectively. The subscriptykindicates the corresponding row in the weight matrix.

For instance, Figure 2 shows an example of win-dow size 3. At the second position, the input features forlikeare composed of the hidden vectors at posi-tion 1 (h_{I), position 2 (}h_{like) and position 3 (}h_the). Therefore, the conditional distribution for the entire sequenceyin Figure 1(c) can be calculated as

p(y_|h)= 1 Z(h)exp

4 X

k=1

(W0)yk·hk+

4 X

k=2

(W−1)yk·hk−1

+

3 X

k=1

(W+1)yk·hk+1+

3 X

k=1

Vyk,yk+1 !

,

where the first three terms in the exponential of the RHS consider unary clique while the last term con-siders the pairwise clique with matrixV represent-ing pairwise state transition score. For simplicity in description on parameter updates, we denote the log-potential for clique c_{∈ {}U, P_} by gc(h,yc) =

hWc, F(h,yc)i.

4.3 Joint Training for RNCRF

Through the objective of maximum likelihood, up-dates for parameters of RNCRF are first conducted on the parameters of the CRF (unary weight matri-ces ΘU = {W0, W+t, W−t} and pairwise weight matrix V) by applying chain rule to log-potential updates. Below is the gradient forΘU (updates for

V are similar through the log-potential of pairwise cliquegP(y0k, y0k+1)):

4ΘU =

∂₋logp(y_|h)

∂gU(h, y_k0) ·

∂gU(h, yk0) ∂ΘU

, (4)

where

∂−logp(y|h)

∂gU(h, yk0)

=−(1yk=y0k−p(y

0

k|h)), (5)

and y_k0 represents possible label configuration of node k. The hidden representations of each word and the parameters of DT-RNN are updated sub-sequently by applying chain rule with (5) through BPTS as follows,

4h_root = ∂−logp(y|h)

∂gU(h, yroot0 ) ·

∂gU(h, yroot0 ) ∂h_root ,(6)

4h_k₆₌_root = ∂−logp(y|h)

∂gU(h, y0k)

·∂gU(h, yk0) ∂hk

+4h_par(k)·∂hpar(k)

∂hk

, (7)

4Θ_RNN =

K X

k=1

∂₋logp(y_|h)

∂hk ·

∂hk

∂Θ_RNN, (8)

whereh_{root represents the hidden vector of the word} pointed by ROOT in the corresponding DT-RNN. Since this word is the topmost node in the tree, it only inherits error from the CRF output. In (7), h_par₍_k₎denotes the hidden vector of the parent node of node k in DT-RNN. Hence the lower nodes re-ceive error from both the CRF output and error prop-agation from parent node. The parameters within DT-RNN, Θ_{RNN, are updated by applying chain}

rule with respect to updates of hidden vectors, and aggregating among all associated nodes, as shown in (8). The overall procedure of RNCRF is summa-rized in Algorithm 1.

5 Discussion

(6)

Algorithm 1Recursive Neural CRFs

Input: A set of customer review sequences: S =

{s1, ..., sN}, and feature vectors ofddimensions for each

word{xw}’s, window sizeT for CRFs

Output:Parameters:Θ =Θ_RNN,ΘU, V

Initialization: InitializeWeusingword2vec. InitializeWv

and {Wr}’s randomly with uniform distribution between h

−_√√6 2d+1,

√

6

√

2d+1

i

. InitializeW0, {W+t}’s,{W−t}’s,V,

andbwith all 0’s foreach sentencesido

1: Use DT-RNN (1) to generatehi

2: Computep(yi|hi)using (2)

3: Use the backpropagation algorithm to update parame-tersΘthrough (4)-(8)

end for

the difficulty of incorporating dependency structure explicitly as input features, which motivates the de-sign of our model to use DT-RNN to encode depen-dency between words for feature learning. The most important advantage of RNCRF is the ability to learn the underlying dual propagation between aspect and opinion terms from the tree structure itself. Specif-ically as shown in Figure 1(c), where the aspect is

food and the opinion expression islike. In the de-pendency tree,fooddepends onlikewith the relation

DOBJ. During training, RNCRF computes the hid-den vectorh_{like for}like, which is also obtained from h_{food. As a result, the prediction for}likeis affected by h_{food. This is one-way propagation from}food tolike. During backpropagation, the error forlikeis propagated through a top-down manner to revise the representationh_{food. This is the other-way} propa-gation fromliketofood. Therefore, the dependency structure together with the learning approach help to enforce the dual propagation of aspect-opinion pairs as long as the dependency relation exists, either di-rectly or indidi-rectly.

5.1 Adding Linguistic/Lexicon Features

RNCRF is an end-to-end model, where feature engi-neering is not necessary. However, it is flexible to in-corporate light hand-crafted features into RNCRF to further boost its performance, such as features from POS tags, name-list, or sentiment lexicon. These features could be appended to the hidden vector of each word, but keep fixed during training, unlike learnable neural inputs and the CRF weights as de-scribed in Section 4.3. As will be shown in

[image:6.612.344.508.58.105.2]

exper-Domain Training Test Total Restaurant 3,041 800 3,841 Laptop 3,045 800 3,845 Total 6,086 1,600 7,686

Table 1: SemEval Challenge 2014 task 4 dataset

iments, RNCRF without any hand-crafted features slightly outperforms the best performing systems that involve heavy feature engineering efforts, and RNCRF with light feature engineering can achieve even better performance.

6 Experiment

6.1 Dataset and Experimental Setup

We evaluate our model on the dataset from SemEval Challenge 2014 task 4 (subtask 1), which includes reviews from two domains: restaurant and laptop3_.

The detailed description of the dataset is given in Table 1. As the original dataset only includes manu-ally annotate labels for aspect terms but not for opin-ion terms, we manually annotated opinopin-ion terms for each sentence by ourselves to facilitate our experi-ments.

For word vector initialization, we train word em-beddings with word2vec (Mikolov et al., 2013) on the Yelp Challenge dataset4 _{for the restaurant}

do-main and on the Amazon reviews dataset5_(McAuley

et al., 2015) for the laptop domain. The Yelp dataset contains 2.2M restaurant reviews with 54K vocab-ulary size. For the Amazon reviews, we only ex-tracted the electronic domain that contains 1M re-views with 590K vocabulary size. We vary differ-ent dimensions for word embeddings and chose 300 for both domains. Empirical sensitivity studies on different dimensions of word embeddings are also conducted. Dependency trees are generated using Stanford Dependency Parser (Klein and Manning, 2003). Regarding CRFs, we implement a linear-chain CRF using CRFSuite (Okazaki, 2007). Be-cause of the relatively small size of training data and a large number of parameters, we perform pre-training on the parameters of DT-RNN with

cross-3_{Experiments with more publicly available datasets, e.g.} restaurant review dataset from SemEval Challenge 2015 task 12 will be conducted in our future work.

(7)

entropy error, which is a common strategy for deep learning (Erhan et al., 2009). We implement mini-batch stochastic gradient descent (SGD) with a mini-batch size of 25, and an adaptive learning rate (AdaGrad) initialized at 0.02 for pretraining of DT-RNN, which runs 4 epochs for the restaurant domain and 5 epochs for the laptop domain. For parameter learning of the joint model RNCRF, we implement SGD with a de-caying learning rate initialized at 0.02. We also try with varying context window size, and use 3 for the laptop domain and 5 for the restaurant domain, re-spectively. All parameters are chosen by cross vali-dation.

As discussed in Section 5.1, hand-crafted features can be easily incorporated into RNCRF. We gen-erate three types of simple features based on POS tags, name-list and sentiment lexicon to show fur-ther improvement by incorporating these features. Following (Toh and Wang, 2014), we extract two sets of name list from the training data for each domain, where one includes high-frequency aspect terms, and the other includes high-probability as-pect words. These two sets are used to construct two lexicon features, i.e. we build a 2D binary vector: if a word is in a set, the corresponding value is 1, otherwise 0. For POS tags, we use Stanford POS tagger (Toutanova et al., 2003), and convert them to universal POS tags that have 15 different cate-gories. We then generate 15 one-hot POS tag fea-tures. For sentiment lexicon, we use the collection of commonly used opinion words (around 6,800) (Hu and Liu, 2004a). Similar to name list, we create a bi-nary feature to indicate whether the word belongs to opinion lexicon. We denote by RNCRF+F the pro-posed model with the three types of features.

Compared to the winning systems of SemEval Challenge 2014 task 4 (subtask 1), RNCRF or RN-CRF+F uses additional labels of opinion terms for training. Therefore, to conduct fair comparison ex-periments with the winning systems, we implement RNCRF-O by omitting opinion labels to train our model (i.e., labels become “BA”, “IA”, “O”). Ac-cordingly, we denote by O+F the RNCRF-O model with the three additional types of hand-crafted features.

6.2 Experimental Results

We compare our model with several baselines:

• CRF-1: a linear-chain CRF with standard lin-guistic features including word string, stylis-tics, POS tag, context string, and context POS tags.

• CRF-2: a linear-chain CRF with both

stan-dard linguistic features and dependency infor-mation including head word, dependency rela-tions with parent token and child tokens.

• LSTM: an LSTM network built on top of word

embeddings proposed by (Liu et al., 2015). We keep original settings in (Liu et al., 2015) but replace their word embeddings with ours (300 dimension). We try different hidden layer di-mensions (50, 100, 150, 200) and reported the best result with size 50.

• LSTM+F: the above LSTM model with the

three additional types of hand-crafted features as with RNCRF.

• SemEval-1, SemEval-2: the top two winning

systems for SemEval challenge 2014 task 4 (subtask 1).

• WDEmb+B+CRF6: the model proposed by (Yin et al., 2016) using word and de-pendency path embeddings combined with linear context embedding features, dependency context embedding features and hand-crafted features (i.e., feature engineering) as CRF input.

The comparison results are shown in Table 2 for both the restaurant domain and the laptop domain. Note that we provide the same annotated dataset (both as-pect labels and opinion labels are included for train-ing) for CRF-1, CRF-2 and LSTM for fair compar-ison. It is clear that our proposed model RNCRF achieves superior performance compared with most of the baseline models. The performance is even bet-ter by adding simple hand-crafted features, i.e., RN-CRF+F, with 0.92% and 3.87% absolute improve-ment over the best system in the challenge for aspect extraction for the restaurant domain and the laptop domain, respectively. This shows the advantage of

(8)

Restaurant Laptop Models Aspect Opinion Aspect Opinion

SemEval-1 84.01 - 74.55

-SemEval-2 83.98 - 73.78

-WDEmb+B+CRF 84.97 - 75.16

-CRF-1 77.00 78.95 66.21 71.78

CRF-2 78.37 78.65 68.35 70.05

LSTM 81.15 80.22 72.73 74.98

LSTM+F 82.99 82.90 73.23 77.67

RNCRF-O 82.73 - 74.52

-RNCRF-O+F 84.25 - 77.26

-RNCRF 84.05 80.93 76.83 76.76

[image:8.612.310.542.59.165.2]

RNCRF+F 84.93 84.11 78.42 79.44

Table 2: Comparison results in terms of F1 scores.

combining high-level continuous features and dis-crete hand-crafted features. Though CRFs usually show promising results in sequence tagging prob-lems, they fail to achieve comparable performance when lacking of extensive features (e.g., CRF-1). By adding dependency information explicitly in CRF-2, the result only improves slightly for aspect ex-traction. Alternatively, by incorporating dependency information into a deep learning model (e.g., RN-CRF), the result shows more than 7% improvement for aspect extraction and 2% for opinion extraction. By removing the labels for opinion terms, RNCRF-O produces inferior results than RNCRF because the effect of dual propagation of aspect and opinion pairs disappears with the absence of opinion labels. This verifies our previous assumption that DT-RNN could learn the interactive effects within aspects and opinions. However, the performance of RNCRF-O is still comparable to the top systems and even better with the addition of simple linguistic fea-tures: 0.24% and 2.71% superior than the best sys-tem in the challenge for the restaurant domain and the laptop domain, respectively. This shows the ro-bustness of our model even without additional opin-ion labels.

LSTM has shown comparable results for aspect extraction (Liu et al., 2015). However, in their work, they used well-pretrained word embeddings by training with large corpus or extensive external resources, e.g., chunking, and NER. To compare their model with RNCRF, we re-implement LSTM with the same word embedding strategy and label-ing resources as ours. The results show that our

Restaurant Laptop Models Aspect Opinion Aspect Opinion DT-RNN+SoftMax 72.45 69.76 66.11 64.66 CRF+word2vec 82.57 78.83 63.62 56.96

RNCRF 84.05 80.93 76.83 76.76

[image:8.612.72.306.60.211.2]

RNCRF+POS 84.08 81.48 77.04 77.45 RNCRF+NL 84.24 81.22 78.12 77.20 RNCRF+Lex 84.21 84.14 77.15 78.56 RNCRF+F 84.93 84.11 78.42 79.44

Table 3: Impact of different components.

model outperforms LSTM in aspect extraction by 2.90% and 4.10% for the restaurant domain and the laptop domain, respectively. We conclude that a standard LSTM model fails to extract the rela-tions between aspect and opinion terms. Even with the addition of same linguistic features, LSTM is still inferior than RNCRF itself in terms of as-pect extraction. Moreover, our result is compara-ble with WDEmb+B+CRF in the restaurant domain and better in the laptop domain (+3.26%). Note that WDEmb+B+CRF appended dependency context in-formation into CRF while our model encode such information into high-level representation learning.

To test the impact of each component of RNCRF and the three types of hand-crafted features, we con-duct experiments on different model settings:

• DT-RNN+SoftMax: rather than using a CRF, a softmax classifier is used on top of DT-RNN.

• CRF+word2vec: a linear-chain CRF with word embeddings only without using DT-RNN.

• RNCRF+POS/NL/Lex: the RNCRF model with POS tag or name list or sentiment lexicon feature(s).

(9)

25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400 dimension

0.72 0.74 0.76 0.78 0.80 0.82 0.84 0.86 0.88

f1-sc

ore

aspect opinion

(a) On the restaurant domain.

25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400 dimension

0.55 0.60 0.65 0.70 0.75 0.80 0.85

f1-sc

ore

aspect opinion

[image:9.612.105.257.63.352.2]

(b) On the laptop domain.

Figure 3: Sensitivity studies on word embeddings.

DT-RNN for modeling interactions between aspects and opinions. Hence, the combination of DT-RNN and CRF inherits the advantages from both mod-els. Moreover, by separately adding hand-crafted features, we can observe that name-list-based fea-tures and the sentiment lexicon feature are most ef-fective for aspect extraction and opinion extraction, respectively. This may be explained by the fact that name-list-based features usually contain informative evident for aspect terms and sentiment lexicon pro-vides explicit indication about opinions.

Besides the comparison experiments, we also conduct sensitivity test for our proposed model in terms of word vector dimensions. We tested a set of different dimensions ranging from 25 to 400, with 25 increment. The sensitivity plot is shown in Fig-ure 3. The performance for aspect extraction is smooth with different vector lengths for both do-mains. For restaurant domain, the result is stable when dimension is larger than or equal to 100, with the highest at 325. For the laptop domain, the best result is at dimension 300, but with relatively small

variations. For opinion extraction, the performance reaches a good level when the dimension is larger than or equal to 75 for the restaurant domain and 125 for the laptop domain. This proves the stability and robustness of our model.

7 Conclusion

We have presented a joint model, RNCRF, that achieves the state-of-the-art performance for explicit aspect and opinion term extraction on a benchmark dataset. With the help of DT-RNN, high-level fea-tures can be learned by encoding the underlying dual propagation of aspect-opinion pairs. RNCRF com-bines the advantages of DT-RNNs and CRFs, and thus outperforms the traditional rule-based meth-ods in terms of flexibility, because aspect terms and opinion terms are not only restricted to certain ob-served relations and POS tags. Compared to fea-ture engineering methods with CRFs, the proposed model saves much effort in composing features, and it is able to extract higher-level features obtained from non-linear transformations.

Acknowledgements

This research is partially funded by the Economic Development Board and the National Research Foundation of Singapore. Sinno J. Pan thanks the support from Fuji Xerox Corporation through joint research onMultilingual Semantic Analysisand the NTU Singapore Nanyang Assistant Professorship (NAP) grant M4081532.020.

References

Li Dong, Furu Wei, Chuanqi Tan, Duyu Tang, Ming Zhou, and Ke Xu. 2014. Adaptive recursive neural network for target-dependent twitter sentiment

classi-fication. InACL, pages 49–54.

Dumitru Erhan, Pierre-Antoine Manzagol, Yoshua Ben-gio, Samy BenBen-gio, and Pascal Vincent. 2009. The difficulty of training deep architectures and the effect

of unsupervised pre-training. InAISTATS, pages 153–

160.

Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Domain adaptation for large-scale sentiment

classification: A deep learning approach. In ICML,

(10)

Christoph Goller and Andreas K¨uchler. 1996. Learning task-dependent distributed representations by

back-propagation through structure. In ICNN, pages 347–

352.

Sepp Hochreiter and J¨urgen Schmidhuber. 1997. Long

short-term memory. Neural Computation, 9(8):1735–

1780.

Minqing Hu and Bing Liu. 2004a. Mining and

summa-rizing customer reviews. InKDD, pages 168–177.

Minqing Hu and Bing Liu. 2004b. Mining opinion

fea-tures in customer reviews. InAAAI, pages 755–760.

Ozan ˙Irsoy and Claire Cardie. 2014. Opinion

min-ing with deep recurrent neural networks. InEMNLP,

pages 720–728.

Mohit Iyyer, Jordan L. Boyd-Graber, Leonardo

Max Batista Claudino, Richard Socher, and

Hal Daum´e III. 2014. A neural network for

factoid question answering over paragraphs. In

EMNLP, pages 633–644.

Niklas Jakob and Iryna Gurevych. 2010. Extracting opinion targets in a single- and cross-domain setting

with conditional random fields. In EMNLP, pages

1035–1045.

Wei Jin and Hung Hay Ho. 2009. A novel lexicalized hmm-based learning framework for web opinion

min-ing. InICML, pages 465–472.

Nal Kalchbrenner, Edward Grefenstette, and Phil Blun-som. 2014. A convolutional neural network for

mod-elling sentences. InACL, pages 655–665.

Yoon Kim. 2014. Convolutional neural networks for

sen-tence classification. InEMNLP, pages 1746–1751.

Dan Klein and Christopher D. Manning. 2003. Accurate

unlexicalized parsing. InACL, pages 423–430.

John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilis-tic models for segmenting and labeling sequence data.

InICML, pages 282–289.

Himabindu Lakkaraju, Richard Socher, and

Christo-pher D. Manning. 2014. Aspect specific

senti-ment analysis using hierarchical deep learning. In NIPS Workshop on Deep Learning and Representation Learning.

Quoc V. Le and Tomas Mikolov. 2014. Distributed

rep-resentations of sentences and documents. In ICML,

pages 1188–1196.

Fangtao Li, Chao Han, Minlie Huang, Xiaoyan Zhu,

Ying-Ju Xia, Shu Zhang, and Hao Yu. 2010.

Structure-aware review mining and summarization. In COLING, pages 653–661.

Kang Liu, Liheng Xu, and Jun Zhao. 2012. Opinion target extraction using word-based translation model. InEMNLP-CoNLL, pages 1346–1356.

Kang Liu, Liheng Xu, Yang Liu, and Jun Zhao. 2013a. Opinion target extraction using partially-supervised

word alignment model. InIJCAI, pages 2134–2140.

Qian Liu, Zhiqiang Gao, Bing Liu, and Yuanlin Zhang. 2013b. A logic programming approach to aspect

ex-traction in opinion mining. InWI, pages 276–283.

Pengfei Liu, Shafiq Joty, and Helen Meng. 2015. Fine-grained opinion mining with recurrent neural networks

and word embeddings. InEMNLP, pages 1433–1443.

Bing Liu. 2011. Web Data Mining: Exploring

Hy-perlinks, Contents, and Usage Data. Second Edition. Data-Centric Systems and Applications. Springer. Yue Lu, Malu Castellanos, Umeshwar Dayal, and

ChengXiang Zhai. 2011. Automatic construction of a context-aware sentiment lexicon: An optimization

ap-proach. InWWW, pages 347–356.

Tengfei Ma and Xiaojun Wan. 2010. Opinion target

extraction in Chinese news comments. InCOLING,

pages 782–790.

Julian McAuley, Jure Leskovec, and Dan Jurafsky. 2012. Learning attitudes and attributes from multi-aspect

re-views. InICDM, pages 1020–1025.

Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton van den Hengel. 2015. Image-based

recom-mendations on styles and substitutes. InSIGIR, pages

43–52.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word

represen-tations in vector space.CoRR, abs/1301.3781.

Nir Ofek, Soujanya Poria, Lior Rokach, Erik Cam-bria, Amir Hussain, and Asaf Shabtai. 2016. Un-supervised commonsense knowledge enrichment for

domain-specific sentiment analysis. Cognitive

Com-putation, 8(3):467–477.

Naoaki Okazaki. 2007. CRFsuite: a fast

im-plementation of conditional random fields (CRFs). http://www.chokkan.org/software/crfsuite/.

Bo Pang and Lillian Lee. 2008. Opinion mining and

sentiment analysis. Foundations and Trends in

Infor-mation Retrieval, 2(1-2).

Maria Pontiki, Dimitris Galanis, John Pavlopoulos, Har-ris Papageorgiou, Ion Androutsopoulos, and Suresh Manandhar. 2014. Semeval-2014 task 4: Aspect

based sentiment analysis. InSemEval, pages 27–35.

Ana-Maria Popescu and Oren Etzioni. 2005. Extract-ing product features and opinions from reviews. In EMNLP, pages 339–346.

Guang Qiu, Bing Liu, Jiajun Bu, and Chun Chen. 2011. Opinion word expansion and target extraction

through double propagation. Computational

(11)

Richard Socher, Christopher D. Manning, and Andrew Y. Ng. 2010. Learning Continuous Phrase Representa-tions and Syntactic Parsing with Recursive Neural

Net-works. InNIPS, pages 1–9.

Richard Socher, Cliff C. Lin, Andrew Y. Ng, and Christo-pher D. Manning. 2011a. Parsing natural scenes and natural language with recursive neural networks. In

ICML, pages 129–136.

Richard Socher, Jeffrey Pennington, Eric H. Huang, An-drew Y. Ng, and Christopher D. Manning. 2011b. Semi-Supervised Recursive Autoencoders for

Predict-ing Sentiment Distributions. InEMNLP, pages 151–

161.

Richard Socher, Brody Huval, Christopher D. Manning, and Andrew Y. Ng. 2012. Semantic Compositionality

Through Recursive Matrix-Vector Spaces. InEMNLP,

pages 1201–1211.

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng, and Christo-pher Potts. 2013. Recursive deep models for se-mantic compositionality over a sentiment treebank. In EMNLP, pages 1631–1642.

Richard Socher, Andrej Karpathy, Quoc V. Le,

Christo-pher D. Manning, and Andrew Y. Ng. 2014.

Grounded compositional semantics for finding and

de-scribing images with sentences. TACL, 2:207–218.

Duyu Tang, Bing Qin, Xiaocheng Feng, and Ting Liu. 2015. Target-dependent sentiment classification with

long short term memory.CoRR, abs/1512.01100.

Ivan Titov and Ryan T. McDonald. 2008. A joint model of text and aspect ratings for sentiment summarization.

InACL, pages 308–316.

Zhiqiang Toh and Wenting Wang. 2014. DLIREC: As-pect term extraction and term polarity classification

system. InSemEval, pages 235–240.

Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003. Feature-rich part-of-speech

tagging with a cyclic dependency network. InNAACL,

pages 173–180.

Hao Wang and Martin Ester. 2014. A sentiment-aligned topic model for product aspect rating prediction. In EMNLP, pages 1192–1202.

Hongning Wang, Yue Lu, and ChengXiang Zhai. 2011. Latent aspect rating analysis without aspect keyword

supervision. InKDD, pages 618–626.

David Weiss, Chris Alberti, Michael Collins, and Slav Petrov. 2015. Structured training for neural network

transition-based parsing. InACL-IJCNLP, pages 323–

333.

Yuanbin Wu, Qi Zhang, Xuanjing Huang, and Lide Wu. 2009. Phrase dependency parsing for opinion mining. InEMNLP, pages 1533–1541.

Yichun Yin, Furu Wei, Li Dong, Kaimeng Xu, Ming Zhang, and Ming Zhou. 2016. Unsupervised word and dependency path embeddings for aspect term

ex-traction. InIJCAI, pages 2979–2985.

Lei Zhang, Bing Liu, Suk Hwan Lim, and Eamonn O’Brien-Strain. 2010. Extracting and ranking

prod-uct features in opinion documents. InCOLING, pages

1462–1470.

Li Zhuang, Feng Jing, and Xiao-Yan Zhu. 2006. Movie

review mining and summarization. In CIKM, pages