yananli.pdf

(1)

Yanan Li. Multi-aspect Sentiment Analysis for Car Reviews by Modeling Aspects and Overall Ratings. A Master’s Paper for the M.S. in I.S degree. December, 2019. 60 pages. Advisor: Stephanie W. Haas

In this paper, I compare the impact of sentiment score of review text, overall rating and chunk parsing on aspect rating prediction. The best outcome is achieved when using chunk parsing to divide sentences describing more than one aspect into specific chunks, then take sentiment score of those chunks and sentences depicting one aspect as input, with addition of the overall rating.

Headings:

Natural language processing Rating prediction

(2)

MULTI-ASPECT SENTIMENT ANALYSIS FOR CAR REVIEWS BY MODELING ASPECTS AND OVERALL RATINGS

by Yanan Li

A Master’s paper submitted to the faculty of the School of Information and Library Science of the University of North Carolina at Chapel Hill

in partial fulfillment of the requirements for the degree of Master of Science in

Information Science.

Chapel Hill, North Carolina December 2019

Approved by

(3)

1 INTRODUCTION

Customers’ way of making purchase decisions has been continually influenced, as the popularity of online reviews has grown. A study1_{about “how online reviews influence} sales” conducted in 2017 by Spiegel Research Center in Northwestern University presents that nearly 95% people read online reviews before making a purchase. In addition to the apparent fact that reviews offer valuable information for potential

customers, they also are effective channels for businesses to understand customers’ needs better. According to the study, customers are more willing to buy a product when reviews are available, and reviews have a greater impact on purchase probability for higher-priced items than on cheaper ones. Cars are typical high-priced items and cars reviews from TrueCar2_{are used in this paper for analyzing.}

TrueCar is a digital automotive marketplace, which operates its own site and powers car-buying programs for over 500 companies. TrueCar has a network of over 14,000

Certified Dealers and currently, over one third of all new car buyers engage with the TrueCar network during their purchasing process. A TrueCar user can share his or her

experience with a car by posting a review of the model, an overall rating and aspects ratings (car aspects, meaning the features of a car). Common car aspects include:

(5)

technology, reliability, efficiency, etc. Consider a typical car review, as shown in Figure 1. This review shows multiple aspects of the car, such as Audio, Interior and Technology. The reviewer gives an overall rating as well as an explicit rating on each aspect. Thus, a user would be able to easily know the reviewer’s opinion on each aspect.

Figure 1: A Sample Car Review

(Qu et al., 2010) proved that ratings are more informative than polarity expressed through review texts based on experiments on three target domains of Amazon reviews: books, movies (DVDs), and music (CDs), using a novel kind of “bag-of-opinions”

(6)

document, since different reviewers may give the same overall rating for very different reasons. For example, in a car review, one reviewer may like the audio feature of a vehicle, but another individual may have enjoyed the technology. Thus, the same overall rating could represent satisfaction (and dissatisfaction) on very different aspects. To help users tell the difference, it is necessary to understand a reviewer’s rating on each aspect. Not only could the analysis of these aspect ratings benefit mining interested aspects for users, but it could also help manufacturers improve specific features of their services and products. Therefore, such analysis can be very beneficial in predicting ratings, for

different aspects in a review. However, in some instances, there is a mismatch between the review and the rating (Shrestha & Nasoz, 2019). Therefore, analysis of review text is valuable to understand customers’ opinions.

(7)

2 LITERATURE REVIEW

2.1 Multi-Aspect Sentiment Analysis

(8)

Research Features Model Dataset (Size) Lu et al.,

(2011) text representation Support vector regression model Three datasets: two restaurant review datasets from OpenTable (73,495) and CitySearch.com (652)

respectively, one hotel review dataset from TripAdvisor (66,512)

McAuley et al., (2012)

text

representation Preference and Attribute Learning from Labeled

Groundtruth and Explicit Ratings (PALE LAGER)

Five datasets: two beer review datasets from BeerAdvocate (1,586,259) and Ratebeer (2,924,127) respectively, pubs review dataset from

BeerAdvocate (18,350) Toys & Games review from Amazon (373,974)

Audio Books review from Audible (10,989) Titov and McDonald (2008) text representation Multi-Aspect Sentiment model (MAS)

One dataset: hotel review dataset from TripAdvisor (10,000) Wang et al., (2010) text representation Latent Rating Regression (LRR) model

One dataset: hotels review dataset from TripAdvisor (108,891)

Wang et

al., (2011) text representation Latent Aspect Rating Analysis Model (LARAM)

Two datasets:

hotels review dataset from TripAdvisor (37,181) MP3 reviews from amazon (16,680) Diao et al., (2014) text representation Jointly modeling aspects, ratings and sentiments

(JMARS)

One dataset: movies review dataset from IMDB (348,415)

Pappas and Popescu-Belis, (2014)

text

representation APWeights: a multiple-instance regression model

Seven datasets: three beer review datasets from

BeerAdvocate and RateBeer one audiobooks review dataset

one Toys and Games review dataset

(9)

Research Features Model Dataset (Size) one Ted talks comments dataset Toh and Su, (2016) text representation NLANGP: Supervised Machine Learning System for Aspect Category

Classification and Opinion Target Extraction

Two datasets:

product reviews from

Amazon (Blitzer et al., 2007): labeled data: 2,000

unlabeled data: 3685 (DVDs) and 5945 (kitchen)

user reviews found in the Yelp

Table 1: Feature, model and datasets for Multi-Aspect Sentiment Analysis

Research Evaluation Metric Baseline Methods Lu et al.,

(2011) Accuracy Precision Recall F1 score LDA

Local LDA (Brody et al., 2010) MultiGrain Lda (Titov et al.2008) Segmented Topic Models (STM) (Du et al.2010)

McAuley et al., (2012)

Accuracy

Mean Squared Error (MSE) LDA SVM Titov and McDonald (2008) Precision Recall

Logistic Regression models

Wang et

al., (2010) Mean Squared Error (MSE) Correlation

Mean Average Precision (MAP)

(10)

Research Evaluation Metric Baseline Methods Wang et

al., (2011) Mean Squared Error (MSE) Pearson correlation Mean Average Precision (MAP) Percentage of mis-ordered aspects inside reviews

nDCG of aspect ranking inside reviews

LDA+LRR (Wang et al., 2010) sLDA+LRR (Wang et al., 2010) Bootstrap+LRR (Wang et al., 2010)

Diao et al., (2014)

Mean Squared Error

(MSE) Predict the rating as the average of past ratings Probabilistic matrix factorization (PMF) (Mnih et al., 2007)

Hidden factors with topics (HFT) (McAuley et al., 2013)

Pappas and Popescu-Belis, (2014)

Mean Absolute Error (MAE)

Mean Squared Error (MSE)

Support Vector Regression (SVR) Lasso Regression

Toh and

Su, (2016) F1 measure and F1 loss a Support Vector Machine (SVM) with a linear kernel (Pontiki et al., 2016) Table 2: Evaluation metric and baseline methods for Multi-Aspect Sentiment Analysis

(Lu et al., 2011) assumed that aspects are fixed, and that it is sufficient to identify a single aspect for each sentence in a document. To identify aspects, they evaluated weakly

supervised topic models, then used minimal prior knowledge in the form of seed words to encourage a correspondence between topics and ratable aspects, and labeled each

(11)

PALE LAGER (McAuley et al., 2012) is a model that combine aspects, and ratings on aspects, as a function of the words that appear in each sentence of a review. Since each aspect uses different sentiment words, it is beneficial to learn sentiment models per aspect (Fahrni & Klenner, 2008). However, it is also necessary to explicitly model relationships between aspects. For example, based on the dataset from BeerAdvocate, the word

“skunky” was among the strongest 1-star predictors for all aspects, even though the word clearly refers only to smell. Unsurprisingly, a product that smells “skunky” is unlikely to be rated favorably in terms of its taste. Therefore, a dependency term representing correlation of aspects was added in final multi-class SVM objective (McAuley et al., 2012) to acknowledge the correlation between aspects. And this model is able to “disentangle” content and sentiment words: content words that are indicative of a particular aspect as well as the aspect-specific sentiment words that are indicative of a particular rating are automatically learned. For example, when a beer is described as having “stale hops, but a velvety body”, the model is supposed to learn that “body” refers to palate, and “stale” and “velvety” refer to negative and positive sentiments about those aspects.

(12)

(Titov and McDonald 2008) used topic-models to identify “topics” whose words are highly correlated with the aspects on which users vote. They applied their model to assign aspect labels to sentences in a review. Their MAS model consists of two parts. The first part is based on Multi-Grain Latent Dirichlet Allocation ((MG-LDA) (Titov and McDonald, 2008), which has been previously shown to build topics that are

representative of ratable aspects. The second part is a set of sentiment predictors per aspect that are designed to force specific topics in the model to be directly correlated with a particular aspect.

(Wang et al., 2010) introduced a two-stage approach to analyze opinions expressed about an entity in an online review at the level of topical aspects. In the first stage, they

employed a bootstrapping-based algorithm to identify the major aspects (guided by a few seed words describing the aspects) and segment reviews into aspects. In the second stage, a generative Latent Rating Regression (LRR) model was proposed which aims at

inferring aspect ratings and weights for each individual review based only on the review text and the associated overall rating. Wang et al., (2011) presented Latent Aspect Rating Analysis Model (LARAM), which extends LRR to perform both aspect segmentation and aspect rating prediction in a unified framework.

(13)

special effects, but only if the user is interested in them and if the movie has special effects worth discussing. The model is able to capture the sentiment in each aspect of a review, and predict partial scores under different aspects. In addition, the user interests and movie topics can also be inferred from the model. The model also provides a link between interest and relevance on a per-aspect basis and it allows to differentiate between positive and negative sentiments on a per-aspect basis.

(Pappas and Popescu-Belis 2014) treated each review text as a bag of sentences

(instances) and each sentence is modeled as a word vector. For learning from texts with known aspect ratings, the model performs multiple instance regression and assigns importance weights to each of the sentences or paragraphs of a text, uncovering their contribution to the aspect ratings.

(14)

In summary, methods in this section of multi-aspect sentiment analysis share the same assumptions about words and sentences, for example, using the word syntax to determine whether a word serves for aspect or sentiment, or relating a specific aspect to a sentence, such as identifying a single aspect for each sentence in a document.

2.2 Aspect-Based Sentiment Classification

Another related research area is aspect-based sentiment classification. Earlier studies treated aspect extraction and sentiment analysis as separate phases. They first extracted aspect expressions from sentences (Pontiki et al., 2014), and then determined sentiments. For instance, in the sentence “This camera is sleek and very affordable”, “sleek” and “affordable” are aspect expressions since the word “sleek” refers to the appearance of the camera and the word “affordable” to its price. Feature, model and datasets used by

researches of aspect-based sentiment classification are illustrated in Table 3. Evaluation metric and baseline methods are displayed in Table 4.

Research Features Model Dataset (Size)

Dong et al., (2014) text representation Adaptive Recursive Neural Network (AdaRNN)

One dataset: tweets (6,940)

Wang et al., (2016) text representation Attention-based Long Short-Term

Memory Network (AT-LSTM)

Two datasets: laptop (3,845) and restaurant datasets (3,841)

Pontiki et al., (2014)

text

representation

Models are submitted by 32 teams

Two datasets: laptop (3,845) and restaurant datasets (3,841)

Lakkaraju et al., (2014)

text

representation Recurrent Neural Networks and its variants

Two datasets: beer reviews (8532) and camera reviews (5008)

(15)

Research Evaluation Metric Baseline Methods Dong et al.,

(2014) Accuracy Macro-average F1-score

SVM-independent features SVM-dependent features

SVM-converted dependency tree RNN

Wang et

al., (2016) Accuracy LSTM Target-Dependent LSTM (TD-LSTM) and Target-Connection LSTM (TC-LSTM) (Tang et al., 2015a)

Pontiki et al., (2014)

F1 measure Majority

Lakkaraju et al., (2014)

Accuracy FACTS (FACeT and Sentiment extraction model) and CFACTS (Coherence based FACeT and Sentiment extraction model)

Multi-class Support Vector Machines Naive Bayes classifiers with tf-idf vectors of words as features

Table 4: Evaluation metric and baseline methods for Aspect-Based Sentiment Classification

(Lakkaraju et al., 2014) employed recurrent neural networks and its variants for the task of extraction of aspect sentiment pair. Their approach did not treat aspect extraction and sentiment analysis as separate phases or enforce explicit modeling assumptions on how these two phases should overlap and interact.

(Pontiki et al., 2014) conducted SemEval-2014 Task4 aiming to foster research in the field of aspect-based sentiment analysis, where the goal is to identify the aspects of given target entities and the sentiment expressed for each aspect. There are four subtasks in this research, including aspect term extraction (SB1), aspect term polarity (SB2), aspect category detection (SB3) and aspect category polarity (SB4). For SB1 and SB2,

(16)

review sentences, SB1 and SB2 ask to identify all aspect terms present in each sentence (e.g., “wine”, “waiter”, “appetizer”, “price”) and determine the polarity of each aspect term (positive, negative, conflict or neutral). For SB3 and SB4, a predefined set of aspect categories (e.g., PRICE, FOOD) are given, and the task is to identify the aspect

categories discussed in each sentence and determine the polarity. The result shows that analysis systems would perform better if set of aspect categories are predefined.

(Dong et al., 2014) focused on integrating target information with Recursive Neural Network (RNN) to leverage the ability of deep learning models. Target is the entity in a sentence, for example, the tweet “@ballmer: windows phone is better than ios!” has three targets (@ballmer, windows phone, and ios). The user expresses neutral, positive, and negative sentiments for them, respectively.

(Wang et al., 2016) proposed attention-based Long Short-Term memory for aspect-level sentiment classification. The models are able to attend different parts of a sentence when different aspects are concerned. For example, the sentiment polarity of “Staffs are not that friendly, but the taste covers all.” will be positive if the aspect is food but negative when considering the aspect service.

(17)

2.3 Document-Level Sentiment Classification

Another related research field is document-level sentiment classification (Tang et al., 2015b; Chen et al., 2016; Tang et al., 2016; Yang et al., 2016) because single aspect sentiment classification can be treated as an individual document classification task, which assumes that each document expresses a sentiment of each aspect in turn on a single product and targets. To illustrate, given a document with 3 aspects with 3 different ratings, this document will be represented 3 times in the dataset, once for each aspect. Feature, model and datasets used by researches of document-level sentiment

classification are illustrated in Table 5. Evaluation metric and baseline methods are displayed in Table 6.

Research Features Model Dataset (Size)

Tang et al., (2016)

text

representation

Deep memory network approach for aspect level sentiment classification

Two datasets: laptop (3,845) and restaurant datasets (3,841) Tang et al., (2015b) representations of individual preferences of users, overall qualities of products and review texts

User Product Neural Network (UPNN)

Three datasets derived from IMDB and Yelp Dataset Challenge IMDB (84,919) Yelp 2014 (231,163) Yelp 2013(78,966) Chen et al., (2016) representations of individual preferences of users, overall qualities of products and review texts

A hierarchical LSTM model

Three datasets derived from IMDB and Yelp Dataset Challenge IMDB (84,919) Yelp 2014 (231,163) Yelp 2013(78,966) Yang et al., (2016) text representation Hierarchical Attention Network (HAN)

(18)

Research Features Model Dataset (Size) Yelp 2014 (1,125,457) Yelp 2015 (1,569,264) IMDB review (348,415) Yahoo answer

(1,450,000) Amazon review (3,650,000)

Table 5: Feature, model and datasets for Document-Level Sentiment Classification

Research Evaluation Metric Baseline Methods Tang et al.,

(2016)

Accuracy Running speed

Majority is a basic baseline method, which assigns the majority sentiment label in training set to each instance in the test set.

feature-based SVM

LSTM and attention-based LSTM models (Tang et al.,2015a)

Tang et al., (2015b)

Accuracy MAE RMSE

Majority

Trigram (Fan et al., 2008)

TextFeature (Kiritchenko et al., 2014) AvgWordvec + SVM

sentiment-specific word embeddings (SSWE) + SVMg (Tang et al., 2014b)

Paragraph Vector (Le and Mikolov, 2014) Recursive Neural Tensor Network (RNTN) (Socher et al., 2013) + Recurrent Neural Network (RNN)

a state-of-the-art recommendation algorithm JMARS (Diao et al., 2014).

Chen et al., (2016)

Accuracy RMSE

Majority

Trigram (Fan et al., 2008)

TextFeature (Kiritchenko et al., 2014) AvgWordvec + SVM

sentiment-specific word embeddings (SSWE) + SVM(Tang et al., 2014b)

Paragraph Vector (Le and Mikolov, 2014) Recursive Neural Tensor Network (RNTN) (Socher et al., 2013) + Recurrent Neural Network (RNN)

(19)

Research Evaluation Metric Baseline Methods Yang et al.,

(2016) Accuracy A linear classifier based on multinomial logistic regression TextFeature (Kiritchenko et al., 2014)

AvgWordvec + SVM

sentiment-specific word embeddings (SSWE) + SVM (Tang et al., 2014b)

CNN-word (Kim, 2014) CNN-char (Zhang et al., 2015) LSTM

Conv-GRNN and LSTM-GRNN (Tang et al., 2015).

Table 6: Evaluation metric and baseline methods for Document-Level Sentiment Classification

(20)

(Tang et al., 2015b) and (Chen et al., 2016) considered the effect of individual

preferences of users and overall qualities of products toward the sentiment rating of a review. (Tang et al., 2015b) built a model dubbed UPNN. Users and products are encoded in continuous vector spaces. In continuous vector spaces, a numerical representation (e.g. the real numbers) of user preferences and product qualities is captured and these representations are further integrated with continuous text

representation in a unified neural framework for sentiment classification. And the output is the sentiment polarity label of a document. (Chen et al., 2016) built a hierarchical Long short-term memory (LSTM) model to generate sentence-level representation and

document-level representation jointly.

(21)

The studies in this section of document-level sentiment classification contain approaches which are based on neural networks in a hierarchical structure. However, they did not include multiple aspects.

2.4 Document-level Multi-aspect Sentiment Classification

An overview of multitask learning (Caruana, 1997) together with other researches (Collobert et al., 2011; Luong et al. 2016) shows that multitask learning can be used as straightforward approach for document-level multi-aspect sentiment classification. The following characterizations of multi-aspect learning is defined by (Caruana, 1997):

“Multitask Learning is an approach to inductive transfer that improves

generalization by using the domain information contained in the training signals of related tasks as an inductive bias. It does this by learning tasks in parallel while using a shared representation; what is learned for each task can help other tasks be learned better.”

(22)

is view?”. With these questions, they guess possible keywords for each aspect. After that, the person simply read the review, find those keywords, and examine comments.

2.5 Summary of Literature Review

To sum up, multi-aspect sentiment analysis uses the word syntax to determine whether a word serves for aspect or sentiment, or identifying a specific aspect for each sentence. Aspect-based sentiment classification extract aspect expressions from sentences, and then determine sentiments. Document-level sentiment classification assumes each document expresses a sentiment on a single product for each aspect, but it doesn’t include multi-aspects. With the introduction of multitask learning, there are researches focusing on mimicking human’s reaction when being asked to evaluate an aspect for a product.

(23)

(24)

3 DATA

In this section, I first introduce real-world dataset used in this paper and present some observations about the distribution of aspects rating and overall rating.

3.1 Data Set and Preprocessing

I crawled 19,932 car reviews from TrueCar. These include all the records in the website with an actual text comment. These review records contain 35 car brands in market, provided by reviewers between Jan, 2016 and Jun, 2019. The website review system allows reviewers to enter an overall rating, a review comment, and 13 aspect ratings in each review: Satisfaction, Value, Safety, Reliability, Efficiency, Ride, Seating, Exterior, Interior, Audio, Comfort, Technology and Performance. However, the users may not see

any reason to rate and describe all 13 aspects, and they may just include aspects that they care most about. The overall rating and Satisfaction rating are integers from 1 star to 5 stars, while ratings of other 12 aspect ratings include 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, and 5 stars. These can serve as ground-truth for quantitative evaluation of aspect rating prediction.

I first perform simple pre-processing on this dataset:

(25)

in Table 7, the reviews without aspect ratings (48.02%) are almost half of all reviews. After this pre-processing stage, 10,361 reviews are left.

# of aspects being

rated Count of Reviews Percentage of Reviews

0 9571 48.02%

1 285 1.43%

2 9 0.05%

3 1 0.01%

4 1 0.01%

5 6 0.03%

6 2 0.01%

7 31 0.16%

8 17 0.09%

9 0 0.00%

10 27 0.14%

11 41 0.21%

12 620 3.11%

13 9321 46.76%

Total 19932 100%

Table 7: Number of aspects rated per review

(26)

Aspect Keywords

Technology

ac, acc, android, apple, Bluetooth, button, camera, cellphone, computer, connect, connection, connectivity, cruise, display, electronic,

entertainment, feature, gps, horsepower, infotainment, iphone, keyless, lane, map, nav, navi, navigation, phone, play, power, radio, remote, sensor, smartphone, software, tech, technologically, touch, touchscreen, voice

Audio aux, cd, loud, media, mp3, music, player, song, sound, speaker, stereo Performance accelerate, acceleration, auto, automate, automatic, awd, cylinder, diesel, drive, drivetrain, engine, engineer, exhaust, gasket, gear, gearbox,

gearshift, handle, haul, performance, steering, tow, transmission, wheel Exterior color, door, exterior, flat, gate, headlamp, headlight, lift, liftgate, paint, _{rear, style, tail, tailgate, tire, wheel, window, windshield, wiper}

Seating backseat, heat, lumbar, memory, seat, seatbelt

Interior

beam, box, cargo, cigarette, compartment, console, cup, dash, dashboard, door, glove, glovebox, headrest, headroom, holder, interior, legroom, light, lighter, lock, mat, mirror, moonroof, pedal, pocket, roof, room, roomy, space, steering, storage, sun, sunglass, sunroof, taillight, trunk, unlock, view, visor, wheel

Efficiency battery, charge, consumption, efficient, fuel, gas, hybrid, mileage, tank Ride ride

Reliability noise, rattle, rattling, reliability, shake

Safety airbag, assistant, blind, brake, camera, departure, lane, quality, safe, safety, sensor, warning

Comfort

airflow, arm, armrest, backseat, climate, comfort, comfortable,

comfortably, conditioning, convenience, heat, headrest, lumbar, luxury, memory, seat, seatback, seatbelt, support, temperature, uncomfortable, vent

Value cheap, expensive, money, price, useful, value, wallet

Satisfaction

able, absorbed, admirable, admired, affable, affectionate, afraid, aggravated, agreeable, aggressive, alarmed, amazed, amazing, ambivalent, amused, angry, annoy, annoyed, anxious, apprehensive, ashamed, astonished, awful, awkward, bad, beautiful, best, cheesy, complain, decent, destroy, detriment, disappointed, dissatisfied, enjoy, excellent, fun, glad, good, great, handsome, happy, hate, horrible, impressive, incomplete, like, love, lovely, negative, nice, pleased, pleasure, positive, regret, ridiculous, satisfactory, satisfy, shame, suck, terrific, ugly, wonderful, worst

Table 8: keywords list for each aspect

(27)

# of aspects being described

Count of Reviews

Percentage of Reviews

0 334 3.22%

1 1587 15.32%

2 1796 17.33%

3 1691 16.32%

4 1464 14.13%

5 1104 10.66%

6 826 7.97%

7 594 5.73%

8 434 4.19%

9 296 2.86%

10 135 1.30%

11 66 0.64%

12 31 0.30%

13 3 0.03%

Total 10361 100.00%

Table 9: Number of aspects described per review

2) convert all the words into lower case; and removing punctuation and stop words. I use a default list of stop words in English defined by NLTK3_{. After the pre-processing, the} data set contains 10,027 car reviews and 289 car models.

The average data set in research described in the literature review (section 2) is 467,492 records. In comparison, the data size in this paper is only 2.14% of the average size even though I scraped all available data from TrueCar website. One potential explanation is that the datasets being used in related study before are mainly about reviews of hotels, restaurants and low-priced products, such as mp3. However, considering the type of business and how frequently consumers tend to use that type of business, a small data size for car review is acceptable. For instance, business types that people use frequently

(28)

(such as restaurants) will probably have more reviews because more people simply visit these types of business more often and on a regular basis. But people tend not to change their car regularly. In addition, Yelp and TripAdvisor are dominant toward reviews for restaurants and hotels, but TrueCar has some major competitors such as KBB.

3.2 Observation

3.2.1 Trends of overall ratings

When scoring a product, users may consider multiple aspects of the product. If these aspects could meet the users’ requirement, they can give a high overall rating, otherwise, they could give low scores. Therefore, overall rating can partly reflect the users’ attitudes to aspects. To investigate the trend of overall ratings, I computed the distribution for overall ratings, aspects ratings, and total aspect ratings. It shows that more than half of reviewers give high ratings for overall ratings: 4 and 5 stars for their purchases. Only about 2% reviewers are extremely dissatisfied and give the rating of 1 star. The same patterns are also observed in the distribution of aspect ratings for Performance, Exterior, Reliability, Safety, Satisfaction and total aspect ratings. The details of overall rating distribution are shown in Table 10 and Figure 2, and aspect rating distribution, total aspect ratings distribution is showed in Table 11 and Figure 3.

Overall

Rating 1 2 3 4 5 Total

Count of

Reviews 164 2098 2066 2257 3481 10066

Percentage of

Reviews 0.02 0.21 0.21 0.22 0.35 1

(29)

(30)

28

Aspect 5.0 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 Total

Technology Count 1178 980 1022 1057 1459 1533 1431 636 292 9588

Percentage 12.29% 10.22% 10.66% 11.02% 15.22% 15.99% 14.92% 6.63% 3.05% 100.00%

Audio Count 2738 821 1191 619 2062 572 1491 301 214 10009

Percentage 27.36% 8.20% 11.90% 6.18% 20.60% 5.71% 14.90% 3.01% 2.14% 100.00%

Performance Count 2164 1565 1479 1232 1685 1000 736 164 34 10059

Percentage 21.51% 15.56% 14.70% 12.25% 16.75% 9.94% 7.32% 1.63% 0.34% 100.00%

Exterior Count 3105 1300 1449 966 1679 733 682 108 37 10059

Percentage 30.87% 12.92% 14.41% 9.60% 16.69% 7.29% 6.78% 1.07% 0.37% 100.00%

Seating Count 1726 1126 1669 1057 1997 1026 1172 219 58 10050

Percentage 17.2% 11.2% 16.6% 10.5% 19.9% 10.2% 11.7% 2.2% 0.6% 100.0%

Interior Count 1905 1565 1394 1214 1669 1011 1022 212 65 10057

Percentage 18.94% 15.56% 13.86% 12.07% 16.60% 10.05% 10.16% 2.11% 0.65% 100.00%

Efficiency Count 1565 643 991 688 1641 798 2547 323 781 9977

Percentage 15.69% 6.44% 9.93% 6.90% 16.45% 8.00% 25.53% 3.24% 7.83% 100.00%

Ride Count 1748 1099 1567 1069 1909 993 1271 294 115 10065

Percentage 17.37% 10.92% 15.57% 10.62% 18.97% 9.87% 12.63% 2.92% 1.14% 100.00%

Reliability Count 2865 913 1310 607 2069 641 1156 273 136 9970

Percentage 28.74% 9.16% 13.14% 6.09% 20.75% 6.43% 11.59% 2.74% 1.36% 100.00%

Safety Count 2334 1287 1539 1008 2031 742 917 120 28 10006

(31)

29

Aspect 5.0 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 Total

Comfort Count 1442 1508 1481 1452 1677 1332 930 210 33 10065

Percentage 14.33% 14.98% 14.71% 14.43% 16.66% 13.23% 9.24% 2.09% 0.33% 100.00%

Value Count 2171 321 1369 238 2601 199 2494 0 565 9958

Percentage 21.80% 3.22% 13.75% 2.39% 26.12% 2.00% 25.05% 0.00% 5.67% 100.00%

Satisfaction Count 4432 0 1775 0 2290 0 1244 0 435 10176

Percentage 43.55% 0.00% 17.44% 0.00% 22.50% 0.00% 12.22% 0.00% 4.27% 100.00% Table 11-b: Aspect rating distribution

(32)

3.2.2 Aspect Description Distribution

Table 12 and Figure 4 demonstrate the distribution of the number of aspects being described in review text. There are only 3 reviews that present all 13 aspects. By the fact that 76.98% reviews describe 0 to 5 aspects, it is known that most reviewers are not willing to offer a comprehensive review including all aspects.

# of aspects described Count Percentage

0 334 3.22%

1 1587 15.32%

2 1796 17.33%

3 1691 16.32%

4 1464 14.13%

5 1104 10.66%

6 826 7.97%

7 594 5.73%

8 434 4.19%

9 296 2.86%

10 135 1.30%

11 66 0.64%

12 31 0.30%

13 3 0.03%

Total 10361 100.00%

(33)

Figure 4: Number of aspects described per review

(34)

# of Reviews in which the Aspect is/is

not Described

YES NO

Technology 6283 3744

Audio 1191 8836

Performance 3340 6687

Exterior 3023 7004

Seating 2074 7953

Interior 3485 6542

Efficiency 1561 8466

Ride 820 9207

Reliability 1007 9020

Safety 2235 7792

Comfort 3317 6710

Value 3020 7007

Satisfaction 8009 2018

Table 13: Number of reviews in which the aspect is/is not described

(35)

4 METHODS

In this section, I discuss the methods for mapping the sentences to one or more aspects, getting sentiment score for each sentence, computing sentiment for each aspect, and adding overall rating into prediction.

4.1 Mapping Aspect to Sentence

There is an assumption in this paper: that it is clear which words are discussing which aspects in a review. Therefore, I first perform sentence and aspect mapping in a review document based on the given keywords describing aspects. The keywords list is

illustrated in Table 8.

(36)

drive” would be related to Technology and Performance as it has two words from keywords list: technology and drive.

4.2 Sentiment Analysis of Sentences

I used VADER4_{(Valence Aware Dictionary and sEntiment Reasoner) model for} recognizing sentiment for each aspect-relevant sentence. VADER is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. This model outputs sentiment scores to 4 classes of sentiments: Negative, Neutral, Positive and Compound. The score of Compound is computed by summing the valence scores of each word in the lexicon, which is then normalized to be between -1 (most extreme negative) and +1 (most extreme positive)5_.

4.3 Aspect Rating Prediction with Text Feature

In order to compute rating for aspect A in a review R, a sum of sentiment scores of all sentences in R relating to A will be calculated. Then I use the average sentiment score as the aspect rating prediction. Because the sentiment score outputted by VADER is range from -1 to 1, the last step is to map the range to [1,5]. The following formula shows the process of calculation:

Aspect Sentiment Score = (

∑_2%&'%&*%2$%&'()%&' $*+,% -+, .%/%01&' $%&'%&*%2 34)5%, +- .%/%01&' $%&'%&*%2 67) ∗:

;

4_{http://www.nltk.org/howto/sentiment.html}

(37)

If the Aspect Sentiment Score is not an integer or among [1.5, 2.5, 3.5, 4.5], it will be rounded to the nearest integer or decimal in [1.5, 2.5, 3.5, 4.5].

Consider an example review contains 4 sentences:

“In other words, at this level, there is no excuse for the standard sound system to be a

"dial" and not a touch screen style stereo. The sound is good, but it's totally antiquated.

Additionally, I would comment that the front seat can be a bit of a challenge to get out of

at times, and seems a little cramped at times. Thank you.”

Table 14 shows the relevant aspects and sentiment score for each sentence.

Sentence Relevant Aspects Sentiment Score 1st sentence [Technology, Audio, Exterior] -0.2263 2nd sentence [Audio, Satisfaction] 0.2382

3rd sentence [Seating, Comfort] 0.0772

4th sentence []

Table 14: Relevant aspects and sentiment score for example review To get sentiment score for Audio, the following computation is performed:

Aspect Sentiment Score = <

=>.@@AB C >.@BD@ @ 67E∗:

; = 2.51

Therefore, the final predicted rating for Audio of this review would be 2.5.

4.4 Aspect Rating Prediction with Overall Rating Feature and Text Feature

(38)

rating. The formula is as follows, where the Overall Rating Weight in this formula ranges from 0.0 to 1.0.

AvgPct = Overall Rating * Overall Rating Weight + Aspect rating prediction with text feature * (1 - Overall Rating Weight)

4.5 Aspect Rating Prediction with Chunk Parsing

I use sentence sentiment score to compute aspect sentiment score, however, one concern about this methodology is when one sentence describes more than one aspect with different emotions. For example, given a sentence “Lexus is quieter on road and has a great ride but driver seat is less comfortable.”, this sentence relates to three aspects:

Ride, Seating and Comfort but the sentiment for Ride is different from the sentiment for

Seating and Comfort. In the dataset, there are 14,142 out of 46,503 (30.41%) sentences describing 2 or more aspects. And 3441 out of 10027 (34.32%) reviews include sentences relating to more than 1 aspect. To mitigate the effect of this problem, I detect chunks in all sentences depicting more than one aspect.

(39)

I use “pos_tag” package in NLTK library as “part-of-speech” tagger. The tagger defines words tags based on standard English sentence structure. But for review texts not following a formal structure, words are likely to be tagged differently if tagging is done by a person. For example, given a sentence “fun to drive and excellent on gas”,

“excellent” is tagged as “VB” (verb), instead of “JJ” (adjective). In addition, words with all letters uppercased are tagged as “NNP” regardless the actual parts of speech.

Therefore, I have adjusted the tags in the 7 patterns to follow NLTK tagging rules. In the sentence “fun to drive and excellent on gas”, “excellent on gas” can be extracted with pattern “<VB><IN>?<DT>?<NN>+”. And “EXCELLENT SOUND” can be got through pattern “<NNP>+<CC>?”.

Compared to the original sentence, aspects are expressed more specifically in chunks. In the above example, there are two chunks after chunk detection: “great ride” and “driver seat is less comfortable”. Then, instead of analyzing the sentiment score for a full sentence, I get the sentiment score for each chunk and calculate sentiment score for each aspect.

Tag Pattern Example

<JJ><NN>+6_<VBZ>?7_<RBR>*8_<JJ>* _{Tech-Plus package is} phenomenal

excellent safety quiet ride

engine performance is awesome <NN>+<TO><VB>* fun to drive

convenient to use mirror hard to adjust

technology need to improve

(40)

Tag Pattern Example <RB>+<TO><VB>* fun enough to drive

very hard to drive <VB><IN>?<DT>?<NN>+ excellent on gas

Love the ride

perform like first-generation technology

love that safety aspect

<NNP>+<CC>? EXCELLENT SOUND

<DT>?<NN>+<VBZ><RB>*<JJ>+<CC>?<JJ>? <NN>?

car is reliable

seat entry is simply difficult and uncomfortable

glove compartment is too small the navigation system is

somewhat complex

the trunk is surprisingly roomy <NNP>+<IN><NNP>+ EXCELLENT FEEL OF THE

WHEEL Table15: Tag patterns and examples

Tag Meaning (Example)

CC coordinating conjunction DT determiner

IN preposition/subordinating conjunction JJ adjective (“big”)

NN noun, singular (“desk”)

NNP proper noun, singular (“Harrison”) RB adverb (“very”, “silently”)

TO to

VB verb, base form (“take”) VBD verb, past tense (“took”)

VBG verb, gerund/present participle (“taking”) VBZ verb, 3rd person singular, present (“takes”)

(41)

5 RESULTS AND DISCUSSION

(42)

40

5.1 MSE with Various Overall Rating Weight (Without Chunk Parsing)

I use Mean Squared Error (MSE) as the evaluation metrics, and the results of MSE with various overall rating weight, but without

chunk parsing are shown in Table 17.

Aspect

MSE with Various Overall Rating Weight (Without Chunk Parsing) OR Weight 0.0 OR Weight 0.1 OR Weight 0.2 OR Weight 0.3 OR Weight 0.4 OR Weight 0.5 OR Weight 0.6 OR Weight 0.7 OR Weight 0.8 OR Weight 0.9 OR Weight 1.0

Technology 1.7114 1.4263 1.1868 0.9749 0.8951 0.8576 0.7658 0.7880 0.8237 1.0429 1.0824 Audio 2.5757 2.2491 1.8656 1.5880 1.3833 1.2263 1.1007 1.0487 1.0565 1.1853 1.2375 Performance 1.8444 1.6081 1.2551 1.0085 0.8553 0.6471 0.6111 0.5361 0.5372 0.5833 0.6021 Exterior 2.7211 2.3435 1.9148 1.5484 1.3375 0.9454 0.9188 0.7797 0.7248 0.6907 0.6921 Seating 1.8873 1.5892 1.2466 0.9772 0.8173 0.6330 0.5430 0.4786 0.4705 0.5366 0.5593 Interior 2.0448 1.7187 1.3450 1.0256 0.8631 0.6028 0.5165 0.4172 0.3923 0.4138 0.4255 Efficiency 2.1186 2.0105 1.6771 1.5719 1.4394 1.5064 1.3236 1.3378 1.3753 1.4898 1.5285 Ride 1.4177 1.2851 0.9969 0.8083 0.6967 0.6273 0.5256 0.4985 0.4889 0.5391 0.5549 Reliability 2.0076 1.7254 1.3976 1.1275 0.9965 0.7510 0.6506 0.5685 0.5520 0.5586 0.5641 Safety 2.1191 1.8278 1.4183 1.1357 0.9334 0.6741 0.5933 0.4978 0.4586 0.4608 0.4842 Comfort 1.5956 1.3715 1.0303 0.7895 0.6520 0.5024 0.4028 0.3463 0.3356 0.3770 0.3968 Value 2.0300 1.7265 1.4224 1.1716 1.0516 0.9347 0.7948 0.7604 0.7554 0.8765 0.8947 Satisfaction 2.1169 1.9068 1.6372 1.2916 1.1132 1.0087 0.9399 0.8714 0.8428 0.8691 0.8799

(43)

An overall rating weight of 0, which means no overall rating information included in the

model at all, produces very poor result. In contrast, the method with an overall rating

weight of 0.8 achieves better MSE toward the average and 8 aspects: Performance,

Seating, Interior, Ride, Reliability, Safety, Comfort, Value and Satisfaction. For the other

5 aspects: Technology and Efficiency run better when overall rating weight is 0.6; Audio

and Performance perform better with an overall rating weight of 0.7; Exterior

outperforms others with an overall rating weight of 0.9.

(44)

42

5.2 MSE with Various Overall Rating Weight (With Chunk Parsing)

Table 18 displays results of MSE with various overall rating weight, and chunk parsing.

Aspect

MSE with Various Overall Rating Weight (With Chunk Parsing) OR Weight 0.0 OR Weight 0.1 OR Weight 0.2 OR Weight 0.3 OR Weight 0.4 OR Weight 0.5 OR Weight 0.6 OR Weight 0.7 OR Weight 0.8 OR Weight 0.9 OR Weight 1.0

Technology 1.6190 1.2523 1.0874 0.8451 0.8133 0.7578 0.6998 0.7409 0.7674 1.0564 1.0824

Audio 2.2133 1.8466 1.5876 1.3257 1.2681 1.0822 1.0421 1.0214 1.0477 1.2082 1.2375

Performance 2.0419 1.6589 1.3398 1.0191 0.9400 0.5964 0.6055 0.5325 0.5332 0.5905 0.6021

Exterior 2.7508 2.2128 1.8710 1.4394 1.3775 0.8231 0.8951 0.7578 0.7369 0.6918 0.6921

Seating 1.7451 1.2884 1.0900 0.7631 0.7547 0.4735 0.4726 0.4255 0.4280 0.5576 0.5593

Interior 2.0523 1.5942 1.2916 0.9234 0.8578 0.4990 0.4833 0.3834 0.3739 0.4215 0.4255

Efficiency 1.9518 1.7369 1.5231 1.3529 1.3057 1.3263 1.2120 1.2544 1.2826 1.5036 1.5285

Ride 1.8758 1.4501 1.2223 0.8527 0.8009 0.5219 0.5176 0.4433 0.4602 0.5481 0.5549

Reliability 2.1970 1.7165 1.4620 1.0969 1.0700 0.6668 0.6430 0.5406 0.5390 0.5570 0.5641

Safety 2.4302 1.9038 1.5680 1.1423 1.0526 0.5946 0.6060 0.4876 0.4693 0.4735 0.4842

Comfort 1.8367 1.3600 1.1200 0.7444 0.7095 0.3897 0.3825 0.3042 0.3073 0.3910 0.3968

Value 1.8760 1.5015 1.2946 1.0257 0.9793 0.8269 0.7293 0.7075 0.7078 0.8925 0.8947

Satisfaction 2.3033 2.0495 1.7694 1.3812 1.1953 1.0636 0.9773 0.9002 0.8506 0.8696 0.8799

(45)

After chunk parsing all the sentences depicting more than one aspect, the model performs

best with an overall rating weight of 0.7. This result further supports the point that the

MSE result would be better considering overall rating weight. One noticeable change is

that compared to the best result of the model without chunk parsing, the average MSE

decreases 3.57% using to the model with chunk parsing. And the latter model has a lower

MSE in 10 aspects, including Technology, Audio, Performance, Seating, Interior,

Efficiency, Ride, Reliability, Comfort and Value. A more direct comparison is shown in

Table 19.

Aspect MSE Without Chunk Parsing MSE With Chunk Parsing

OR Weight 0.8 OR Weight 0.7

Technology 0.8237 0.7409

Audio 1.0565 1.0214

Performance 0.5372 0.5325

Exterior 0.7248 0.7578

Seating 0.4705 0.4255

Interior 0.3923 0.3834

Efficiency 1.3753 1.2544

Ride 0.4889 0.4433

Reliability 0.5520 0.5406

Safety 0.4586 0.4876

Comfort 0.3356 0.3042

Value 0.7554 0.7075

Satisfaction 0.8428 0.9002

AVERAGE 0.6780 0.6538

Table 19: Comparison of MSE for model with and without chunk parsing

The fact that the model with chunk parsing achieves better results indicates reviewers are

(46)

sentences for each review is 4.64 and there are 13 aspects in total, it is essentially

impossible for each reviewer to write a single sentence for each aspect.

5.3 Over-Prediction and Under-Prediction

In this part I discuss the change of over-prediction and under-prediction as overall rating

weight varies.

• Over-Prediction: the predicted rating is greater than actual rating.

• Under-Prediction: the predicted rating is less than actual rating.

• Accurate-Prediction: the predicted rating and actual rating are the same.

As shown in Table 20 and Figure 6, the number of over-prediction and

accurate-prediction rise as the overall ratings weight increases. With an overall rating weight of

1.0, over-prediction is 56.21% above it is when the weight is 0.0. And accurate prediction

increases dramatically by 203.85% from a weight from 0.0 to 1.0. In contrast,

(47)

45 Prediction

Type

# of Over-Prediction, Under-prediction and Accurate-Prediction with Various Overall Rating Weight OR Weight 0.0 OR Weight 0.1 OR Weight 0.2 OR Weight 0.3 OR Weight 0.4 OR Weight 0.5 OR Weight 0.6 OR Weight 0.7 OR Weight 0.8 OR Weight 0.9 OR Weight 1.0

Over-Prediction 8840 8971 9305 10086 9873 12542 10196 11658 11497 13746 13809

Under-prediction 24438 24094 23053 22586 22266 17697 18824 17318 14924 9870 9191

Accurate-

Prediction 5042 5255 5962 5648 6181 8081 9300 9344 11899 14704 15320

(48)

(49)

47 Table 21 and Figure 7 display MSE of over-prediction, prediction and all prediction (including over-prediction,

under-prediction and accurate-under-prediction) with different overall rating weight. As it appears in Figure 7, when adding weight to overall

rating, MSE of under-prediction and all prediction decreases and reaches the lowest level at overall rating weight 0.7 and 0.8

respectively, then gradually increases. In contrast, over-prediction has its lowest MSE when overall rating weight is 0.4, and its largest

MSE of 1.36 with an overall rating weight of 1.0.

Prediction Type

MSE of Over-Prediction, Under-prediction and All-Prediction OR Weight 0.0 OR Weight 0.1 OR Weight 0.2 OR Weight 0.3 OR Weight 0.4 OR Weight 0.5 OR Weight 0.6 OR Weight 0.7 OR Weight 0.8 OR Weight 0.9 OR Weight 1.0

Over-prediction 1.20 1.20 1.00 0.90 0.86 0.94 0.96 0.98 1.07 1.32 1.36

Under-prediction 2.80 2.22 1.96 1.43 1.35 0.99 0.96 0.83 0.88 1.12 1.19

All-Prediction 2.06 1.68 1.42 1.08 1.00 0.76 0.73 0.67 0.66 0.76 0.77

(50)

(51)

According to all prediction MSE in Figure 7, the best overall rating weight is 0.8.

However, the number of accurate-predictions is the greatest when overall rating weight is

1.0. This shows that mean squared error and predictive accuracy do not always agree

when it comes to identifying an optimal prediction model.

The possible reason for under-prediction is that reviewers exaggerate their sentiment

when rating. Additionally, over-prediction implies that reviewers may offer less positive

rating than their text comments. However, another possibility is that the sentiment

analysis itself could be wrongly-rating the sentiment. The only way to determine this

would be to interview the reviewers themselves, learning what their true intentions are.

The reasons of the mismatch of review text sentiment polarity and review rating offer a

potential area for further exploration.

5.4 Aspect-Based MSE Summarization

The aspect-based MSE change is reflected in Figure 8. MSE for each single aspect is

presented in Table 22 and Table 23. Plus, Table 24 and Table 25 further display MSE of

each overall rating weight. It is observable that Audio and Efficiency have larger MSE

than other aspects when overall rating weight is 0.5 or higher. Compare to the best

average MSE 0.6538 at overall rating weight 0.7, the smallest MSE for Audio is 1.0214,

(52)

50 With regarding to change percentage from top to bottom, the sharpest change of MSE happens to Comfort, with a drop of 83.44%

from weight 0.0 to weight 0.7. Safety has the second most abrupt change of MSE of 80.69%. Another big fall of MSE is with Interior,

which is 81.78%.

(53)

(54)

(55)

(56)

(57)

6 CONCLUSIONS

The major goal for this paper is to respond to the question of whether overall rating

would improve the accuracy of aspect rating prediction or not. I propose a model

mapping aspect to sentence. The model takes review texts and overall ratings as input,

and discovers each review’s ratings on the given aspects. Then, an empirical experiment

on a car review data set shows that with overall rating, aspect ratings are predicted more

precisely. Furthermore, the prediction result is more accurate after extracting chunks

narrating aspects more specifically.

A major limitation of this methodology is the authority of keyword lists for aspects and

common tag patterns for chunk parsing. Keywords for each aspect could be overlapping,

for example: words depicting “front seat” can be related to both Comfort and Seating.

Additionally, consumers have different interpretations toward aspects. Another limitation

concerns the use of chunk parsing. People may use complicated and diverse phrases in

reviews for expression and communication. The tag patterns that I use in this research are

unlikely to correctly identify all chunks in review texts. Future optimization of the current

model will focus on addressing limitations, such as those listed above, and on validating

(58)

7 REFERENCES

Anderson, M., & Magruder, J. (2012). Learning from the Crowd: Regression Discontinuity Estimates of the Eects of an Online Review Database. The Economic Journal, doi: 10.1111/j.1468-0297.2012.02512.x, 2012.

Caruana, R. (1997). Multitask learning. Machine Learning 28(1):41–75.

Chen, H., Sun, M., Tu, C., Lin, Y., & Liu, Z. (2016). Neural sentiment classification with user and product attention. In Proceedings of EMNLP. pages 1650–1659.

Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning

Research 12(Aug):2493–2537.

Diao, Q., Qiu, M., Wu, C.Y., Smola, A.J., Jiang, J., & Wang, C. (2014). Jointly modeling aspects, ratings and sentiments for movie recommendation (jmars). In Proceedings of KDD, pages 193–202.

Dong, L., Wei, F., Tan, C., Tang, D., Zhou, M., & Xu, K. (2014). Adaptive recursive neural network for target-dependent twitter sentiment classification. In Proceedings of ACL, pages 49–54.

Fahrni, A., & Klenner, M. (2008). Old wine or warm beer: Targetspecific sentiment analysis of adjectives. Affective Language in Human and Machine.

Lakkaraju, H., Socher, R., & Manning, C. (2014). Aspect specific sentiment analysis using hierarchical deep learning. In NIPS Workshop on Deep Learning and

Representation Learning.

Liu, B. (2012). Sentiment analysis and opinion mining. Synthesis lectures on human language technologies, 5(1):1–167.

Lu, B., Ott, M., Cardie, C., & Tsou, B.K. (2011). Multi-aspect sentiment analysis with topic models. In Proceedings of ICDM Workshops, pages 81–88.

(59)

McAuley, J., Leskovec, J., & Jurafsky, D. (2012). Learning attitudes and attributes from multiaspect reviews. In Proceedings of ICDM. IEEE, pages 1020–1025.

Pang, B., & Lee, L. (2008). Opinion mining and sentiment analysis. Foundations and trends in information retrieval, 2(1-2):1–135.

Pappas, N., & Popescu-Belis, A. (2014). Explaining the stars: Weighted multiple-instance learning for aspect-based sentiment analysis. In Proceedings of EMNLP, pages 455–466. Association for Computational Linguistics.

Pontiki, M., Galanis, D., Papageorgiou, H., Androutsopoulos, I., Manandhar, S., ALSmadi, M., Al-Ayyoub, M., Zhao, Y., Qin, B., Clercq, O.D., Hoste, V., Apidianaki, M., Tannier, X., Loukachevitch, N., Kotelnikov, E., Bel, N., Jim´enez-Zafra, S.M., & Eryi˘git, G. (2016). Semeval-2016 task 5: Aspect based sentiment analysis. In

Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pages 19–30. Association for Computational Linguistics.

Pontiki, M., Galanis, D., Pavlopoulos, J., Papageorgiou, H., Androutsopoulos, I., & Manandhar, S. (2014). Semeval-2014 task 4: Aspect based sentiment analysis. Proceedings of International Workshop on Semantic Evaluation at, pages 27–35.

Qu, L., Ifrim, G., & Weikum, G. (2010). The bag-of-opinions method for review rating prediction from sparse text patterns. In Proceedings of the 23rd International Conference on Computational Linguistics. Association for Computational Linguistics, 2010, pp. 913– 921.

Shrestha, N., & Nasoz, F. (2019). Deep Learning Sentiment Analysis of Amazon.com Reviews and Ratings. ArXiv, abs/1904.04096.

Socher, R., Perelygin, A., Wu, J.Y., Chuang, J., Manning, C.D., Ng, A.Y., & Potts, C. (2013). Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of EMNLP.

Sutskever, I., Vinyals, O., & Le, Q. (2014). Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems (NIPS 2014).

Tang, D., Qin, B., & Liu, T. (2015a, September). Document modeling with gated recurrent neural network for sentiment classification. In Proceedings of EMNLP, pages 1422–1432, Lisbon, Portugal, September. Association for Computational Linguistics.

Tang, D., Qin, B., & Liu. T. (2015b). Learning semantic representations of users and products for document level sentiment classification. In ACL. pages 1014–1023.

(60)

Titov, I., & McDonald, R. (2008). A joint model of text and aspect ratings for sentiment summarization. ACL.

Toh, Z., & Su. J. (2016). Nlangp at semeval-2016 task 5: Improving aspect based

sentiment analysis using neural network features. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pages 282–288. Association for Computational Linguistics.

Wang, H., Lu, Y., & Zhai, C. X. (2010). Latent aspect rating analysis on review text data: a rating regression approach. In Proceedings of KDD, pages 783–792.

Wang, H., Lu, Y., & Zhai, C.X. (2011). Latent aspect rating analysis without aspect keyword supervision. In Proceedings of KDD, pages 618–626.

Wang, Y., Huang, M., Zhao, L., & Zhu, X. (2016). Attention-based lstm for aspect-level sentiment classification. In Proceedings of EMNLP.

Xia, R., Zong, C., & Li, S. (2011). Ensemble of feature sets and classification algorithms for sentiment classification. Information Sciences, 181(6):1138–1152.

Yang, Z., Yang, D., Dyer, C., He, X., Smola, A.J., & Hovy, E.H. (2016). Hierarchical attention networks for document classification. In HLT-NAACL, pages 1480–1489.

Yin, Y., Song, Y., & Zhang, M. (2017). Document-level multi-aspect sentiment