Full text


Peilin Wang. A Study of Review Helpfulness Prediction Using Text from Different Domains. A Master’s Paper for the M.S. in I.S degree. May, 2020. 46 pages. Advisor: Jaime Arguello

In the era of user generated content, the user review has been playing an important role. Learning about user comments is of great significance to businesses, users and all the other business participants. The Mathew effect tells us that the more exposed the reviews are, the easier it is for them to continue to get exposure. And this will cause a lot of quality comments to never get attention from other users. So how to mine the high-quality reviews has always been a concern of text mining researchers. Within this study, we built different review helpfulness prediction models based on text data from different domains on Amazon. We generated different kinds of features from reviews text and used them as input. We found that the effect of features and models varies according to the domains and scenes.


Natural Language Processing Text Mining



by Peilin Wang

A Master’s paper submitted to the faculty of the School of Information and Library Science of the University of North Carolina at Chapel Hill

in partial fulfillment of the requirements for the degree of Master of Science in

Information Science.

Chapel Hill, North Carolina May 2020

Approved by





2.1 Data Preparation ...5

2.2 Feature Engineering ...6

2.3 Problem Resolution & Model Building ...10


3.1 Dataset Description ...15

3.1.1 Initial Dataset ...15

3.1.2 Data Selection ...16

3.2 Dataset Exploration ...16

3.2.1 Dataset Overview ...16

3.2.2 Dataset—Time Span ...17

3.2.3 Dataset—Time Distribution ...18

3.2.4 Dataset—Word Cloud ...19

3.3 Dataset Processing ...22

3.3.1 Data Labeling & Balancing ...22

3.3.2 Dataset Preprocessing ...23

3.4 Feature Engineering ...23

3.5 Model Building ...26


4.1 Evaluation Method Overview ...27

4.2 Result ...29

4.2.1 Office Product Dataset Result ...29

4.2.2 Digital Music Dataset Result ...30

4.2.3 Grocery and Gourmet Food Dataset Result ...31

4.3 Comparison & Discussion ...32




With the rapid development of the Internet and related applications, traditional information distribution methods have undergone a radical change. Users are increasingly involved in the process of information interaction. In another word, consumers need and want to be co-creators of content, not merely be an audience who is broadcasted to, and user reviews are a key part of the entire user-generated content ecosystem.

User reviews could be very convenient for other users because the consumers can get the real experience from others, and that could help them make decisions about the product or service. Besides that, analyzing the user reviews will also bring great benefits to business. By analyzing user reviews, product and service providers could get to know user preferences to improve their products. So high quality reviews are of great value to both online businesses and consumers.


the votes it receives. And after that, reviews posted later usually get less exposure which result in less votes. The factor of review quality is not that prominent and as the result, many useful reviews will be ignored.



Within this part, I reviewed some of the previous papers of the related work. The literature review follows the process of the whole research in order to find some reference content at each step.

2.1 Data Preparation


Speaking of machine learning projects, an important problem to deal with is data imbalance. In the paper of Haibo He et al. (2009), researchers raised their concerns with the performance of machine learning models trained by imbalanced data, due to the underrepresented data and severe class distribution skews. They also claimed that training models based on imbalanced data sets requires new understandings and algorithms to process raw data efficiently and generate data sets that are more suitable to train machine learning models. Based on their research, the authors proposed multiple sampling

techniques such as random over-sampling and under sampling, informed under-sampling, synthetic sampling with data generation, adaptive synthetic sampling and sampling with data cleaning techniques.

2.2 Feature Engineering

Feature engineering is a very important part for machine learning projects. In fact, this is a process of analyzing, extracting and reorganizing the large amount of data and variables, with the goal of eventually outputting features which are helpful for solving the problem. And for the specific problem of review helpfulness prediction, feature

engineering is actually about digging deeper into the problem itself and find the most representative features.


them, Set length is the number of total unique words in the review. In addition to those basic features, the authors added other ones, mainly the following three: Polarity, Subjectivity and Reading Ease. The polarity of a text represents whether the expressed opinion in that text is negative, positive, or neutral (Wilson et al., 2009). In this paper, the authors used the SentiWordNet database to calculate the positive and negative scores of a text. Subjectivity is also added to measure the intensity of emotional expression of the reviews. It is calculated by finding how many sentences of the review are expressing opinion and dividing this number by the total number of sentences in the review. And speaking of the Reading Ease, it is an effective measurement of the readability of a review. The authors have introduced two types of calculation method. One is called Flesch Reading Ease and the other one is Dale Chall formula. The Flesch Reading Ease score is given as:

𝐹𝑅𝐸 = 206.84 − (1.02 ∗ 𝐴𝑆𝐿) − (84.6 ∗ 𝐴𝑆𝑊)

In this formula, FRE is the score, ASL is the average sentence length in words (number of words divided by the number of sentences) and ASW is the average syllables per word. The above formula’s core idea is to use the length of word to measure the

readability while Dale Chall formula uses a count of Difficult words. It uses a list of 3000 words that groups of fourth-grade American students could reliably understand. And the formula is as below:

𝑅𝑎𝑤 𝑆𝑐𝑜𝑟𝑒 = 0.16 ∗ (𝑃𝐷𝑊) + 0.05 ∗ (𝐴𝑆𝐿)


Within another research by Pei-Ju Lee et al. (2018), they used the data of online hotel reviews. And during the feature engineering procedure, they have done something similar to the above study that they also used the natural language features as part of their independent variables, and they examined subjectivity as well. However, there exist two main differences. Firstly, the researchers expanded the review sentiment measurement and both supervised and unsupervised sentiment analysis techniques were considered. Secondly, they took reviewer characteristics into consideration when evaluating review helpfulness. They thought that review helpfulness has a close relationship with the

reviewer’s characteristics because the user’s personal review habits are likely to affect the quality of reviews. They introduced a lot of user-related variables into their research, including reviewer age, reviewer gender, reviewer level, total amount of past reviews, total amount of past received votes, review frequency and so on.

Actually, many related studies value the features which are relevant to reviewer. Liu, Huang, An & Yu (2008) examined the reviews on several popular websites,

including CNET, Amazon and IMDB and conducted preliminary experiments to evaluate the various factors involved. The three kinds of important features they found are


but filled with insulting remarks. If these differences could be properly represented into the prediction model, the model could be more reasonable. Otterbacher (2009) did something similar. She defined this kind of feature as believability reputation, including if the reviewer uses real name, if the reviewer has top reviewer badge, reviewer’s rank in the community, total reviews contributed by the reviewer and helpful votes received by the reviewer. The research was conducted based on the thinking that these attributes might be used by community members to assess reviewer reputation.

Otterbacher (2009) also mentioned an interesting point in his research that

consumers with extreme opinions of a product are more likely to write reviews and often want to vent their frustrations. Similarly, Cao et al. (2011) have also found that reviews expressing extreme opinions are more impactful than reviews with neutral of mixed points.


2.3 Problem Resolution & Model Building

2.3.1 Dependent Variable

For machine learning projects, it is of significance to label your data correctly. In another word, we should determine the dependent variable. A dependent variable is what you measure in the experiment and what is affected during the experiment. And for review helpfulness prediction, I need to reasonably measure the helpfulness based on the data I have got.

There exist several different kinds of methods to determine whether the review is helpful or not. Different methods are determined by two main factors. On the one hand, the method is determined by the data collected. On the other hand, it is also determined by the researcher’s purpose.

Lee, Hu and Lu (2018) defined the review helpfulness as follows:

𝐻𝑒𝑙𝑝𝑓𝑢𝑙𝑛𝑒𝑠𝑠F =



Where HelpfulVotes denotes the number of votes for review i, and ElapsedDay denotes the number of days between the date the review was posted and the date the review was crawled. The purpose of doing this is to normalize to the age of the review because older posts would have greater opportunities to accumulate helpfulness votes. After the


In the study conducted by Kim and Arguello (2017), they used the threshold of

𝜉 = 2 in deciding whether a review is useful. Levi & Mokryn (2014) have used the

similar method based on the threshold of 𝜉 = 5. Kim & Arguello has also tried to set the threshold as 𝜉 = 5. Their difference is that Levi & Mokryn classified all the reviews less than five usefulness votes as not_useful while Kim & Arguello only considered reviews with zero usefulness vote as not_useful.


2.3.2 Model Building

Although researchers are all working on the problem of review helpfulness prediction, everyone has got different research perspectives. After obtaining the required data and generating required features, in the process of dismantling the problem and building the model, the specific method needs to serve the specific problem being studied.

Jyoti Prakash Singh et al. (2016) used an ensemble learning technique called gradient boosting algorithm to analyze the data. Ensemble learning employs multiple base learners and combines their predictions. The researchers applied this kind of method for several reasons. Firstly, each data chunk was relatively small so that the cost of training a classifier on it was not high. Secondly, they saved a well-trained classifier instead of the whole instances in the data chunk, which cost much less memory. They divided the total dataset into two parts: training set and testing set. 70% data were in the training set and the rest of them were in the testing set. And for evaluation, they used MSE, mean-square error, which means the average squared difference between the estimated values and the actual value. They continued to approach the limits of the model by increasing the number of trees to find the optimal number of trees in the ensemble.


regression (LGR) and support vector machine (SVM). DT is a tree structure that consists of multiple internal-nodes and leaf-nodes. It uses a simple tree structure to represent the rules between independent and dependent variables. An RF is an ensemble learning method developed by constructing multiple DTs. LGR is a nonlinear regression model, which aims at predicting the probability of an event occurring by fitting data to a logistic function, thereby allowing inputs with any values to be transformed and constrained to a value between 0 and 1. And derived from statistical learning theory, SVM is currently one of the most effective methods for high-dimensional data and is widely used in classification analysis. The core concept of SVM is using structural risk minimization, which minimized boundary error by induction. After model building, they used precision and recall and other derived measures to evaluate their model. I will expand this part in the next section.

Siering, Deokar and Janze (2018) also tried to build different predictive models to assess the ability of a review’s text contents to accurately predict a reviewer’s


consist of a variety of computational neurons appearing in interconnected input, hidden, and output layers, and are intended to mimic the behavior of human neural networks.

2.3.3 Related Issues Expansion



Within this section, I will focus on the approaches I took. I will start with the description of the dataset I chose. After that I will process and analyze the dataset, and build models based on my needs and also based on the findings from above earlier researches.

3.1 Dataset Description

3.1.1 Initial Dataset

The dataset I chose to use is Amazon reviews dataset. There are several reasons for me to choose this dataset. Firstly, Amazon is one of the most famous e-commerce sites in the world. The reviews from Amazon are quite representative. Secondly, since I want to study the problem of review helpfulness prediction across domain, Amazon is the best choice because there exist reviews of various categories on e-commerce websites. Last but not least, Reviews on e-commerce websites are more informative than reviews on forums, such as Reddit. The responses and reviews on the forum are affected by the topics too much, which is not convenient to measure its quality based on characteristics.


3.1.2 Data Selection

The whole dataset is too big for me to use. So instead, I chose to use the 5-core dataset, which is smaller. Julian McAuley and his team have reduced the data to extract the k-core, such that each of the remaining users and items have k reviews each. From all the domains, I chose three of them, including Grocery_and_Gourmet_Food, Digital Music and Office_Products. I chose these three to make the span of the dataset as large as possible so that I could try to find the relationship and difference between the validity predictions of different categories reviews in the later discussion.

3.2 Dataset Exploration

3.2.1 Dataset Overview

After getting the original dataset, the first thing I concerned about is the

helpfulness votes for each review. The basic overview of the three datasets are as follow:

Offce_Products Digital_Music Grocery_and_Gourmet_Food with 1 or more than 1

vote 50213 7379 77149

no votes 750144 162402 1066711

total 800357 169781 1143860

Table 1 Dataset Overview

We can get to see that most of the reviews do not have a helpfulness vote. The distribution of amount of data has a guiding effect on the labeling.

I explored the data from several different perspectives. Firstly, time. The time span of the dataset I use is large. And in order to use the method used by Lee, Hu and Lu (2018), which evaluates the helpfulness as well as taking exposed time factor into


votes. I’ve done some basic exploration about that within the above part. What’s more, in addition to the numbers, I tried to use the word cloud to show the characteristics of the review text, including the differences between different domains and which words appear frequently.

3.2.2 Dataset—Time Span

Figure 1 Initial Dataset

The above picture is a sample of initial dataset of office product reviews. I used python datetime to convert the date into timestamp. After that, I sorted the dataset and try to find the time span. And I did the same thing to the other datasets. The result is as follow:

Offce_Products Digital Music Grocery_and_Gourmet_Food

earliest comment date 1999/10/11 1998/7/9 2000/8/9 latest comment date 2018/10/2 2018/9/26 2018/10/2

Table 2 Time Span of Datasets


3.2.3 Dataset—Time Distribution

Studying time distribution of this dataset is of significance, especially for labeling. If I want to use the combination of votes and exposure time to label, I need to know which years of data to select from as a sample. The time distribution of three datasets are as follows:

Figure 1 Time Distribution for Office Products

Figure 2 Time Distribution for Digital Music 0

50000 100000 150000 200000 250000


Office Products

Series 1

0 5000 10000 15000 20000 25000 30000 35000 40000 45000


Digital Music


Figure 3 Time Distribution for Grocery and Gourmet Food

As we can see from the above figures, time distribution for the three datasets are all characteristic. Firstly, the data is very unevenly distributed overtime. Data of recent years account for most of them. Secondly, the latest data in the datasets comes from 2018, but there is not much data in 2018. The main reason is that at the time of this dataset’s release, 2018 has not finished yet.

3.2.4 Dataset—Word Cloud

Wordcloud is one of the visualization methods to show the keywords of text. For the three different domains, I generated three wordclouds to take a look at the text characteristics of these data sets.

0 50000 100000 150000 200000 250000 300000 199 8 199 9 200 0 200 1 200 2 200 3 200 4 200 5 200 6 200 7 200 8 200 9 201 0 201 1 201 2 201 3 201 4 201 5 201 6 201 7 201 8

Grocery and Gourmet Food


Figure 4 Office Product WordCloud


Figure 6 Grocery and Gourmet Food WordCloud


3.3 Dataset Processing

3.3.1 Data Labeling & Balancing

I used the following method to label data:

𝐻𝑒𝑙𝑝𝑓𝑢𝑙𝑛𝑒𝑠𝑠F =𝐻𝑒𝑙𝑝𝑓𝑢𝑙𝑉𝑜𝑡𝑒𝑠F


Where HelpfulVotes denotes the number of votes for review i, and ElapsedDay denotes the number of days between the date the review was posted and the date the review was crawled. Specific for here, I calculated the ElapsedDay by subtracting the time of each review from the latest review date in each dataset. For example, the office product dataset, if a review was released on 2017-10-11, I will calculate the time difference between 2017-10-11 and 2018-10-3. I used 2018-10-3 instead of 2018-10-2 to avoid the denominators of the latest reviews to be 0. After that, based on references, I sorted the Helpfulness score and assume the top 1% to be useful reviews. For digital music dataset, I would make that fraction 5% because that dataset has less data. When it comes to reviews that not useful, considering balancing data, I randomly extracted same amount of reviews as helpful reviews in the rest of the dataset. I chose not to simply pick the last 1% or 5% of the dataset because although in that way, the model may be more accurate, models trained using that way cannot stand the test of complex situations. So, I chose to random sampling to make the dataset messier. And at last, I labeled the useful data as 1 and useless data as 0. The overview of the three final datasets I used for machine learning model building are as below

office_final music_final grocery_final


useless 8004 8489 11439

total 16008 16978 22878

Table 3 Overview of Final Datasets

3.3.2 Data Preprocessing

Before building machine learning models, I need to preprocess the text data, which mainly includes some cleaning and regularization work of text data. During the process, I found that there are a few problems with some of the texts in the data, so I deleted them. The overview of datasets I finally used for feature engineering and model building are as below:

office music grocery

useful 8000 8488 11437

useless 8002 8480 11437

total 16002 16968 22874

Table 4 Overview of Final Datasets (2.0)

3.4 Feature Engineering

The feature I used contain 4 different categories. Firstly, natural language features. This kind of features are mined from the review text itself and are shown as below:

feature name feature meaning

word_count the amount of words in each review word_allcaps_count the amount of uppercase words in each review


review_stopwords the amount of stopwords in each review review_nonstopwords the amount of words in each review

which are not stopwords

review_positive_words the amount of positive words in each review review_negative_words the amount of negative words in each review

review_pos_sentiment review positive score review_neg_sentiment review negative score review_obj_sentiment review objective score

Table 5 Natural Language Features

For the amount of positive and negative words, I used the list of positive and negative words found online. When computing review sentiment score, I used the SentiWordNet database. SentiWordNet is a lexical resource for sentiment analysis and it assigns to each synset of WordNet three sentiment scores: positivity, negativity and objectivity.

The second kind of feature I used is review readability. I used the Flesch Reading Ease score based on the previous study. The Flesch Reading Ease score goes for:

𝐹𝑅𝐸 = 206.84 − (1.02 ∗ 𝐴𝑆𝐿) − (84.6 ∗ 𝐴𝑆𝑊)

In this formula, FRE is the score, ASL is the average sentence length in words (number of words divided by the number of sentences) and ASW is the average syllables per word.

Third, user characteristics. To accomplish this, for each of the three datasets, firstly, I traversed the initial dataset to get two dictionaries, called user_vote and


each user that appears in the final dataset based on the above two dictionaries and used them as features.

What’s more, I add unigram feature with TF-IDF weights and Doc2Vec to the features. TF-IDF is one of the methods used to measure the importance of a word in the text. TF-IDF accentuates terms that are frequent is document, but not frequent in general. IDF is called Inverse Document Frequency (IDF), which goes for:

𝑖𝑑𝑓] = log ( 𝑁 𝑑𝑓])

N = number of documents in the collection

𝑑𝑓] = number of documents in which term t appears

And TF-IDF goes for:

𝑡𝑓]× log (𝑁 𝑑𝑓])

Doc2Vec is an unsupervised algorithm that can obtain vector representations of sentences, paragraphs and documents, which is an extension of Word2Vec. The

fundamental theory for Word2Vec is when training language models such as Skip-Gram and CBOW, we will get some by-products, which are the activated weights of neural networks during training, and those new vectors can be used to represent the words, which are the vectors we want to use. And the difference of Word2Vec and Doc2Vec is that when it comes to Doc2Vec, instead of simply inputting word vectors, we add another vector called paragraph vector together with the word vectors into the training set of the language models. After that, the vector can be used to find similarity between sentences, paragraphs and documents by calculating distance, and also can be used for text


Before using the above methods, I did another round of data preprocessing based on the final dataset. The main things I have done include lowering text, tokenizing text, removing punctuation, removing stop words, remove empty tokens, lemmatizing text, etc. After the preparation, I generated Doc2Vec vectors for every one of the reviews using python gensim, which is an open source third-party python toolkit. When training, I set vector size as 100 and keep other parameters as their default values. And when adding TF-IDF columns, I added the columns for every word that appears in at least 20 different reviews to filter some of them and reduce the size of final output because too much features will be a weaken factor for model training.

3.5 Model Building

Within the model building part, I decided to use three models—Logistic

Regression, Naïve Bayes and RandomForest classifier. I applied them in python by using python machine learning library scikit-learn. For Naïve Bayes model, I used

MultinominalNB classifier which fits NLP projects better.



Within this part, I will mainly focus on analyzing the results of the models and compare them with each other. I have adopted several evaluation metrics including accuracy, precision, recall, F1-score, ROC curve, etc. to examine the effectiveness of all the models.

4.1 Evaluation Method Overview

(1) Accuracy

Accuracy is the most direct way to measure the prediction results. Accuracy goes for:

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑎𝑚𝑜𝑢𝑛𝑡 𝑜𝑓 𝑎𝑐𝑐𝑢𝑟𝑎𝑡𝑒 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑒𝑑 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠

𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠

(2) Precision & Recall

For machine learning prediction problems, precision is the fraction of accurately predicted instances among all the relevant instances, while recall is the fraction of the total amount of relevant instances that were actually predicted accurately.

Predicted 0 Predicted 1 Actual

0 True Negative False Positive Actual


Based on the above confusion matrix, precision goes for:

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒

𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒

𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒

𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒

(3) F1-score

There is a compromise between precision and recall. When we are comparing two different models and the two indicators conflict, it is difficult for us to compare between the models. F1-score is one of the methods to solve this problem.

F1-score is an indicator used to measure the accuracy of a binary classification model in statistics. It takes both the precision and recall into account. F1-score can be regarded as a harmonious average of model precision and recall. F1-score goes for:

𝐹f = 2 ×𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑟𝑒𝑐𝑎𝑙𝑙

𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙

(4) ROC Curve & AUC

ROC curve, which stands for receiver operating characteristic curve, is an important way to measure the effectiveness of models.


Each time we choose a different threshold, we can get a set of FPR (false positive rate) and TPR (true positive rate). And this is one of the points on ROC curve.

AUC (area under curve) is the area under the ROC curve. As a numerical value, AUC can directly evaluate the quality of the classifier. The larger the value, the better the classifier is.

4.2 Result

4.2.1 Office Product Dataset Result

When applying RandomForest Classifier to the office product review dataset, the result is as below (the number of support is the corresponding sample size):

precision recall f1-score support

not helpful 0.869 0.832 0.85 2373

helpful 0.843 0.878 0.86 2428

micro avg 0.855 0.855 0.855 4801

macro avg 0.856 0.855 0.855 4801

weighted avg 0.856 0.855 0.855 4801

Accuracy = 0.855

Table 6 Result of RandomForest using office dataset

When applying Logistic Regression model to the office product review dataset, the result is as below:

precision recall f1-score support

not helpful 0.844 0.857 0.851 2373

helpful 0.858 0.846 0.852 2428

micro avg 0.851 0.851 0.851 4801

macro avg 0.851 0.851 0.851 4801

weighted avg 0.851 0.851 0.851 4801

Accuracy = 0.851


When applying Multinomial Naïve Bayes model to the office product review dataset, the result is as below:

precision recall f1-score support

not helpful 0.778 0.89 0.83 2373

helpful 0.875 0.751 0.809 2428

micro avg 0.82 0.82 0.82 4801

macro avg 0.826 0.821 0.819 4801

weighted avg 0.827 0.82 0.819 4801

Accuracy = 0.820

Table 8 Result of Multinomial Naïve Bayes using office dataset

4.2.2 Digital Music Dataset Result

When applying RandomForest Classifier to the digital music review dataset, the result is as below:

precision recall f1-score support

not helpful 0.794 0.866 0.828 2525

helpful 0.855 0.779 0.815 2566

micro avg 0.822 0.822 0.822 5091

macro avg 0.825 0.822 0.822 5091

weighted avg 0.825 0.822 0.822 5091

Accuracy = 0.822

Table 9 Result of RandomForest using music dataset

When applying Logistic Regression model to the digital music review dataset, the result is as below:

precision recall f1-score support

not helpful 0.764 0.845 0.803 2525

helpful 0.83 0.743 0.784 2566

micro avg 0.794 0.794 0.794 5091

macro avg 0.794 0.794 0.793 5091

weighted avg 0.794 0.794 0.793 5091


Table 10 Result of Logistic Regression Model using music dataset

When applying Multinomial Naïve Bayes to the digital music review dataset, the result is as below:

precision recall f1-score support

not helpful 0.713 0.934 0.808 2525

helpful 0.906 0.629 0.743 2566

micro avg 0.78 0.78 0.78 5091

macro avg 0.809 0.782 0.776 5091

weighted avg 0.81 0.78 0.775 5091

Accuracy = 0.780

Table 11 Result of Multinomial Naïve Bayes using music dataset

4.2.3 Grocery and Gourmet Food Dataset Result

When applying RandomForest Classifier to the grocery review dataset, the result is as below:

precision recall f1-score support

not helpful 0.832 0.791 0.811 3385

helpful 0.806 0.845 0.825 3478

micro avg 0.818 0.818 0.818 6863

macro avg 0.819 0.818 0.818 6863

weighted avg 0.819 0.818 0.818 6863

Accuracy = 0.818

Table 12 Result of RandomForest Classifier using grocery dataset When applying Logistic Regression model to the grocery review dataset, the result is as below:

precision recall f1-score support

not helpful 0.828 0.814 0.821 3385

helpful 0.822 0.835 0.828 3478

micro avg 0.825 0.825 0.825 6863


weighted avg 0.825 0.825 0.825 6863 Accuracy = 0.825

Table 13 Result of Logistic Regression model using grocery dataset

When applying Multinomial Naïve Bayes to the grocery review dataset, the result is as below:

precision recall f1-score support

not helpful 0.691 0.935 0.795 3385

helpful 0.904 0.593 0.716 3478

micro avg 0.762 0.762 0.762 6863

macro avg 0.798 0.764 0.756 6863

weighted avg 0.799 0.762 0.755 6863

Accuracy = 0.762

Table 14 Result of Multinomial Naïve Bayes using grocery dataset

4.3 Comparison & Discussion

At the beginning of discussion of the results, there are several things need to be clarified. Firstly, about the Multinomial Naïve Bayes model. The Multinomial Naïve Bayes model doesn’t accept negative feature input. So, the dataset which has been used to train Multinomial Naïve Bayes model is a little different from that used for other 2

models. From all the above features, the negative input comes from Doc2Vec. When training this model, I didn’t include the Doc2Vec vector, and all the other features stayed the same. And as a result, the effect of Multinomial Naïve Bayes was influenced


on the inner comparison between the three domains instead of comparing it with the other two models. Secondly, there exist three other notions within the above form, including micro avg, macro avg and weighted avg. Micro-averaging is to establish a global

confusion matrix for each sample in the dataset regardless of category, and then calculate the corresponding index. For example, the micro-averaging precision goes as:

𝑃gFhij = 𝑇𝑃

𝑇𝑃 + 𝐹𝑃=

∑l 𝑇𝑃F


∑ 𝑇𝑃F+ ∑l 𝐹𝑃F

Fmf l


Macro-averaging refers to the arithmetic mean of the statistical index values. The macro-averaging precision goes as:

𝑃gnhij = 1

𝑛o 𝑃F



The difference between Macro-averaging and Micro-averaging is that Macro-averaging gives each class the same weight, while the Micro-averaging gives each sample the same weight for decision.


Figure 7 ROC Curve of the best performance models (Office Product, Digital Music, Grocery and Gourmet Food in order)

Digital Music dataset appears to be the dataset whose performance influenced by the category most. For the reviews which are not helpful in digital music dataset,


for helpful reviews. The phenomenon is especially obvious when applying Naïve Bayes. And when combining them together using f1-score, for digital music dataset, we can tell that the models judge reviews which are not helpful better.

Another thing we could find out from the above results is that the difference of results between micro average and macro average is small. This is mainly because before building models, I made the different kinds of samples basically balanced. The

comparison of micro average and macro average may play a bigger role when the data is not balanced because within that case, we can get to see an index which better reflects the real situation.

When using RandomForest model, besides the prediction results, I also generated the feature importance of all the features and tried to find out which feature influences the result most. The results are as below:

feature importance


doc2vec_vector_45 0.00851 doc2vec_vector_97 0.008234

Table 15 Feature Importance of office product dataset

feature importance user_past_vote 0.066577 doc2vec_vector_11 0.020785 neg_sentiment 0.020357 word_count 0.018438 obj_sentiment 0.017721 review_stopwords 0.017405 doc2vec_vector_85 0.017008 char_count 0.014594 review_nonstopwords 0.012903 doc2vec_vector_55 0.011294 doc2vec_vector_69 0.011005 doc2vec_vector_98 0.010563 sentence_count 0.010558 doc2vec_vector_68 0.009699 pos_sentiment 0.009471 doc2vec_vector_33 0.008903 doc2vec_vector_63 0.008666 doc2vec_vector_26 0.00865 doc2vec_vector_65 0.008473 doc2vec_vector_92 0.007907

Table 16 Feature Importance of digital music dataset

feature importance user_past_vote 0.043724

char_count 0.031214 neg_sentiment 0.028445 review_stopwords 0.021726 review_nonstopwords 0.021003 obj_sentiment 0.016767

word_count 0.01674


doc2vec_vector_23 0.011824 doc2vec_vector_80 0.010674 doc2vec_vector_79 0.010625 doc2vec_vector_53 0.009335 sentence_count 0.008904 doc2vec_vector_60 0.008807 doc2vec_vector_17 0.008689 doc2vec_vector_90 0.008455 word_allcaps_count 0.008269 doc2vec_vector_2 0.008196 doc2vec_vector_59 0.007973 review_positive_words 0.007765 Table 17 Feature Importance of grocery dataset

As we can see, the user_past_vote is a really important feature for the prediction of review helpfulness. For all the three datasets, user_past_vote is the most important feature and its index of importance is obviously higher than the second one.

User_past_vote is the votes received by the reviewer before. This result indicates that it is of great significance to analyze the past circumstances of reviewers. And actually, some deeper meanings may exist beneath the surface of past votes received, including user’s writing style, user’s willingness of participation, user’s personality and so on. If we can build a more comprehensive user portrait based on data associated with users, we could try to make more accurate predictions and even dig more about the further use of reviews.


reviews are more unstructured compared with those within office product reviews and grocery reviews, which are filled with different kinds of symbols and special characters. So, for the digital music reviews, when identifying text category, the amount of words is more important than the amount of characters. For the similar reason, char_count has played a bigger role in the review helpfulness prediction of office product reviews and grocery reviews than digital music reviews. There is one thing strange that the Flesch reading ease score doesn’t appear in any of the three rankings. According to conventional understanding, this should be an important indicator. The reason is worth further


What is more, I find that features associated with review sentiment is pretty useful for review helpfulness prediction. The sentiment scores generated based on



In conclusion, this paper mainly conducts an exploration of review helpfulness prediction using data from different domains. Firstly, I went through related previous works, and tried to find related background knowledge and also methods of feature engineering and model building. Secondly, I processed data as needed, including data cleaning, data balancing and feature engineering. After that, I built different models based on datasets from different domains. And finally, I evaluated the results using several metrics.



Bjering, E., Havro, L. J., & Moen, Ø. (2015). An empirical investigation of self-selection bias and factors influencing review helpfulness. International Journal of Business and Management, 10(7), 16.

Cao, Q., Duan, W., & Gan, Q. (2011). Exploring determinants of voting for the

“helpfulness” of online user reviews: A text mining approach. Decision Support Systems, 50(2), 511-521.

Casaló, L. V., Flavián, C., & Guinalíu, M. (2011). Understanding the intention to follow the advice obtained in an online travel community. Computers in Human

Behavior, 27(2), 622-633.

Chua, A. Y., & Banerjee, S. (2016). Helpfulness of user-generated reviews as a function of review sentiment, product type and information quality. Computers in Human Behavior, 54, 547-554.

Danescu-Niculescu-Mizil, C., Kossinets, G., Kleinberg, J., & Lee, L. (2009, April). How opinions are received by online communities: a case study on amazon. com helpfulness votes. In Proceedings of the 18th international conference on World wide web (pp. 141-150).


Flanagin, A. J., & Metzger, M. J. (2013). Trusting expert-versus user-generated ratings online: The role of information volume, valence, and consumer

characteristics. Computers in Human Behavior, 29(4), 1626-1634.

Forman, C., Ghose, A., & Wiesenfeld, B. (2008). Examining the relationship between reviews and sales: The role of reviewer identity disclosure in electronic

markets. Information systems research, 19(3), 291-313.

Ghose, A., & Ipeirotis, P. G. (2010). Estimating the helpfulness and economic impact of product reviews: Mining text and reviewer characteristics. IEEE transactions on knowledge and data engineering, 23(10), 1498-1512.

Ghose, A., & Ipeirotis, P. G. (2006, December). Designing ranking systems for consumer reviews: The impact of review subjectivity on product sales and review quality. In Proceedings of the 16th annual workshop on information technology and systems (pp. 303-310).

Hu, N., Bose, I., Koh, N. S., & Liu, L. (2012). Manipulation of online reviews: An analysis of ratings, readability, and sentiments. Decision support systems, 52(3), 674-684.

Kim, H., & Arguello, J. (2017). Evaluation of features to predict the usefulness of online reviews. Proceedings of the Association for Information Science and

Technology, 54(1), 213-221.


Lee, E. J., & Shin, S. Y. (2014). When do consumers buy online product reviews? Effects of review quality, product type, and reviewer’s photo. Computers in Human Behavior, 31, 356-366.

Lee, P. J., Hu, Y. H., & Lu, K. T. (2018). Assessing the helpfulness of online hotel reviews: A classification-based approach. Telematics and Informatics, 35(2), 436-445.

Levi, A., & Mokryn, O. (2014, April). The social aspect of voting for useful reviews. In International conference on social computing, behavioral-cultural modeling, and prediction (pp. 293-300). Springer, Cham.

Li, N., & Wu, D. D. (2010). Using text mining and sentiment analysis for online forums hotspot detection and forecast. Decision support systems, 48(2), 354-368. Liu, Y., Huang, X., An, A., & Yu, X. (2008, December). Modeling and predicting the

helpfulness of online reviews. In 2008 Eighth IEEE international conference on data mining (pp. 443-452). IEEE.

Malik, M. S. I., & Hussain, A. (2020). Exploring the influential reviewer, review and product determinants for review helpfulness. Artificial Intelligence

Review, 53(1), 407-427.

Martin, L., & Pu, P. (2014, June). Prediction of helpful reviews using emotions extraction. In Twenty-Eighth AAAI conference on artificial intelligence. Mudambi, S. M., & Schuff, D. (2010). What Makes a Helpful Review? A Study of


N. Korfiatis, E. García-Bariocanal, S. Sánchez-Alonso, Evaluating content quality and helpfulness of online product reviews: the interplay of review helpfulness vs. review content, Electronic Commerce Research and Applications 11 (3) (2012) 205–217.

Otterbacher, J. (2009, April). 'Helpfulness' in online communities: a measure of message quality. In Proceedings of the SIGCHI conference on human factors in computing systems (pp. 955-964).

Perc, M. (2014). The Matthew effect in empirical data. Journal of The Royal Society Interface, 11(98), 20140378.

Qazi, A., Syed, K. B. S., Raj, R. G., Cambria, E., Tahir, M., & Alghazzawi, D. (2016). A concept-level approach to the analysis of online review helpfulness. Computers in Human Behavior, 58, 75-81.

Qi, J., Zhang, Z., Jeon, S., & Zhou, Y. (2016). Mining customer requirements from online reviews: A product improvement perspective. Information & Management, 53(8), 951-963.

Sahoo, N., Dellarocas, C., & Srinivasan, S. (2018). The impact of online product reviews on product returns. Information Systems Research, 29(3), 723-738.\


Siering, M., Deokar, A. V., & Janze, C. (2018). Disentangling consumer

recommendations: Explaining and predicting airline recommendations based on online reviews. Decision Support Systems, 107, 52-63.

Singh, J. P., Irani, S., Rana, N. P., Dwivedi, Y. K., Saumya, S., & Roy, P. K. (2017). Predicting the “helpfulness” of online consumer reviews. Journal of Business Research, 70, 346-355.

Singh, J. P., Rana, N. P., & Alkhowaiter, W. (2015, October). Sentiment analysis of products’ reviews containing English and Hindi texts. In Conference on e-Business, e-Services and e-Society (pp. 416-422). Springer, Cham.

Sun, X., Han, M., & Feng, J. (2019). Helpfulness of online reviews: Examining review informativeness and classification thresholds by search products and experience products. Decision Support Systems, 124, 113099.

Tsai, C. F., Chen, K., Hu, Y. H., & Chen, W. K. (2020). Improving text summarization of online hotel reviews with review helpfulness and sentiment. Tourism

Management, 80, 104122.

Wan, Y. (2015). The Matthew effect in social commerce. Electronic Markets, 25(4), 313-324.

Wang, X., Tang, L. R., & Kim, E. (2019). More than words: Do emotional content and linguistic style matching matter on restaurant review helpfulness?. International Journal of Hospitality Management, 77, 438-447.


Weathers, D., Swain, S. D., & Grover, V. (2015). Can online product reviews be more helpful? Examining characteristics of information content by product

type. Decision Support Systems, 79, 12-23.

Wilson, T., Wiebe, J., & Hoffmann, P. (2009). Recognizing contextual polarity: An. exploration of features for phrase-level sentiment analysis. Computational linguistics, 35(3), 399-433.

Wu, J. (2017). Review popularity and review helpfulness: a model for user review effectiveness. Decision Support Systems, 97, 92-103.

Xiang, Z., Schwartz, Z., Gerdes Jr, J. H., & Uysal, M. (2015). What can big data and text analytics tell us about hotel guest experience and satisfaction?. International Journal of Hospitality Management, 44, 120-130.

Zhou, S., & Guo, B. (2017). The order effect on online review helpfulness: A social influence perspective. Decision Support Systems, 93, 77-87.





Related subjects :