Authors:
Sam Westra Adam Briles
Yelp Review Analysis
Introduction88 percent of consumers trust reviews they read online[1]. Yelp has offered the platform from which millions of people have been able to critique businesses and
services. Yelp’s platform, while effective in its own right, has opened the door for usage in popular websites that commonly refer to these reviews and averaged star ratings to make lists of the best and worst businesses. Professional or not, these reviews are read by prospective customers and helps to shape opinions of those who’ve never stepped foot on the establishment. This raises questions as to whether or not any of this
information is considered well informed.It is left to chance as to whether or not someone actually knows what they’re talking about when it comes to a particular business.
Looking at the ways prospective customers perceive reviews could help inform a business on how to properly respond to commonly issued complaints. A business that better understands how it is viewed in at least one medium, can ultimately speculate how it is perceived as a whole or alter its advertising to either address the issues its community has. They would be able to identify reviews that would gain the most traction among its other reviews and be able to properly respond before too many potential customers have their opinions shaped by one or two particularly effective reviews. In certain cases, they’d be able to turn reviews into a way they could better approach customer issues, potentially even to the point that they could alter certain aspects about the multiple ways that they do their business.
Finding out what makes a review great is not simple. What makes a positive review good will differ from what makes a negative review good. We look to address the question of what makes a review great by assessing the characteristics of reviews. A few fields we will examine will be the unique words in each category, review length, and word frequencies. With this data, we can then infer a lot about the basic rhetoric used in a review. We will be applying a few other tactics to screen out bad data too. To know if our data is valid, we will have to check it against existing research and try to assess the results in terms of human behavior. We are gathering data about the review, in a way that hasn’t already been done. There is nothing we can compare this to to check if our code valid. However, we can look at reviews that have the discovered characteristics and ask peers if they think the review is helpful. It will also be easy to identify outliers if they exist. For example, “wonderful” should not be common in negative reviews. We did gain some valuable and surprising results from this dataset. Users do prefer shorter
reviews to long flowery ones, and there are definitely differences in what customers prefer in different review types. With this information, the problem space then shifts to creating a system that can identify what is a good positive or negative review. With this system, reviews as they are written, could be pushed to the top before the community comes to a consensus.
Problem Characterization:
Finding what a reader likes in a review is not an easy task. Different rhetorics will be used for different review types. The length, word choice, and structure will differ greatly. Due to this, simply looking for what words occur frequently will not work if done on the entire review dataset. For example, the combination of “Wonderful” and
“dreadful” should probably not be used in the same. Therefore, reviews must be
differentiated from each other somehow to avoid finding conflicting rhetoric. People will score differently too. Some might think a five star review is a baseline, with stars only being docked if they are inconvenienced. Others might consider a three star review to be the standard. This make it difficult to decide what to consider a positive review. In fact, there is a bias towards how certain age groups view user reviews. Where younger people consider a business’ average rating to matter the most, older adults didn’t pay much attention to it [2]. Therefore, younger people could be more inclined to increase the useful score if the rating was high.
Reviews are long, and an aggregation of them takes up a large amount of data. To get accurate information on all of them in a quick manner, a way to circumvent the bottleneck of IO needs to be used. Due to this, a cluster must be used to analyze a large set of reviews. However, the reviews will be distributed across the cluster. Words will be counted across different machines. Thus, finding the highest occuring words and ordering them appropriately will be difficult. Working in a distributed environment with this large dataset could cause potential problems with the heap on the driver. Pulling down too much of any data to a driver to handle or store it for analysis could crash a driver with ease. Therefore, data will have to be stored sparingly. The best approach would be to find a way to never aggregate large amount of data the driver.
When trying to track down what keywords are used in a review, some words must be completely ignored. Looking for words like “the”, “of”, and “was” will not tell much about the nature of a review. If you really want to know what makes reviews good, uncommon words must be found. Filtering out a lot of words in a dictionary will be computationally expensive, so a balance of performance and words filtered by must be found. Finding uncommon words will involve filtering out bad data in some fashion.
Splitting reviews is not an easy task either. To find what words are in a review, data must be split on something. Regex patterns are a great idea, but they quickly get complex. Simply splitting on whitespace will return strings with periods and decreasing the exact word counts of some words.
Verifying the results is going to be a big issue. Reviews are entirely subjective, so it is hard to judge if the results returned are valid just by looking at them. Where
someone might think long reviews are helpful, another might think short ones are. Therefore, verifying that the computed results are correct and the program is bug free will be difficult to do by eye.
Dominant approaches to the problem:
One approach taken to increasing recommendation accuracy doesn’t use user given scores at all. Instead, a rating is derived from the text body of a review. Then users are grouped, based on similarity, using soft clustering techniques[3]. It was then found that “...using textual information results in better review score predictions than those derived from the coarse numerical star ratings given by the users.”[3]. This work does not however take into account scores of useful, cool, or funny. It also doesn’t focus on finding what people think of as a good review. Instead, it looks to replace the star rating system to create better reviews. This allows it to get around customer review bias and differences. Another method is based in an attempt to recreate the process with which Amazon processes its own reviews. Keywords prove a powerful tool when it comes to pairing reviews and products together, and by searching through these to find relevant data. It starts with natural language processing [4]. It starts by tokenizing the reviews and tagging the grammatical function of words that are tokenized. Using an NLP framework, one would be able to even handle text and words that are related to one another in order to find words of more importance. In this way, we’d be able to more effectively parse the reviews as a whole. Even down to recognizing the entities named within a review, such as understanding that a name like Walmart is a corporation by the fact that the rest of the sentence discusses their business processes. Drawbacks to this design are related solely to the deep-learning aspect of NLP frameworks, and finding a way to implement a neural network in a way that was meaningful to the broad spectrum we were looking at would have been difficult to prepare within the time constraints.
Methodology:
One of the most difficult decisions we had to make was choosing how to split our reviews into categories. The data must be split to avoid collecting nonsense word occurrences. Words that differ greatly in their tone should not usually be in a review. If the author is taking a negative tone, “wonderful” won’t appear by “disgusting”. To solve this problem, we split our reviews into three categories. The categories were positive, average, and negative reviews. This allowed us to pick out what makes each type of review great rather than getting a nonsense average of the whole dataset. A positive review was five stars, a negative one one to two stars, and an average review was three to four stars. After splitting up reviews, we then chose to split reviews by usefulness. This was binary, either useful or not. To define what was useful, we found the average useful value of the dataset. This value was approximately three. Therefore, anything above the average was helpful and anything below it was’t. We chose the average instead of just having one useful vote because people differ in what they prefer. One outlier might have different preferences compared to the norm. We didn’t want them swaying the data, so naturally finding what the average person thinks is useful is the better option. The more votes, the safer the review is to consider useful. Using this definition of useful, we then split up our review score. This resulted in six categories to split our reviews by. Having six categories has the useful effect of keeping conflicting words out of the reviews and identifying specifically what makes each review type useful. Going into this, we didn’t know if people would prefer long reviews to short ones in general or only in cases of negative reviews. Having the six categories helped us pick out these differences.
With these reviews, there are naturally a lot of words that are useless when trying to find which word choice people enjoy. Articles, pronouns, and transition words won’t say much about the tone of a review. Therefore, they need to be excluded. Creating a massive list of these words by hand didn’t seem like a great idea. We also couldn’t find anything online. Due to this, we then settled on generating our own list from the data. After finding the words that occur the most in all of the reviews, we found started looking at how many words we could exclude before we started losing adjectives. At 110 words removed, we started catching adjectives. We ended up deciding on excluding an even 100 words, which didn’t manage to catch many adjectives at all. Excluding these 100 words had the added effect of making our word list for each category, once it was sorted, show a clear tone to the review. To actually filter the words out, we made a list out of the 100 most common words and filtered them out by iterating through our six separate dataframes and checking if each piece of text matched something in our unwanted words list.
Splitting our list of words was no easy task. We initially tried to split a csv file into parts by separating on the comma. This ruined the data, and everything in the
dataframe was entirely inconsistent. This was due to the presence of commas in text of the reviews. After some digging, we found a JSON version of our Yelp review dataset which read in without a hitch. It solved our bad data problem.
Splitting the words within the text was not easy, and we ultimately decided to simply split on whitespace. This leaves punctuation attached to some strings, creating a few instances of some words. For example, “the” and “the.” would count as two
separate words when we try to get the word count. We decided against solving this problem because having words with their surrounding punctuation gives additional context to how a word was used. There might be common endings in sentences people expect, therefore leaving them in helps us identify this. When counting unique words, the total amount of unique words becomes total number of unique words and their basic sentence placement. This provides us with more information about the word use, while skewing unique word counts a bit high. However, this is a tradeoff we were willing to make.
We gathered data about what words were used and their frequency, the length in characters of the average review, total unique words, average usefulness, average cool score, and average funny score for all six review categories. We decided to include average funny and cool scores in this because we wanted to know if their usage
correlates to a review being useful. We included length analysis to find if Yelp users like long flowery reviews, or short and to the point ones.
Almost everything we did in this project revolved around the use of Spark’s sql libraries. These libraries made it fairly painless to calculate averages of certain
categories and count the occurrences of words. One of the most difficult pieces of this project was figuring out how to actually break apart the reviews into their words. We eventually settled on using the explode function, which broke down each review’s words into a distinct rows. Since all of these words were placed into a dataframe and
distributed, our driver’s heap did not overflow. Initially we tried to seperate everything into a list, but this did not work in the slightest. The driver’s heap could not store all the information, so it had to stay distributed. With all the words broken down into rows, we just called count to find word occurrences. Then, to find distinct words we called count again. Count initially broke the dataframe into two columns, the words and their times they occured. Calling that again simply counts all the words, and avoids counting duplicates. To avoid counting duplicate words with different capitalization, we made sure to cast every word to lowercase.
Experimental Benchmarks:
In our project, we only found what information could be useful in differentiating useful and not useful reviews. We never built a filter or printed all the reviews in the dataset that matched our model. Though, to confirm if our data was valid we did go through the dataset and find a few reviews that matched the criteria we found. This criteria excluded matching words to what we found was typical for each category, but it did include the length of the sentence being within 10% of 1000 characters, a cool score greater than or equal to three, and a funny score greater than or equal to 2. Then, we asked a few of our peers if the reviews that matched our metrics were more useful than reviews that didn’t. Five out of Five of our peers said the reviews that we found were more useful than the reviews that didn’t match our criteria.
Additionally, we reviewed our data against papers and reports that had familiar studies performed. In reviewing, a paper on vocabularies, we found that, not only in reviews, but in other forms of text as well, people generally try to expand their active vocabulary while writing [5]. This is done in order to create a sense of intelligence and the person writing this is actively trying to avoid doing that right now. But in the case of our benchmark against a subtle implication like this was in our vocabulary tracking, and the results agree with the source. There’s a large difference in vocabulary ranges between the amount of useful and not useful reviews, and on average, it shows that the useful reviews often have less complex vocabularies. In terms of accuracy, we
reminded ourselves, only after turning in the code, that a considerable amount of words have punctuation stapled to the front or end of them, ultimately inflating vocabulary counts in a way we’re uncertain of. This proved fairly consistent across the board
regardless, considering all of the reviews were parsed the same way and each category ended up with similarly inflated review vocabularies. In regards to testing the usefulness in comparison to length, we found that research showed that when content was longer, people were more likely to engage with the content itself, and therefore consider it more useful and necessary [6].
In the case of these reviews, reviews that were generally around 100 words in length, were considered less useful on average compared to reviews that were 200 words or greater. Accuracy here was simpler in comparison to calculating the
vocabulary. We simply took the average lengths of each review and quickly saw the correlations between our data and the research Rob Marsh had written on [6]. We additionally took two reviews, one that was longer in length, as well as fairly basic in vocabulary and showed it to friends, asking them which ones they found generally more useful in all. Each of them chose the longer, simpler review. In terms of actual
performance of our code, it takes approximately 10 minutes to run from the command line, granted that we don’t lose any executors along the way. The lazy execution
some real time to run are when it’s running the final printing phase. Here it can be seen inside the job, working out the solutions one after the other. We’d hoped to see it try to map and print multiple jobs at once, but we couldn’t seem to get it to do more than one across the cluster at a time.
Insights Gleaned:
Going into this project, nobody on our team had ever done any functional programming. Neither of us knew what was going on initially. However, by the end of the project we understood just how powerful it was. Being able to act on a massive dataset with just one function was extremely convenient. Chaining functions together made operating on the data far simpler than using map reduce.
Our biggest issue was trying to figure out what words to filter out. We couldn’t find an online dictionary of words that we could import, and we didn’t want to
systematically what words to exclude from the English language. To get around this issue, we ended up just finding the 100 top words in all the reviews. The top 100 words were almost all the words expected. Any common article or pronoun like “her” or “him” was in it. Going out to the top 120 words seemed to start catching words like “beautiful”. This caused us to simply choose the top 100 words and turn them into a list. Then, before printing out the most common words, we filtered out top 100. This gave us
acceptable results, with the top 100 words of our subcategories having large differences in the most common words used. While not a perfect solution, it worked well.
Printing our output was difficult due to the nature of a distributed environment. The use of Spark caused it to be hard to identify which words were in fact the highest. Twenty four different file would print for one output statement, and there was far too much data to sift through. Creating a list to iterate through and print the results to a file only resulted in overflowing our heap. We ended up discovering that limiting a
dataframe caused all of the data to end up on one node. Limiting the data to a dataframe also allowed us to easily print a text file to our HDFS. Discovering limit though forced us to understand how files are actually distributed and that anytime something isn’t in an RDD or Dataframe it is on one computer. Leading the heap to overflow. In terms of speed, we couldn’t figure out which parts of persist we were
supposed to use in context to the data. We learned a lot about the intricacies of caching in the system and how well it could have performed in line with what we were doing, but there simply was not enough time left to implement it.
User reviews over the years have picked up more and more traction. Yelp is not the only service that uses them, markets like Amazon and Steam do too. As consumers slowly start to depend more and more on reviews, reviews need to be better. Customer reviews present an issue though. If products and services are reliant on reviews to get consumers to buy a product, they can be hurt by negative reviews. Research has been done to understand how consumers best read these reviews, understanding that the star rating isn’t something that can be perfect. For consumers to best believe they are getting a good product, the star rating should be around 4.4 [7]. Review bombing has slowly become a problem over the last few years. Recently the studio Psyonix was purchased by Epic Games. Due to controversy around Epic Games and their storefront, the Psyonix’s customer base review bombed its games on Steam. Customers reacted so aggressively due to false reporting. It was rumored that the popular game Rocket League would be removed from Steam’s storefront. Though, there were no such plans[8]. The rumor alone was enough to drop the games review score from 90%
overall to 43% positive in its recent reviews. A rumor alone was able to hurt this product, and by extension its publisher. The more popular reviews get with consumers, the more review bombing could damage a products sales. Malevolent forces could even organize to review bomb at a products release, which could ruin the lifetime sales of a product.
This issue needs to be addressed quickly with how important reviews are becoming to the average person. Where we found what a useful review looks like, malicious reviews need to detected too. Though, this isn’t simple. Words can’t simply be analyzed due to people's creativity. Using character symbols, users have been known to create profane pictures. The future of this problem won’t simply be about finding useful reviews, but about detecting and stopping review bombing. Especially due to pressure from product owners putting pressure on retailers to fix this problem. If a retailer doesn’t have a way to stop review bombing, why should a product maker risk having their product dip in sales if a destructive rumor starts to spread? Solving the review bombing problem could even give a retailer a competitive advantage. Product makers are far more likely to prefer a service that doesn’t allow review bombing since there is less risk that their product sales will be negatively affected by a few peoples whims.
Conclusion:
From our information gathered on what makes a Yelp review useful, we can say that useful reviews definitely have higher cool and funny scores than reviews that aren’t useful. Also, reviews that have less unique words tend to be more useful as well. From looking at the words that occur the most, it is hard to tell what makes a positive review useful and what words make positive review not useful. The words are pretty close in
similarity for the two sets. The dataset did not have equal numbers of each review type, so our word count data might be a bit skewed. Both useful and not useful subtypes alway share similar words, and they are always relatively similar in their occurences. The data that seems to most clearly indicate if a review was helpful was if it was rated funny or cool and if it was relatively long. Almost all useful reviews were twice the length in characters as their not useful counterparts. All this in conjunction with one another suggest users favor detailed reviews that are funny and cool, but also relatively simple to read.
To improve how we find useful reviews in a dataset, without having to use user controlled statistics, textual analysis would have to be used. Length is a big indicator of usefulness, but it's not even close to enough. Using user generated statistics for cool and funny, words and sentence structure could be found for what makes a cool or funny review. Then, textual analysis could be performed on a user generated review to find out if it’s funny, cool, and potentially useful. Our set of words found for each of the six categories we collected data on could be used to find the tone of text. If a lot of words match the high star review, then the reviewer is likely giving a positive review. If this matches the stars the reviewer gave, they are providing a valid review. Then, if the text suggested it was funny or cool and the length was appropriate, the review could be posted at the top of the webpage to help consumers choose what to buy or where to eat.
Bibliography
[1] "Customer Reviews For SEO, Marketing & Sales | Trustpilot Business", Business.trustpilot.com, 2019. [Online]. Available:
https://business.trustpilot.com/reviews/why-do-people-read-reviews-what-our-research-revealed. [Accessed: 03- May- 2019].
[2] von Helversen, B., Abramczuk, K., Kopeć, W. and Nielek, R. (2018). Influence of consumer reviews on online purchasing decisions in older and younger adults. Decision Support Systems, 113, pp.1-10.
[3] Ganu, G., Kakodkar, Y. and Marian, A. (2019). Improving the quality of predictions using textual information in online user reviews.
[4] Towards Data Science. (2019). How to use Natural Language Processing to analyze product
reviews?. [online] Available at:
https://towardsdatascience.com/how-to-use-natural-language-processing-to-analyze-product-revi ews-17992742393c [Accessed 3 May 2019].
[5] D. Oppenheimer, "Consequences of Erudite Vernacular Utilized Irrespective of Necessity: Problems with Using Long Words Needlessly", APPLIED COGNITIVE PSYCHOLOGY, vol. 20, pp. 139–156, 2006. Available: https://www.affiliateresources.org/pdf/ConsequencesErudite.pdf. [Accessed 3 May 2019].
[6] R. Marsh, "The data's in! Should you write short posts or long ones?", Copywriting for startups
and marketers, 2019. [Online]. Available: https://copyhackers.com/2016/02/short-long-content/. [Accessed: 03- May- 2019].
[7] N. Pesce, "This is exactly how many reviews it takes to get someone to buy something" , MarketWatch, 2019. [Online]. Available:
https://www.marketwatch.com/story/this-is-exactly-how-many-reviews-it-takes-to-get-someone-to-buy-something-2017-08-22-12883123. [Accessed: 03- May- 2019].
[8] Saini, G. Gaming News, Reviews, and Articles - TechRaptor.net. (2019). Rocket League Review
Bombed Following Epic Games Acquisition of Psyonix. [online] Available at:
https://techraptor.net/content/rocket-league-review-bombed-following-epic-games-acquisition-of-p syonix [Accessed 2 May 2019].