4.6 Finalization Stage
4.7.5 Algorithmic Baseline Study
We now explore the performance of several baseline variants of C3EL ablating various system components. Tables4.10and 4.11report the obtained results on ECB and ClueWeb09 respectively. Explicitly, we study the effects of the following modules: – Co-occurring Mentions: Removal of co-occurrence mention contexts during the
creation of mentions’ context summaries reduces the semantic information content for disambiguation and hence adversely affects both NEL and CCR procedures. We thus observe a sharp decrease in CCR performance and also a degradation in NEL. – Link Validation: Filtering of mention linking to KB entities using link validation step (with threshold τ ) in C3EL enables corroboration of mention context keywords with the linked entity features. This leads to enhanced detection of new or emerging entities by reducing induction of noise during the CCR phase. Removal of this pro- cess permits aggressive entity linking and introduces noise, affecting new/emerging entity detection. From the above tables, on removal of link validation step, we ob- serve nearly 20% reduction of accuracy (on both datasets) in the identification of out-of-KB entities compared to C3EL.
– NEL Categorization: The differentiation of mentions into classes (during NEL) using mapping confidence to KB entity reduces the collusion of “strong” linked mentions with other “noisy” mention contexts. This reduces incorrect grouping of different mentions with similar surface forms and contexts, thereby improving the precision of CCR. Elimination of the classification approach is observed to degrade the CCR
Table 4.11 – CCR and NEL results (%) on ClueWeb09 for different baselines of C3EL.
Baselines
CCR result NEL results
Within-KB Out-of-KB P R B3F1 C I U C I Ignored Mention Co-occurrence 69.3 72.2 70.7 83.8 14.6 1.6 80.6 19.4 Link Validation (τ ) ignored 74.8 81.0 77.8 88.9 10.1 1.0 69.8 30.2 Removed NEL Classification 70.1 77.6 73.6 86.1 12.3 1.6 79.5 20.5 Distant KB feature dropped 66.4 72.9 69.5 85.4 13.0 1.6 83.7 16.3 C3EL 75.8 81.4 78.5 88.3 10.1 1.6 83.7 16.3
results, which in turn increases spurious entity linkage, decreasing NEL efficiency. – Distant KB features: As observed by [Baker, 2012;Zheng et al., 2013], extracted ex-
ternal KB features provide global and enhanced information cues promoting CR. We similarly observe CCR to attain the lowest F1 scores (compared to other base- lines) when the KB features are ignored. This in turn affects the linking of (some) well-known entities due to reduced context, leading to incorrect or low confidence NEL. Since no feature inclusion is performed for out-of-KB mentions (due to failed link validation), no effect is observed for such entities.
Discussion: Hence, from the above empirical setup and evaluations, we observe that the joint CCR-NEL formulation in C3EL encompassing multiple information sources (from source text and external KB) and noise filtering (by link validation) enables global information propagation across the iterations, thereby providing mutually enhanced performance for both CCR and NEL.
4.8
Summary
This chapter presented the novel C3EL framework, the first approach for joint com- putation of cross-document co-reference resolution (CCR) and named entity linking (NEL). Our approach utilizes context summaries including co-occurring mention con- text and external KB features allowing for global context and feature propagation across documents and link validation for precise detection of out-of-KB entities. The iterative approach embedded in the interleaved CCR-NEL stages enables information feedback between CCR (providing corpus-wide information cues) and NEL (providing distant KB features) for enhanced performance of both CCR and NEL tasks, along with highly accurate new or emerging entity identification. Experimental results on large news and Web data demonstrate robustness and performance gains of our framework compared to existing methodologies.
5
CREDIBILITY OF
ENTITY-CENTRIC
TEXTS
Knowledge base (KB) construction entails the efficient representation of extracted en- tity related facts and relationships. The quality of information pertaining to the entities, obtained from the input corpus thus plays a pivotal role in the overall applicability of the KB. In this setting, the precise categorization of entity-centric textual information as credible or non-credible provides a significant challenge, given the diversity and subtle introduction of possibly spam, irrelevant, biased, and fake information.
This chapter presents a novel language and temporal model based methodology to effi- ciently identify fact/information demonstrating low credibility. The proposed method harnesses features derived from texts, associated user/reader ratings and sentiments, and publication timestamps for leveraging classifiers to label extracted textual snip- pets as credible or not. Experimental results on large real-life datasets demonstrate significant classification accuracy improvements over state-of-the-art approaches.
5.1
Introduction
Motivation. The extraction of accurate and meaningful entity based facts and entity-
entity relationships from news articles, blogs, and forums forms the bedrock of large knowledge base construction procedures and applicability to search related applica- tions.
Given the vast amount of data generated across diverse domains and the popularity of social media, there has been an unfortunate increase in the proportion of non- credible documents and facts – either fake (aimed at promotion/demotion of entities), incompetent (irrelevant), biased (distorted), or for sensationalization. In fact, recent studies1found that majority web users tend to share news and information without
reading or verifying article details. This serious trend is even more pronounced in
potential customer-centric domains such as product, service, and travel review fo- rums such as TripAdvisor, Yelp and Amazon, – wherein manipulative/deceptive item reviews amounted to nearly 20%, with a further 16% of users reviews deemed as “not- recommended” by Yelp [Luca and Zervas, 2015]. Hence, entity-centric facts extracted (and represented in KBs) from such sources might severely degrade the reliability of KB, necessitating the identification of potentially non-credible information.
Several approaches geared towards fact checking [Metzger and Flanagin, 2013;Wu et al., 2014] have been proposed to alleviate the problem. In this work, we aim to assess the credibility of natural language texts before fact extraction, which can then be further combined with existing fact verification methodologies to provide a robust framework towards clean knowledge repository construction. Owing to the scarcity of credibility based labeled data in most domains, this work primarily focuses on the detection of deceptive review texts present in customer-oriented product/service review portals such as TripAdvisor and Yelp. We later show that the proposed method is domain- independent and hence can easily be transferred to other text based scenarios, thereby addressing the current lack of labeled training data.
Existing research on this topic has cast the problem of review credibility into a binary classification task: a review is either credible or deceptive. To this end, supervised and semi-supervised methods relying on features about users, entities, and activities have been proposed [Jindal and Liu, 2008]. However, information about user histories and activities are not always available in many scenarios, for example in cases of “long tail” items or users. On the other hand, language-based approaches [Mihalcea and Strapparava, 2009;Ott et al., 2011;Ott et al., 2013] consider word-level unigrams or bigrams as features to learn latent topic models and classifiers (e.g., [Li et al., 2013]). User activity and their behavioral deviation from the mean/majority ratings have also been used by the industry [Mukherjee et al., 2013a], but it tends to over-emphasize trusted long-term contributors and suppress outlier opinions. All these approaches employ several aggregated metadata, and are thus hardly viable for cross-domain adaptation and for new items with very few reviews – often by not so active users or newcomers in the community.
Problem Statement. In this chapter, we aim to efficiently detect non-credible entity
review texts with limited information in the absence of rich data about user histories, community-wide correlations, and for “long tail” items (with sparse review texts and ratings), thereby providing domain-independence. Interestingly, prior methods shown to achieve high classification accuracy, do not provide any interpretable evidence as to why a certain review is classified as non-credible. Our goal is then to not only compute a credibility score for review texts but also to provide possibly interpretable evidence for explaining why certain reviews have been categorized as non-credible.
5.2. Related Work
5.1.1 Approach and Contributions
The proposed method efficiently performs credibility analysis of entity reviews by exploiting inconsistencies across features derived from the user item review sentiments and the corresponding item ratings. Further, temporal “burst” features – where a number of extreme reviews are written within a short span of time – are also fed to Support Vector Machines for obtaining credibility scores for reviews and identifying possible causes leading to a review being categorized as deceptive.
To this end, the novel components of the proposed approach are:
– a classification model based on extracted feature vectors from limited item-user metadata across items and users, to compute review credibility score for detecting non-credible reviews.
– a novel notion of interpretable evidence for entity texts based on language models, sentiment, timestamp, and rating to possibly characterize as to why a review is deemed deceptive.
The above features are used to identify, score, and highlight inconsistencies that may appear between reviews, ratings, and the community’s overall characterization of an item, for classifying item reviews as credible or otherwise. In a nutshell, the major contributions of this chapter are as follows:
– A novel consistency model for credibility analysis of reviews that works with limited information, with particular attention to “long tail” items, and offers interpretable evidence for reviews classified as non-credible (Section5.3);
– investigate how credibility scores and the learnt model can be transferred across dif- ferent domains and communities thereby addressing the scarcity of labeled training data (Section5.3.4); and
– experimental evaluation on TripAdvisor and Yelp datasets to demonstrate the viabil- ity and advantages of the proposed method over state-of-the-art baselines in terms of classification accuracy and providing interpretable evidence (Section5.4).
5.2
Related Work
Existing approaches for fake review and opinion spam detection primarily focused on two different aspects of the problem:
Linguistic Analysis [Mihalcea and Strapparava, 2009;Ott et al., 2011;Ott et al., 2013] – These approaches exploit the distributional difference in the wordings of authentic and manually-created fake reviews using word-level features to learn latent topic models and classifiers [Li et al., 2014b]. However, the artificially created fake review datasets (by Amazon Mechanical Turks) for the studied tasks were shown to give away explicit features not dominant in real-world data. This was confirmed by a study on Yelp filtered reviews [Mukherjee et al., 2013a], where the n-gram word-level language features along
with specific lexicons (e.g., LIWC psycholinguistic lexicon [Pennebaker et al., 2001] and WordNet Affect [Strapparava and Valitutti, 2004]) performed poorly. Additionally, linguistic features such as text sentiment [Yoo and Gretzel, 2009], readability score (e.g., Automated readability index (ARI), Flesch reading ease, etc.) [Hu et al., 2012], textual coherence [Mihalcea and Strapparava, 2009], and rules based on Probabilistic Context Free Grammar (PCFG) [Feng et al., 2012] have also been studied.
Rating and Activity Analysis – In the absence of proper ground-truth data, prior
works make simplistic assumptions about non-credibility, e.g., duplicates and near- duplicates are fake, and make use of extensive background information like brand name, item description, user history, IP addresses and location [Jindal and Liu, 2007; Jindal and Liu, 2008;Lim et al., 2010;Wang et al., 2011;Mukherjee et al., 2012;Mukher- jee et al., 2013b;Mukherjee et al., 2013a;Li et al., 2014a;Rahman et al., 2015] to train regression models on extracted features to classify reviews as credible or deceptive. Similar works in this area also consider ad-hoc features like extreme ratings, user ac- tivity (number of posts, friends etc.), review length, rating deviation from community mean, burstiness, and simple language features (like content similarity, presence of literals, numerals, capitalizations, and POS tags) for learning models. Although such approaches perform quite well in practice, the use of extensive and aggregated features limits their application to a broader domain due to lack of related information. In contrast to these works, our approach uses limited information about users and items to construct several consistency features harvested primarily from user ratings and review texts only, thereby catering to a broad domain of applications. Further, none of the existing approaches provide interpretation as to why a review should be deemed non-credible – which we aim to tackle in this chapter.
Learning to Rank – Supervised models have also been developed to rank items from
constructed item feature vectors [Liu, 2009]. Such techniques optimize measures like Discounted Cumulative Gain, Kendall-Tau, and Reciprocal Rank to generate item rankings similar to the training data based on the feature vectors. As such, a related area of study involves the re-ranking of items according to their proper rating, wherein the “credible” reviews are implicitly gauged based on the constructed feature vectors.