4.4 Analysis of Content Credibility Evaluations
4.4.2 Consensus and Controversy
An obvious way to assess credibility of a given Website is to ask an expert for an assessment. The content being assessed needs to be, of course, of the same domain as the field of expertise of the assessing expert. If a group of ex- perts is assembled in order to make an assessment, their final decision should be by synergy smarter than the smartest of the experts in the group. However tempting the experts’ assessments could be, the experts’ unavailability can be a problem. Despite this fact, even a group of lay people of different expertise levels can still produce meaningful assessments according to the wisdom of crowds principle: “When our imperfect judgments are aggregated in the right way, our collective intelligence is often excellent” [254]. The concept of har- vesting the wisdom of crowds became influential and widely used; however, it is still a subject of criticism, which covers the limitations of crowd wisdom and its proper application [131].
There are several requirements to be met in order to achieve a wise crowd. According to Surowiecki [254] these are diversity, independence and decentral- ization. In other words, the crowd needs to consist of individuals of different knowledge levels, who are expressing their private opinions with no influence from the others, by using their local knowledge and private information. These conditions may be difficult to satisfy due to frequent lack of robust control on the diversity of the pool of respondents in reputation systems. If the crowd
does not meet all the conditions and specifically consists of uninformed mem- bers only, we cannot expect the most optimal decisions to be made. In terms of diversity and collaborative environments, there is also a possibility of situ- ations in which the minority of well-informed experts is marginalized by the majority of less informed lay people, thus, the constructive feedback of the knowledgeable vanishes in the noise [80].
According to Lanier answers that the crowd is asked to give should be no more complicated than a single number or value [130]. Taleb draws attention to the limits of the crowd wisdom, which should not be applied to questions of complex outcomes and unknown distributions [256]. So what kind of task is it to assess web credibility? Surowiecki presents three types of problems the crowd is capable of solving: cognition, coordination and collaboration tasks. The credibility assessment is the cognition task in which the crowd is asked whether a given Website is credible. In this type of task well-formulated ques- tions should have a single right answer [254], which unfortunately does not always apply to Web credibility assessment. In the majority of cases aggrega- tion of the crowd responses works surprisingly well; however, there are some cases in which aggregation should be avoided, and we should accept the fact that we cannot produce meaningful assessment.
While evaluating the credibility, we need to consider the thematic category of the evaluated content as the crowd approach will work best with general knowledge content assessment. Considering the strong subjective component of credibility evaluation, several subjects need to be treated with caution. Such subjects are inherently controversial and cover sex, religion and other culture dependent taboos. Human beliefs are hardly subject to assessment that is expected to produce a single right answer. For example, it is yet impossible to objectively prove or disprove the existence of God. We can draw another example from Reconcile studies dataset. Among many categories of pages, which were labeled by crowdsourcing, there was “Cannabis”, a category of pages concerning to the use of marijuana. The distribution of the credibility ratings gathered in this category visibly differs from the overall distribution and is depicted in Figure 4.9. The ‘Cannabis’ received almost uniform distri- bution of the ratings (except for rating 1). This is not surprising as the drug use related subjects are widely considered as “controversial”.
Surowiecki says, “Diversity and independence are important because the best collective decisions are the product of disagreement and contest, not consensus or compromise” [254]. As mentioned in the above paragraphs, such a rule does not apply well to all kind of problems one would wish to solve using a wise crowd. In terms of Web content assessment one should carefully monitor the thematic categories of the items selected for evaluation. When the number of pages is high or the selection method is automated, this becomes a difficult task, due to the diversity of the content on the Web. At the aggregation step a personalized result reflecting the current user’s standpoint and preference is presented. Another solution for such a dilemma is to monitor the level of agreement among the respondents for the assessed item. Figure 4.10 shows
FIGURE 4.9: Distribution of credibility ratings in “Cannabis” category in comparison to overall distribution of credibility.
possible extreme ratings distribution. Perfect agreement on the left shows all the responses concentrated into one class. No agreement in the middle is depicted as a uniform distribution of the ratings and finally polarization is presented on the right. Perfect polarization depicted in Figure 4.10 shows even distribution of the ratings between two distant and opposite classes, what later in this section will be interpreted as a strong controversy.
FIGURE 4.10: Levels of agreement depending on the ratings distribution.
Respondents rating Web content credibility are likely to be asked to place their perceptions on a scale defined by two polar opposites reflecting non- credibility and credibility. The concentration of those reported perceptions can be referred to as the agreement on the item’s credibility [270]. Most likely
the scale used to measure such opinion will be an ordered scale of the Likert- type on which measurement of the concentration or dispersion needs to be addressed in a proper way. Assuming an interval scale for the Likert categories and using, e.g., standard deviation poses a risk of reaching false conclusions [106]. Thus, consensus measures for ordinal variables should be used instead, in order to depict the extent of the inter-rater agreement. The agreement or ordinal consensus measures of Van der Eijk [270], Leik [140] and Tastle and Wierman [263] can be used for this purpose. Such agreement measures are typically normalized from 0, representing polarization, to 1, representing perfect agreement, thus effectively indicating the controversy.
FIGURE 4.11: Distribution of Leik consensus values.
Given a training dataset covering controversy and credibility or trustwor- thiness assessments, it is feasible to build a controversy classifier. One possible source of training data is Wikipedia’s Article Feedback Tool (AFT), which is an internal Wikipedia survey for article feedback, to engage readers in the assessment of article quality35. However, as depicted in Figure 4.11 consensus
measures, specifically Leik consensus, perform well at discriminating contro- versial and not controversial pages based on the user ratings distributions. There is a visible concentration of controversial pages for consensus below 0.4 value, which is the the threshold for polarization.
An agreement measure can be used to monitor the potential polarization of the credibility perceptions of the raters. If strong polarization occurs, it might be a sign of controversy, a state in which two opposite outlooks on the same matter exist. Ratings concerning controversial content should not be aggregated. In terms of the Reconcile Web Credibility Corpus, the pages
35
FIGURE 4.12: Reconcile Web Credibility Corpus median credibility ratings distribution, including “controversial” tag.
considered as controversial, based on the ratings distribution, amount to about 5% of all evaluated pages, as depicted in Figure 4.12.
The existence of controversy, that is, the existence of opposite credibility perceptions, is possible in the light of bounded rationality or Prominence- Interpretation (P-I) theory [67]. The controversies easily explained by P-I theory are manageable, and the wisdom of crowds approach can still be ap- plied and ratings themselves aggregated. To the contrary to P-I explainable controversies, inherently controversial subjects, taboos and beliefs should be restrained from aggregation of the crowd ratings.