Collection Means - Data Collection - (Chapman & Hall_CRC Machine Learning & Pattern Recognition

4.3 Data Collection

4.3.1 Collection Means

We propose three practical ways of obtaining a research credibility–related dataset. First, the most convenient, is reusing of an already available dataset. Second, make use of data stored by credibility evaluation supporting tools; and ﬁnally, collect a dataset by manually labeling the Websites, e.g., via crowdsourcing.

4.3.1.1 Existing Datasets

Despite the fact that reusing is cheaper and less time-consuming, not putting an eﬀort into building one’s own dataset have several shortcomings. A nontrivial issue one may encounter is that the available dataset does not ﬁt one’s needs, which can be severe as the list of publicly accessible credibility datasets is currently limited. Reusing a dataset also includes risk of having

no access to the same versions of the content that was evaluated in the given dataset. As the Webpages are updated or abandoned, the possibility that there will be no way to reevaluate them is high.

One of publicly available datasets is one from a study on Web credibility and augmenting search results by Schwarz and Morris [237] and kindly shared by Microsoft2. In this dataset the examined 1000 Webpages were labeled a single time. Another, much bigger dataset, consisting of 6000 rated Webpages is available at the Webpage of a research project Reconcile3_{. This dataset is}

presented in Section 4.3.3.

4.3.1.2 Data from Tools Supporting Credibility Evaluation

Another approach to obtain a dataset consists of using external sources that provide credibility-related data although not in a structured form. By repeating the steps taken in the work of Sondhi et al. [247], one can look for lists of a priori credible or not credible content. Sondhi et al. used a list of credible medicine sites, accredited by Health on the Net foundation4_{. As}

proposed in Fritch [73] directories of prescreened pages by raters of a certain level of expertise can be referred to as sources of expert labeled data. An example of a directory of trusted pages assessed and chosen by librarians can be the IPL2 service pages5_{. On the contrary, a potential source of a priori non}

credible sites are lists of pages infected with malware or spyware. Such black lists are publicly available, e.g., the StopBadware organization6_{. Reputation}

systems focused on trust can also be used as a source for a credibility dataset, e.g., WOT (Web of trust), Reconcile, and Factlink. A wide overview of such tools is presented in Section 4.3.2.

4.3.1.3 Data from Labelers

Provided that one has available time and resources, it is a good idea to pre- pare a unique dataset tailored for the task’s requirements. A set of Webpages chosen for study needs to be labeled according to its credibility, and this task needs to be performed with expertise and without a bias. A natural way of achieving such a goal is hiring domain experts to determine the credibility of the content of our interest. The number of experts is limited by our resources and the type of incentives used for the raters.

Another option is a judgment crowdsourcing [42], specifically carried out using crowdsourcing marketplaces, e.g., Amazon Mechanical Turk or Click- worker. Howe [101] introduced the definition of crowdsourcing as “the act of a company or institution taking a function once performed by employees and outsourcing it to an undefined (and generally large) network of people in the

2_{http://research.microsoft.com/en-us/projects/credibility/} 3_{http://www.reconcile.pl} 4 http://www.hon.ch 5_www.ipl2.org 6_{https://www.stopbadware.org/}

form of an open call”. In this special case, incentives for the labelers are also the key factor determining the size of the dataset. Another aspect of crowdsourcing credibility is the unknown expertise level of the hired workers. Thus the gathered feedback on credibility of the given pages from single raters needs to be aggregated into final assessment labels. This is not a trivial task to per- form, due to the unknown expertise of the workers as well as their motivation and honesty. The noise and spam in the crowdsourced data is a ubiquitous issue recently discussed in the field’s literature [102, 242]. The manner of fil- tering the data and dealing with such labeler-related issues is discussed in the later sections of this chapter.

The risk of dishonest or biased workers applies as well to a crowdsourcing with hired workers as reputation systems. Both methods of obtaining data need to address a matter of credibility of the raters themselves [197]. Fabri- cated ratings, especially if the system enables giving feedback anonymously, can jeopardize the goal of gathering reliable data [138]. Another issue with data gathered via the crowdsourcing marketplace might be the sample of raters participating in the task. If one’s study or work requires the raters’ sample to be balanced, this issue needs to be addressed. Amazon Mechanical Turk user demographics in 2009 were an approximation of the U.S. Internet users population7,8_{. However, according to one of the most recognizable prac-}

titioners, Panos Ipeirotis, the proﬁle of the Mechanical Turk users (workers) is changing, and the proportion of users coming from the United States dropped from almost 80% to about 50% in 2012 [102]. It should be expected that the distribution raters of demographic features will not be balanced.

In document (Chapman & Hall_CRC Machine Learning & Pattern Recognition) -Computational Trust Models and Machine Learning-Chapman & Hall Crc (2014) (Page 106-108)