Data collection - First phase - Network analysis of shared interests represented by social book

CHAPTER 3. METHODS

3.2 First phase

3.2.1 Data collection

Collecting data from delicious.com can be done using two different methods. One method is to use the Really Simple Syndications (RSS, a simple Web feed) feature offered by the site. Delicious.com provides a number of RSS feeds including Recent

RSS (a feed of the bookmark postings made recently) and Hotlist RSS (a feed of the URLs that are most popular at a particular point in time). Among other methods, the

Recent RSS feed was used for collecting data for this study10. Another method of

data collection is to crawl the pages on the site. On delicious.com, there are three kinds of pages corresponding to three distinct entities involved in bookmarking activities: information object (URL), user, and tag. Each user has their own page(s) including the entire list of their bookmarks, and there is a page for each information object (URL) including all the bookmark postings of the URL (made by different users). Each tag also has a page containing all the bookmark postings associated with it. Since all these pages are open to the public, one can crawl the pages as needed11. As will be described below, this method was used to get the entire history of sample users and URLs. Figure 3.1 and Figure 3.2 show an example of a user page and an example of a URL page, on delicious.com respectively.

collect a sample of bookmarking activities that occurred during the period. Although the collected data do not include all posting activities (because of the time interval of fetching), it can be assumed that no systematic bias would be involved in the data collection process.

11_While _{delicious.com}_{provides Application programming interfaces (APIs),} they could not serve the purpose of this study because the APIs require authentication and allow access only to one’s own account. Therefore, an alternative method, crawling and page scraping, was used as the primary data collection method. Note that web scraping necessarily relies on the consistency of the page structure, over which the researcher has no control. Major changes in the site design, for instance, may make scripts written for the previous version obsolete. In fact, at the end of July 2008, delicious.com changed the entire ‘look and feel’ of the site, and the underlying html document structure for each page was also changed. If the data collection for this study had not been completed by then, all the scripts would have had to be rewritten for the completion of data collection. Another limitation of the crawling approach is that many servers restrict the amount of data that can be crawled from a single IP address in a given time period. In fact, this was one of the

Figure 3.1 A user page on delicious.com

Given the huge scale of the information space being studied, it was important to find a way to capture both the breadth and the depth of the space. To this end, two complementary methods of data collection were used. For capturing the breadth of the space, a large sample of recent bookmarking activities was collected into a dataset called Recent. The Recent dataset includes a sample of each of two main entities involved in bookmarking activities, users and information objects. The range of each entity in this dataset provides a sense of the breadth of the information space. For representing the depth of the space, two separate samples were drawn, based on the Recent dataset: one is a sample of users from the user population and the other is a sample from the population of information objects. For each sample set, the entire history of bookmarking activities associated with each sample element was collected. The resulting datasets are called User History dataset and URL History

dataset, respectively. Figure 1 illustrates the range and coverage of each dataset.

The Recent dataset was collected from January 14, 2008 to April, 21, 2008 (for 14 weeks), using the Really Simple Syndications (RSS) feature provided by

delicious.com. Through the subscript of the Recent RSS feed, a sample of the most recent bookmarking activities were collected. In total, 1,226,472 postings were collected with 999,835 distinct URLs saved by 288,727 distinct users. As described above, this dataset represents the current breadth of the activities on the delicious.com

site.

In order to get data that accumulated over time, two additional datasets were collected: the URL History dataset and the User History dataset. The URL History

dataset includes the entire set of postings associated with 10,000 sample URLs, and the User History dataset contains the entire set of postings ever made by each of 10,000 sample users. Sample URLs and users were randomly selected from the

Recent dataset. The final URL History dataset has 1,733,178 postings (of 10,000 sample URLs) made by 484,034 users, and the final User History dataset has 3,521,843 postings of 2,451,711 distinct URLs (made by the 10,000 sample users). Table 1 summarizes the size of each dataset.

Table 3.1 The size of each dataset

Dataset No. of postings No. of users No. of URLs

Recent 1,226,472 288,727 999,835

URL History 1,733,178 484,034 10,000

Having these two history datasets, in addition to the Recent dataset, allows us to look at the question of accumulation and overlap from two views: a resource- centric view and a user-centric view. From the resource-centric view, we examined, for instance, what proportion of resources (represented by URLs) is shared by multiple users. From the user-centric view, on the other hand, we looked at how many users share one or more resources with other users.

3.2.2 Measures of accumulation and overlap

In document Network analysis of shared interests represented by social bookmarking behaviors (Page 132-137)