Data collection and data set - Factors of Association

4.2 Factors of Association

4.2.2 Data collection and data set

The data collection for this component of the study occurred between August 30, 2005 — the first day of fall semester — and December 13, 2005, the end of fall semester. On a weekly basis, a web crawler captured the Facebook profiles of individuals that self-identified as freshmen at the University of North Carolina at Chapel Hill. Only individuals with publicly accessible profiles within the University of North Carolina network are included in the study1_{. At the time, an individual’s profile that was open}

to the “UNC” network was exposing data to approximately 35,000 students, faculty, and staff; for this reason the IRB provided a research exemption (Appendix A).

Table 4.1: Observations by week, longitudinal data collection

Week Observations Percent Cumulative %

1 3,087 5.85 5.85 2 3,177 6.02 11.86 3 3,229 6.12 17.98 4 3,205 6.07 24.05 5 3,280 6.21 30.26 6 3,304 6.26 36.52 7 3,325 6.30 42.82 8 3,331 6.31 49.13 9 3,331 6.31 55.44 10 3,349 6.34 61.78 11 3,356 6.36 68.14 12 3,361 6.37 74.50 13 3,368 6.38 80.88 14 3,366 6.38 87.26 15 3,365 6.37 93.63 16 3,363 6.37 100.00

Counts of number of unique observations per week, fall 2005 longitudinal collection.

1_{Between 2004 and 2007, Facebook’s global network was segmented into smaller networks, such as}

schools, workplaces, and geographic regions. At the time, these “networks” represented a meaningful privacy boundary. See boyd and Hargittai (2010) for historical perspective. As of writing, the concept of “networks” as privacy boundary has been largely deprecated in Facebook.

During the data collection, I observed 3,499 unique profiles. These unique profiles accounted for 52,797 observations over the course of the 16-week data collection. Counts of observations per week are presented in Table 4.1. In longitudinal data collection, attrition within the subject pool is a prime threat to validity of findings (Harris, 1998). There are many causes of attrition, including subject mortality, relocation, or unwill- ingness to participate. These causes primarily affect long running, burdensome studies. The data collection for this study, on the other hand, was observational in nature and occurred during a fairly short time interval of one semester. Therefore, case-level missingness is most likely attributable to privacy policy change (i.e. making the profile private and unavailable to the crawler), data collection error (e.g. website failed to respond to query, data corruption in transfer), or entry to the subject pool after data collection has begun. Table 4.2 provides insight into patterns of case-missingness in the subject pool, identifying that for 82% of unique profiles, all 16 weeks of observations are present. Visual inspection of missingness patterns indicate that the majority of missingness is due to late pool entry, rather than attrition during the study.

Upon collection, the Facebook profiles were processed using an XML parser, and individual profile elements were both anonymized and abstracted. This process involved the removal of personally identifiable information and the conversion of personally identifiable information into derivative factors. For example, many individuals shared their “IM Screenname.” The screenname itself was removed from the data set, but a derivative effects code that measuresif the case shared a screenname remains. Another example is the listing of interests and favorites. After processing, the only derivatives that remain are counts of the interests and favorites.

As a result of the profile data extraction, I was able to build network “edge lists” of the articulated ties within the freshman network. Within a Facebook profile, an individual can articulate a reciprocal tie to any other willing member in the service.

Table 4.2: Patterns of missing data, longitudinal data set

Frequency Percent Cumulative Pattern

2885 82.45 82.45 1111111111111111 96 2.74 85.20 .111111111111111 60 1.71 86.91 ..11111111111111 46 1.31 88.23 111.111111111111 38 1.09 89.31 ...11111111111 30 0.86 90.17 ....111111111111 30 0.86 91.03 ...1111111111111 21 0.60 91.63 ...1111111111 18 0.51 92.14 ...1111111 275 7.86 100.00 (other patterns) 3499 100.00 100.00

This table describes patterns of case-wise missingness within the 16 weeks of data collection. The majority of subjects are represented for all 16 weeks. The rightmost column indicates the shape of the missing data. For example, 96 observations are missing week one and no other weeks.

Because I am interested in the freshman cohort, I only extracted dyadic ties articulated between freshmen. In creation of the edge lists, anonymous identifiers were assigned to each member of the data set, ensuring that network representation can not be directly linked back to the actual identity2_.

In survey research, it is fairly uncommon to encounter a data set with near-complete coverage of a large population. Generally, when the sampling population exceeds 5% of the target population, the Finite Population Correction (FPC) can be applied to account for increased precision associated with high coverage (Kish, 1965). I have not applied the FPC to the following estimates, for reasons both technical and empirical. The primary empirical reason is that non-FPC standard errors are more robust, thus

2_{It must be noted that anonymity in social network data is theoretical. With network structure}

and vertex attributes, it is generally possible to identify individuals within a network. Therefore, these data may never be shared, they are protected with access-control password and encryption, and they are abstracted so that the impact of unintended leakage would be minimal.

decreasing the likelihood of type I error. The second empirical reason is the self- reported nature of the data. Had I been working from a data set with rigorous collection procedures (e.g. administrative records, in-person survey administration) I would feel more comfortable applying the FPC. Therefore, in the following study, standard errors are presented with the assumption of an infinite population.

In document Networked information behavior in life transition (Page 154-157)