Findings 1 Data Acquisition and Pre-processing: Linking Large Datasets

3.3 Production of Hit List Show for BBC Radio 5 live

3.3.3 Findings 1 Data Acquisition and Pre-processing: Linking Large Datasets

boyd and Crawford (2012) define “Big Data” through three pillars: technology, analysis and mythology. While in this thesis I stray from using Big Data terminology, preferring social data science instead, it is noteworthy how well the three-pillar framework matches some of the key issues involved in the production of the Hit List. To start with, the technological pillar is defined as “maximizing computation power and algorithmic accuracy to gather, analyze, link, and compare large data sets.” (boyd and Crawford, 2012, p. 663). Linking and comparing datasets lay at the heart of the process of composing the Hit List chart, as the analyst collated data from different online platforms. This is by no means an exclusive property of the Hit List case compared to the other cases studied in this thesis. In the previously discussed “Shakespeare Lives” project our team also dealt with data from Facebook, Twitter, VK.com and Sina Weibo. However, two particular aspects are specific to the Hit List case:

1. The data from different platforms were “pre-cooked” to different degrees. 2. The analyst had to collate all the data into one and only one hit list. Both issues are discussed in detail below.

3.3.3.1 How raw the “raw data” were

After the first several months of the show’s run, the list of platforms was firmly established as Twitter, Facebook, Google Trends and YouTube. The only mode of accessing data from those platforms that was realistic given the budget constraints and the need to produce the show weekly was to use automated data acquisition jobs that connected to the platforms’ APIs or feeds to obtain the platforms’ “raw” data. However, as Gitelman 2013, p. 3 notes, there is no such thing as truly raw data, since “data need to be imagined as data to exist and function as such, and the imagination of data entails an interpretive base”. While all the Hit List data captured one underlying phenomena – a user’s interaction with an online platform – the way this essence was represented in the acquired data varied both due to the specifies of the data sources and the

3.3. PRODUCTION OF HIT LIST SHOW FOR BBC RADIO 5 LIVE 81

decisions of the analyst. As a result, at the very moment of acquisition the data from different platforms were“pre-cooked”to a varying degree.

Twitter data were arguably the rawest. The analyst employed the Twitter Streaming API’s Sample endpoint to continuously acquire a 1% random sample of all the published tweets in the world in real time. Such data were potentially the richest, however they also required the most initial treatment from the analyst before the core analysis tasks (identification of which narrow themes in the data were topical/news-worthy and how wide the boundaries of an individual theme were) became at all possible. First of all, the data coming from parts of the world other than the UK had to be filtered out, which is discussed in detail in Section 3.3.3.3. Second, the individual tweets had to be somehow thematically aggregated. The most straightforward way to do that was through identifying the most commonly used hashtags and collating the tweet counts for those that belonged to one topic. Since multiple hashtags were often used in one tweet, the analyst first derived the list of the most common hashtags and then, for each of those common hashtags, a list of their co-occurring hashtags. The analyst used their judgement to pick-and-choose from these lists, thus forming topical hashtag clusters.

In the case of YouTube, the analyst could make use of quite significant data “pre-cooking” on the platform’s side thanks to YouTube’s automatically generated and continuously updated playlists of most popular videos in each country, including the UK15. While YouTube did not openly state how often the UK playlist was updated and how long it was, through trial-and-error the analyst determined that the updates happened approximately every 20 minutes and that the playlist always consisted of 200 videos. Thus, the YouTube data collection job made calls to the YouTube API 3 times an hour throughout the week to access the current contents and order of this playlist. Such use of the playlist made the acquired data pre-cooked on several levels compared to Twitter:

• Out-of-the-box, the data reflected the interests specifically of the UK public.

• Instead of coming at the level of individual user interactions (posting a tweet), YouTube data came at the level of individual objects (videos) with which the interactions (watching, liking, commenting, etc.) could happen. Therefore, no additional analysis stage of reducing individual interactions to thematic objects (analogous to reducing tweets to their hashtags) was required.

• The order of videos in each individual playlist gave the analyst a pre-cooked popularity score for each video. Instead of deriving their own popularity measure (e.g. some integrated

15_{At the moment of this writing, UK’s playlist can be found at}_{https://www.youtube.com/playlist?list=} PL-DfNcB3lim9IZmUXEjE1Ov0Ir1NDa3Yr

rating that would take into account views, likes, comments and other metrics), the analyst could simply assign a high score to the videos at the top of the playlist and a low score to the videos at the bottom of the playlist. The procedure that YouTube used to order its auto-generated playlists was completely black-boxed, however it was not that much of a concern in the case of the Hit List. Moreover, YouTube’s playlist ordering was most certainly better informed than any measure the analyst could design, since the publicly available data of video’s popularity were presumably not as rich as YouTube’s internal data. The use of this pre-cooked popularity score avoided repeated additional API calls to acquire details on each individual video, which might have led to rate-limiting. The considerations of what data to use therefore were influenced by other factors, to do with the proprietary nature of the data. The data from Google Trends were acquired in a similar manner – i.e. repeatedly with a 20 minute interval – from the Google Trends Atom feed. In many ways these data were similar to YouTube data (presumably since both platforms belonged to the Google product ecosystem). The data represented a list of the most popular Google search terms in the UK at the moment that could be treated in a similar manner to the YouTube playlist. The only difference was that the Google Trends lists of search terms were sorted by only one parameter – the total number of searches for a term within the reported time frame – and that the data did include an approximate number of the search requests, for example “100,000+” or “5,000+” with larger numbers having lower granularity. Having to aggregate such estimates rather than exact values was a typical example of a limitation associated with dealing with proprietary data. However, since such aggregates provided insight into at least the order of magnitude to the search term popularity, using those still gave higher precision than only using their chart positions in the Atom feed. The subsequent aggregation of weekly data for both YouTube and Google Trends boiled down to summation of the chosen popularity scores for each video / search term across all the considered scrapes.

The Facebook data were almost “well-done” immediately at the point of acquisition. In principle, the Facebook Graph API allowed to acquire data on an arbitrary level of granularity – e.g., if there had been such a desire, the analyst could acquire detailed data on each individual “like”. However, as the experiments revealed to the analyst, Facebook put severe restrictions on the number of requests that could be made to the API, as the speed with which Facebook returned responses to the API requests fell dramatically over time. Because of that and a lack of convenient methods to collect a random sample of the relevant Facebook data, the analyst had to come up with an acquisition approach that would be very selective both in terms of the particular Facebook pages – i.e. either individual user pages or organisational public pages – and in terms of the acquired data types.

3.3. PRODUCTION OF HIT LIST SHOW FOR BBC RADIO 5 LIVE 83

The chosen approach was to limit the acquisition to weekly posts from the public pages of the UK news media organisations with substantial following on Facebook and to use post-level engagement metrics – comments, likes and shares. This approach involved a noteworthy trade-off: on the one hand, it went a bit against the motto of the show as it did not expand the show’s agenda beyond the items already reported in mainstream news media. On the other hand, the derived Facebook data were very convenient to work with as by their very nature they represented almost exclusively news-worthy topics judged to be interesting to the UK public. Minimal moderation from the analysts side was required. When comparing this back to the complex process of making the Twitter data workable, it becomes self-evident that calling data coming from these two platforms equally “raw” would be a drastic oversimplification.

3.3.3.2 Aggregating data across platforms

While the discussion above shows the complexity of bringing the data to the state when the popularity of the represented topics could be assessed within each platform, turning those platform-wise popularity estimations into a coherent integrated rating was a significant issue on its own. The data from different platforms represented qualitatively different interactions between the users and the topics and were also affected by the relative popularity of the platforms and the difference in their demographic profiles (Duggan and Brenner, 2013). It is non-trivial to judge how much a single Facebook comment in search queries or in tweets. Moreover, the YouTube score, as discussed above, was based on relative positions of a video in an auto-generated playlist rather than on raw counts of underlying user interactions.

The approach that the data science team agreed upon was making separate charts for individual platforms and then aggregating those via assigning each platform a weight. Thus, the 40 most popular topics on each platform received a platform-specific score distributed from 40 to 1 inversely to the topic’s position in the platform chart. A weighted sum of those scores across all platforms then gave a topic’s overall score.

This approach had profound consequences for the contents of the show. First, it allowed a good degree of variety in the covered topics. Each platform had its own characteristic trending content – partially due to the platform mechanics and the demographics of their user-base, partially due to the way we collected the data from each platform (cf. Ruths and Pfeffer, 2014). For example, YouTube appeared to be prone to carrying “viral”, entertaining content, while Facebook, by the construction of the data acquisition process, was “newsy”. With the approach taken, the most popular content from each platform was presented somewhere in the chart even in those weeks when that platform had not gained high absolute levels of interaction. On the other hand, a platform-specific topic, no matter how overwhelmingly popular it was within one platform,

could never take a top spot from a theme that was trending across multiple data sources. This sometimes led to a more predictable top-5/top-10 (with Brexit and the US Presidential Elections being at the top of the list for around half of 2016), but gave us additional confidence in the top slots of the chart.

However, arguably even more important consequence of the approach taken to link the platform- specific datasets was the ability to weight the platforms differently. The weights fluctuated during the first year of the programme, but from August 2015 to the end of the show they have remained constant. The key factors behind the chosen weights were:

• A strive for a diverse and balanced list of topics. Some of the key dimensions to balance were (a) uniqueness of the topics in a weekly chart vs. coverage of the topics trending across the mainstream media, (b) UK vs. international focus, (c) hard news vs. entertainment. The degree to which the balance was achieved was assessed mainly by the show producers, for whom this balance was a major selling point of the show. As one of the producers said in an interview, “the breadth of the stories [...] from either side of the spectrum and everything in between was vital to what made [the Hitlist] so good”. Moreover, as her colleague observed:

“I think [the Hit List] is really reflective of how people digest news. So many people actually just digest news through social media, so they are looking at one second a cute panda and the next second Donald Trump. [... Traditional] news, they haven’t caught up with this; that’s why this is so important.”

• Striving for a workable list. One aim in compiling the list was to allow the show producers to give the topics a proper journalistic treatment. While the producers were up for the challenge that the data-driven reporting brought and were happy to study topics they would not normally come across in their work, some topics lacked in substance beyond repair by journalistic work. YouTube, as the least event-driven and the most content-driven platform, contributed a lion share to this problem, which naturally led to it being the lowest rated platform. In addition to using the weightings, informal rules emerged that aimed to exclude content that did not fit the definition of “news” for a general audience. Examples were music videos or gaming videos (see Section 3.3.4 for a more detailed discussion of content exclusion). • The level of confidence in and experience with processing data from various platform. At the

very start of the show Twitter was the only data source. While its data were in a sense the most problematic to deal with, the data science team members (including the lead analyst) had had the most previous experience in dealing with Twitter data out of all online platforms. For these reasons, Twitter stayed the highest weighted platform for the course of the show.

3.3. PRODUCTION OF HIT LIST SHOW FOR BBC RADIO 5 LIVE 85

By contrast, Facebook was initially added with a low weight, but after several months, when its high news-value and important role in counter-balancing data from other platforms had been robustly established, it was advanced to the same highest weight as Twitter.

It is worth noting that the motivation behind the selected weights was very practical and pragmatic. The weights did not necessarily reflect our perception of the relative importance of the selected platforms, or our understanding of the overall volumes of interaction on them. Rather than that, they were used to support the end-goal of the project: production of an interesting and varied weekly news chart for live radio that would resonate with the audience. This may seem a bit counter-intuitive given the show’s data-driven, evidence-based nature, but arguably is a common goal for all such attempts at creating “trending” features – for example, in the case of the aforementioned YouTube playlist of the most popular UK videos, it is hard to imagine that Google compiles it for any other reason than engaging audiences.

3.3.3.3 Filtering UK data

The discussion above suggests that most of the data processing that had to be performed beyond the straight-forward automated data aggregation was done manually by the analyst who exercised their judgement on the topic boundaries and news-worthiness. However, it has also been mentioned that the Twitter data were acquired in the form of a real-time 1% random sample. This motivated the use of machine learning to filter out the tweets that did not come from the UK. Indeed, when charting the topics discussed on Twitter, the analyst could not afford to only rely on the tweets that contained geolocation data, as such tweets were extremely sparse. The geolocation codes are contained in only about 5% of tweets (Graham et al., 2014) – and that would have been 5% of a 1%. Hence, for the vast majority of data we had to infer their country of origin using a classification algorithm. The analyst employed a Bayesian classifier that had been implemented by a different data scientist on the team and had been trained on the tweets that did have geocodes. A revised version of the classifier is discussed by Zubiaga et al. (2017). The version of the classifier employed in the production of the Hit List reported on precision and recall at 85% and 68% respectively, which implied an expected level of 15% false positives in the filtered data. It is worth examining how the application of an algorithm with known limitations affected the chart itself, the team’s perception of that chart and how it was dealt with.

While the filtering algorithm definitely allowed for a collation of a much more UK-centric list of topics, it systematically left in a loosely defined set of topics (expressed as hashtags) that the Hit List team strongly suspected to be false positives. The suspicion was rooted in (a) the absence of these topics in the charts of the other individual platforms (b) the Hit List team’s expectations and perceptions of what might be of interest for the UK public on Twitter and beyond and (c)

common features shared by many of those topics. Those topics tended to be US-centric, which could be explained by the fact the the US sector of Twitter was by far the largest and that the majority of British and American tweets shared a common language, which could lead to further confusion for the classifier. Some of those topics were left in the chart (e.g. the highly topical political ones around the US presidential election and the Black Lives Matter movement), while others (e.g. related to some of the US-specific TV talk shows) were eliminated manually by the analyst. Interestingly, if the analyst had not manually discarded those topics, some of them would have consistently charted for consecutive weeks. Such repeating irrelevant chart entries could have been specifically annoying to the show’s audience.

A closely related concern was in regard to the low recall rate of the algorithm. The time- and resource constraints of the Hit List production did not allow for studying the unfiltered version of a chart, so while the analyst did their best to deal with false positives, they were ill-equipped against false negatives. In principle, since the hashtags/topics were not used as features in the country classifier, it was not completely unreasonable to assume that a probability of a

In document Project management in social data science : integrating lessons from research practice and software engineering (Page 97-110)