Data Collection, Data Discovery, Data Cleaning

CHAPTER 3. METHODOLOGY

3.2. Proposal for an Interdisciplinary Approach to Television Studies 137

3.2.2. Data Collection, Data Discovery, Data Cleaning

This section is thought of as a walk-through the process of data collection, discovery and cleaning, as well as on the archival and database sources I used. It is therefore not specifically part of the overview on the theoretical framework that guided my methodological approach, but rather a way to start introducing a more pragmatic description of methodology and modus ope-randi used in this research. A description of the corpus of anthology series I selected can be summed up in the following elements: medium (television), text type (anthology dramas), tempo-ral frame (broadcasted between 1947 and 2019), geographical location (United States). In order to be visualized, the information needs to be organized in a structured database. To this aim, data were preliminarily extracted, collected and organized, through research in physical archives (UCLA Film & Television Archive, Paley Center for Media, AFI Louis B. Mayer Library) and online databases (IMDb and Wikipedia). The data collected therefore come from an uneven set of archives, and include both “vertical”, institutional archives as well as “horizontal”, user-generated and participatory archives. The process of data collection is a fundamental one, as it poses the basis for the data visualization and analysis. In Digital Humanities, this systematic procedure for gathering data was notably observed by Johanna Drucker. Drucker points at the fact that data col-lection is a method borrowed from natural and social sciences and that it needs to be readapted for the study of culture. In particular, she discusses the concept of data as something that should to be “rethought through a humanistic lens and characterized as capta, taken and

constructed” (Drucker 2015: 238) This approach starts from the premises that “capta is ‘taken’

actively while data is assumed to be a ‘given’ able to be recorded and observed. […] Humanistic inquiry acknowledges the situated, partial, and constitutive character of knowledge production, the recognition that knowledge is constructed, taken, not simply given as a natural representation of pre-existing fact.” (ibidem)

A “cultural-type survey” (Lorusso 2015) therefore poses significant problems related to the principle of selection of the corpus, as well as its size, density, complexity and level of inter-connectedness. Furthermore, archival collections and online databases both contain messy data, often not ready for a comprehensive information retrieval of the entire catalogue. To facilitate the understanding of the methodology adopted here, I will describe the process I followed for active-ly collecting data, or, better, capta, which I then displayed through a graphical expression. My research originally started from physical archives. I initially adopted archival research as a means for collecting information in order to define a corpus of anthology series. While the Paley Center for Media was useful for a contextual research on early U.S. television history, the UCLA Film &

Television Archive served as the main source for mapping U.S. anthology series, which was inte-grated with a complementary research at the American Film Institute’s Louis B. Mayer Library.

UCLA Film & Television Archive contains “over 160,000 holdings spanning the entire course of broadcast history” , making it one of the largest television archive collections in the United ²⁹ States. If we browse for television anthologies on the archive’s online catalogue, there are 1072 titles of episodes listed as part of several anthology series (see Fig. 1).

See: https://www.cinema.ucla.edu/collections/explore-collections.

Figure 1. Screenshot of the results for the query I run on the UCLA Film & Television Archive  ³⁰ online catalog.

The list found on the online catalogue, however, includes titles that cannot be screened due to the precarious state of their preservation. A close reading of the entire catalogue is there-fore not possible, and, even when it is, the amount of content to retrieve and analyze is very high, thus compromising an unbiased analysis able to offer a complete overview of the corpus. While I did watch single episodes by booking a copy at the archive, I therefore decided to opt for other methods for information retrieval. This is how I began moving towards a digital humanities ap-proach. Even though it is not possible to extract metadata directly from the UCLA Film &

Query/Search request: Topic/Genre/Form(anthologies) AND Keyword(television) AND Keyword(United States)

(LANG=ENG). 

https://cinema.library.ucla.edu/vwebv/search?searchArg1=anthologies&argType1=all&searchCode1=SKEY&com-bine2=and&searchArg2=television&argType2=all&searchCode2=GKEY&combine3=and&searchArg3=United +States&argType3=all&searchCode3=GKEY&year=2018-2019&fromYear=&toYear=&location=all&place=all&lan guage=ENG&recCount=100&searchType=2&page.search.search.button=Search.

sion Archive’s online catalogue through a .cvs or .json file, which are the most common formats for data visualization, I was able to save a PDF file containing all the data available for each entry - e.g. title, format, year, subject, publisher (i.e. television network), and other additional notes and descriptions (see Fig. 2). These data were then moved manually into a tabular form and incorpo-rated with data found of Wikipedia, which was used for organizing the UCLA catalogue in a more structured format.

Figure 2. Screenshot of a single item’s section extracted from the printable view version (file format: PDF) of the query I run on the UCLA Film & Television Archive online catalog (see Fig. 2).

For the purposes of my research, in addition to the data collected from the UCLA Film &

Television Archive’s catalogue and Wikipedia, I extracted data found on online databases, with the intent of making the information more complete and minimizing the biases of choosing a single archival source. I notably referred to two online databases: IMDb, which was used to in-corporate missing information, and Wikidata, which was used as the main reference. The work on

Wikipedia is particularly interesting to show possible methods for extracting information from Wikipedia’s lists or items. On the one hand, I manually extracted the list contained on the follo-wing page: https://en.wikipedia.org/wiki/Anthology_series. I then matched this list with the UCLA list in order to group single episodes-items into the correspondent series - e.g. “He’s for me” (S02E21) in Alcoa Hour (NBC, 1955-1957). In parallel, I extracted data from the Wikipedia category “American_anthology_television_series” (https://en.wikipedia.org/w/index.php?

title=Category:American_anthology_television_series) through the Wikidata Query Service, a tool that allows to access a Wikidata’s dataset through a specific query, using the semantic query language SPARQL. Figure 3 shows the SPARQL Query I used for extracting the information sto-red in RDF (Resource Description Framework). More specifically, I filtesto-red all the items in the category “anthology series”, and selected the following attributes: title, genre, production compa-ny, distributor, original network.

Figure 3. Screenshot of the SPARQL Query I used for extracting the information stored in RDF (Resource Description Framework) format from the Wikidata Query Service’s platform.

The Wikidata Query Service allows to extract data a format that can be read by visualiza-tion tools, thus enabling a process of data discovery through visualizavisualiza-tion. “Visual data discovery is the use of visually-oriented, self-service tools designed to guide users to insights through the effective use of visual design principles […].” (Ryan 2016: 40) Data discovery is the preliminary step for data cleaning. If the dataset is extracted automatically, there is a high chance of finding errors and null values, which is not a risk if the capta are gathered manually from archives, as I did in the first phase of data collection. The query service contains options for either visualizing the dataset in a tabular form or for generating visual models through the use of programs directly embedded on the platform. Polestar, for examples (Fig. 4), offers the possibility to visualize the dataset on a graph, by selecting x and y axis (in this case, respectively, genre and original net-work’s label) and by showing the size (#count).

Figure 4. Screenshot of the dataset extracted from the Wikidata Query Service as visualized on Polestar.

In the screenshot above we observe that the dataset contains data labeled as “undefined”, as well as double entities like ABC versus American Broadcasting Company, which should be collapse in the same entity. Nevertheless, despite the messiness of these data, we can still verify that the dataset presents a quite accurate depiction of the occurrences of the anthology form, in the sense that it shows the major networks which originally broadcasted this televisual form (ABC, CBS, NBC). If we scroll down, Netflix emerges as a medium size point, matching the scale of FX, HBO and Showtime as far as anthology content, proving that in its relatively short history as a platform (2007-2019), compared to the longer history of cable channels, it was still able to create a fairly large collection of anthologies. I will observe these dynamics more in-depth in chapter 4 and 5, by combining distant and close reading analysis.

Another tool for visual data discovery is RAWGraphs, which offers an open source visua-lization framework, by “providing a missing link between spreadsheet applications (e.g. Mi-crosoft Excel, Apple Numbers, OpenRefine) and vector graphics editors (e.g. Adobe Illustrator, Inkscape, Sketch).” For the visual discovery of the dataset I extracted I used the Alluvial dia³¹ -gram by RAWGraph, which is designed to display relational flows between different categories, by grouping together via coloring the entities belonging to the same label. Figure 5 provides a graphical visualization of flows of anthological content between production company, genre, ori-ginal network, in order to observe the emergence of certain genres (e.g. drama, science fiction) or networks above others (e.g. ABC, BBC, CBS, ITV, NBC, Netflix). These visual operations do not allow for interpretation. They rather give an overview on the dataset, by indicating where errors, inaccuracies, repetitions are located within it, and what are the nodes that we might want to

See: https://app.rawgraphs.io/

plore further to understand trends and patterns in the industrial-cultural network traced by antho-logy series.

Figure 5. Alluvial diagram made with RAWgraph based on the dataset   extracted from the Wikidata Query Service.

As a third step, I therefore proceeded to data cleaning with Python to eliminate null va-lues, compress duplicates in single attributes (e.g. true crime and neo-noir were collapsed under a single umbrella term), and filter items or information I don’t need for my analysis (e.g. films, non

U.S. TV series). Finally, by using the fuzzy matching algorithm in Python, I then integrated mis-sing data on production company, original networks, release date, number of episodes by merging the Wikidata dataset with an IMDb dataset accessed via API. The datasets resulting from these processes of data collections and data filtering counts less than 500 items. On the one hand, what I call dataset I was the outcome of a manual process for assembling data from the UCLA Film &

Television Archive’s catalog with data from a web page on Wikipedia about anthology series. On the other hand, dataset II was generated through a semi-automated process of data extraction from online databases such as Wikidata and IMDb. Since dataset I and dataset II are relatively small corpora if compared to other corpora in Digital Humanities, I was also able to compare the two and merge them manually, by moving items from the less complete dataset to the more com-plete one. The final dataset is therefore the outcome of a long process of data manipulation, for organizing the final corpus, making it readable by visualization tools and ultimately for turning messy data into structured capta ready for generating meaningful graphic representations.

This sometimes neglected preliminary phase is at the very core of a digital humanities me-thodology, which does not resolve in the moment of production of visual models or of digitized content, but in the very process of defining a workable corpus. Even before data analysis and in-terpretation is performed, the creation of a corpus and its visualization already set the premises for the production of knowledge. Much like in cultural semiotics, the corpus I selected tries to be as much unbiased as possible, by blending together different sources. Quoting the semiotician Anna Maria Lorusso, it notably aims to be “significant and representative”, while avoiding “both the logic of exemplum (taking a single case and postulating a posteriori that it explains everything else) and the most extreme derivation of constructivism (by defining an ad hoc corpus that

confirms the original hypothesis and that, therefore, does not really test it or have the ability to modify it).” (Lorusso 2015: 55) I want to stress on this point, which I believe should represent the core of any Digital Humanities inquiry and was notably adopted for the present project. Of course, the analysis will always present a point of view, as one cannot escape from choosing cri-teria for the selection or starting from an interpretative hypothesis, but the aim is to render the perspective as much objective as possible. In other words, “while in deconstructionism we pro-ceed by digging inside a single case—all the better if it is exceptional—chosen in a rather idio-syncratic manner, in semiotic analysis we explain our research hypothesis and our own corpus building procedures, avoiding extraordinary examples and focusing instead on a series of ordina-ry cases that are significant because they demonstrate regularity.” (ibidem)

In document Redefining the Anthology: Forms and Affordances in Digital Culture (Page 140-149)