Interactive content analysis : evaluating interactive variants of non-negative Matrix Factorisation and Latent Dirichlet Allocation as qualitative content analysis aids

(1)

Variants of Non-negative Matrix Factorisation and

Latent Dirichlet Allocation as Qualitative Content

Analysis Aids

Aneesha Bakharia

Bachelor of Engineering in Microelectronic Engineering, Griffith University Master of Digital Design, Griffith University

A THESIS SUBMITTED TO

THE FACULTY OF SCIENCE AND ENGINEERING OFQUEENSLANDUNIVERSITY OF TECHNOLOGY IN FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

Principal Supervisor: Professor Peter Bruza Associate Supervisor: Associate Professor Jim Watters

Associate Supervisor: Dr Bhuva Narayan Associate Supervisor: Dr Laurianne Sitbon

Information Ecology Discipline Information Systems School Faculty of Science and Engineering Queensland University of Technology

(2)

(3)

(4)

Statement of Original Authorship

The work contained in this thesis has not been previously submitted to meet require-ments for an award at this or any other higher education institution. To the best of my knowledge and belief, the thesis contains no material previously published or written by another person except where due reference is made.

Signature:

Date:

Copyright in Relation to This Thesis

c

i QUT Verified Signature

(5)

(6)

To my Grandmother, Rada Bakharia.

(7)

(8)

Abstract

Numerous software applications exist to assist researchers with qualitative content analysis. Despite the increased number of these applications, researchers continue to argue that there has been little progress in computer-aided content analysis over the last few decades. Although more sophisticated algorithms for automated thematic discov-ery have been available, mainstream adoption has been limited. The main contributing factor has been a lack of trust in automation and a lack of interactivity provided by algorithms. Interactivity is seen as a means by which a researcher can incorporate domain knowledge to better contextualise the automatically derived themes for their own research. This research was designed to directly address issues relating to trust and interactivity associated with thematic discovery algorithms. The central hypothesis was to evaluate whether the use of interactive thematic discovery algorithms that are able to incorporate domain knowledge improves the ability for a content analyst to respond to their research questions.

Recent thematic discovery algorithms such as Non-negative Matrix Factorisation (NMF) and Latent Dirichlet Allocation (LDA) are particularly suited to the task of finding latent themes within document collections. NMF produces a matrix decompo-sition where the resulting two matrices only contain positive values and simultaneously groups together both the words and the documents that belong to themes. LDA is a generative model that represents a document as a mixture of themes, each of which is afforded a different probability distribution. Both algorithms possess features that are not found in other statistical procedures such as Latent Semantic Analysis and k-means clustering. NMF and LDA are not hard clustering solutions and have the capacity to assign documents to multiple themes.

(9)

itative research data that uses thematic discovery algorithms. In order to achieve this, the themes derived by manual coders were compared with the themes derived with the aid of a thematic discovery algorithm. A key finding was that fine-grained themes that were based on specific domain related relationships was missing from the list of themes derived by the thematic discovery algorithm. While interactivity was seen as a means for content analysts to provide the additional relationships such as domain knowledge to the algorithm, the nature of the interactivity was ill-defined and was determined from a carefully designed experiment. The types of interactivity that qualitative content analysts required was the ability to create themes by specifying the words that make up a theme, the ability to merge themes and the ability to split themes. An interactive variant of NMF and LDA, namely Penalised Non-negative Ma-trix Factorisation (PNMF) and Dirichlet-Forrest Latent Dirichlet Allocation (DF-LDA) that matched the interactivity requirements were selected and evaluated by recruited experiment participants. The performance of both algorithms was comparable and via analysis of participant usage of the algorithms, evidence in support of the research hypothesis was gathered. Participants via an iterative process were able to supply domain knowledge to the algorithms and discover themes that were directly related to their research questions.

Numerous design guidelines for the implementation of thematic discovery algo-rithms as qualitative content analysis aids emerged from the analysis of experiment results. These design guidelines formed the basis for the development of an Evaluation Model for Interactive Content Analysis which was made up of four key categories, namely: Trust, Interpretability, Theme Coherence and Interactivity. A Conceptual Framework for Interactive Content Analysis was also proposed that provides a theoret-ical foundation for a qualitative analysis process where interaction occurs between the analyst and the thematic discovery algorithm. The Conceptual Framework for Interactive Content Analysis incorporates an interactive feedback loop as a means for the researcher to provide their domain knowledge to an algorithm and set the context to answer their research questions.

(10)

Keywords

Qualitative Content Analysis, Computer-Aided Content Analysis, Inductive Content Analysis, Conventional Content Analysis, Thematic Analysis, Topic Modelling, Text Analysis, Non-negative Matrix Factorisation, Latent Dirichlet Allocation, NMF, LDA

(11)

(12)

Acknowledgments

I would like to thank Professor Peter Bruza for his excellent supervision and for introducing me to the potential of the NMF and LDA algorithms at a very early stage in my research. I would also like to thank Professor Peter Bruza for his patience, guidance and insight all along my journey to completion. My associate supervisors, Associate Professor Jim Watters, Dr Bhuva Narayan and Dr Laurianne Sitbon have my sincere thanks for always being available to listen and provide both timely and expert advise. I thank Associate Professor Jim Watters for sharing his expertise on research methodologies and always providing inspiration. I especially thank Dr Bhuva Narayan for always sharing her knowledge, being approachable and inspiring.

Embarking on a PhD is a journey that I would not have been able to complete were it not for the love, support and guidance of my Mum (Juleka), Dad (Abdulah), Aunts (Julekhai, Hajira and Kulsum) and Uncle (Ebrahem). I began my journey with my Grandmother siting beside me as I wrote my proposal for this PhD and although she has past away, I finished the PhD with her very much in my thoughts and heart.

I would also like to thank the members of the QI Research Group at QUT (Kirsty Kitto, Bevan Koopman, Mike Symonds, David Galea, Lance Devine, Rune Rasmussen and Guido Zuccon) for always being willing to hear about my research and provide insightful feedback.

A special thanks also goes out to my friends, Nhu Tran, Shane Dawson, Nadia Chester, Elizabeth Hall, Deidre Seeto and Craig Comerford for their support and advise along the way.

(13)

(14)

List of Figures

1.1 The basic functions of Computer-aided Qualitative Data Analysis Soft-ware (CAQDAS) tools . . . 3

1.2 An illustrated overview of Thematic Discovery Algorithms . . . 10

1.3 A flowchart of the main thesis chapters that address research questions. 14

2.1 Comparing conventional and directed content analysis . . . 22

2.2 Qualitative Data Analysis Framework . . . 24

2.3 Conceptual Foundation of Content Analysis . . . 35

2.4 The display of themes derived from the LDA and NMF algorithms. . . 44

2.5 Factor model for the Latent Dirichlet Allocation algorithm. . . 52

2.6 Factor model for the DF-LDA algorithm . . . 65

2.7 Fold · all factor graph for Logic-LDA . . . 67

3.1 The four phases of Design-based research. . . 80

3.2 A flowchart illustrating the main thesis chapters that address research questions. . . 84

4.1 Example theme template provided to manual coders (Group A). . . . 97

4.2 Instructions provided to Group B participants in Experiment 1. . . 98

4.3 The user interface to display derived themes for Group B participants in Experiment 1. . . 98

4.4 The keyword-in-context tool for Group B participants in Experiment 1. 99

(23)

pants in Experiment 1. . . 99

4.6 The online questionnaire interface provided to Group B participants in Experiment 1. . . 100

4.7 Example theme template completed by manual coders in Group A. . . 102

4.8 Venn diagrams illustrating theme overlap present in themes manually coded by participants in Group A. . . 106

4.9 Group B participant selection of the number of themes (k) in Experi-ment 1. . . 115

5.1 Conceptual Foundation of Content Analysis . . . 129

5.2 Conceptual Foundation for Interactive Content Analysis. . . 132

5.3 Evaluation Framework for Conceptual Foundation of Interactive Con-tent Analysis. . . 137

7.1 Instructions provided to participants prior to completing the research task in Experiment 2. . . 169

7.2 The user interface provided to participants in Experiment 2. . . 169

7.3 The display of themes generated by interactive algorithms in Experi-ment 2. . . 170

7.4 Keyword-In-Context functionality included within the user interface provided to participants in Experiment 2. . . 171

7.5 Theme creation included within the user interface provided to partici-pants in Experiment 2. . . 172

7.6 Removing theme creation entries in the autocomplete control. . . 172

7.7 Creating multiple rules within the user interface provided to partici-pants in Experiment 2. . . 173

7.8 The interface to merge themes provided to participants in Experiment 2. 173

7.9 The interface to split themes provided to participants in Experiment 2. 174

7.10 The number of themes validation message displayed to participants. . 174

(24)

7.11 A summary of the rules specified by a participant for review and re-creation. . . 175

7.12 The interface provided to participants in Experiment 2 to rate and label themes. . . 175

7.13 Radar plots for the Theme Coherence Questionnaire Section. . . 178

7.14 Radar plots for the Interpretability Questionnaire Section. . . 181

7.15 Radar plots for the Trust Questionnaire Section. . . 182

7.16 Radar plots for the Interactivity Questionnaire Section. . . 184

7.17 Group A Participant Sessions (DF-LDA). . . 189

7.18 Group B Participant Sessions (PNMF). . . 190

7.19 Spearman rank-order correlation coefficients for all Group A (DF-LDA) participants. . . 194

7.20 Spearman rank-order correlation coefficients for all Group B (PNMF) participants. . . 195

7.21 Spearman rank-order correlation coefficients the initial theme review by Researcher A. . . 205

7.22 Spearman rank-order correlation coefficients the initial theme review by Researcher B. . . 210

A.1 Example theme template provided to manual coders (Group A) in Ex-periment 1. . . 243

A.2 Instructions provided to Group B participants in Experiment 1. . . 244

A.3 Instructions provided to participants prior to completing the research task in Experiment 2. . . 245

(25)

(26)

List of Tables

2.1 Coding differences between the three approaches to content analysi . 24

2.2 Criteria for evaluating Computer-aided Qualitative Data Analysis Soft-ware (CAQDAS). . . 32

2.3 Comparative analysis of functionality included in Computer-aided Qual-itative Data Analysis Software (CAQDAS). . . 33

2.4 Characteristics of five qualitative approaches . . . 37

2.5 The three stages of Grounded Theory analysis. . . 38

2.6 Sample dataset for illustrating properties of the NMF and LDA algo-rithms. . . 47

2.7 Display of simultaneous word and document clustering produced by NMF. . . 50

2.8 Display of simultaneous word and document clustering produced by LDA. . . 54

2.9 UMass Coherence Scores for topics from the NIH Corpus . . . 71

3.1 Research activity mapped to phases within the Design-based Research Methodology . . . 81

3.2 Mapping research questions to experiments. . . 82

4.1 Profiles for participants recruited for Experiment 1. . . 95

4.2 Themes derived by Participants in Group A. . . 103

4.3 Themes derived by Participants in Group B . . . 108

(27)

content analysis. . . 121

5.1 Design Guidelines for Interactive Qualitative Analysis Software. . . . 134

6.1 Interactivity Requirements for Algorithms Used to Support Qualitative Content Analysis. . . 152

6.2 Interactivity requirements for algorithms used to support Qualitative Content Analysis. . . 153

6.3 Domain Rules included in DF-LDA. . . 157

6.4 First Order Logic (FOL) rules included in Logic-LDA. . . 158

6.5 A comparison of NMF and LDA interactive algorithm variants. . . 160

7.1 Evaluation Model Questionnaire Section 1: Theme Coherence Col-lated Responses. . . 177

7.2 Evaluation Model Questionnaire Section 2: Interpretability Collated Responses. . . 179

7.3 Evaluation Model Questionnaire Section 3: Trust. . . 181

7.4 Evaluation Model Questionnaire Section 4: Interactivity Collated Re-sponses. . . 183

7.5 Summary statistics for Group A and Group B participant sessions. . . 188

7.6 Linking rules to research questions for Group A and Group B partici-pants. . . 191

7.7 Initial themes derived by Researcher A. . . 203

7.8 Initial themes derived by Researcher B. . . 209

8.1 Full list of Design Guidelines for Interactive Content Analysis. . . 228

A.1 Excerpt from Dataset used in Experiment 1 and Experiment 2 . . . 242

(28)

Chapter 1 Introduction

Content analysis is a systematic approach to the identification and interpretation of themes within text. Content analysis is an important and essential analysis and in-ference technique that is employed by researchers involved in qualitative research. A variety of qualitative research methodologies dictate the use of content analysis with most researchers performing content analysis manually by reading all of the textual research data. Content analysis completed entirely in a manual manner is tedious and time consuming, and the content analysis process could be greatly enhanced by the use of computational aids. In particular, recent developments in thematic discovery algorithms (a form of unsupervised machine learning) could play an important role in identifying the key themes present in a collection of textual documents. However, the use of algorithms and advanced computational techniques as qualitative content analysis aids has not achieved mainstream acceptance and adoption. It is becoming imperative to use advanced computational aids as the size of text-based research data continues to increase. The research presented in this thesis aims to address issues affecting the adoption of computational techniques for qualitative content analysis by proposing the use of recent thematic discovery algorithm.

In this introductory chapter, the research significance will be established by ex-ploring the predominant reasons for the lack of adoption of advanced computational techniques (more specifically recent advances in thematic discovery algorithms). This will allow the key research gaps to be identified along with a clearly articulated central hypothesis and supporting research questions. The research background, motivation

(29)

and scope will also be briefly discussed. An overview of each chapter in the thesis is also provided.

1.1 Overview

Researchers, particularly within social sciences and humanities, encounter and also create diverse textual datasets within their qualitative research studies, which require analysis. Data originates from traditional sources such as workshops, focus groups, interviews and surveys where content analysis is performed on open-ended textual data in order to either verify or generate a theory. In recent years however, with the proliferation of social media, researchers are able to gain access to larger datasets from blogs and microblogging networks such as Twitter. In many instances it is impossible for a researcher or even a team of researchers to read each textual response and perform analysis manually. The need for computational aids to assist in the process of qualitative content analysis is therefore becoming important.

In recent years two algorithms have been introduced with properties that are ideally suited to the task of grouping similar documents or shorter textual responses together, namely Non-negative Matrix Factorisation (NMF) (Seung and Lee, 2001) and Latent Dirichlet Allocation (LDA) (Blei et al., 2003). Both algorithms are able to simultane-ously group words and documents into themes (also known as clusters or topics) and allow documents to be assigned to more than one theme. While advances have mainly focused on improvements in the speed, accuracy and convergence of the algorithms, some research is beginning to emerge on how humans interpret themes (Chang et al., 2009), and utilise these algorithms within social sciences research (Ramage et al., 2009). A number of extensions to NMF and LDA that support interactivity have also been proposed.

How can recent advances in text analysis and theme discovery algorithms help researchers to make sense of the diverse existing and newly created textual document collections? The research outlined in this thesis addresses this question by reviewing recent advances in thematic discovery algorithms and by investigating the extent to which these algorithms are able to serve as inductive content analysis aids.

(30)

1.2. BACKGROUND 3

1.2 Background

Krippendorff (2012) describes content analysis as a “careful, detailed, systematic ex-amination and interpretation of a body of material to identify patterns, themes, biases and meanings”. The Neuendorf (2002) definition of content analysis states that content analysis is “the use of replicable and valid methods for making specific inferences from text to other states or properties of its source”. Researchers usually adhere to a qualitative methodology such Phenomenology, Grounded Theory or Ethnography while conducting research with content analysis being performed as a key data analysis step. The data analysis step of many qualitative methodologies require theme and pattern extraction (Creswell, 2012, p 78-79) and would benefit from the thematic discovery algorithms proposed within this research.

Figure 1.1: The basic functions of Computer-aided Qualitative Data Analysis Software (CAQDAS) tools, reproduced from Silver and Patashnick (2011).

Content analysis historically has been viewed as tedious, time-consuming and de-moralising (Danielson and Mullen, 1965). Although Computer-aided Qualitative Data Analysis Software (CAQDAS) is available, the main focus of such software has been to facilitate manual coding and serve as text retrieval systems (Lewins and Silver, 2007; Silver and Fielding, 2008). Figure 1.1, reproduced from Silver and Patashnick

(31)

(2011), shows the functionality that is currently available in popular CAQDAS like AT-LAS.ti (ATAT-LAS.ti Scientific Software Development GmbH, 2013), and NVivo (QSR International, 2013). The inclusion of exploratory data analysis tools that incorporate advanced computational techniques in particular, is lacking in current CAQDAS (see Section 2.3: A Survey of the Functionality of Content Analysis Software). Tools with more sophisticated algorithms such as CATPAC (Doerfel and Barnett, 1999) and Leximancer (Smith and Humphreys, 2006), which perform semantic and cluster analysis have been available but are still not used widely. The content analyst is nevertheless still required to read all of the textual content in order to identify the main themes. Though ten years old at the time of writing this thesis, the criticism of Berg (2004, p 372) that no progress has been made in the last few decades is still relevant.

1.3 Factors Affecting the Lack of WideSpread Usage of

Computa-tional Aids

In this section, I will discuss seven factors that have contributed to the lack of widespread usage of computational aids to support qualitative content analysis. Essentially quali-tative researchers do not trust the derived output of computational techniques because the algorithm being used is not disclosed, no support for interacting with the algorithm is provided and the derived themes may be of poor quality and may not answer the research question. This impedes the researchers’ ability to trust the derived output and present credible research findings.

CAQDAS in general also has issues that have had a negative impact on adoption. According to di Gregorio (2010), CAQDAS is not used in everyday practise by senior practitioners. The main reasons put forward by di Gregorio (2010) include the claim that CAQDAS is not user-friendly; it requires significant institutional training and support; and a considerable investment of time by the researcher. In many cases where CAQDAS has been used, the usage is at a very superficial level - mainly to perform data management (Lu and Shulman, 2008). According to Fielding (1998), there is little use of Boolean retrieval which is readily available in most CAQDAS. In the following subsections, seven concerns raised about CAQDAS will be discussed.

(32)

1.3. FACTORS AFFECTING THE USAGE OF COMPUTATIONAL AIDS 5

1.3.1 Lack of Transparency

Content analysts have a sceptical view software products that promise theme discovery but do not describe or expose their algorithms (Krippendorff, 2004). The lack of transparency into the algorithms being used in many of the programs has reduced the credibility that a researcher has in the derived themes and has resulted in a lack of trust in the software program. The use of a form of artificial intelligence in QualRus (The Idea Works, Inc., 2013) is described as controversial as no disclosure is made about the nature of the technique used to suggest codes (tags or classifications) with Lewins and Silver (2006, p 273) going on to say that “the functioning of this aspect of the software needs to be well understood early on in order to make effective and appropriate use of the available tools”.

Semantic Validity is important in establishing credibility. Semantic Validity is the degree to which textual statements (document, paragraphs, text segments) correspond to the predetermined themes that they have been mapped to. Certain commercial software packages do not display the derived themes in a manner that the analyst can explore and gather supporting evidence. In some cases the algorithm being used is able to either map a theme to the documents within the theme or map the words that belong to a theme but not both the words and documents that belong to a theme together. The lack of ability of these algorithms to provide the theme output in the form of both words and documents makes it difficult for researchers to gather supporting evidence for the presence of a theme. Graphical depictions which represent clusters of groups of documents with circles and lines to connect documents are harder to interpret than textual representations which show document titles and main keywords (Hearst, 2009, p 209). It is difficult to know what a document is actually about without at least reading some of the text (Kleiboemer et al., 1996).

1.3.2 Lack of Interactivity

The incorporation of interactivity is also an issue that has not been addressed directly in relation to content analysis. Denzin and Lincoln (2005, p 638) describe interactivity in terms of the lack of support for situated and contextualized analysis from an analyst’s perspective. They make a valuable point when they say “Software programs for the

(33)

qualitative researcher need to be interactive, allowing for many different interpretive spaces to emerge, spaces that connect patterns with meanings and experience”. Inter-activity is seen as a means by which a content analyst can provide domain knowledge specifically related to the research question that the analyst is seeking to answer. The content analyst needs to use thematic discovery algorithms in an iterative manner by reviewing the derived themes, using additional domain knowledge as required and repeating the process. There is however very limited research that details the exact nature of the required interactivity.

1.3.3 Lack of Accurate Comparative Studies

There has generally been a lack of studies comparing themes derived from manual coding with the themes derived with the assistance of computational aids, in particular thematic discovery algorithms. Certain studies have used the term “computer-aided” liberally to encompass a number of techniques. The Conway (2006) study compared the coding results obtained by humans with computer-assisted coding in the analysis of political campaign coverage in newspapers. Conway concluded that the two processes yielded significantly different results with the computer-aided coders able to discover broader areas of coverage as opposed to the more nuanced results obtained by manual coders. However, the study used a limited and very basic text-search interface which cannot be generalised to larger bodies of text. Nevertheless, such studies have led to assumptions about computer-aided analysis that has led to a complacency regarding further research in the direction whilst at the same time qualitative researchers continue to distrust such analysis.

1.3.4 Ability to Address Research Questions

Analysts use a model (known as an analytical construct), to operationalise the knowl-edge that the analyst has in relation to the context of the research. The analytical construct is a key element in Krippendorff’s conceptual foundation for content analysis (Krippendorff, 2004, p 35). The analytical construct is used by the analyst to make the appropriate inferences in order to answer the research questions. In simplistic terms, the analytical construct can be thought of as a set of rules that the analyst employs.

(34)

1.3. FACTORS AFFECTING THE USAGE OF COMPUTATIONAL AIDS 7

When an algorithm is used to aid content analysis, it in-part serves as an analytical construct. Most algorithmic approaches rely on word frequency and proximity to determine similarity but this may not always be a valid assumption for all types of content analysis scenarios (Weitzman, 2005). There is a fear that researchers will be misled by focusing on quantity (frequency counts) instead of the actual meaning, whether frequent or not (Odena, 2012). In many instances an algorithm may derive statistically valid themes, but these themes may be of no relevance to the research questions the analyst is seeking to answer (Krippendorff, 2004).

1.3.5 Accessibility and Usability

While natural language processing and machine learning researchers publish new al-gorithms for text analysis, the source code for these alal-gorithms is rarely released. This essentially means that these algorithms will never be used by qualitative content analysts. Even if the source code is released, considerable data processing skills are still required to convert data into the correct format and execute these algorithms from a command line interface. The lack of documentation combined with the absence of a user-friendly interface puts these algorithms out of the reach of most qualitative content analysts. Ramage et al. (2009) choose to directly address this issue by integrating the LDA algorithm with Microsoft Excel, a spreadsheet software package that most researchers have access to and are familiar with.

1.3.6 Quality of Derived Themes

A major obstacle to the adoption and user acceptance of thematic discovery algorithms outside of the algorithm development community is trust (Mimno et al., 2011). The-matic discovery algorithms sometimes produce themes that are of poor quality (i.e., themes that are made up of a mix of unrelated words) (Mimno et al., 2011). This greatly reduces user confidence (Mimno et al., 2011) and the use of these algorithms in real-world tasks (Newman et al., 2010). A derived theme that has poor quality (e.g., a theme with the main words being: banana, sky, canoe, furious) can still have statistical significance but be of no real value to end users (Newman et al., 2010).

(35)

1.3.7 Research Methodology Bias

In the 1990’s desktop computers became more readily accessible and affordable. This resulted in the widespread use of computers and incorporation of the desktop computer into everyday-life activities. CAQDAS and computational techniques to aid content analysis began to emerge at this time as well. In 1990 Strauss and Corbin (1990) published their book on the Grounded Theory methodology which was widely read by qualitative practitioners. As a result, Grounded Theory heavily influenced the features that were included in qualitative software (Kelle et al., 1995). Qualitative content analysts were concerned that the epistemologies of the software developers would influence the data analysis tools and bias the analysis towards a particular methodology (di Gregorio, 2010). CAQDAS were thought of as “presupposing a way of doing research” and blamed for de-contextualising the research data (Lu and Shulman, 2008, p 180). Most CAQDAS include generic tools that are applicable to a range of methodologies (Tesch, 1990) but concerns about methodology bias towards Grounded Theory (Glaser and Strauss, 1967) in particular is still present.

1.4 Thematic Discovery Algorithms

In the previous section, issues affecting the widespread adoption of computational aids for qualitative content analysis were identified. These issues were many, complex and varied. As it is not possible to address all of these issues within the scope of a single thesis, I will concentrate on the most significant issues that can be addressed by recent advances in thematic discovery algorithms. I have therefore chosen to focus on the three issues that directly affect trust namely transparency, interactivity and quality. I propose the use of two very well known thematic discovery algorithms namely Non-negative Matrix Factorization (NMF) (Seung and Lee, 2001) and Latent Dirichlet Allocation (LDA) (Blei et al., 2003). Although NMF and LDA are motivated from different mathematical perspectives, NMF from linear algebra and LDA from proba-bilistic graphical models, both algorithms have properties that make them applicable to support the qualitative content analysis process:

(36)

1.4. THEMATIC DISCOVERY ALGORITHMS 9

The output of both algorithms can be interpreted as a network that links docu-ments to themes and themes to words. The link values are only positive. This has an advantage over previous techniques such as Latent Semantic Analysis (LSA) (Landauer and Dumais, 1997) where negative values produced were difficult to interpret. The themes derived from NMF and LDA are easily interpretable and this will essentially allow content analysts to gather supporting evidence for the existence of a theme.

• Both NMF and LDA don’t only place a document into a single theme, they rather place a document into all of the themes that are relevant. NMF and LDA are not hard clustering algorithms like k-means clustering in which a choice must be made about which theme or topic to place the document within (Hearst, 2009, p 209). Documents are simultaneously about multiple topics (e.g., “Should an article on government controls on RU486 be placed in a cluster on pharmaceu-ticals or one on women’s rights, religion, or politics?” (Hearst, 2009, p 209)). Figure 1.2, reproduced from Blei (2012), provides an illustration of the topics within a corpus and the words within a single document that contribute to the various topics.

• The “theory of meaning” behind the mathematical models underpinning both NMF and LDA have been published and are highly cited. Both algorithms have also been implemented in a variety of programming languages and are available under an open source license. This has enabled a good level of transparency with both algorithms being continually extended and evaluated.

• Various extensions also exist that facilitate the addition of domain knowledge by allowing users to provide information about documents that should be grouped together in the same theme. These variants are examples of semi-supervised algorithms and support the addition of domain knowledge via constraints (Basu et al., 2008). These algorithms have the potential to meet the interactivity re-quirements of qualitative content analysts.

• Numerous Theme Coherence metrics have been proposed (Mimno et al., 2011; Newman et al., 2010; Stevens and Buttler, 2012) for NMF and LDA. These metrics have the potential to identify themes that are of poor quality.

(37)

Figure 1.2: An illustrated overview of Thematic Discovery Algorithms, reproduced from Blei (2012).

1.5 Research Significance

The significance of the proposed research relates to the widespread use of qualita-tive content analysis, the emergence of large document collections from social media sources and the growing demand for computational techniques that remove the need for researchers to perform analysis manually (Burnap et al., 2013). There is a lack of published research studies that focus on the evaluation of computational techniques (specifically thematic discovery algorithms) applied to qualitative content analysis. Various issues also exist that have in the past impeded the use of computational tech-niques.

Numerous researchers have criticised the computer-aided content analysis tech-niques currently being used because of a lack of interactivity and inability to in-corporate the domain knowledge of the analyst (Denzin and Lincoln, 2005, p 638). Interactivity is largely seen as a way to contextualise the analysis to better answer the research questions being asked by the content analysts. Within the proposed research, theme discovery is evaluated as an interactive and iterative process. The proposed

(38)

1.6. RESEARCH MOTIVATION 11

research is significant because the findings have the potential to make thematic discov-ery algorithms (NMF and LDA) more contextually relevant, accessible and credible to content analysts.

As far as I am aware, this is the first research that addresses the issues that have contributed directly to the lack of widespread usage of computational aids to qualitative content analysis. This research proposes the use of thematic discovery algorithms (NMF and LDA) and extensions to these algorithms that support interactivity. The findings of this research have the ability to transform the way qualitative content anal-ysis is performed. In particular the proposed interactive thematic discovery algorithms have the ability to not only improve the trust and credibility that a researcher has in the derived themes but will also enable researchers to answer questions related directly to their research and gather supporting evidence.

1.6 Research Motivation

I am motivated to bring together two diverse discipline areas by applying thematic discovery algorithms as computational aids to qualitative content analysis. Content analysis is often described as tedious and time-consuming. Although computer-aided qualitative content analysis software is widely used (Neuendorf, 2002), the main focus of such software has been to facilitate manual coding and serve as text retrieval systems (Lewins and Silver, 2007; Silver and Fielding, 2008). The content analyst is neverthe-less still required to read all of the textual content in order to identify the main themes. The main motivation behind this research is to simplify the workflow surrounding theme discovery and encourage the use of more sophisticated algorithmically aided techniques.

I am also motivated to address issues that have had a negative impact on the use of computational techniques for qualitative content analysis. I am particularly interested in improving the content analyst’s trust and confidence in the themes derived with the assistance of a thematic discovery algorithm. This is important because the use of the-matic discovery algorithms does not solely rely on the availability of these algorithms and their ability to incorporate domain knowledge but by the willingness of content

(39)

analysts to use these computational aids. Increased confidence can only be achieved if the proposed algorithms and their output can be substantiated with evidence. This will improve the research credibility and increase confidence in thematic discovery algorithms thereby promoting the use of these advanced computational aids.

1.7 Hypothesis and Research Questions

In this research I will address the lack of transparency and interactivity in algorithms used as computational aids for qualitative content analysis. As a hypothesis I propose that:

• Performing conventional content analysis with thematic discovery algorithms that allow the inclusion of domain and contextual information via an interactive and iterative process improves the semantic validity of the themes produced in relation to the context of the research study.

To assess the above mentioned hypothesis, I need to consider the suitability and applicability of thematic discovery algorithms as qualitative content analysis aids. This can be achieved by comparing the themes derived from manual coding with the themes derived with the aid of a thematic discovery algorithm. I will also allow themes derived via NMF and LDA to be rated by the research participants. This leads to Research Question 1.

Interactivity is seen as essential to allow analysts to use their domain knowledge to better contextualise the derived themes in relation to their research questions thereby improving semantic validity. Interactivity however is ill-defined with no studies specif-ically seeking to identify the types of manipulations that are required. Within this research I must therefore determine the interactivity requirements. This is reflected in Research Question 2.

Once the interactivity requirements are known, I then need to select and evaluate algorithms that meet these requirements. This supports the inclusion of Research Question 3.

(40)

1.8. RESEARCH SCOPE 13

I also need to determine whether the use of algorithms that support the addition of domain knowledge via an interactive and iterative process leads to improved semantic validity. This need is reflected in Research Question 4.

In order to support the hypothesis, I need to address the following research ques-tions:

• Research Question 1. Are thematic discovery algorithms such as NMF and LDA suitable computational aids for inductive content analysis?

• Research Question 2. What are the types of interactivity that content analysts require when using a theme discovery algorithm such as NMF and LDA?

• Research Question 3. How do semi-supervised variants of NMF and LDA meet the interactivity requirements of qualitative content analysts?

• Research Question 4. Does the addition of domain knowledge via an interactive and iterative process improve a researchers ability to address research questions?

1.8 Research Scope

Content analysis as a research methodology is applicable to a variety of media (audio, video, images, etc) but within this research it will be restricted to data of a textual nature. The scope of this research is limited to the evaluation of algorithms that are able to support summative and conventional (inductive) content analysis as opposed to directed (deductive) content analysis. Algorithms such as Non-negative Matrix Factorisation and Latent Dirichlet Allocation are able to discover latent themes within textual data and are therefore appropriate algorithmic aids for content analysis studies where no a priori coding scheme is required. Directed content analysis on the other hand, involves content analysts mapping textual responses to categories from a pre-defined coding scheme. Directed content analysis is a classification task and more suitable to supervised learning algorithms such as Support Vector Machines (Cortes and Vapnik, 1995). The evaluation of supervised learning algorithms as computational aids to directed content analysis is beyond the scope of this research.

(41)

1.9 Chapter Summary

In this chapter I identified and discussed issues that have affected the adoption of computational aids for qualitative content analysis. The key contributing factor was found to be a lack of interactivity. Interactivity was seen largely as a means by which a researcher could add domain knowledge. The lack of interactivity in turn affected the ability for researchers to trust the findings uncovered using a computational technique. Recent algorithms however have features that are suitable to support theme discov-ery in textual document collections and interactive variants of these algorithms have emerged.

In this thesis I aim to evaluate thematic discovery algorithms within the context of qualitative content analysis. I was able to construct a clear hypothesis relating to the use of thematic discovery algorithms and the impact interactivity has on semantic validity. Four key research questions were also posed to help address the central hypothesis.

(42)

1.10. STRUCTURE OF THE THESIS 15

1.10 Structure of the Thesis

This section contains an outline of the thesis structure. Figure 1.3 shows the key chapters in this thesis that address the four research questions introduced in Section 1.7.

In Chapter 2, I conduct a comprehensive literature review into the use of computa-tional techniques as aids for qualitative content analysis. Various aspects of qualitative content analysis and thematic discovery algorithms are also explained. Within the literature review, research gaps are identified and mapped to the research questions proposed in this chapter. Key metrics for the evaluation of thematic discovery algo-rithms are also introduced as they will be used in the experiments.

In Chapter 3, the research design and chosen research methodology are outlined. The rationale for selecting the design-based research methodology is detailed. Three experiments are then designed, with each experiment seeking to answer specific re-search questions. The first experiment seeks to explore the suitability of thematic discovery algorithms as computational aids for qualitative content analysis, compare the themes derived by human coders with the themes derived by an algorithm and determine interactivity requirements. The second experiment uses the interactivity requirements obtained from the first experiment and evaluates two thematic discovery algorithms that meet these requirements with content analysts. In the first and second experiments, the evaluation is conducted using a small corpus mainly to allow partici-pants to be able to complete their analysis in a short timeframe. In the third experiment, I allow researchers to analyse their own datasets using interactive thematic discovery algorithms.

Chapter 4 contains a description of the first experiment (Experiment 1: Deter-mining Interactivity Requirements) in terms of the dataset, participants and methods of evaluation. The results of the experiment are also reported and analysed in a systematic manner. Numerous design guidelines relating to the use of theme discovery algorithms as qualitative content analysis aids emerge from the analysis. Interactivity requirements of theme discovery algorithms are determined from the analysis.

(43)

received in the first experiment and construct a theoretical model for the use of the-matic discovery algorithms as aids to the qualitative content analysis process. I also develop an evaluation framework for interactive content analysis. A survey instrument is developed with a set of questions for each of the criteria in the evaluation framework. The survey is primarily developed for use in the second experiment.

In Chapter 6, I select and evaluate interactive variants of thematic discovery algo-rithms against the criteria determined in Chapter 4 (Experiment 1: Determining In-teractivity Requirements). The theory behind semi-supervised algorithms that support instance level constraints is also summarised. The mathematical underpinning for each algorithm is also explained. Finally two algorithms are selected for evaluation in the second experiment.

Chapter 7 details the second (Experiment 2: Semantic Validity Enhancement) and third (Experiment 3: Interactive Content Analysis for Qualitative Researchers) experiments in which the two interactive variants of thematic discovery algorithms are evaluated. The chapter explains the experimental design for each experiment including information about the datasets used, the participant recruitment and the evaluation process. The results for both experiments are evaluated and a comparative review of the results obtained from both algorithms is presented. As with Chapter 4, additional design guidelines that emerge during analysis are documented.

Chapter 8 presents the results and finding of the research conducted and docu-mented within this study. Chapter 8 discusses the findings emerging from all three experiments. The Design Guidelines for the use of interactive thematic discovery algorithms as aids for qualitative content analysis are consolidated from all three ex-periments. The final version of the Conceptual Framework for Interactive Content Analysis and the Evaluation Model for Interactive Content Analysis are presented. Research implications and future research directions are also discussed.

(44)

Chapter 2 Literature Review

The aim of this study is to address issues affecting the adoption of computational techniques for qualitative content analysis by proposing the use of recent thematic discovery algorithms that support interactivity. In order to achieve this aim, a thorough understanding of the factors that have impeded the uptake of algorithmic computational aids must first be obtained. The primary focus of this Literature Review chapter is to uncover issues impeding the uptake of algorithmic computational aids for quali-tative content analysis and identify recent algorithmic advancements that are able to adequately address these issues. The main issues identified include lack of analyst trust in computational techniques and lack of support for interactivity within these computational techniques.

The Literature Review begins with a brief history of content analysis, then seeks to explore and analyse various definitions for qualitative content analysis. Three types of qualitative content analysis (i.e., Summative, Conventional and Directed) are then described and mapped to various machine learning algorithms (Supervised, Unsuper-vised and Semi-SuperUnsuper-vised), with Conventional (also known as Inductive) content analysis being selected as the focus of the research, and documented within this thesis.

The next section in this Literature Review highlights the need for computational aids in the qualitative content analysis process. Various prior studies, comparatively analysing the results obtained by content analysts not using computational aids with those using computational aids are then reviewed with the aim of uncovering addi-tional reasons for the lack of mainstream usage. A review of various computer aided

(45)

qualitative software packages is also included in this Literature Review to highlight the functionality and type of algorithms that are not currently implemented and read-ily available for use by content analysts. A section on qualitative content analysis frameworks and methodologies is included to gain an understanding of the analysis and coding process from a data analysis and methodological perspective, again with a concentration on identifying where computational aids may fit into the workflow of a content analyst. Reliability and validity are then discussed in the context of qualitative content analysis evaluation.

The final section of this Literature Review chapter focuses on algorithms that are able to address the issues identified as impeding the uptake of computational aids for qualitative content analysis by introducing theme discovery algorithms. The Non-negative Matrix Factorisation (NMF) and Latent Dirichlet Allocation (LDA) algo-rithms are detailed because both algoalgo-rithms allow corpus documents to belong to multiple themes and have variants that are able to incorporate domain knowledge and support interactivity. The chapter concludes with a discussion on metrics and evaluation approaches for NMF and LDA and identifies shortcoming in relation to applying these evaluation techniques in the context of qualitative content analysis.

2.1 Defining Content Analysis

It is important to gain a clear understanding of content analysis and the different approaches to content analysis that are currently being used. This serves to provide a good foundation for the Literature Review. This section begins by first providing a background on the emergence of content analysis and a brief history of content analysis. Various definitions of content analysis are considered in terms of the types of activity they encompass. Three types of content analysis (Summative, Conventional and Directed) are introduced and mapped to various machine learning algorithms. As computational aids to address all the types of content analysis is not feasible, Conventional content analysis is selected as the research focus for this thesis. The Miles and Huberman Data Analysis framework (1994), which proposes a content analysis workflow that is of a generic nature and supports all three types of content analysis is then introduced.

(46)

2.1. DEFINING CONTENT ANALYSIS 19

2.1.1 The History of Content Analysis

The emergence of content analysis can be traced back to the introduction of the printing press, a time at which the church started to analyse non-religious printed material. As a technique, content analysis became more popular with the mass production and distribution of newspapers in the 20th century (Krippendorff, 2004, p. 3). The first qualitative newspaper analysis sought to answer a rhetorical question: ’Do newspapers now give the news?’ (Speed, 1893; cited in Krippendorff, 2004). In this study news-papers between 1881 - 1893 were analysed and it was found that coverage focus had changed from religion and science to gossip and sports. As mass media continued to infiltrate society, content analysis was extended to study other types of media including radio, television and movies.

In the 1930’s the survey research method became popular and this lead to an abundance of open responses that required analysis. Notable content analysis studies in the 1930’s include Woodward’s (1934; cited in Krippendorff, 2004) use of qualitative analysis of newspapers as a technique for opinion research and McDiarmid’s (1937; cited in Krippendorff, 2004) analysis of US presidential inauguration speeches.

According to Riffe and Freitag (1997; cited in Neuendorf, 2002), over a 24 year period there was a sixfold increase in content analysis related articles published within Journalism and Mass Communications Quarterly, which is a prominent journal, fo-cussing on mass media. By the mid 80’s, content analysis had been incorporated in to 84 percent of journalism masters degree programs in the US (Fowler, 1986; cited in Neuendorf, 2002). Content analysis is now used as a qualitative research method in diverse domains including communications research, political science, criminology, psychology, sociology, business, literary analysis and education.

The use of computers to aid with the qualitative analysis of textual content dates back to the 1960’s. Most researchers at the time viewed manual content analysis as a time-consuming, tedious and demoralising task (Danielson and Mullen, 1965) and inter-coder reliability issues were encountered when large datasets were analysed. It was presumed that computers would eliminate the need for multiple coders.

(47)

was the first software to focus on content analysis. The General Inquirer performed word counts and also contained various dictionaries that were used to assess valence, emotion and cognitive orientation (Neuendorf, 2002, p. 231). A personal computer compatible version of the General Inquirer is still available. The dictionary based approach can still be found in many current qualitative analysis software applications. Diction is the most prominent of these applications (Hart, 2001), which contains 33 dictionary lists made up of 10,000 words and can search documents for evidence of five main semantic features (activity, optimism, certainty, realism and commonality). Other examples include DIMAP (Litkowski, 1997) and LIWC (Pennebaker et al., 2007).

2.1.2 The Definition of Content Analysis

The simplest definition of content analysis is perhaps provided by Weber (1990) who describes content analysis as a means by which to categorise text. The definition provided by Bogdan and Biklen (1998), “. . . the coding and data interpretation of various forms of human communication”, extends the Weber (1990) definition by incorporating a context for analysis. Krippendorff (2012) starts to reflect that content analysis is a scientific method by describing content analysis as a “careful, detailed, systematic examination and interpretation of a body of material to identify patterns, themes, biases and meanings”. Neuendorf (2002) however provides the most compre-hensive definition of content analysis by articulating the requirement for reliability and validity within his definition which states that content analysis is “the use of replicable and valid methods for making specific inferences from text to other states or properties of its source”.

In attempting to evaluate the use of more recent and sophisticated algorithms to aid the content analysis process, the definitions of content analysis serve to guide the important topics that must be covered within this Literature Review. This includes the exploration of pattern and theme discovery algorithms in the context of coding, data interpretation, inference, reliability and validity. Krippendorff (2012) provides a definition that incorporates the purpose of content analysis which is described as a technique to make inferences from a phenomena where observation can’t be made directly.

(48)

2.1.3 Types of Qualitative Content Analysis

Leading on from the Weber (1990) definition of content analysis, in which text is coded into explicit categories to represent meaning and described using statistics, I sought a more formal and specific categorisation of widely used approaches to content analysis. Hsieh and Shannon (2005) articulate this categorisation and find that there are three distinct approaches to qualitative content analysis namely: Summative, Conventional and Directed. The differences between these content analysis approaches has arisen from coding scheme selection, the origination of codes and the trustworthiness of the approach (Hsieh and Shannon, 2005). Table 2.1, which is adapted from Hsieh and Shannon (2005) summarises the differences and maps the approaches to appropriate machine learning algorithms. Figure 2.1 which is reproduced from Elo and Kyngas (2008) illustrates the differences between conventional (inductive) and directed (de-ductive) content analysis. In the sub-sections that follow, a brief overview of each approach is presented along with an introduction to the main types of machine learning algorithms that match the approach. It is not feasible to focus on computational-aids that address the issues relating to all three approaches to content analysis, particularly as each approach maps to different types of algorithms. In order to reduce scope, this research is only focused on algorithms for conventional content analysis.

Summative Content Analysis

The goal of summative content analysis is to explore word usage in context (Hsieh and Shannon, 2005). Summative content analysis begins with the calculation of word frequency counts as a way to quantify word usage and then proceeds with the iden-tification of words that need to be studied (Kondracki et al., 2002). Occurrences of the selected words are then located usually via a search and analysed in terms of their contextual usage. If however the analysis is purely quantitative it is referred to as manifest content analysis as no inferred meaning within the text is explored (Potter and Levine-Donnerstein, 1999). Summative content analysis however seeks to go beyond pure word counts to achieve latent content analysis where meaning is discovered from the context in which words are used (Babbie, 1992; cited by Hsieh and Shannon, 2007).

(49)

Figure 2.1: Comparing conventional and directed content analysis, reproduced from Elo and Kyngas (2008).

(50)

Conventional Content Analysis

Conventional content analysis is best suited to studies that are required to describe a phenomenon but for which limited existing theory and literature exist. A researcher conducting content analysis derives the coding categories directly from the raw textual data. The themes are said to “emerge” or “flow” from the data (Hsieh and Shannon, 2005). No predefined coding schemes are used in conventional content analysis. Con-ventional content analysis is also known as inductive content analysis (Boyatzis, 1998; Elo and Kyngas, 2008; Mayring, 2000).

Within conventional content analysis, researchers participate in what is known as “open coding”. Text is read, line-by-line, with the researcher highlighting and tagging words that are representative of concepts. After the initial coding process, the tags or labels are grouped together into higher-level categories, which form the initial coding scheme. Qualitative methodologies like Grounded Theory and Phenomenology use conventional content analysis but go further to generate a theory or a theoretically connected approach from the raw textual data.

Directed Content Analysis

Directed content analysis is employed when an existing theory needs to be validated or conceptually extended. Directed content analysis is able to either find supporting or non-supporting evidence for a theoretical framework. With directed content analysis the researcher approaches the data with a predefined set of analytic codes and cate-gories known as a coding scheme, which has been derived to support existing theories. The researcher is required to read and classify text to a specified category. Directed content analysis is also referred to as deductive content analysis (Boyatzis, 1998; Elo and Kyngas, 2008; Mayring, 2000; Potter and Levine-Donnerstein, 1999).

2.1.4 The Qualitative Content Analysis Process

In the previous section, three approaches to content analysis are described. In this section I focus on the workflow of the content analyst and introduce a framework that is general in nature and able to be used by all three of the approached introduced in the

(51)

Table 2.1: Coding differences between the three approaches to content analysis, adapted from Hsieh and Shannon (2005).

Coding Approach Study Begins With Derivation of Codes Algorithms Summative Keywords Keywords identified

before and during analysis.

Frequency counts and concordances.

Conventional Observation Categories developed during analysis.

Unsupervised and semi-supervised algorithms: NMF, LDA and clustering algorithms such as k-means.

Directed Theory Categories derived from pre-existing theory prior to analysis.

Supervised classification

algorithms: Support Vector Machines, Decision Trees and Na¨ıve Bayes.

previous section (Summative, Conventional and Directed). The Miles and Huberman Framework for Qualitative Data Analysis (1994) breaks the qualitative content analysis process down into four concurrently interacting streams (see Figure 2.2) namely: Data Collection, Data Display, Data Reduction, Drawing Conclusions and Verification.

Figure 2.2: Qualitative Data Analysis Framework, reproduced from Miles and Huberman (1994, p. 11).

(52)

Data Reduction

The Data Reduction stream encapsulates all areas of analysis. Within this stream, summarisation and coding are the key foundational activities. Coding is the process of assigning labels (or codes) to text segments (words, phrases, sentences, etc). Codes facilitate the process of identifying patterns and deriving themes from the summarised data. Initially descriptive, low inference codes are applied while the content analyst immerses themselves in the data. As familiarity with the data grows, higher-order interpretive codes are assigned. These higher-order codes are either prescribed, mean-ing that they have emerged from a theory (directed content analysis) or been derived from the data itself (conventional content analysis). Irrespective of whether directed or conventional content analysis is employed, a definition or clear meaning needs to be attached to the code. This definition is essential to operationalise the code within the context of the research and the data being analysed. The key indicators that place a text segment within a category must be identifiable and documented to ensure reliability and provide an audit trail.

Data Display

The Data Display encompasses how the data is laid out, organised and summarised for perusal by a content analyst. Miles and Huberman (1994) stress that the “display of data” is crucial to the process of conducting qualitative analysis. Better display in essence leads to improved and valid conclusions (Miles and Huberman, 1994, p. 11).

Drawing Conclusions and Verification

Miles and Huberman (1994) advocate the use of abstraction and comparison in drawing and verifying research findings. Abstraction is the process via which lower-level concrete concepts are grouped together into higher level abstract (or more general) concepts. Comparison is seen as a fundamental technique for the identification of patterns. Comparison essentially leads to greater abstraction and the conceptualisation of higher level concepts.

Interactive content analysis : evaluating interactive variants of non-negative Matrix Factorisation and Latent Dirichlet Allocation as qualitative content analysis aids

Variants of Non-negative Matrix Factorisation and

Latent Dirichlet Allocation as Qualitative Content

Analysis Aids

To my Grandmother, Rada Bakharia.

Abstract

Keywords

Acknowledgments

Table of Contents

List of Figures

List of Tables

Chapter 1

Introduction

1.1

Overview

1.2

Background

1.3

Factors Affecting the Lack of WideSpread Usage of

Computa-tional Aids

1.4

Thematic Discovery Algorithms

1.5

Research Significance

1.6

Research Motivation

1.7

Hypothesis and Research Questions

1.8

Research Scope

1.9

Chapter Summary

1.10

Structure of the Thesis

Chapter 2

Literature Review

2.1

Defining Content Analysis