Chapter 2 Literature Review
2.3 Process Mining
2.3.1 Analysing Writing Processes
The last thirty years have manifested an increasing interest in discovering writing strategies and exploring the various stages of the writing process. During the 1980s, researchers studied how authors went about writing and revising the many drafts of their work, and proposed methods for analysing these procedures. Initially these analytical methods were performed by hand, by tediously collecting hardcopies of revisions and manually analysing them, as in the studies conducted by Faigley & Witte (1981) and Boiarsky (1984). More recently, new software and advances in computational linguistics have allowed researchers to collect revisions in electronic formats, as well as securing logs or revision histories that include timestamp and authorship information, thereby assisting by partially automating the collection of revisions produced by authors and allowing researchers to concentrate on the analysis of the writing process. Currently, there are many software applications used in the study of writing processes, ranging from key-stroking, single-user logging applications such as InputLog (Leijten & Van Waes, 2006) to version controlled document applications which support collaborative writing, like Google Docs (Google Docs, 2013), EtherPad (EtherPad, 2013), and Wikipedia (Wikipedia, 2013).
Several researchers (Caporossi & Leblay, 2011), (Leijten & Van Waes, 2006), and (Tillema et al., 2011) have used InputLog to study and analyse the writing processes of individual authors. Of particular interest is the analysis performed by Tillema et al. (2011) from a study conducted to investigate whether the (meta)cognitive activities (i.e. reading the assignment, planning, text production, revising, etc) of secondary school students during writing tasks, as measured by thinking aloud techniques and key-stroke logging, could be predicted by their individual writing styles -- planning or revising. The researchers assumed that
27
writing style was determined by the temporal distribution of (meta)cognition across the writing process. A multilevel regression model was employed to model the occurrence of the (meta)cognitive activities over the period of the writing process. The results showed that among all activities, the online temporal distribution of reading the assignment and planning were different for different degrees of the students’ writing styles. Although this study investigated single authors, the analysis technique can nevertheless be applied for analysing collaborative writing by computing the temporal distribution of (meta)cognitive activities across individual students’ writing processes.
Although recently there has been a great deal of research using Wikipedia’s revision history for applications in Natural Language Processing (Ferschke et al., 2013), these studies used the revision data and its history record as the basis for practical applications such as spelling correction, vandalism detection, automatic article quality assessment, or trustworthiness. Research on extracting and analysing collaborative writing processes automatically is still scarce.
Among the various individuals who used Wikipedia’s revision history to analyse the evolution of Wikipedia articles, there was one particular researcher -- Han et al. (2011) – who applied a Markov model technique to analyse the lifecycle of the Wikipedia articles. In this study, the authors defined six stages through which an article usually passed before reaching a convergence state. These states were identified as 1) building structure, 2) contributing text, 3) discussing text, 4) contributing structure and text, 5) discussing structure, and 6) text/content agreement. Three features were used as observation variables to determine these states: 1) Update type, including insertion, deletion, and modification; 2) content type, including structure, text, format, structure + text, text + format, and structure+format; and 3) revision granularity, including heading level, word level, sentence level, paragraph level, section level, and link level. A sequence of these observation variables was used to build a Markov model of a particular article, and revision cycle patterns were extracted based on this model in order to find correlations between human evaluated quality classes and revision cycle patterns to automatically assess the quality of an article. It should be noted that although the authors made reference to having used hidden Markov models, the hidden states applied were nevertheless the six states mentioned above. A learning algorithm was not used to obtain the
28
hidden states; in fact, the Markov states are predefined based on the values of the three features used as observation variables. For example, inserting a heading was determined to be part of the “building structure” state, whereas inserting words in a paragraph was deemed part of the “contributing text” state.
Nonetheless, the Han et al. approach outperformed the previous results in the study by Dalip et al. (2009) that worked on the same task and data, but without using features based on revision history data. Hence the features based on revision history proved to be helpful elements for not only predicting quality of Wikipedia articles but also analysing the history of jointly authored documents such as the Wikipedia articles.
Another study also using Wikipedia revision history and types of text edits was performed by Zeng et al. (2006), who were the first researchers to develop and evaluate a model of article trustworthiness based on revisions histories. Their model was based on author reputation, edit type features and the trustworthiness of the previous revision. The edit type features chosen for use were addition and deletion; the number of deleted and/or inserted words was measured. Interestingly, the authors applied a Dynamic Bayesian network (DBN) based on these features to estimate the trustworthiness of a revision based on a sequence of previous states, i.e. revisions. Although this work was not related to the writing process, the proposed techniques, especially DBN, could also be employed in analysing writing processes, using revision history and different types of features.
Although the works discussed above all involved collaborative writing by a web community like Wikipedia, small scale Wikis in classrooms, which also provide revision histories, can also be used to identify students’ collaborative writing patterns. Heeter and Jeong (2012) conducted a study to extract students’ collaborative processes in Wikis for the purpose of discovering whether group members preferred to work individually (sequentially or in parallel) rather than collaboratively (or reciprocally) in wikis. Interestingly, the authors systematically generated a coding scheme and then manually coded text edits captured in the revision histories of a Wiki. Based on sequences of coded text edits, they built Markov models and identified patterns in the action-sequences that students performed in a Wiki. The authors’ result was consistent with prior research, which found that students preferred to work on an individual rather than a collaborative
29
basis (Heeter & Jeong, 2012). Although this study analysed sequentially the individual and collaborative writing actions observed in the Wiki, the proposed coding scheme was based on raw student text edit data captured in the Wiki’s revision history and did not explore the semantic nature of those text edits.
This concludes the overview of representative modelling techniques for analysing both individual and collaborative writing processes, such as multilevel regression models, hidden Markov models, and Dynamic Bayesian networks. The following section presents existing techniques that involve the use of graphs and visualisation methods for analysing writing processes.