Future Work - Discussion, Future Work, and Conclusion

Chapter 9 Discussion, Future Work, and Conclusion

9.4 Future Work

Figure 9-1. Summary of algorithms adapted and created in the toolbox, and algorithms which can be used for improving the toolbox in the future.

Abbreviation: Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA), Diff Latent Semantic Analysis (DiffLSA), Diff Author-Topic Model (DiffATM), Natural Language Processing (NLP), Hidden Markov Model (HMM), Dynamic Bayesian Network (DBN), and an open source process mining framework, (ProM).

In this thesis, in order to build a toolbox for automatically extracting process models of writing processes and providing visualisations that illustrate aspects of collaborative writing, several algorithms are created and adapted from two main fields: text mining and process mining.

Figure 9-1 summarises all the algorithms in the toolbox. The oval depicts an algorithm developed in the two main fields. The grey ovals show the algorithms used and created in this thesis. Text mining algorithms are used extensively in the heuristic to automatically identify writing activities. Latent Semantic Analysis (LSA) (Landauer & Dumais, 1997; Landauer et al., 2007) is used to compute the text cohesion of each revision and to detect cohesion changes during the writing process. The LSA-based document clustering algorithm Lingo (Osinski & Weiss, 2005) is used for extracting topics and calculating topic overlap. A probabilistic graphical modelling technique like Latent Dirichlet Allocation (LDA) (Blei et al., 2003) is also employed. DiffLDA (Thomas, 2011; Thomas et al., 2010a) is adapted to extract topics for topic evolution charts and Diff Author-Topic Model (DiffATM) is created to construct topic-based collaboration networks. From the field of process mining, two techniques are used for extracting writing process models. The first one is based on the process mining framework ProM (ProM, 2013), like Heuristic Miner (Weijters & Ribeiro, 2010; Weijters et al., 2006). The other one is based on Markov models,

177

like Hidden Markov Model (HMM) (Rabiner, 1989). Obviously, several techniques exist that can be integrated into the toolbox as shown in Figure 9-1. For instance, Dynamic Bayesian Network can be used to model collaborative writing processes and extract patterns of collaborative writing activities. Natural Language Processing (NLP) techniques can be used to improve the heuristic for automatically identifying collaborative writing activities. The following subsections are concerned only with techniques that can be used to improve the toolbox in the future.

9.4.1 Improving the Heuristic with Natural Language Processing

In this thesis, the technique used in the heuristic for automatically identifying collaborative writing activities is based purely on text mining methods. In the future, the technique from this thesis can incorporate those used in natural language processing (NLP). Recently, Bronner and Monz (2012) proposed a method for automatically distinguishing between factual and fluency edits performed on Wikipedia articles. Factual edits alter meaning, whereas fluency edits improve style or readability. The Bronner and Monz approach was based on supervised machine learning using language model probabilities, string similarity measured over different representations of user edits, comparison of part-of-speech tags and named entities, and a set of adaptive features extracted from large amounts of unlabelled user edits. Although their method requires a huge amount of labelled data which can be acquired from Wikipedia, it achieves high classification accuracy. Other techniques of NLP that may be helpful to improve the accuracy and the effectiveness of automatically identifying writing activities include work in recognising text entailments, identifying paraphrases, and simplifying sentences. For instance, if a sentence in the current revision can be identified as a paraphrase of the same sentence in the previous revision, the text edit that transforms the sentence can be designated as a revising activity; but when applying these techniques to collaborative writing processes, a problem arises in ensuring that the sentence in the current revision is the same sentence in the previous revision. Nevertheless, natural language processing technique appears a promising avenue for automatically identifying writing activities.

The aforementioned natural language processing techniques for classifying text edits, especially those of (Bronner & Monz, 2012), are based on several features of

178

the writing process, chosen from hundreds of possible features. In Wikipedia text edit classification, researchers use not only text-based features, similar to those in this work, but also other types of features such as surface, vandalism, and revert. Surface edits consist of edits affecting mark-up segments. Vandalism edits include edits deliberately compromising Wikipedia’s integrity and revert edits representing edits restoring a previous stage of a page. An open research question still exists that asks what features should be included in the classification of text edits and how the features should be weighted. Recently, neural network technique has reappeared, using neural networks to learn features from a set of inputs and labelled outputs. Interestingly, Sutskever et al. (2011) uses recurrent neural networks to generate text, character by character, given an initial set of words or phrases. Although the technique requires a great deal of resources for computation, in the future, when powerful computers are easily available and accessible, it will be interesting to see if neural network technique can be incorporated to improve the automatic detection of collaborative writing activities.

9.4.2 Improving Topic Extraction

In order to extract topics for computing topic overlap, the heuristic prefers the document clustering algorithm Lingo to topic modelling or Latent Dirichlet Allocation. The reason for this preference is that unlike topic modelling, which outputs a topic as a related group of words (thus creating the interpretation difficulty previously noted), Lingo first finds the label for topics before performing the clustering task. In addition, Lingo uses Latent Semantic Analysis, which is also used to compute cohesion. Therefore, both topic overlap and cohesion changes can be computed in one operation every time a new revision is produced. Recently, Liu et al. (2009) proposed a new technique for measuring the cohesion of classes in software repositories, based on the analysis of latent topics embedded in comments and identifiers in source codes. This proposed approach, named Maximal Weighted Entropy, utilises the topic modelling technique and information entropy measures to quantitatively evaluate the cohesion of classes in software. Interestingly, based on this technique and the topic evolution, the topics and cohesion measure of a revision can be extracted by applying LDA only once. The only drawback of the approach is the time factor, as the inference of the number of topics and the topic membership

179

can take a significant amount of time to accomplish (see Subsection 9.2.5). Again, at a point in the future when more advanced technology exists, topic modelling is likely to be a viable method for extracting topics and cohesion because of the availability of powerful computer resources and the improvement of inference techniques that will speed up the computation.

9.4.3 Creating Interactive Visualisations

All the visualisation types proposed in this thesis are created as prototypes for proof of concept. The revision map provided to students as they were engaged in writing their documents was intended to serve the purpose of the user study (as described in Chapter 8). Obviously, more interactive types of visualisations can be developed as well. For example, revision maps can be created by using a Javascript library of D3.js (Bostock, 2012). This library can manipulate documents based on data. D3 can bring data to life using HTML, SVG and CSS. The characteristic D3 emphasis on web standards offers the full capabilities of modern browsers without ties to a proprietary framework, combining powerful visualisation components and a data-driven approach to DOM manipulation.

As another example, revision maps can be created to have a split-attention effect: to understand data, the teacher or researcher can juxtapose two windows. One window shows the map, and the other window displays the text. We can then go back and forth between the two windows to connect data with the contents of a paragraph. With a bit of creativity, the visualisation could be integrated into the text itself, by playing with multiple parameters such as the colour of the text, the colour of the background and some type of bar chart placed vertically in the margins, making the data more useful.

In document A Data Mining Toolbox for Collaborative Writing Processes (Page 193-196)