Topic-based Collaboration Networks - Visualising Collaborative Writing Processes

Chapter 8 Visualising Collaborative Writing Processes

8.4 Topic-based Collaboration Networks

For further analysis, it is useful to visualise how students collaborate around topics, with particular emphasis on ascertaining whether students develop their ideas and concepts independently or whether they work together on the same topics. Figure 8-5 shows a topic-based collaboration network from a group of four students jointly writing a document for the prototype experiment, which will be described in Section 8.6. Each node represents a student author. A square depicts a group coordinator. Circles represent group members. A connection (link) between two nodes indicates that those two students have written about the same topics during their tasks. Figure 8-5 shows that the group coordinator a1 and group member a2 have both worked

143

with all group members to draft, revise, and edit some of the document topics. The group coordinator has a responsibility to assign writing tasks to individual members and to make sure the assigned tasks progress according to plan. Group members a3 and a4, however, have not written about the same topics. In other words, a3 and a4 have both worked independently with a1 and a2 to develop some topics.

Figure 8-5. A topic-based collaboration network for collaborative writing. The network is inspired by the social network proposed by Broniatowski and Christopher (2012). Nodes represents students: a1 to a4. The square is the group coordinator and circles are group members. A connection between two nodes means that the two corresponding students have

written about the same topics.

The contribution of this thesis toward accomplishing the visualisation resides in the creation of a Diff Author-Topic Model (DiffATM), which is an extension of Author-Topic Model (ATM) (Rosen-zvi et al., 2003). As DiffLDA overcomes the duplication effect in LDA, DiffATM is developed to deal with the duplication effect in ATM. In this research, similarly to DiffLDA, DiffATM is applied to text edits identified at the paragraph and word levels in order to extract topics. The application of DiffATM, however, instead of providing a cluster of topics per revision, provides a cluster of topics per author. Based on a number of revisions, a particular author can be represented by a membership of topics written in those revisions. Like DiffLDA for writing processes, DiffATM is developed by selecting the number of topics and hyper-parameters based on the trade-off between the model fitting and the simplicity of model structure. In addition, social networks are applied as proposed by Broniatowski and Christopher (2012) for collaborative writing tasks based on the membership of topics of individual authors.

The subsequent section describes the Diff Author-Topic Model followed by a description of the method used for constructions of topic-based collaboration networks.

144

8.4.1 Diff Author-Topic Model for writing processes

This thesis develops Diff Author-Topic Model (DiffATM), which in turn uses a variant of LDA known as the author-topic (AT) model (Rosen-zvi et al., 2003) which adds probabilistic pressure to assign each author to a specific topic. Shared topics are therefore more likely to represent common ideas and concepts. The DiffATM model provides an analysis that is guided by the authorship data of the documents (provided by revision histories) and the word co-occurrence data used by DiffLDA. Each author is modelled as a multinomial distribution over a fixed number of topics that is selected empirically as explained below. Each topic is, in turn, modelled as a multinomial distribution over words.

As described in Subsection 8.1, the Text Comparison Utility (TCU) outputs the delta documents (i.e. added and deleted paragraphs) and each revision is produced by one or more authors. The authors of a revision are assigned to the delta documents of that revision. The Author-Topic Model (ATM) is then applied to the entire set of delta documents.

As in DiffLDA, the hyper-parameters defining each Dirichlet prior (α and β) of DiffATM are dependent on the number of topics, which is selected independently for each document using the trade-off between the model fitting and the simplicity of the model structure as described in Subsection 8.3.4. The likelihood of two authors writing the same topic will depend on the hyper-parameters chosen (Broniatowski & Christopher, 2012). In general, larger values of α will lead to more topic overlap for any given corpus, motivating the use of a consistent hyper-parameter selection algorithm across all corpora analysed. All hyper-parameter settings used for the analyses presented in this thesis follow the guidelines derived empirically by Griffiths and Steyvers (2004). In particular, α = 50/(# topics), inducing topics that are mildly smoothed across authors, and β = 200/(# words), inducing topics that are specific to small numbers of words.

Like DiffLDA, the DiffATM model is fit by using a Markov-chain Monte Carlo (MCMC) approach. Information about individual authors is included in the Bayesian inference mechanism, so that each word is assigned to a topic in proportion to the number of words by that author already in that topic, and in proportion to the number of times that specific word appears in that topic. Thus, if two authors use the same word in two different senses, the DiffATM model will account for this polysemy.

145

Details of the MCMC algorithm derivation are given in the paper by Rosen-Zvi et al. (2003).

After the number of topics, T, has been selected, a T-topic DiffATM model is fit to all delta documents. Ten samples are taken from 20 randomly initialised Markov chains, such that there are 200 samples in total. The result of the final samples are used to construct topic-based collaboration networks, as described below.

8.4.2 Construction of Networks from Topics

After an ATM has been fit, networks are constructed networks in order to analyse student collaboration, with particular interest in linking together two students who often use the same topics of discourse over the writing period. The same method proposed by Broniatowski and Christoper is used (Broniatowski & Christopher, 2012), in computing the joint probability of each pair of authors writing about the same topic as:

- ∩ ) = : - c = d | - c = d | )

A joint probability of two authors which exceeds 1/T (e.g. 0.1 if T=10) is indicated by creating a link between the two nodes; the reason for choosing this condition is explained in (Broniatowski & Christopher, 2012). A square author- author matrix is constructed with entries equal to one for each linked author pair, and entries equal to zero otherwise. This procedure is then repeated several times for each document (Broniatowski & Christopher, 2012) to average across whatever probabilistic noise might exist in the DiffATM fit. Authors who link across multiple DiffATM fits more often than would be expected according to chance are considered to be linked in the network for that document. The author-author matrix is obtained after 200 samplings of DiffLDA. Each author pair with an entry higher than 125 is considered as linked. Five topic-based collaboration networks of four student goups are presented in

Figure 8-6, showing different networks with different numbers of connections, which demonstrates that the dynamic of topic sharing during the writing process differs among groups.

146

Figure 8-6. Topic-based collaboration networks of four different groups of students writing documents. Squares represent group coordinators. Circles are group members. Links between two nodes indicate that the two corresponding authors have written about the same

topics.

In document A Data Mining Toolbox for Collaborative Writing Processes (Page 159-163)