Chapter 5 WriteProc: A Framework for Data Collection
5.2 Writing Environment: Google Docs
Earlier in this thesis, it was established that in order to explore the workings of the writing process, it is necessary to have a front-end writing tool that supports collaborative writing (CW) and stores all revisions and their metadata. For this reason, tools such as Microsoft Word or OpenOffice, which do not allow CW, were not viewed as viable choices for this work, although these tools do provide some
67
functionality to detect changes and who produced those changes. Web 2.0 tools such as Google Docs and the incipient Microsoft Word Live allow users to write on a web application, or to write offline and synchronise the material later, with the service provider storing the different versions of the document. For this reason, Google Docs was selected as the front-end writing tool for this work.
Google Docs is a web-based utility with most of the functionalities necessary for word processing which allows users to share documents with other team members and to write synchronously. Authors can access Google Docs through their web browsers from anywhere and at anytime they choose. Each author requires a Gmail account in order to access the tool that he or she can obtain from Google free of charge; this author’s Gmail account is referred to as the author ID in the framework of this study.
The writing process begins with the creation of a particular document that is then assigned to a group of students by course administrators/lecturers. Students work on the documents by writing and editing, after which they submit a final version. As previously noted, the crucial aspect of this particular writing process is the fact that Google Docs stores all revisions and revision histories made from beginning to end.
Each document created in Google Docs created is assigned a unique document identification number (i.e. document ID). Google Docs also keeps track of all versions by incrementally numbering each subsequent one (i.e. revision ID) every time the document is edited. Whenever an author makes changes and edits a particular document, Google Docs stores the edited content text and keeps a record of the following information in the revision history:
• The version number (revision ID).
• The identification of the author (author ID).
• The timestamp (date and time) of the changes made
There are occasions when many authors engage in editing the same content in a document at almost the same time. In this case, since Internet connection speeds are not instantaneous, when an author makes a change, he or she temporarily creates a local version of the document that is different from the versions that other collaborators see. When this occurs, Google Docs implements a mechanism to make
68
sure that all the text change operations eventually converge on the same correct version of the document1.
Figure 5-2. the web-based interface of revision history (on the right panel) of Google Docs, which shows a list of revisions. Each revision contains a timestamp (date and time) and an
author ID (different colours for different authors).
Authors can also access the revision history of their documents by using the web- based interface via the command “see revision history” under the “file” main menu on Google Docs, as shown in Figure 5-2. Google Docs displays the web-based interface of the revision history on the right-hand panel, which includes a list of revisions containing timestamps and author IDs of the corresponding edits. Different authors are assigned different colours for identification. The web-based interface incorporates two types of revision history for each document, designated as more and less detailed revisions. For both types of revision history, an author can select a particular revision to view the edited content.
Since Google Docs automatically saves documents every few seconds even when no changes have occurred2, there may be any number of revisions for any particular document.
1 Since 2011, Google Docs uses a new algorithm for merging changes called operational transformation. It also uses the collaboration protocol to make sure that each author knows when there are changes that need to be merged. Please see the three white papers “what’s different about new Google Docs” (Google Docs White Paper, 2010a, 2010b, 2010c) for a thorough detailed explanation of the operational transformation and the
collaboration protocol (these two were originally developed as engines driving Google Wave).
2
This framework was implemented in 2010. In 2011, Google has changed the auto-saving functionality so that Google Docs only auto-saves when there is a change in the content text of the documents.
69
As previously elaborated, this research uses an application programming interface (API) with the goal of automatically retrieving all authors' revisions and revision histories of documents. Before describing the API, however, it is important to first define some terminology used in this thesis.
A writing session, as defined for the purposes of this study, is composed of consecutive revisions that are made less than 30 minutes apart. A time threshold of 30 minutes was established to distinguish between writing sessions, as used in the data analysis for web usage mining (Markov & Larose, 2007). If two consecutive revisions show a timestamp difference of more than 30 minutes, the later revision becomes the first revision of a new writing session. Every author’s writing sessions are determined according to this 30-minute cut-off. It is considered that students perform their text edits continuously during a writing session. The inactive time that occurs when students pause to read the text written so far and to think about what they are going to write next should consist of a fairly short interval (pause). If this inactive time becomes longer, it is assumed that a new writing session follows.
Since Google Docs automatically saves documents frequently, the resulting number of revisions is very high and must be reduced to a more manageable size. This is accomplished by grouping the revisions into major revisions, which are defined as the final revisions that end a writing session. All revisions within a major revision originate from the same author. In this thesis, the creation of major revisions is performed by WriteProc after retrieving all revisions. Figure 5-3 shows an example of revision history and major revisions.
Figure 5-3. Revision history before 2011 showing 13 revisions: R1-R13 written by 2 authors: U1 and U2. Each revision has timestamp associated with it. Σ and σ are time difference of
70
From 2011, Google has changed the auto-saving functionality so that Google Docs only auto-saves when there is a change in the content text of the documents. In addition, as mentioned previously, Google Docs implements a mechanism to make sure that all the text change operations eventually converge on the same correct version of the document. Therefore, the number of revisions has been reduced significantly. For a particular document, the revision history retrieved by Google Document List API (verstion 3.0) is the less detailed revision history shown in the web-base interface, as described above. For data collected since 2011, the reduction of the number of revisions is no longer needed. This thesis considers the revisions of the less detailed revision history as major revisions, which have timestamps and authors’ IDs associated with them. Although all revisions including the ones shown in the more detail revision history can be downloaded, their timestamps and authors’ ID are not available. Figure 5-4 shows the revision history provided by Document List API since 2011.
Figure 5-4. Revision history since 2011. Only revisions displayed in “less detailed” revision history have timestamp and user IDs.