The first step in developing and evaluating the metadiscourse tagging approach for academic lectures is to select a source of data appropriate for lecture discourse analysis. One of the goals of this thesis is facilitating access to the OCW lecture platforms that are freely available online, permitting use of high quality educational materials organised as courses
under creative commons licences2. This is because the creating and preparing of these OCW online courses requires substantial initial and ongoing investments of human labour. Unlike other online courses, such as massive open online courses (MOOCs), the OCW content is less structured, which clearly means there would be a benefit to an automatic process for organising materials, to aid the learning process. For these reason, we restrict the analysis to OCW sources that meet the following criteria:
1. could be found on a wide range of topics of related lectures for two different disciplines, i.e. lectures courses, and provided by different speakers within each discipline, in order to have a representative set;
2. provided audio material, which will be used in the development of the ASR system, to generate automatic transcriptions;
3. provided reference transcriptions that are useful for the annotation task;
4. provided segments boundaries that represent the discourse structure of the lecture, which is useful for the application task, to validate the proposed metadiscourse tagging approach.
There are many OCW sources of spoken discourse from different universities, such as MIT OCW3 at the Massachusetts Institute of Technology, Open YALE Courses4at the University of YALE, UCI OCW5 at the University of California, Irvine, and Stanford OCW6. However, not all of them fulfilled the aforementioned criteria. The comparisons between these resources soon led us to choose both MIT OCW and Yale OCW over the other universities’ platform. Firstly, MIT and Yale OCW are the only platforms providing the gold standard of discourse segment boundaries, which will be used to train and test the application task in this thesis. Secondly, MIT and Yale are known to provide high-quality recordings and use the same settings across all lectures, which is beneficial in developing the ASR system. This contrasts with other OCW platforms, which used different recording conditions, making them less easy to process automatically.
Another decision was necessary regarding the variety of disciplines to choose from in formulating the final dataset. Lecture courses from two different disciplines – Physics and Economics – were chosen. This decision was mainly based on the availability of lecture re- sources of similar introductory courses taught across the two different platforms, MIT OCW
2https://creativecommons.org/licenses/by-nc-sa/3.0/us/ 3 http://ocw.mit.edu/index.htm 4 http://oyc.yale.edu 5 http://ocw.uci.edu 6 http://online.stanford.edu/courses
Physics Economics Overall
# Lecture 57 49 106
Average # of segments per lecture 6 7 6.5
# Segment 395 354 749
# Token 4004990 3894639 7899629
# Words 11309 14280 25589
# Utterance 32903 30756 63659
Table 2.4: Lecture Corpus Statistics. The first column shows the statistics for the collection of Physics lectures, in terms of average number of thematic segments per lecture, number of thematic segments, and numbers of tokens, words and utterances, respectively. The second column presents similar statistics for the set of Economics lectures. The last column presents
the overall statistics across both disciplines.
and Open YALE Courses. For example, the Physics course from MIT OCW is called “Clas- sical Mechanics” and the one from Open YALE Courses is called “Fundamentals of Physics”. These courses cover approximately the same scientific material but they are taught by dif- ferent lecturers from the different institutions. Another reason for choosing these disciplines is to enable an investigation of whether there is a difference in detecting metadiscourse acts between Natural Science and Social Science lectures. In total, 106 OCW lectures were col- lected from the two disciplines (2 courses for each); the following section provides more detail about the chosen courses.
Physics Lecture Dataset
The Physics dataset consists of spoken lecture transcripts taken from an undergraduate introductory Physics class. Two Physics courses have been included, one from MIT OCW, and the other from YALE Open Course. In contrast to the Economics lectures datasets, this corpus contains much longer texts and consists of 57 lectures. A typical lecture of 75 minutes has 500–600 sentences, with up to 8500 words in each, which corresponds to about 15 pages of raw text. Table2.4shows further statistics for the Physics corpus.
As stated above, this corpus also contains annotations for thematic segment bound- aries, with segments labels. The thematic segments herein are, in fact, a multi-dimensional, heterogeneous collection of pragmatic and semantic-oriented text units. These thematic seg- mentations were produced by the teaching staff of the Physics course at MIT and Yale. As stated earlier, the objective of these materials was to facilitate access to lecture recordings available on the class website under the OCW initiative. On average, a lecture was annotated with six segments, with a typical segment corresponding to two pages of a transcript. These segmentation boundaries are required later, to investigate whether metadiscourse tags are indicative of high-level discourse structures (thematic boundaries), as will be demonstrated in Chapter7.
Economics Lecture Dataset
The second lecture dataset differs in both subject matter and lecturing style. This dataset comprises two economics courses taken from MIT OCW and YALE Open Course. The undergraduate introductory economics corpus has, in total, 49 lectures of 75 minutes and, on average, 650–800 sentences per lecture, which corresponds to roughly 8500 words. Further statistics about the Economics lectures are also presented in Table2.4.
As with the Physics lectures, the thematic segmentation boundaries were obtained from the course website, and again, objective of these lectures is to facilitate access to OCW resources. On average, an Economics lecture was annotated with seven segments, with a typical segment corresponding to two pages of transcript. As was noted with the Physics lectures, the thematic segmentations were a heterogeneous collection of pragmatic and se- mantic oriented units. The thematic boundary annotations are used as the gold standard for the application task (thematic segmentation) and are presented in this thesis in Chapter7.