• No results found

Project Description: GeWiss Project

4.2 Introduction to the data sections

4.2.1 Project Description: GeWiss Project

In the following, background, aims, objectives and funding of the project from which the data for this PhD study was taken will be introduced. The project name

GeWiss is an acronym that stands for ‘gesprochene Wissenschaftssprache

112

funded by the Volkswagen Foundation and it its duration was from January 2010 until September 2012.17

The GeWiss website18 explains the aims and objectives of the GeWiss project, namely identifying key practices in spoken academic discourse on a contrastive dimension. Researchers at Aston University focus on analysing specialist presentations in German and English with special reference to humour and metaphors.

Besides audio and video recordings, an extensive set of meta data has been collected. The meta data categories that have been used for the project corpus are available in the appendix, in 9.4. Potential new projects with the aim of either analysing or expanding the GeWiss corpus have been discussed in Schmidt and Wörner (2012).

The GeWiss project has collected data following the criterion of authenticity. Therefore, this concept will be briefly introduced here. Authenticity of data is a major factor for linguistic research to achieve credibility and to determine to what extent research results can be applied. A common position across a wide range of approaches is inductivism, which means that research results and categories should emerge from the data as opposed to e.g. invented examples. The notion of inductivism is also compatible to the notion of authenticity that aims to keep researcher bias and influence to a minimum. Authentic data for research can be defined as data that has not been generated specifically for the purpose of a study, as discussed by e.g. Weijenberg (1980). Weijenberg says that e.g. service encounters, such as selling food items in a store would be most authentic if recorded by a hidden recorder because there was no interference by researchers or other staff. While this idea is interesting as a thought experiment that helps to illustrate the idea of maximising authenticity by minimising interference in actual interaction, it is not feasible to conduct covert recordings in reality. Ethics has to be taken into consideration, which is why the closest to the imagined covert recording of actual interaction is recording real speech events in their natural settings after previously having informed the actants and obtained consent from them. This is

17 http://www1.aston.ac.uk/lss/research/research-projects/gewiss-spoken-academic-

discourse/ (14/04/13) as well as in German on https://gewiss.uni- leipzig.de/de/projektbeschreibung.html (14/04/13).

18 http://www1.aston.ac.uk/lss/research/research-projects/gewiss-spoken-academic-

113

what was done in the GeWiss project. To comply with data protection and research ethics, all main actants (speakers in talks, examiners, lecturers, exam candidates) had to sign consent forms prior to the event that was recorded.

In the following paragraphs, authenticity in connection with data analysis will be discussed, namely which notions of ‘authentic’ texts help to establish which specific advantages can be gained by using both corpus linguistic (quantitative) and discourse analytic (qualitative) approaches. Which criteria does textual material have to fulfil in order to qualify as ‘authentic’ from the perspective of corpus linguistics? Virtanen (2009: 1056) discusses the notion of ‘raw data’ in the corpus and states that this is an ‘illusion’ because – particularly in the area of spoken corpora – there have been too many steps between data collection and analysis in order to still refer to it as ‘raw’ at all (cf. ibid.). The specific problem of authenticity of a spoken corpus is that the data in the form of transcripts has gone through a process of interpretation, the transcription. Something that could come close to a solution of the problem of authenticity in spoken corpora is the following idea (cf. ibid. p. 1057): Audio and if possible video recordings should be included with a spoken corpus so that they can be compared to the transcriptions.

The discussion of authenticity in corpus linguistics will continue by discussing what aspects both corpus linguistics (CL) and discourse analysis (DA) approaches share. The first aspects to consider are ethical issues alongside some others (ibid. p. 1065):

“There are ethical issues that are common to both areas. Data collection involves a balance between, on the one hand, authenticity, naturalness and representativeness, and on the other hand, ethics, metalinguistic awareness and availability. In this respect corpus design can profit from the experience of linguists of different orientations.”

So both CL and DA share ethical issues and a desire for naturalness, authenticity and representativeness while it is pointed out that corpus design can profit from DA linguistics, particularly with special reference to the notion of ‘context’. The notion of context strongly differs from CL views. For DA, it is important to have the original context, e.g. the newspaper with surrounding articles. If an analyst finds the newspaper article in a new context of the corpus that was created with, then the new context differs from the original one. Such differences will be of importance to discourse researchers, as the environment of an article, the setting so to speak; might play a role for the analysis. Similar aspects are also raised in Stubbs (2001)

114

with reference to critical discourse analysis (CDA), as well as in Biber (1993), (1994) and Biber and Conrad (2001), Biber and Jones (2009) and Schiffrin et al. (2003), Scherer (2000), Hunston (2002) and O'Keeffe and McCarthy (2010). Particularly focused on spoken corpus design and on an overview of existing spoken corpora is Wichmann (2008).

Decisions on spoken corpus design also include decisions on segmentation. In general, segmentation is the process of dividing discourse into units that are smaller than the whole of the given material. For written texts, there are paragraphs, clauses and sentences, see also Himmelmann (2006). Even written texts can be segmented into utterances, as Stoll (1998) showed. For spoken discourse, another layer of complexity emerges. Spoken discourse has to be transferred from audio- or video recordings into texts. This process is called transcription and the results of such a process are called transcriptions or transcripts. The act of segmentation can begin during transcription. This is the case e.g. with the GAT and GAT 2 conventions. GAT stands for gesprächsanalytisches Transkriptionssystem (conversation-analytic transcription system), see Selting et al. (1998) and Selting et al. (2009). The basics of a GAT 2 transcription will be explained with the help of the following figure, a screenshot from a transcript opened in the Partitureditor, part of the EXMARaLDA (Extensible Markup Language for Discourse Annotation)19 package.

Figure 2: Extract from a transcript, part of the English sub corpus

Now, with the help of Figure 2,20 segmentation will be defined for the specific context of the transcript that was created with the GAT 2 conventions. In this transcript, two types of segments can be identified. The first more technical type of a segment is in the top grey line of the picture and has a number (e.g. 2, 3 etc.). The length of this segment is by default 2 seconds, but can be made longer at the discretion of the transcriber. The other part of segmentation that also happens during transcription is deeply rooted in the GAT 2 conventions, namely intonation units. As Selting et al. (2009) emphasise, this form of transcript as displayed in Figure 2, represents “an iconic reflection of the temporal sequence of events in real

19www.exmaralda.org/ (16/02/13).

115

time”. The latter refers to what is being said in the communicative event, in this case a research talk. As part of this linear sequence of transcribed spoken discourse, one can see intonation units, for example the name at the beginning of Figure 2, which then is followed by a pause, to be followed by the rest of the transcript. A second type of segmentation stems from the theory of funktionale Pragmatik (functional pragmatics), see Rehbein (1977), (1995), (2001) and Fiehler (2004). As Rehbein (2001) emphasises, the transcription and analysis of the data is not a decisionistic one-to-one transliteration of the spoken discourse into a priori categories, but a process that consists of several stages (cf. ibid. p. 927). Segmentation is already done during transcription, and its nature is two-fold. First, the transcription conventions allow the use of punctuation, which are already segmentation into clauses and sentences, following conventions of written discourse. Then, the segmentation is continued to divide the data into so-called

sprachliche Prozeduren (linguistic procedures), see Rehbein (1995). A linguistic

procedure is smaller than a speech act, which is smaller than the discourse, the whole of the material (cf. Rehbein (1995) and (2001)).

The transcription conventions endorsed by Rehbein (see references above) are called HIAT (Halbinterpretative Arbeitstranskription). HIAT transcriptions are also orthographic (unless the research is about phonetics, then there would be an additional phonetic transcription) and the ‘partitur’ transcription shows the parallel actions of verbal and nonverbal communication. Other aspects, such as phonetics and intonation, can be easily added (cf. ibid. p. 930). The main advantage of such transcription conventions is the fact that they show simultaneous events of speaker and hearer. Rehbein (ibid.) even maintains that the combination of the partitur notion using HIAT conventions is the only system with this advantage. However, this is not the case; the partitur notation using e.g. the GAT 2 conventions has the same advantage of showing the simultaneity of communicative events of speaker and hearer while also allowing to annotate any aspects a researcher might be interested in. Based on the publication dates, one can see that HIAT is the older system and that Rehbein had the idea of using a partitur-like notation first, and after that, the GAT systems were developed. One major criticism of Rehbein (cf. ibid.) towards conversation analysis remains true: Among other typical simultaneous events in discourse, conversation analysis attempts to force overlaps into an unsystematic line-by-line notation.

116

Now, that two transcription and segmentation conventions (GAT2 and HIAT) have been compared, the GeWiss corpus will be introduced, followed by more detailed remarks on how the GAT2 conventions were adapted for the GeWiss project.

An overview of the whole GeWiss corpus will be presented here. The GeWiss corpus consists of three main genres: specialist presentations, student presentations and oral examinations. Here, only the whole GeWiss corpus of specialist presentations will be presented here because the other genres are not used in this study. Besides, more information on the corpus as a whole including the genres that are not discussed here is available online in the corpus handbook.21 More information is summarised in Table 3 below (see also pp. 4f in the handbook, see footnote 21): size of the whole GeWiss corpus with all

genres in all three languages (English, German and Polish)

1,273,529 tokens

Total duration of the corpus recordings 126:05h Total number of communications

(genres), subdivided into:

371

Specialist presentations 58

Student presentations 89

Oral examinations 224

Total number of main speakers22 462

Genders in the corpus

Female 330

Male 132

Table 3: Overview of the whole GeWiss corpus

Size 44,316

Duration 5:06h

Number of specialist presentations 5

Number of speakers 6

Female 2

Male 4

Table 4: Overview of the English L1 speaker's sub corpus (GeWiss)

21 https://gewiss.uni-leipzig.de/open.php?url=Handbuch.pdf (18/04/13). A registration is

required in order to be able to access the resources, which include the GeWiss corpus.

22 Main speakers refer to the main actants in the respective genres i.e. speakers in a talk,

117

When comparing the duration in hours (h), it turns out that the sub corpus of all talks that were held by L1 speakers of English equals to about four per cent of the whole GeWiss corpus. The data used in this study (see 4.2.3 below) is a subset of the corpus outlined in Table 4.

All data was transcribed with GAT 2, the second version of the GAT transcription conventions, see Selting et al. (2009) that enable and facilitate the analysis of the corpus both from a conversation-analytic perspective and others. GAT stands for ‘gesprächsanalytisches Transkriptionssystem’, which means discourse and conversation-analytic transcription system (cf. ibid.). The advantages for GAT 2 are that it is

 Claimed to be usable in Conversation Analysis, Discourse Analysis and Interactional Linguistics because orthographic conventions are followed, which maximises readability

 easily accessible for novices of transcriptions for the same reason

 structured into different levels: GAT 2 offers a simple initial level of transcription: the minimal transcript, which mainly notates the wording of discourse, but can also be expanded to more detailed level later (all cf. ibid. p. 3).23

The transcription conventions are based on Selting et al. (2009) and the conventions have two main functions. First, they represent the exact and precise wording of what is said in the spoken events that form the basis of the transcripts. This includes deletions, clitisations, regionalisms, compound nouns, abbreviations and numbers. The notation also includes markers of hesitation or non-verbal reactions to the discourse, such as pauses, laughter, breath-in and out, unintelligible passages, and non-verbal events or actions.

The adaptions of the transcriptions conventions to English follow the conventions in Selting et al. (2009) and Selting et al. (2011). There were some adaptions specific to the GeWiss project,24 which concern the following phenomena. For the purpose of easier searchability of the corpus, short forms and clippings were standardised, e.g.

cos, cuz, cus, cause for because were all transcribed as cause. For clitisations,

23 The more detailed levels of transcription using GAT2 will not be discussed here, as they have

no relevance for this study.

24 These and the following paragraphs about transcription conventions are based on a GeWiss

corpus handbook, which is available from https://gewiss.uni- leipzig.de/open.php?url=Handbuch.pdf after a free registration at http://gewiss.uni-leipzig.de.

118

apostrophes were substitutes by underscores (_): i_m, we_ll, don_t. Less conventional forms that were strongly influences by processes of assimilation and reduction were noted as one word, e.g. wanna (want to), gimme (give me). The orthographically correct spelling e.g. want to was put into the comment tier of the transcript.

Compound nouns were spelt according to the Oxford English Dictionary. When something was spelt in the transcript, this was realised by noting down the individual letters, separated by spaces, e.g. m a. unlike in the German transcripts, this is realised by a syllabic spelling of the letters. In the comment tier, the pronunciation is given in a syllabic notation e.g. emm ay for M.A. and the acronym is explained, e.g. Master of Arts.

Hesitation markers are reduced to as few as possible variants: er, erm, um; hm_hm,

hm, yeah, no. Any hesitation markers that strongly deviate from these forms are

noted in the form they are realised: yep, nope.

Other potential research directions that the GeWiss corpus offers are mainly on the contrastive dimension. The corpus enables to compare German language use and cultural factors in academic settings internally across Germany, the UK, and Poland. Furthermore, the contrastive dimension can also be between L1s and L2s. So questions such as how German as an L1 and L2 differ can be explored, as well as the same question between English L1- and L2 speakers. Furthermore, the teaching of German for academic purposes (GAP) can be enhanced using an empirical basis like the GeWiss corpus, which has so far been missing and hence constituted a real gap in research (more information in footnote 17).

4.2.2 Data collection, post-processing, meta data and