4.3 Data collection: using tests
4.3.1 Rationale for test type
We have noted that the use of this kind of pre- and post-test is well-established in experimental and quasi-experimental research designs of this nature because, providing the sample is representative and groups are equivalent, it allows us to measure a key variable; the type of instruction given to each group. A free response test allowed us to gather quantitative data and overall means and gain scores in terms of usage, alongside global scores and scores for interactive ability and discourse management. These scores could then be measured for significance of variance. The data could thus act as one objective measure of the effectiveness of these types of instruction, which would act as a useful counterweight to the qualitative data collection and help us to find at least part of the answer to our research questions. We can imagine, for example, students stating in diaries and interviews that they prefer one type of instruction to another and it is useful to establish whether such a preference is reflected in pre- and post-test gain scores or not. In this sense, the tests allowed for triangulation of the data.
There is, however, an obvious threat to internal validity in two key areas: ‗the practice effect‘ and ‗participant desire to meet expectation‘ (Dornyei 2007:53). As students become more familiar with the test format, their performance could improve as a result of this and not the type of experimental instruction they have received. This is a genuine threat to validity if it has any effect on the test results and we need to carefully explain how we guarded against this so that it did not have a significant impact on the test scores in this study.
If we wish to measure the gains in the use of DMs in a pre- and post-test, it is clear we need to use the same test format at least but it does not mean we need to use precisely the same topics or questions in each test, providing each test is equivalent. For this reason, a variant of the same test was used in each case so that the topics chosen differed between each test but the test followed exactly the same format and each test was equivalent. It was hoped this would militate against the practice effect to some extent as students would not be able to rehearse answers from test to test, although they would become more familiar with the test format.
The second threat to validity is that each experimental group may try to ‗exhibit a performance which is expected of them‘ (Dornyei 2007:54). In other words, they may try to use DMs in the test in order to please the researcher or because that is what they believe they should do. To a certain extent, we could not remove this possibility but we could try to ensure it did not have a significant effect upon the test scores. Firstly, students were not told that the study was trying to measure use of DMs but were given a broad description of it, telling them it was aimed at helping them to improve their spoken language. Such a description does not constitute any lack of research ethics but does mean that learners would not be focused on using DMs in order to please the researcher. Secondly, they were not given any explicit instruction to use DMs in any of the tests. Thirdly, if students did try to use as many DMs as possible to please the researcher, it did not guarantee they would be used correctly i.e. with the functions taught. The tests were also viewed by a second researcher (a senior lecturer in ELT) when calculating which target DMs were used. Only DMs used with the correct function were included when calculating overall usage and gain scores. This meant that a learner who simply tried to use as many DMs as possible would not necessarily achieve a greater gain score.
As we have noted, this type of study also frequently employs both an immediate and delayed post-test. The reason for this is clear. We wish to measure immediate gains from the
experimental treatment and gains over time, something we were unable to during the pilot study. This then allows us to analyse the results in terms of their effect on acquisition. Schmitt (2010:2), discussing studies of this type focused on vocabulary acquisition, suggests that a delayed post-test ‗shows durable learning‘ and an immediate post-test shows ‗whether
treatment had an effect‘. This gives us a clear rationale for the use of an immediate and delayed post-test.
The question then turns to how we define ‗immediate‘ and ‗delayed‘. In their overview of FFI studies Norris and Ortega (2001), show that in studies of a similar design, definitions of these terms vary considerably and unfortunately there is no real consensus regarding the optimal amount of time after a study to hold a delayed test (Schmitt 2010: 156). For the purposes of this study, ‗immediate‘ was taken to mean directly after the experimental instruction had finished, on the last day of teaching for each group. This meant that each group took an
immediate post-test after ten hours of instruction and it was used in order to find out if the treatment had any effect.
A delayed post-test took place after eight weeks. This followed a suggestion made by Truscott (1998) that a delay of more than five weeks but less than one year may be enough to measure the longer-term effect of FFI upon acquisition. A delayed test of eight weeks after the immediate post-test fitted the timescale suggested by Truscott and was also a practical time limit because learners had not left the university and were available.
As we have noted above, there are several kinds of test available to us. It is worthwhile, then, discussing each type in turn and justifying our choice of a free response test. Metalinguistic judgement tests allow learners to demonstrate that they can observe the target language and judge correct usage. For us, this would mean asking learners to look at samples of the target DMs and to comment on correct or incorrect usage. This type of test seems largely to assess declarative knowledge and whilst valid for this, does not allow us to assess how well the learners can actually produce the target DMs. Our research questions show we are interested in how two different types of teaching impacts upon the acquisition of the target DMs and the ability to produce them is one clear way to measure this. In other words, we wanted to assess the procedural knowledge of the learners when using the target DMs and a metalinguistc judgement test would not allow us to do this. Selected response tests also allow students to assess correct usage, but by looking at a choice of language samples. As in metalinguistic judgements, this type of test assesses declarative knowledge and as such did not serve our purpose. Constrained constructed response tests allow learners to produce the target language in very controlled ways, through, for example, filling in gaps in sentences, using the target language. This kind of test assesses declarative knowledge to a certain extent (learners need to analyse the correct form to use) and procedural knowledge (learners need to decide which to use in the context). The advantage of this type of test is that we can design it to focus very explicitly on the target forms and thus we can test only those forms in focus. The disadvantage is that it only tests procedural knowledge to a limited extent. It is clear that the ability to fill in the gaps with a target form without time pressure and the visual support of the written word is not the same as being able to use a form in spontaneous speech. In fact, we can easily imagine a learner being able to do the first successfully but not the second. We have also noted that this
study attempted to measure spoken DMs and as such, the appropriateness of a written test format is at least questionable.
Free constructed response tests allow learners to produce the target language in a much ‗freer‘ format, through, for example, the use of role-plays. This kind of test aims to measure
procedural knowledge by giving students the opportunity to use the target forms, but not ‗forcing‘ them to do so. This study employed such a test because, as we have noted, we wished to measure the effect of two different types of instruction on the usage of the target DMs and such tests have been used successfully to demonstrate significant effects of FFI (Norris and Ortega, 2000, 2001, N.Ellis 2007).
The problem with such a test is of course that students may not use the target forms at all. This may not mean they have not acquired them through the different types of instruction but that the test simply allows for avoidance of the target forms. There is no doubt this is a risk with this kind of test but, as we have noted, if we wish to measure spontaneous use of the target forms, then other test types do not serve our purpose. What we need to ensure is that the test chosen does not restrict the types of responses students can give, i.e. it does allow for free responses and gives learners the opportunity to use the target DMs.
The test chosen allowed for this in two ways. Firstly, it is an established test, in commercial use. This means it has been extensively piloted both in its design and choice of topics to ensure a good variety of interaction, between both the interlocutor and students and between students themselves. This is clearly reflected in the use of non-specialised topics of general interest and the three-part design, which allows for a variety of interaction and free responses. The full version of the tests can be found in appendix two. Table eighteen shows the stages of the free response speaking test used in the pilot and main study.
Table 18 Stages of the free response speaking test (pilot and main study) Part 1 – Introductions
Interview to elicit personal information. Candidates respond to the interlocutor and not to each other. The interview consists of a number of short turns with candidates being invited to respond alternately. Part 1 last for 3 minutes divided equally between both candidates. In the event of three candidates, allow 4 minutes divided equally between all candidates.
Part 2 – Interactive discussion.
Candidates discuss a topic based on two prompts provided by the interlocutor. They exchange ideas and opinions and sustain a discussion for four minutes. The interlocutor does not take part in the discussion. If candidates start to address the interlocutor directly, hand or other gestures should be used to indicate that the candidates should speak to each other.
Part 3 – Responding to questions
A three-way discussion between interlocutor and candidates based on the topic from Part 2 of the test. The interlocutor leads the discussion by selecting from the questions below. It is not necessary to use all the questions. The interlocutor may ask for a specific response from one candidate or throw the discussion open to both candidates. The interlocutor should encourage candidates to elaborate on, or react to, their partner‘s response by verbal invitation (e.g. What do you think? Do you agree?) or non-verbal gesture. Candidates should be given equal opportunities to speak but the interlocutor may wish to give a candidate who has been rather reticent in earlier parts of the test a chance to redress the balance. This part of the test lasts about five minutes.
The marking scheme for the test (see appendix three) reflects the different opportunities for learners to display various facets of language competence. This is because it includes both a global marking system and bandings for grammar, vocabulary, pronunciation, discourse management and interactive ability. This ensures that learners who attempt to restrict their responses so that they are, for instance, always grammatically accurate are unlikely to score very highly. Secondly, by examining recordings of the test made with learners at the same level who had not been subject to any experimental instruction, we can see these learners do make use of several of the target DMs. Table nineteen shows the usage of the target DMs by two sets of learners at B2 level, studying at the university. The first two learners (students A and B) were Chinese and the second pair (students C and D) Japanese and Spanish. They therefore represent a realistic sample of the international student population in the context of our study. The recordings were made for marking and standardisation purposes and none of the students were given any explicit instruction in the use of the target DMs before the test.
Table 19 Sample test responses without teaching of target discourse markers Function DM(s) Student A Student B Student C Student D Opening So 1 1
Monitoring You know 2
Justifying Cos 2 2
Although the students‘ use of DMs is limited, we can clearly argue that this does at least demonstrate that the test provided opportunities to use the target DMs.