Direct, indirect, semi-direct, analytic and holistic methods

The Universe, Dark Energy and Us

2.3 Test methods

2.3.1 Direct, indirect, semi-direct, analytic and holistic methods

If one reading comprehension multiple-choice item is well designed, it can only be answered correctly or incorrectly. Tests may have open-ended questions, matching tasks or the like but, even so, if they are well designed and their key is clear, they are not difficult to mark. These items are frequently used to assess receptive skills. For productive skills, a different type of assessment is needed.

Things are far more complex when it comes to analyzing spoken performance. The performance of candidates in productive skills is not simply correct or incorrect and it frequently builds on multiple facets. There are other powerful forces involved in the final mark of candidates, like rater cognition (see section 2.2.2) or the analytic^► scales, the rubrics which raters use to assess

trained and standardized. Rubrics, on the other hand, offer a more stable standpoint because when they are finished they remain unchanged for very long periods of time.

But to define more exactly what rubrics are, we must first delve into the notions of direct and indirect assessment and second into the difference between holistic^► and analytic approaches.

Clark (1979:36) was the first to discuss in 1975 a variety of techniques for measuring speaking ability and proposed the use of terms direct and indirect to distinguish 2 broad types of testing approaches. Four years later he included semi-direct methods in the classification. “Direct speaking tests were considered to include any and all procedures in which the examinee is asked to engage in face-to face communicative exchange with one or more human interlocuface-tors” (ibid.).

“Indirect speaking tests were considered to include both (1) those situations in which the examinee is not actually required to speak and (2) speech based on recorded or printed stimuli” (ibid.). Finally, the term semi-direct is used “to characterize those tests which, although eliciting active speech by the examinee, do so by means of tape recording, printed test booklets, or other ‘nonhuman’

elicitation procedures, rather than through face-to-face conversation with a live interlocutor” (ibid.). The type of tests that we are preparing our rubrics for is a hybrid of direct and semi-direct methods. It is direct in the sense that it requires candidates to engage in real, face-to-face communication in 2 of its 3 parts. They are semi-direct in so far as candidates are also given visual prompts to elicit from them a sustained monologue. Candidates always take the oral test in groups of 2 and, exceptionally, in groups of 3.

The tests whose outcome will be measured by our rubrics consist of 3 different parts and, basically, they bear the structure of a proficiency interview. In part 1 (direct) the rater asks introductory questions to candidates which are used to elicit personal information and to break the ice. In part 2 (semi-direct) candidates must describe pictures which normally present opposed views of the same matter (reading through digital books vs. reading through traditional books;

studying alone vs. studying in group, etc.). In part 3 (direct) candidates are also given visual prompts but, this time, with the objective of engaging them in conversation.

The proficiency interview method for testing oral performance enjoys a very high degree of face validity but also has its downside. This “technique does not perfectly reflect real-life conversational settings”, it has difficulty in “eliciting certain fairly common language patterns typical of real-life conversation” and offers candidates little room “to demonstrate productive control of interrogative patterns unless the interviewer takes special pains to ‘turn the conversation around’ at one or more points during the interview” (Clark, 1979:38). Yet, “of the currently available testing procedures, the face-to-face interview appears to possess the greatest degree of validity as a measure of global speaking proficiency and is clearly superior in this regard to both the indirect (non-speaking) and semi-direct approaches” (ibid.).

As stated in the introduction of the present section,

[f]or both direct and semi-direct speaking tests, the reliability question is somewhat more complicated in that examinee performance must be evaluated by human judges rather than through such mechanical means as answer key stencils or computer scoring devices. Two distinct types of reliability enter the picture here: intra rater reliability, which refers to the extent to which a given scorer is able to consistently assign the same scores to individual tests that he or she evaluates two or more times in succession; and inter-rater reliability, which refers to the extent to which two or more different raters assign the same scores to a given test performance.

Clark (1979:41) Modern advances in testing have made analyses beyond inter-rater and intra-rater reliability possible. Both types of analysis assume that judgments are made through one well-calibrated tool and that raters are the only factor that can introduce randomness in the observations made. In other words, they assume that

source of variability. Nowadays we know that it is not so. There may be a great degree of inter-rater unreliability if the assessment criteria used are not easy to understand or if they leave too much room for interpretation. Many raters will recognize themselves in the words of Knoch (2009:12) below, which we already quoted in the introduction:

I often found that the descriptors provided me with very little guidance. On what basis was I meant to, for example, decide that a student uses cohesive devices

‘appropriately’ rather than ‘adequately’ or that the style of a writing script ‘is not appropriate to the task’ rather than displaying ‘no apparent understanding of style’? […] This lack of guidance by the rating scale often forced me to return to a more holistic form of marking where the choice of the different analytic categories was mostly informed by my first impression of a writing script […]. I often felt that this was not a legitimate way to rate and that important information might be lost in this process.

Although Knoch refers to writing rubrics, the same goes for speaking ones.

Modern psychometric analyses have proved long-time-used rubrics to be faulty (Jansen et al., 2015), and that is the reason why rubrics must also be (re)considered as an object of psychometric analysis.

Besides the consideration of direct, indirect and semi-direct methods and their reliability implications, in the mind of test designers there is also a never-ending good-evil struggle that confronts holistic versus analytical methods. With holistic approaches we think we know what we are assessing, but remain happily or unhappily uncertain about the accuracy or replicability of our assessment.

With analytic approaches we tend to be sure enough of our measurement, but we may jeopardize our certainty as to what exactly we have measured.

The term holistic derives from the Greek word ὅλος (pronounced /'ɔlɔs/) which means “whole”. Holistic scoring thus requires raters to respond to oral performance as a whole and to base their score on a general impression of the candidate. This approach reflects the idea that oral performance is a single entity,

which is best captured by a single score that integrates the inherent qualities of writing (cf. Knoch, 2009:39). Davies et al. (1999:75) define holistic scoring as

[a] type of marking procedure which is common in communicative language testing whereby raters judge a stretch of discourse (spoken or written) impressionistically according to its overall properties rather than providing separate scores for particular features of the language produced (eg (sic) accuracy, lexical range) […]. A problem with holistic judgements, however, is that different raters may choose to focus on different aspects of the performance, leading potentially to poor reliability if only one rater is used. For the sake of reliability, therefore, test performance is normally judged by several raters and their judgements pooled. A further drawback of holistic scoring is that it does not allow detailed diagnostic information to be reported.

While Jonsson and Svingby (2007:131-132) describe it as follows:

Two main categories of rubrics may be distinguished: holistic and analytical. In holistic scoring, the rater makes an overall judgement about the quality of performance, while in analytic scoring, the rater assigns a score to each of the dimensions being assessed in the task. Holistic scoring is usually used for large-scale assessment because it is assumed to be easy, cheap and accurate. Analytical scoring is useful in the classroom since the results can help teachers and students identify students’ strengths and learning needs. Furthermore, rubrics can be classified as task specific or generic.

In fact, it was the apparent lack of reliability mentioned above by Davies et al. (1999:75) what triggered the design of our rubrics. In our opinion, holistic scoring offers certain benefits in achievement or placement tests but, as a sole source of reference in proficiency tests, they leave too much room for individual interpretation.

As will be shown in chapter 4, our rubrics, we felt, had to be analytic and as accurate as possible, well-balanced, cognitive-friendly (so that they did not require too much working memory from raters) and leave little room for interpretation. This is easier said than done, but it was clear from the very

beginning that our rubrics had to be analytic. Knoch (2009:40) defines analytic scoring applied to the assessment of writing as follows:

A common alternative to holistic scoring is analytic scoring. Analytic scoring makes use of separate scales, each assessing a different aspect of writing, for example vocabulary, content, grammar and organization. Sometimes scores are averaged so that the final score is more usable […]. A clear advantage of analytic scoring is that it protects raters from collapsing categories together as they have to assign separate scores for each category. Analytic scales help in the training of raters and in their standardization […] and are also more useful for ESL learners, as they often show a marked or uneven profile which a holistic rating scale cannot capture accurately.

Adapting the categories assessed, the same can be said about analytic scoring of oral performance. Generally speaking, trained raters feel more confident using analytic scales than using holistic ones. There is the underlying belief that “(j)ust as discrete-point test becomes more reliable when more items are added, a rating scale with multiple categories improves the reliability” Knoch (2009:40).

We did not consider the possibility of using primary trait scales^►because this type of rubric is defined with respect to the specific task to be judged and to the degree of success in it (Weigle, 2000:110). In other words, primary trait scales are so task-specific that it would have been impossible to create a suitable rubric with such design for the context described in sections 3.3 and 4.1.

In document Protocol to design a CEFR-linked proficiency rating scale for oral production and app implementation (Page 88-93)