the Five Principles of Assessment

(1)

The Five Principles of Language Assessment

Teachers need to consider five principles of language assessment when they create assessments (Brown & Abeywickrama, 2010). These principles, which are all of equal importance, may be used to evaluate a designed assessment:

1. PRACTICALITY

Practicality refers to evaluating the assessment according to cost, time needed, and usefulness. This principle is important for classroom teachers.

An effective test is practical. This means that it:  is not excessively expensive.

A test that is prohibitively expensive is impractical.  stays within appropriate time constraint.

A test of language proficiency that takes a student 10 hours to complete is impractical.

 is relatively easy to administer.

A test that takes a few minutes for a student to take and several hours for an examiner to evaluate for most classroom situation is impractical.  has a scoring/evaluation procedure that is specific and time efficient.

A test that can be scored only by computer if the test takes place a thousand miles away from the nearest computer is impractical.

(Brown, 2004).

In addition, Brown and Abeywickrama (2010) have explained the attributes of practical tests as follows:

a practical

test- stays within budgetary limits

 can be completed by the test-taker within appropriate time constraints  has clear directions for administration

 appropriately utilizes available human resources  does not exceed available material resources

 considers the time and effort involved for both design and scoring (Brown & Abeywickrama, 2010).

Furthermore, for a test to be practical:

 administrative details should clearly be established before the test,

 students should be able to complete the test reasonably within the set time frame,

 all materials and equipment should be ready,

 the cost of the test should be within budgeted limits,

(2)

2. RELIABILITY

Reliability means that the assessment is consistent and dependable (Brown & Abeywickrama, 2010), which means that the same score, will be achieved from the same type of students no matter when it is scored or who scores it. Brown and Abeywickrama (2010) have summarized the feature of this principle as follows: a reliable test-:

 is consistent in its conditions across two or more administrations  gives clear directions for scoring/evaluation

 has uniform rubrics for scoring/evaluation

 lends itself to consistent application of those rubrics by the scorer  contains items/tasks that are unambiguous to the test-taker (Brown &

Abeywickrama, 2010).

To make the test reliable, especially for subjective and open-ended assessments, it is important to write scoring procedures clearly and to train teachers to be able to score the assessment correctly (Linville, 2011, Unit 2, p. 11).

Factors affecting reliability are (Heaton, 1975: 155-156; Brown, 2004: 21-22):

1. Student-related reliability: students personal factors such as motivation, illness, anxiety can hinder from their ‘real’ performance,

2. Rater/scorer reliability: either intra-rater or inter-rater leads to subjectivity, error, bias during scoring tests,

3. Test administration reliability: when the same test administered in different occasion, it can result differently. For example is the test of aural comprehension with a tape recorder. When a tape recorder played items, the students sitting next to windows could not hear the tape accurately because of the street noise outside the building.

4. Test reliability: dealing with duration of the test and test instruction. If a test takes a long time to do, it may affect the test takers performance such as fatigue, confusion, or exhaustion. Some test takers do not perform well in the timed test. Test instruction must be clear for all of test takers since they are affected by mental pressures.

Some methods are employed to gain reliability of assessment (Heaton, 1975: 156; Weir 1990: 32; Gronlund and Waugh, 2009: 59-64). They are:

1. Test-retest/re t administer: the same test is administered after a lapse of time. Two gained scores are then correlated.

2. Parallel form/equivalent-forms method: administrating two cloned tests at the same time to the same test takers. Results of the tests are then correlated.

3. Split-half method: a test is divided into two, corresponding scores obtained, the extent to which they correlate with each other governing the reliability of the test as a whole.

(3)

4. Test-retest with equivalent forms: mixed method of test-retest and parallel form. Two cloned tests are administered to the same test takers in different occasion.

5. Intra-rater and inter-rater: employing one person to score the same test in different time is called intra-rater. Some hits to minimize unreliability are employing rubric, avoiding fatigue, giving score on the same numbers, and suggesting students write their names at the back of test paper. When two people score the same test, it is inter-rater. The tests done by test takers are divided into two. A rubric and discussion must be developed first in order to have the same perception. Two scores either from intra- or inter-rater are correlated.

3. VALIDITY

By the far most complex criterion of an effective test-and arguably the most importand principle is validIty,” the extend towich imperence made from essesment results are appropiate,meaningfull,and usefull in terms of porpuse of the assesment. A valid test of reading ability actually measures reading ability not 20%20 vision,nor previos knowladge in a subject, nor some other variable of question rrelevance. To measure writing ability, one make ask the students to write as many words as they can in 15 minutes, then simply count the word for the final score. Such a test would be easy to administer(practical)and the scoring quite depandable (realible). But it would not constitute a valid test of writing ability without some considaration of ideas,among other factors.

How is the validaty of a test established? There is no final, absolute measure of validity, but several different kind of evidence may be invoked in support. In some cases, it may be appropiate to examine the extence to wich a test calls for ferformance that mathes that of the course or unit of the study being tested. In other cases we may be concerned with how well a test determines wheater or not students have reached an established set of goals or level of competence. Statistical correLation with other releted but independent measure is onother widely accepted from of evidence. Other concers about a test’s validity may focus on the consequences –beyond measuring the criteria themselves-of a test or even the test-taker’s perception of validity. We will look at the these five types of evidence bellow:

1. content related-evidence

If a test actually samples the subject matter about wich conclusions are to be drawn, and if it requires the test-taker’s to nperform the behavior that ios being measured, it can be claim content -related evidence of validaty. Often popularly referred to as content validaty. You can usually identify content-related evidance observationally if you can clearly define the achievment that you are measuring.

Onother way of understanding content validaty is to consider the difference between direct and indirect testing. Direct testing involves the test-taker’s in actually performing the target task. In and indirect test, learners are not performing the task

(4)

itself but rather a task that is related in some way. For example if you intent to test the learner’s oral production of syllable stress and your task test is to have learners marks(with writtent accent marks) stressed syllable in a list of written word, you could,with a stretch of logic, argue that you are indirectly testing their oral production. A direct test of syllable production would have ton require that students actually produce target words orally.

The most feasible rule of thumb for echieving content validaty in classroom assesment is to test performance directly. Consider, for example a listening/speaking class that is doing a unit on greating and exchanges that includes discourse for asking for personal information( name, address,hobbies, etc) with some form-focus on the verb to be, personal pronouns and question formation. The test on the unit should include all of the actual prformance of listening and speaking.

2. Criterion-Related Evidance

A second form of evidance of the validaty a test may be found in what is called criterion-related evidance, also referred ton as criterion-related validaty,or the extent to which the criterion of the test actually been reached.

Criterion related evidance usually falls into one of two categories concurent and predictive validaty. A test has concurrent validaty if its results are supported by other concurrent performance beyond assessment itself.forexample the validaty of a high score on the exam of a foreign language courses will be substantiated by actual profiency in the language. The predictive validaty of an assessment become importand

In the case of placement tests,admission assessment batteries, language aptitude test, and the like.the assessment criterion in such cases is not to measure concurrent ability but to asses (and predict) a test-tsker’d likelihood of future succes.

3. Construct-Related Evidance

A third kind of evidance that can support validaty, but that does not play as large a role for classroom teachers, is construct-related validaty, commonly reffered to as construct validity. A construct ia any theory, hyphotesis or model that attempts to explain observed phenomenon in our universe of perception. Constructs may or vmay not be directly or empirically measured –their verification often requaire inferential data. Profiency and communicative competence are linguistic constructs , self-esteem and motivation are psyhologycal constructs.

4. Consequncetial Validity

As wall as the above three widely accepted forms of evidance that may be introduceed to support validay of an assessment, two other categories may be of some interest and untility in your own quest for validating classroom tests., among others, underscore the potential importance of the consequence of using an assesment. Consequance validity uncompassed all the consequence of a test, including such

(5)

considarations as its accuracy in measuring intended criteria , its impact on the preparation of test-taker’s, its effect on the klearners, and the intended and unintended) sicial sequences of a test’s interpretation and use. In other word consequntial validaty is How well use of assessment results accomplishes intended purposes and avoids unintended effect.

5. Face Validaty

An importand facet of cconsequential validaty is the extent to wich students view the assessment as fair,relevant, and usefull for improving learning. or what is popularly known as face validity . face validity referst to the degre to wich a test looks right. And appears to measure the knowladge or abilities it claims to measure, based on the subjective judgment of the examines who take it, the administrative personnel who decide in its use, and others psychometrically unsophisticated observers.

Some time student don’t what is beingn tested when they takle a test. They may fell, for a variaty of reason, that a test is not testing what it is supposed to test. Face validaty means that the student perceive the test to be valid. Face validity It can be empirically tested by a teacher or even by a testing expert because it is based on the subjective judgment of the examinees who take it.

4. AUTHENCITY

A fourth major principle of language testing is authenticity, a concept that is a litle slippery to define, especially whitin yhe artr and science of evaluating and designing tests. Bachman and Palmer Aunthenticitydefine authenticity as a degree of correspondence of the chracteristics of a given language test task to the futures of a target language task and then suggest an agenda for identifiying those target language tasks and for tranforming them in to valid test items.

In a test authenticity may be present in the following ways:  the language in the test is as natural as possible.  Items are contextualized rather than isolation.

 Topict are meaningfull(relevant\, interesting) for the learner.

 Some thematic organization to items isprovided, such as through a story line or episode.

 Task represent, or closely approximatye, real-world tasks.

5. WASHBACK

The effects of tests on teaching and learning are called washback. Washback refers to criterion for a test is the influence of the form and the content of the test” in the classrooms. Teachers must be able to create classroom tests that serve as learning devices through which washback is achieved. Washback enhances intrinsic

(6)

motivation, autonomy, self-confidence, language ego, interlanguage, and strategic investment in the students. Instead of giving letter grades and numerical scores which give no information to the students’ performance, giving generous and specific comments is a way to enhance washback.

In large-scale assessment, wasback generally refers to the effects the test have on instruction in terms of how students prepare for the test. “Cram” courses and “teaching to the test” are examples of such washback. Another form of washback that occurs more in classroom assessment is the information that “washes back” to students in the form of useful diagnoses of strengths and weaknesses. Washback also includes the effects of an assessment on teaching and learning prior to the assessment itself, that is, on preparation for the assessment.

Informal performance assessment is by nature more likely to have built-in wash back effects because the teacher is usually providing interactive feedback. Formal tests can also have positive wash back, but they provide no wash back, if the students receive a simple letter grade or a single overall numerical score. Wash back enhances a number of basic principles of language acquisition : intrinsic motivation, autonomy, self confidence, language ego, interlanguage, and strategic investment, among others. One way to enhance wash back is to comment generously and specifically on test performance. Wash back implies that students have ready access to the teacher to discuss the feedback and evaluation he has given, Teachers can raise the wash back potential by asking students to use test results as a guide to setting goals for their future effort.

Washback can also be negative and positive. It is easy to find negative wash back such as narrowing down language competencies only on those involve in tests and neglecting the rest. While language is a tool of communication, most students and teachers in language class only focus on language competencies in the test. On the other hand, a test can be positive washback if it encourages better teaching and learning. However, it is quite difficult to achieve. An example of positive washback of a test is National Matriculation English Test in China. It resulted that after the test was administered, students’ proficiency in English for actual or authentic language use situation improved.

CAN THESE PRINCIPLES APPLY TO CLASSROOMS?

According to the five principles: practicality, reliability, validity, authenticity, and washback, they can be provided as guidelines for evaluating a step-by-step procedure in the classrooms. Clearly, validity is the first priority choice to consider, and practicality is a minor important. Showed eight tips based on five principles’ features below:

(7)

Practically is determined by the teacher’s (and the students) time contains, costs, and administrative details, and to some extent by what occurs before and after the test. To determine whether a test is practical for your needs, you may want to use the checklist below:

 Are administrative details clearly established before the test?

 Can students complete the test reasonably within the set time frame?  Can the test administrated smoothly, without procedural “glitches”?  Are all materials and equipment ready?

 Is the cost of the test within budgeted limits?

 Is the scoring system feasible in the teacher’s time frame?  Are methods for reporting results determined in advance?

As this checklist suggests, after you account for the administrative details of giving test, you need to think about the practicality of your plans for scoring the test.

2. Is the test itself reliable?

Reability applies to both the test and the teacher, and at least four sources of unreliability can be achieved by making sure that all students receive the same quality of input, whether written or auditory. Part of achieving test reliability depends on the physical context-making sure. For example that:

 Every student has a cleanly photocopied test sheets

 Sound amplification is clearly audible to everyone in the room  Video input is equally visible to all

 Lighting, temperature, extraneous noise, and other classroom conditions are equal (and optional) for all student

 Objective scoring procedures leave litte debate about correctiveness of an answer

(8)

3. Does the procedure demonstrate content validity?

There are two steps to evaluating the content validity of a classroom test.

1. Are classroom objectives identified and appropriately framed? Underlying every good classroom test are the objectives of the lesson, module, or unit of the course in question. So the first measure of an effective classroom test is the identification of objectives.

2. Are the lesson objectives represented in the form of test specification? The nest content-validity issue that can be applied to a classroom test centers on the concept of test specification. Don’t let this word scare you. It simply means that a test should have a structure that follows logically from the lesson or unit you are testing.

4. Is the procedure “biased for best?”

This question integrates the concept of face validity with the importance of structuring an assessment procedure to elicit the optimal performance of the student. Students will generally judge a test to be face valid if

 Directions are clear

 The structure of the test is organized logically  Its difficulty level is appropriately pitched  The test has no “surprise”

 Timing is appropriate

A phrase that has come to be associated with face validity is “biased for best”, a term that goes a little beyond howcthe student views the test to a degree of strategic involvement on the part of student and teacher in preparing for setting up and following upon the test itself. To give an assessment procedure that is “biased for best” a teacher

(9)

 Offers student appropriate review and preparation for the test  Suggests strategies that will be beneficial and

 Structures the test so that the best students will be modestly challenged and the weaker student will not be overwhelmed.

5. Are the test tasks as authentic as possible?

Evaluate the extent to which a test is authentic by asking the following question:  Is the language in the test as natural as possible?

 Are items as contextualized as possible rather than isolated?  Are topics and situations interesting, enjoyable, and/or humorous?  Is some thematic organization provided, such as through a story line or

episode?

 Do tasks represent, or closely approximate real word task?

Consider the following two excerpts from tests (Multiple-choice tasks-contextualized and Multiple-choice tasks-detasks-contextualized), and the concept of authenticity may become a little clearer

6. Does the test offer beneficial washback to the learner?

The design of an effective test should point the way to beneficial washback. A test that achieves content validity demonstrates relevance to the curriculum in question and thereby sets the stage for washback. When test items represent the various objectives of a unit, and/or when sections of a test clearly focus on major topics of the unit, classroom tests can serve in a diagnostic capacity even if they are not specifically labeled.

(10)

CONCLUSION:

1. A test is good if it contains practicality, high reliability, good validity, authenticity, and positive washback.

2. The five principles provide guidelines for both constructing and evaluating the tests.

3. Teachers should apply these five principles in constructing or evaluating tests which will be used in assessment activities.