Methods of Assessment
What is item analysis?
What is the facility value of an item? What is its discrimination index?
Item analysis is for judging the value of items in norm-referenced testing. The main information to be obtained about individual items (e.g. Multiple Choice or True / False) is ITEM DIFFICULTY and ITEM DISCRIMINATION.
ITEM DIFFICULTY - finding out the % of people who get the item right in the try-out group.
In norm-referenced testing, one rejects items which are too easy or too difficult because the purpose is to discriminate.
The difficulty of an item = its FACILITY INDEX: % who give the right answer. The usual aim of the test setter is to achieve even to middling facility indices ranging from about 40-60%.
The DISCRIMINATION of an item is judged by comparing those individuals who succeed on a given item with those who score highly on the test as a whole:
Discrimination for any given item = [Correct Tops - Correct Bottoms] / 1/2 Number of students
What are the main difficulties and limitations of language testing?
The state of knowledge of Language & Language Learning. Content validity: does what we are testing represent language? Does it match up to the goals of language learning and the objectives of the language learner?
Difficulties in testing Communicative Competence / Interaction - administrative difficulties.
Objective and Norm-referenced tests tend to point towards measures of receptive learning. Learners don't have to produce language in many of these tests. Validity/ Reliability tension.
A good deal of judgement has been used in developing & setting them re rejection or acceptance of items. They also invite guessing (No of alternatives = 4 or 5 for each item.
A much wider sample of grammar, vocabulary & phonology can generally be included in an objective test than a subjective one, but objective tests can never test ability to communicate in the target language, nor can they evaluate actual performance.
Objective tests tend to test receptive learning. The students may produce no language at all
What are norm-referenced tests? Criterion-referenced tests?
Norm-referenced: The purpose is to compare the level of performance of an
individual with the general standard of performance shown by the total group that he or she belongs to and can be compared with. e.g. IQ test.
A Norm-referenced test compares the behaviour of the individual with the behaviour of others.
Criterion-referenced tests: Emphasis is not on how individual stands with his/her peers BUT on whether or not individual student knows something rather specific that (s)he ought to know or can perform something specific that (s)he is supposed to be able to do.
A Criterion-referenced test describes the behaviour of individual with reference to externally predetermined & specified objectives. The criterion is some externally defined object.
Examples: Attainment scales for Cambridge Oral Examiners or B.J. Carroll's
operational specifications: set of performance criteria e.g. learner possesses level of skills necessary to allow him or her to function adequately as a tourist.
In NORM-REFERENCED testing, the primary purpose is to DISCRIMINATE. Item difficulty: Finding out the % of people in a try-out group who got the item right. Item discrimination - one rejects items that are too easy or too difficult in a norm-referenced test, since those items do not contribute very much to TOTAL
DISCRIMINATION.
To establish "discrimination values" compare those individuals who succeed on a given item with those who score highly on the test as a whole. Desired
In CRITERION-REFERENCED testing, each objective is taken one by one. Not interested in the global sum. We assume that there are two kinds of individuals - those who can carry out the operation and those who cannot.
We are interested in dividing them up into two entirely separate discontinuous groups. But dichotomy of criterion referenced tests. There is a third group of people within the population who are currently engaged in the process of learning to carry out the operation. In this group there will be a continuous rather than a Yes/No distribution. It is the job of the items in a criterion-referenced test to reflect a dichotomy. The individuals in the 3rd group will pass some and fail other items
What is (a) objective and (b) subjective testing?
From the point of view of marking, the objective test has only one correct answer per item, yet the subjective test may result in a range of possible answers, some of which are more acceptable than others.
Note: It is not really the tests that are objective, but the systems by which they are marked.
What are the main subjective testing techniques? Discuss their advantages and disadvantages.
"Global quality sealing": putting in rank order after a quick scan (too clumsy for more than 20 students. (Nine Pile System refinement) approximately the same numbers in each pile.
Assessment in Categories (e.g. vocabulary, grammar, content, form) 5 marks to each; total mark out of 20: probably the commonest way of marking essays. Disadvantage: practice needed to apply the system reliably; reliability between markers may vary.
Also for continuous written work, marker can count off sections of e.g. 8 words and see what can be given credit: sequence of correct words, vocabulary, verb forms, idioms. Needs clearly defined credit points. Useful for marking e.g. letters.
Division of answer into sense groups: marking system can be based on
What are the main objective testing techniques? Discuss the pros and cons. Objective marking with MC & T / F Formats such as i) scripted speech ii) text & argument. Objective marking with EXACT CLOZE and MATCHING Formats like iii) transfer iv) unscripted speech. In all these tests, a list of KEYS gives the only correct answers
With exceptions of FACE and PREDICTIVE, the types of validity outlined above are all ultimately circular in spite of the existence of many esoteric statistical
techniques for assessing validity in these terms.
If our assumptions about the nature of language and language learning are called into question, tests that are perfectly valid in terms of these assumptions, must themselves be called into question.
For example, a test that perfectly satisfies criteria of CONTENT, CONSTRUCT & CONCURRENT validity may nonetheless fail to show in any interesting way how well a candidate performs in a target language.
If CONSTRUCT of LL theory and the CONTENT of a syllabus are themselves not related to this aim or if the test is validated against other language tests which do not concern themselves with this objective I.e. communicative
competence/Performance of X.
Validity exists only in terms of specified criteria. If the criteria turn out to be the wrong ones, then validity claimed in terms of them turns out to be spurious.
The CONTENT VALIDITY of a test is assured by the accuracy of the specification. In reviewing results, if results are the same as before, or as intended or reasonable in the circumstances for which the test is designed, the test can be counted as
satisfactory.
PRAGMATIC VALIDITY: Correlate test scores with the scores or ratings obtained from a criterion measure. By definition, anything which serves as a criterion is taken to possess validity
What are the main requirements of an efficient test?
The problem of test rubric: How far do format and instructions affect assessment?
Instructions given to students on how to do test are an important aspect of validity. The wording should be chosen very carefully, especially at lower levels. Familiarity with the form of the test. You should use L1 if this is the same for all students, for clarity.
Wider sense of "rubric": all arrangements made by the setter to enable the student to give attention to content: layout, length of pauses in LC etc.
If the rubric is poor, you will not be able to tell from the test whether students have learnt the required skill or not. Note: use of L1 in placement test rubrics.
What is test discrimination? When is it not needed?
Discrimination is the capacity of a test to discriminate among different candidates and to reflect the differences in the performances of the individuals in the group. E.g. 70% means nothing at all unless all other scores obtained in the test are known.
Tests on which almost all candidates score 70% clearly fail to discriminate between various students.
Tests which are to be standardized (intended for a large test population and first tried out on a representative sample of testees) - SAMPLING. The small sample mirrors the much larger group for whom the test is intended.
The results of this test are then examined to determine the extent to which it discriminated between individuals who are different. Discriminatory powers must be established.
Discriminatory powers may be needed with PLACEMENT TESTS, but may not be needed with DIAGNOSTIC TESTS or tests concerned with how much of a syllabus students have mastered (i.e. administered internally).
Also with Criterion-Referenced Tests / Performance tests, all you may be interested in is whether or not the testees can perform the criterion, though note the
existence of the third group: learning to perform - some right; some wrong
22. What are the problems of assessing communication in a L2?
The above situation increases the difficulty to analyse precisely what is being tested at any one time. Without tape-recording (to check back) the language is transient. The examiner is making subjective judgements quickly under great pressure.
Limited time. It is often impossible to administer this kind of oral test with large numbers of students.
The nature of language: Interaction based (even letter-writing is affected by interaction factors in a real situation)
Unpredictability (Processing of unpredictable data in real time = vital aspect of using language); Context - of situation: physical environment, role/status, formality Purpose (every utterance is made for a purpose); Performance (successive
approximation); Authenticity (Candidate's ability to come to terms with the
unknown i.e. the unsimplified authentic text.) Watered down items don't measure communicative ability.
Behaviour-based (behavioural outcomes: primacy of content over form in language. A test of communicative ability will have to be criterion referenced against the operational performance of a set of authentic language tasks.)
Performance Tests: Global Task + Enabling Skills, but what if a candidate can handle all the enabling skills individually but cannot mobilize them in a use situation e.g. driving a car. Need for performance tests.
How do we avoid hopelessly subjective measures of production and how do we set receptive tasks appropriate for different levels of language proficiency. Note: B.J. Carroll's 1977 specification: size, complexity, range etc.
Summary of problems: Which performance operations; level of proficiency
anticipated; enabling skills involved in performing tested individually or together; specifying content; format.