This chapter is designed to cover concepts related to
how tests are scored and interpreted, the process of
analyzing the importance and functions of the test in
education, analyzing the data, interpreting and using test
results. In the analysis process, it is pointed out that
descriptive statistics are of great help for the teacher to
summarize student scores in a comprehensible way.
Different methods of interpreting test results are outlined
in this learning activity. The unit ends with a brief
discussion on the different ways in which test results are
used by different stakeholders in the education process.
Scoring of Tests
The following guidelines are suggested to scoring tests:
You must remember that multiple choice tests are difficult to design, difficult to administer, especially in a large class, but easy to score. The reasons for easy scorability of multiple-choice tests are because they usually have one correct answer which must be accepted across the board.
Essay tests are relatively easy to set and administer, especially in a large class. They are, however, difficult to mark or assess. The reason is because easy questions require a lot of writing of sentences and paragraphs. The examiner must read all these.
Scoring of Tests
Cont.
Scoring or marking on impression is dangerous. Some students are very good at impressing examiners with flowery language without real academic substance. If you mark on impression, you may be carried away by the language and not the relevant.
Scoring can be done question-by-question or all questions at a time. The best way is to score or mark one question across the board for all students. Sometimes, this may be feasible and tedious, especially in a large class.
USING TEST RESULTS
As earlier mentioned, conducting tests is not an end in itself. However, before tests could be used for those purposes, the teacher needs to know how well designed the test is in terms of difficulty level and discrimination power, then the teacher should be able to compare a child’s performance with those of his peers in the class. Occasionally, he may like to compare the child’s performance in one subject area with another.
To do this, he carries out the following activities at various times:
1. Item analysis.
2. Drawing of frequency distribution tables.
3. Finding measures of central tendency.
4. Finding measures of Variability 5. Derived scores
Item analysis helps to decide whether a test is good or poor, therefore item analysis is a process of examining class-wide performance on individual test items. There are three common types of item analysis which provide teachers with three different types of information:
Item difficulty gives information about the difficulty level of a question.
Item Discrimination indicates how well each question shows the difference (discriminate) between the bright and dull students. In essence, item analysis is used for reviewing and refining a test.
Analysis of Response Options - In addition to examining the performance of an entire test item, teachers are often interested in examining the performance of individual distractors (incorrect answer options) on multiple-choice items. By calculating the proportion of students who chose each answer option, teachers can identify which distractors are "working" and appear attractive to students who do not know the correct answer, and which distractors are simply taking up space and not being chosen by many students. To eliminate blind guessing which results in a correct answer purely by chance (which hurts the validity of a test item), teachers want as many plausible distractors as is feasible. Analyses of response options allow teachers to fine tune and improve items they may wish to use again with future classes.
Item Difficulty
Teachers produce a difficulty index for a test item by calculating the proportion of students in class who got an item correct. (The name of this index is counter-intuitive, as one actually gets a measure of how easy the item is, not the difficulty of the item.) By difficulty level we mean the number of students that got a particular item right in any given test.
For example, if in a class of 37 students, 24 of the students got a question correctly, then the difficulty level is 65% or 0.65 (24/37). The proportion usually ranges from 0 to 1 or 0 to 100%.
Example:
Interpretation: An item with an index of 0 is too difficult hence everybody missed it while that of 1 is too easy as everybody got it right. Items with index of 0.5 are usually suitable for inclusion in a test.
Though the items with indices of 0 and 1 may not really contribute to an achievement test, they are good for the teacher in determining how well the students are doing in that particular area of the content being tested. Hence, such items could be included. However, the mean difficult level of the whole test should be 0.5 or 50%.
Students A* B C D
Total of students Number of Students Choosing Each
Answer Option 11 0 1 1 13
Item Discrimination
The discrimination index shows how a test item discriminates between the bright and the dull students. A test with many poor questions will give a false impression of the learning situation. Usually, a discrimination index of 0.4 and above are acceptable. Items which discriminate negatively are bad. This may be because of wrong keys, vagueness or extreme difficulty.
Item Discrimination is the proportion of the better-prepared students who had the item correct minus the proportion of less-prepared students who had the item correct.
The following figure is an example of an item analysis for item 12 from a 20-item with four-option multiple choice test.
Students
A
B
C*
D
Upper 25%
0.20
0.00
0.80
0.00
Lower 25%
0.30
0.00
0.20
0.50
All students
0.26
0.00
0.38
0.36
Item Discrimination
None of the best prepared student chose either options B or D, whereas 80% of those students chose option C (the correct choice) and 20% chose option A. The next row of the item analysis shows the proportion of students from the lowest quartile (the 25% of the students who had the lowest total scores on the test). Of this group 50% chose option D, 30% chose option A, and only 20% chose the correct answer, option C.
The last row of the item analysis shows the proportion of all students from the class who chose of the four options. Therefore the item difficulty level is 0.38 (38% chose option C, the correct answer)
Note: Discrimination Indices range from -1.0 to 1.0.
Students
A
B
C*
D
Upper 25%
0.20
0.00
0.80
0.00
Lower 25%
0.30
0.00
0.20
0.50
All students
0.26
0.00
0.38
0.36
For determine item discrimination index it is the difference in the proportion of the better-prepared students who had the item correct as compare to the proportion of less-prepared students who had the item correct, for our example: Item discrimination Index = 0.80 – 0.20 = 0.60
Distribution and Measures
Central tendency
The central indicate that the data seem to cluster: Mean, median and mode
Measures of Variability
Indicate the degree of concentration data with respect to mean:
Standard deviation, coefficient of variation, range, variance, maximum and minimum
Point Measures (quantiles)
Divide an ordered set of data into groups with the same number of individuals:
Percentiles, deciles, quartiles, ...
Distribution
Measures of Central Tendency
Measures of central tendency provide information about the average or typical score in a data set. The most widely used and familiar average. The most reliable and the most stable of all measures of central tendency.
n
x
x
inches
n
x
x
89
.
29
7
625
7
40
70
100
150
140
80
45
Interpretation: The average height in inches is 89.29
Descriptive Statistics with SPSS vs 23
For the above example, follow the procedure for entering data into SPSS SPSS procedure: Create data file
Creating a new SPSS data file consist of two stages: • Defining the variables
• Entering the data
Defining Variable
Step 1. Click the Variable View in the lower-left corner of the data editor window (see the following figure). Type Name in the first cell under the Name column. Assign variable name based on your exercise that you want to analyze.
Height in inches 45 80 140 150 100 70 40
Descriptive Statistics with SPSS vs 23
Step 3. Go to analyze> Frequency and follow the steps in the figure
Step 4. Output Height in InchesStatistics
N Valid 7
Missing 0
Mean 89.29
Median 80.00
Measures of Central Tendency Cont.
Median: Is the middle score in a set of ranked scores. The scores that divides the distribution into halves. It is sometimes called the counting average.
Steps to computing the median
1. Line up scores from lowest to highest 2. Count up to middle score
• If there is 1 middle score, that’s the median
• If there are 2 middle scores, median is their average
Mode
Most common value. In the previous example (the height in inches), there is no mode, because nobody has the same height.
Me= 80 inches
Interpretation: 50% of the participants
Measures of variability
Indicate the degree of concentration data with respect to mean or how far away the measurements are from the center; special cases are:
Variance, standard deviation, coefficient of variation, range, maximum and minimum
Range
The range is the difference between the maximum and minimum values in a set:
RANGE = (Xlargest – Xsmallest)
Example
Data set 1: [1, 25, 50, 75, 100]; R: 100-1 = 100 Data set 2: [48, 49, 50, 51, 52]; R: 52-48 = 5
The range ignores how data are distributed and only takes the extreme scores into account.
Variance
Formula of Variance:
Standard Deviation
Shows the data scatter about the mean. The standard deviation (SD) quantifies variability. It is expressed in the same units as the data.
A small standard deviation means that the group has small variability or relatively homogeneous. At a distance of one half standard deviations of 68% will observations. At a distance of two half standard deviation of 95% will observations.
Example: A sample of 9 students is taken and its score is measured (0 - 100). You want to know the variability of this score.
X(score)
54
77
67
68
46
64
62
56
38
Interpretation: The variability of the score around mean is 11.98 ≈ 12. The standard deviation, usually accompanied by the mean, help to you know how a set of data values distributes around its mean. In our example you conclude that most score of the students in this sample are between 47.13 (59.11-11.98) of score and 71.09 (59.11+11.98) of score
Measures of variability
1 ) ( : , ) ( : 2 2 2 2
n x x s Variance Sample N x VariancePopulation
11
.
59
,
98
.
11
51
.
143
51
.
143
2
s
x
Quartiles: Divide the data into 4 equal parts
Deciles: Divide the data into 10 equal parts
Percentiles: Divide the information into 100 equal parts
Defines the order quantile as a variable value below which is a cumulative frequency.
Special cases are the percentiles, quartiles, deciles,
Point Measures: Quantile
DERIVED SCORES
The Normal Distribution: A “bell-shaped” curve in which most of the scores are clustered around the mean; the farther from the mean, the less frequently the score occurs. Distribution characterized by a bell-shaped curve, and the:
mean = median = mode
Commonly Reported Test Scores Based on the Normal Curve
Z
Scores
• The most fundamental standard score, which is a simple
conversion of an individual’s raw score to a new score that has a
mean of 0 and standard deviation of 1.
• To compute a
Z
score,
subtract the mean from a
raw score and divide by the
standard deviation (SD)
• To convert a
Z
score back to
a raw score, multiply the
Z
score by the SD and then
add the mean
SD
M
X
Z
(
)
M
SD
Z
X
(
)(
)
•
Where Z = Z-score
•
X = any raw score
•
M = the Mean
•
SD = Standard Deviation
Example to converting a raw score to a z-score
Example: Let’s say an individual takes a Statistic exam, the data is
following:
Step:
1.Mean =
2.Variance=
3.Standard deviation =
4.Z-score=
1 ) ( , 1 )( 2 2 2 2
2
n x n x s or n x x s3
.
13
4
)
6
.
12
*
5
847
(
1
)
(
2 22
2
n
x
n
x
s
Studen t Raw Scor e (x)(x)2
Z-score
1 15 225 0.66 2 10 100 -0.71 3 17 289 1.21 4 13 169 0.11 5 8 64 -1.26
63 847
Student
Raw
Score
(x)
Z-score
Probabilistic
Normal value
Percentile
1
15
0.66
0.7454
74.54%
2
10
-0.71
(1-0.7611)=
0.2389
23.89%
3
17
1.21
0.8869
88.69%
4
13
0.11
0.5438
54.38%
5
8
-1.26
(1-0.8962)=
0.1038
10.38%
Interpreting
From this example
we can see the first
student “1” that
individual who
scored a 15 on the
exam has a z-score
of 0.66. By
examining
probabilistic Normal
table you can see
that this student has
a value is 0.7454,
that mean has a
percentile score of
approximately
74.54%
Z
Scores - Example
Consider the
maximum scores obtained in in Inferential and
Research in the table beside. We cannot easily guarantee which of the subject was more tasking and in which the examiner was more generous. Hence, for justice and fair play, it is advisable to convert the scores in the two subjects into common scores
(Standard scores) before they are
ranked. Z score are often used.
Total Mean= 93 Total SD = 5.558
Student
Score in Inferential
Score in
Research Total Rank Z-Score
A 68 20 88 8 -0.900
B 58 45 103 1 1.800
C 47 39 86 9 -1.260
D 45 40 85 10 -1.440
E 54 42 96 3 0.540
F 50 48 98 2 0.900
G 62 30 92 7 -0.180
H 59 36 95 4 0.360
I 48 46 94 5 0.180
J 52 41 93 6 0.000
Z
Scores - SPSS
• Percentiles represent the point on the normal curve below which a
percentage of test scores is distributed.
• A student’s percentile rank on a test indicates the percentage of
students who scored lower in the comparison group.
For example, if a student is ranked in the 55
thpercentile, the
student’s score was 55% better than the comparison group who
took the test.
Percentile Ranks
This expresses a given score in terms of the percentage scores below
it i.e. in a class of 30, Emanuel scored 60 and there are 24 pupils
scoring below him. The percentage of score below 60 is:
Stanines
Stanines are standard scores
based on normalized z-scores.
They extend from 1 to 9 with a
mean of 5 and a standard
deviation of 2. Parents find stanine
results easiest to understand
because their child’s standardized
test scores are reported as:
Stanine Letter Grade Remark
9 A1 Excellent
8 A2 Very Good
7 A3 Good
6 C4 Credit
5 C5 Credit
4 C6 Credit
3 P7 Pass
2 P8 Pass
Stanines - Example
Student
Raw score
(0 to 30) Z-score Stanine A 15 -0.61 4 B 26 1.01 7 C 10 -1.35 2 D 17 -0.32 1 E 28 1.30 8 F 19 -0.02 5
Compare the z-score to the ranges of stanine scores.
Stanine 1 consists of z-scores below -1.75; stanine 2 is -1.75 to -1.25;
Normal distribution curve showing
Exercises
1. The following figure show the results from a Testing, Measurement and Evaluation exam. (Note, the correct answer for each item is noted with an asterisk). Analyze each item (difficulty and discrimination per each table)
a.Find the item difficulty and
interpret the result. b.Find the item
discrimination index and interpret the result.
Item 1:
Students A B C* D Upper 25% 0.25 0.35 0.4 0 Lower 25% 0.4 0.1 0.5 0 All students 0.3 0.2 0.45 0.05
Item 2:
Students A* B C D Upper 25% 0.75 0.05 0.15 0.05 Lower 25% 0.73 0 0.2 0.07 All students 0.74 0.02 0.18 0.06
2. Find and interpret the difficulty of the item for the following data
Note: B* is the correct answer
Students A B* C D
Number of Students Choosing Each
Exercises
2. If in a class of 9, the scores are 29, 85, 78, 73, 40, 35, 20, 10 and 5.
a.Find the measures of Central tendency and interpret. What measure is the most representative for this data and why?
b.Find and interpret measure of variability (standard deviation and coefficient of variation
c.Find and interpret Z score
3. (a) Find the mean and standard deviation for the following marks and interpret the result. (b) Find Z-score and interpret.
20, 45, 39, 40, 42, 48, 30, 46 and 41.
4. Explain why:
Exercises
5. Who student was better in a test. The table shows how is the performance for each students.
Student Mon Tue Wed Thu Fri
Samuel 20 21 22 20 21
Pheneas 30 15 12 36 28
a. Find the mean and standard deviation for Samuel and Pheneas b. Who is most consistent?
c. Who makes the most parts in a week?
6. Harry scored 55 in an English test for which the mean was 50 and the standard deviation was 6. He scored 64 in a Mathematics test for which the mean was 59 and the standard deviation 9.
a.Calculate his standardized score for each subject.
Exercises
Rules to interpret Items with a
discrimination index of 0,40 or greater are very good items
Items with a
discrimination index of 0,30 and 0,39 are quite good, but be worked on to improve
Items with a
discrimination index of 0,20 and 0,29 are the ones to be corrected and improved
Items with a
discrimination index of 0,19 or less are very weak and must be
removed from the test if they can not be corrected and improved.
7. Interpret the following table
Item Analysis of the Academic Achievement Test for Inferential Statistics Lesson
Item No
Difficulty (Pj)
Discrimi-nation (rjx) Item No
Difficulty (Pj)
Discrimi-nation (rjx)
1 0.24 0.25 21 0.20 0.24
2 0.78 0.37 22 0.62 0.56
3 0.19 0.13 23 0.21 -0.02
4 0.6 0.42 24 0.65 0.66
5 0.65 0.45 25 0.87 0.37
6 0.58 0.53 26 0.54 0.6
7 0.63 0.54 27 0.37 0.44
8 0.54 0.47 28 0.67 0.59
9 0.47 0.49 29 0.71 0.61
10 0.34 0.37 30 0.71 0.61
11 0.41 0.37 31 0.66 0.67
12 0.72 0.57 32 0.85 0.46
13 0.59 0.52 33 0.77 0.57
14 0.67 0.59 34 0.72 0.6
15 0.49 0.53 35 0.85 0.45
16 0.64 0.65 36 0.64 0.55
17 0.65 0.54 37 0.69 0.57
18 0.56 0.7 38 0.5 0.58
19 0.57 0.57 39 0.68 0.57