SPECIAL
ARTICLE
AMERICAN
BOARD
OF
PEDIATRICS
I
T IS desirable that as many people as possible should know how much care andthought is given both to the preparation and to the subsequent analysis of the examinations of the American Board of Pediatrics. For this reason the following com-ments on the examination of January 1950 are published for general information.
The written examination of the American Board of Pediatrics given in January 1950
has been subjected to statistical analysis. It is proposed that all subsequent examinations be analyzed in a similar or improved way in order to learn whether modifications are
accomplishing the purpose for which they were made and in the endeavor to improve
the accuracy of the grading. Some of the results of this first statistical analysis may be of interest and may help in understanding how reliable the examination in its present form is.
The examination consisted of 200 false and true statements and was taken by 353
candidates. A majority of the candidates marked all of the statements as being either
false or true. That is to say, they marked with confidence when they knew and guessed when they did not know. However, a fairly large number of candidates refused to
commit themselves at all when they did not know. It is of interest that one of the highest
grades ever earned in these examinations was achieved by a candidate also distinguished
by having refused to commit himself on the largest number of statements. The method
of grading is one which yields essentially the same figure whether or not the candidate
elects to guess. To the number of correctly answered statements one adds half the number
of unanswered statements and from the total subtracts 100 ; the remainder is the grade.
The distribution of the grades earned by the 353 candidates is shown on the
accom-panying chart (histiogram) . The form of the histiogram suggests that we may be dealing
with an essentially normal distribution. The suggestion is corroborated by additional data on the chart. Medians and quartiles have been determined ; these statistics are quite
inde-pendent of an assumption of normality of distribution. The median is a grade which
divides the candidates in half in such a way that 50% have lower grades and 50% have
higher grades. The median is given as 54.9. Since no candidate can earn a grade ending
in the decimal 9. the meaning of the decimal deserves a word of explanation. Let us
sup-pose that 1 0 candidates earned a mark of exactly 54 and that 9 must be placed in the
lower one-half and 1 in the upper one-half to effect the equal division of the total
number of candidates about a median. Statistically this is done by using the decimal 9.
Q
uartiles are determined in a similar way. Since one-fourth of the grades are below the first quartile and one-fourth are higher than the third quartile, it follows that half thegrades are between the first and third quartiles. Again let us stress the point that these
statistics are determined by actual counting of the grades and do not depend on the type of distribution, normal or otherwise.
In contrast, the fidelity with which a mean (or average) and its standard deviation
describe a distribution does depend upon normality. These statistics have also been
AMERICAN BOARD OF PEDIATRICS 599
from it, the probable error of the distribution, has been used in preparing the chart.
This has been done because the range covered by the mean plus and minus one probable
error will for normally distributed data encompass 50% of the figures. Direct comparison
with the quartile figures is then possible. The essential identity of the ranges obtained by
the two methods of computation speaks for the normality of the distribution of these 353
grades.
AMERICAN BOARD OF PEDIATRICS
EXAMINATION OF JANUARY 950 - 353 CANDIDATES
80
-70 MEAN-IRE. 48.2 48.2
MEAN 55.6 54.9
MEAN+IPE. 63.1 63.3
.4 0 60 0 Id z
:
0 0 z > -40 I U ‘4 U) Id 30 0 0 z .4 U -20 IOI
0_u_I
20 25 30 35 40 45 50 55 60 65 70 75 80 85EXAMINATION GRADES
COLUMN FIGURES REFER TO THE CENTRAL POINT OF EACH GROUPING OF GRADES
There are of course more elaborate and more exact methods of testing for normality
of distribution. In the present case the good to be gained does not justify the labor.
So much for the manner in which these 353 candidates are distributed with respect to
grades. A lowest mark of 17.5 and a highest mark of 85 provide ample range for
dif-ferential selection of candidates. There remains the important question of how accurately
the examination defines the position of the individual candidate against the background
re-600 SPECIAL ARTICLE
analysis of variance. The method is one in which total variability (total variance) is
expressed as the sum of the squares of the differences between each grade and the mean for all of the grades. The total variance can then be divided into the three components contributing to it. These components are: variability of ability among the 353 candidates; an actual difference in difficulty between the odd-numbered and even-numbered state-ments ; a residual variability which measures the degree of success of the examination
in fixing the grades of the individual candidate. We are interested here in this residual
variability from which an error of assessment can be computed. The error calculated
from the analysis of variance is that pertaining to an examination based on 100
state-ments. The error appropriate to the entire examination based on 200 statements is easily
derived from it by dividing by the square root of two. Calculated in this way the standard
error of assessment (usually called the standard error of measurement or S.E.M. ) turns
out to be 5.028. Similarly, the probable error of measurement or P.E.M. is 3.391.
The practical meaning of these statistics can be clarified in the following way : If it
were possible for a single candidate to take an unlimited number of similar examinations,
each based on 200 statements of identical difficulty, the average value of all of the grades
would be a constant. This constant is conveniently termed the true grade. For a single examination the odds are even, i.e., 50-50, that the assigned grade will differ from the
true grade by not more than 5.03. Derived from these statistics are the facts that one
time in 20 the assigned grade will differ from the true grade by more than 9.85 and
that once in a hundred times the difference will be greater than 12.95.
An additional and possibly useful item of information can be gleaned from these
statistics. If the relative abilities of
two
candidates are to be compared on the basis of their written examination grades, the odds are only slightly better than even of a real difference of abilities if the grades differ by only 5. We can be fairly sure of a realdifference (odds, 20 to 1) if the grades differ by 14 ; we can be highly sure (odds,
100 to 1) of a real difference if the grades differ by 18. It is hardly necessary to add that
the differences refer only to relative abilities in scoring on this type of false and true
examination. The extent to which the differences measure relative abilities in pediatrics
remains as the $64 question.