The
Effect
of Open-
vs. ClosedBook
Testing
on
Performance
on a Multiple-Choice
Examination
in
Pediatrics
Charles F. Schumacher, Ph.D., Diane W. Butzin, Laurence Finberg, M.D., and Fredric D. Burg, M.D.
From the Departments of Psychometrics arid Graduate and Continuing Medical Evaluation, National Board
of Medical Examiners, Philadelphia; the American Board of Pediatrics, Philadelphia; and the Department of
Pediatrics, Montefiore Hospital and Medical Center, Albert Einstein College of Medicine of Yeshiva
University, New York
ABSTRACT. A study was undertaken to test the effect of open- vs. closed-book testing conditions on performance on a graduate-level, multiple-choice examination in pediatrics. A group of practicing pediatricians and a group of medical students took the examination. For the practice group, no significant difference between mean scores was observed, and the correlation between scores under the two testing
conditions was high. In the student group, however, the mean score was significantly higher under open-book condi-tions and the correlation between scores under the two testing conditions was positive but low. The mean score obtained by practitioners was significantly higher than the
mean score obtained by students under both testing condi-tions. The effects of time limit and level of motivation were
not explored in the present study. Pediatrics 61:256-261,
1978, recertification examination, open-book testing.
Most American specialty boards have now made the decision to evaluate their current diplo-mates periodically for purposes of recertification. A number of alternative methods for accomplish-ing this evaluation are being considered, includ-ing the administration of a written examination. If examination is adopted as one means of evalua-tion for recertification purposes, a number of important questions arise regarding both the type of examination to be developed and the condi-tions under which this examination should be
administered.
The present study was undertaken at the request of the American Board of Pediatrics (ABP) and the American Academy of Pediatrics (AAP) to investigate one of the latter questions which has been raised on several occasions by members of the ABP. Specifically, it has been
suggested that if the ABP were to administer an examination as one method for reevaluating current diplomates, such an examination should not measure simply the examinee’s ability to recall information that is available in pediatric textbooks. Following from this premise, it has further been suggested that examinees be given an opportunity to refer to textbooks to find whatever factual information they might need in order to answer questions or solve problems posed
by the examination. In brief, some members of the ABP and the AAP feel that a recertification examination should be administered under “open-book” conditions rather than under the tradi-tional “closed-book” conditions that normally apply for certification examinations.
One general question that has arisen from the discussion of this suggestion is, does it matter whether an examinee takes the test under open-book conditions or closed-book conditions? More specifically, do examinees taking an examination under open-book conditions obtain higher scores, on the average, than these same examinees taking a similar examination under closed-book condi-tions? Does open- vs. closed-book testing produce the same effect for practicing physicians as it does for medical students? Regardless of the effect on overall level of performance, do examinees rank-order themselves differently when tested under
Received October 28; revision accepted for publication November 15, 1977.
ADDRESS FOR REPRINTS: (D.W.B.) The American Board
of Pediatrics, Inc., Children’s Hospital of Philadelphia, 34th
Street and Civic Center Boulevard, Philadelphia, PA
open-book conditions vs. closed-book conditions?
These questions led to the generation of the
following hypotheses:
1. A group of practicing pediatricians given
the opportunity to refer to textbooks in general
pediatrics while taking a multiple-choice
exami-nation in pediatrics (open-book testing) will
obtain a higher mean score on this test than they will on a similar examination taken without benefit of any reference materials (closed-book
testing).
2. A group of third- and fourth-year medical
students, tested under the same conditions described above for practicing pediatricians, will
also obtain a higher mean score under open-book testing than under closed-book testing.
3. A group of practicing pediatricians will obtain higher mean scores on a multiple-choice examination in pediatrics than a group of third-and fourth-year medical students when both groups are tested under the same conditions, either open-book or closed-book.
4. For both practicing pediatricians and medi-cal students there will be a high positive correla-tion between scores obtained on a multiple-choice examination taken under open-book
condi-tions and scores obtained on a similar examination
taken under closed-book conditions.
The present study was undertaken to test these
hypotheses and to gather information about the
opinions of examinees taking the test under open-vs. closed-book conditions.
Review of the Literature
Although occasional editorials in favor of open-book exams have appeared,12 the literature
contains little experimental evidence to support or refute open-book examinations.
Stalnaker and Stalnaker3 reported that scores on open-book college exams were the same as scores achieved the previous year in a closed-book format. Studies by Kalish,4 Marco,5 and Jehu et al.6 indicate that allowing books or notes in college exams does not improve performance.
These three studies were conducted in systems
where exams were generally in a closed-book format. In a study of students in a medical school operating under an open-book system, Krarup et
al.7 found that performance on a physiology exam
was slightly better on recall items but no better
on the total exam using an open-book format.
Michaels and Kieren8 found that secondary school students in mathematics achieved high scores on open-book exams for knowledge and comprehen-sion items, but not for application items.
Feldhusen9 and Marco5 reported that students
believed open-book exams to be less anxiety-provoking. Michaels and Kieren8 found that
secondary school students were significantly less
anxious in the closed-book setting. The results of
J
ehu et al.6 indicate that access to notes is accompanied by anxiety reduction during, but not before, the exam. Krarup et al.,7 on the other hand, observed that students in an open-book system felt that the open-book exam was more difficult and more stressful.A study by Mankin et al.’#{176}was designed to study the effect of allowing orthopaedic residents to take an examination in an unproctored, permis-sive environment. While the study did not test directly the effect of using texts in an examina-lion, the results do provide some answers to this question, since 56% of the examinees in the
unproctored examination did use reference texts
or journals while none of the examinees in the proctored setting did so. Performances of the unproctored group were significantly better for recall questions, and for interpretation questions, but not significantly different for problem-solving
questions. Interestingly, although the majority of
the residents indicated they enjoyed the unproc-tored setting, a significant number indicated they preferred the proctored exam and were in many cases confused by different opinions in their reference sources.
None of these studies involved examinations in
pediatrics and none involved examinees in a continuing education mode. These limitations plus the lack of conclusive evidence on the value of open-book exams suggested a need for the study described here.
METHOD Overview
The hypotheses to be tested in this study required the development of two examinations that were essentially equivalent with respect to content, the identification of groups of practi-tioners and medical students willing to partici-pate in the study, the administration of the tests to examinees under both open- and closed-book conditions and the analysis of the resulting data.
Instruments
TABLE I
PERFORMANCE OF PRACTICE GROUP ON FIRST ABP
WRIi-FEN EXAMINATION
No. of Participants
Quartile
1st 28
2nd 28
3rd 31
4th 4
Year of first exam
1941-1950 4
1951-1960 29
1961-1970 55
1971-1972 3
To obtain a rough index of the extent to which the two resulting forms were equivalent statistically, a mean difficulty index and a mean discrimination index (biserial r) were calculated for each form, using item statistics that had been generated from the 1975 ABP candidate group (P: .7517 vs. .7566; RbS: .2886 vs. .2937).
In addition, the items in each form were classified by a physician-member of the study team with respect to the cognitive process that was apparently being measured by the item
(recall
of information or problem solving: .84 to.16
vs. .82 to .18).Both of these analyses suggested that the two forms were similar enough to each other to be used for purposes of this study. Exact equivalence was not required since the assignment of test forms to testing conditions was counterbalanced for each examinee group. However, if the two forms were grossly different from each other, an additional source of unwanted variability would have been introduced into the analyses, compli-cating both the analyses and the interpretation of the results.
In addition to the testing instrument described above, a brief questionnaire was also developed to obtain the examinee’s opinion about his/her own performance under the open- vs. closed-book testing conditions, to determine which reference texts were used, to determine the extent to which the examinee was satisfied with these references, and to ascertain the manner in which the references were used (whether it was to find answers or to confirm answers).
Participants
The practice group for this study was identified and recruited by members of the ABP and ABP official examiners in accordance with a set of general guidelines and that were provided by the study team. In September 1975, each Board
member and examiner was contacted by mail, given an outline of the protocol for the study, and asked to recruit at least five practicing pediatri-cians in his/her community who had been certified by the ABP not later than 1972. In addition, each recruiter was asked to select poten-tial participants in such a way as to provide a range of competence in this group, from individ-uals having average and borderline competence to those identified as outstanding.
A total of 123 practitioners agreed to partici-pate in the study. This group was later reduced to
96
individuals for reasons described below. To obtain a rough indication of the extent to which the participant group included a range of competence, the records on file with the ABP for these 96 practitioners were analyzed to determine the distribution of scores they obtained on their first attempt on the ABP written examination.Records were available for 91 of these
partici-pants and the results of this analysis are shown in Table I. These data suggest that the practice group was indeed heterogeneous with respect to competence as measured by the ABP certifying examination, and also heterogeneous with respect to number of years since the first certifying exam was taken.
The third- and fourth-year student groups were
recruited from classes at eight medical schools in
the eastern and midwestern regions of the coun-try. Most of these students were volunteers who were paid a nominal stipend to participate in the study. A total of 223 students were tested, of whom 100 were included in the analyses. This
study group was selected as described below.
Test Administration
All examinations were administered under controlled conditions either by an ABP official examiner (practice group) or a medical school faculty member (student group) at a single testing
session in which the two forms of the test were
given in consecutive two-hour periods. In a few instances practitioners were permitted to take the examinations on two consecutive evenings. To control for possible undetected differences in the two test forms and to control for the sequence in which the tests were taken, the administration of forms according to testing conditions was coun-terbalanced for each examinee group. The number of examinees tested according to sequence of testing conditions and test forms that were used are summarized in Table II.
TABLE II TABLE III
NUMBER OF EXAMINEES IN EACH SUBGROUP
Sequence Form A Open, Form B Open,
Form B Closed Form A Closed
.
Practice Student Practice Student
Group Group Group Group
Open-book first 24 65 40 44
Closed-book first 27 52 32 62
been tested in each of the four subgroups shown in Table II. The original study design called for a total of 25 students in each subgroup. In addition, it was important that the study group should not come predominantly from a single school. There-fore, it was decided that any one school should provide less than half of the students in the study group. Finally, to prevent any differences in performance that might be related to a specific school from influencing results, it was essential to have the same number of students from each school in all four of the student subgroups shown in Table II.
To meet these conditions, three schools were dropped from the sample and some students were eliminated at random from the remaining five schools. This selection process yielded a study group of 100 students.
In selecting the practice group to be included in the analyses, it was also considered essential that each of the practice subgroups shown in Table II be represented by an equal number of examinees. It was necessary to eliminate 12 prac-titioners because they did not take the exam under the sequence prescribed. After these had been eliminated, subgroup 1 with 24 examinees was the smallest of these groups. Therefore, examinees were excluded at random from each of the other subgroups until those groups also contained 24 subjects each. Thus, the total number of practitioners included in the analyses was reduced from 123 to 96.
Statistical Analyses
In order to test the first three hypotheses posed
by this study, a two-way repeated measures
analysis of variance was performed. The indepen-dent variables were practitioners vs. students and open- vs. closed-book conditions. The dependent variable was raw score on the examination.
To test hypothesis 4, a product-moment corre-lation coefficient was obtained between scores on the test taken under open-book conditions and scores obtained under closed-book conditions for
RAW ScoRE MEANS AND STANDARD DEVIATIONS
Group Mean SD
Students (N = 100)
Open-book 61.18 8.73
Closed-book 54.29 8.21
Practitioners (N = 96)
Open-book 68.32 12.22
Closed-book 67.27 11.73
TABLE IV
ANALYSIS OF VARIANCE RESULTS
Source df Mean F
Square
Uncorrelated data Groups (students vs.
practitioners) 1 9,919.79 52.30#{176}
Error 194 189.67
Correlated data
Conditions (open vs. closed) 1 1,544.64 63.77#{176}
Interaction (groups X
conditions) 1 834.57 34.45#{176}
Error 194 24.22
#{176}Significant at .01 level.
the practice group and for the student group,
separately.
RESULTS
Tables III and IV show the outcome of the repeated-measures analysis of variance to test for differences between examinee groups and exami-nation conditions.
The first noteworthy finding from this analysis was a highly significant interaction effect between groups and conditions. This interaction indicates that the effects of the two testing conditions were not the same for the student group as they were for the practice group. Inspec-tion of the mean scores that were obtained by the two groups shows that the student group achieved a noticeably higher mean score when the test was administered under open-book conditions (61.2) than they did when the test was given under closed-book conditions (54.3). A t test for
corre-lated measures indicated that this difference was indeed significant at the .01 level.
The practice group obtained mean scores of 68.3 and 67.3 under the open- and closed-book conditions, respectively. This difference was not significantly different at the .05 level, again using a t test for correlated measures.
TABLE V
SUMMARY OF QUESTIONNAIRE RESPONSES
Q
uestion % RespPractitioner (N=96)
onding
5 Students
(N=100)
1. I think I did better
On the closed-book exam 31 17 On the open-book exam 32 51
No difference 31 26
No response 6 6
3. Now that I have seen the questions,
I am satisfied with the texts 74 86 I selected
I wish I had selected a 15 6
different text
No response 1 1 8
4. When I did use the text,
I used it primarily to find 51 71 an answer
I used it primarily to con- 42 23
firm an answer
Other 6 3
No response 1 3
an examination of the type that was developed for
this study is given to medical students under
open-book conditions, the average student per-forms better than when the test is administered under closed-book conditions. However, when the test is administered to practicing pediatri-cians, average scores are essentially the same under both testing conditions.
A second finding of note was a highly signifi-cant difference between the mean scores obtained
by students and those obtained by practitioners under both testing conditions. Because of the significant interaction effect noted earlier, the
difference between student and practitioner
mean scores was also tested for each condition
separately, using a t test for independent samples.
The results of these analyses indicated that the mean score obtained by practitioners was signifi-cantly higher (.01 level) than the mean score obtained by students for both the open- and closed-book examinations. Thus, the results tend to confirm hypothesis 3, i.e., as might be expected, on an examination that was originally designed for purposes of specialty board certifica-tion, the average practicing pediatrician scored significantly higher than the average student.
Product-moment correlation coefficients that were obtained for each group between scores obtained under the two testing conditions were
.84 for the practice group and .41 for the student
group. Thus, regarding hypothesis 4, it appears
that the rank-ordering of practitioners was quite similar for the two different testing conditions,
whereas a considerable difference in rank-order
occurred for the student group.
Table V contains a summary of the opinions expressed by the practitioners and students in response to the questionnaire that was completed after the examination had been administered.
The opinions expressed by the two groups in response to the first question (Do you think you did better on the open-book or on the closed-book portion of the test, or was there no difference?) were surprisingly consistent with the actual examination performance of the two groups. The practice group was about equally divided in their response to this question, whereas a majority of the students felt that they had done better on the open-book portion of the test.
A large majority of the participants in each group were satisfied with the textbooks that they had selected. However, the few who wished they had selected different texts were more likely to have been practitioners than students.
Responses to the questions regarding the manner in which textbooks were used again differed rather sharply between the practice group and the student group. A large percentage of students (71%) used textbooks primarily to find answers, whereas the practice group were about
equally divided between those who used texts primarily to find answers and those who used texts primarily to confirm answers. This differ-ence may well be due to the fact that the examination was drawn from a test that was designed for individuals who had completed resi-dency training in pediatrics, and, therefore, called for information that was reasonably well-known
by most of the practitioners but not yet known by
many of the students.
DISCUSSION
In attempting to interpret the results that were obtained in this study, and before any firm conclusions are drawn from these results, certain
uncontrolled factors that might have influenced the performance of the participants should be considered.
had been given as a “live” recertification exam.
These variables are the amount of time allowed to
complete the examination and the degree of motivation to do well on the test.
It is quite possible that the time limit for the test was too short for the practice group. If this were the case, the practice group may have been unwilling to spend much time attempting to look
up answers in their textbooks, thus negating the possible effect that these reference materials may have had. In a live recertification test, one would
presumably allow enough time for those who
wished to use textbooks to do so without
jeopar-dizing the examinee’s opportunity to complete
the test.
A second variable that might produce different
results if the examination were live is the level of motivation in the practice group. Although prac-titioners participatedin the study willingly, it is difficult to believe that they were as highly motivated to perform well on the test as they would have been if the examination had been a
live recertification exam. Thus, in a live
examina-tion, especially if more time were allowed, one might well find that practitioners would use textbooks to a much greater extent than they did in the current study and, thereby, possibly improve their test performance under open-book conditions.
SUMMARY AND RECOMMENDATIONS
This study compared the performance of 100 medical students and 96 practicing pediatricians taking a graduate-level examination in pediatrics under open- vs. closed-book conditions. For each of these groups, the parameters of test
perform-ance that were investigated were overall mean
score under each testing condition and
correla-tion between performance under the two testing
conditions.
In the student group, the mean score obtained under the open-book condition was significantly
higher than the mean score obtained when the test was given under closed-book conditions and
the correlation between scores obtained under
the two testing conditions was positive but low (.41). For the practice group, however, no signif-icant difference between mean scores was
observed and the correlation between scores obtained under the two testing conditions was
substantially higher (.84).
These findings suggest that administering a
recertification test of cognitive knowledge to
practicing pediatricians under open-book condi-tions might yield essentially the same overall performance as a similar test given under closed-book conditions, and that the rank-ordering of examinees would be similar on both examina-tions.
Two variables that were not explored in the present study (time limit and level of motivation) might, however, have a substantial impact on the performance of practicing pediatricians taking a test of cognitive knowledge, under open-book conditions, for purposes of recertification. The
anticipated effects of lengthening the time limit
and increasing the level of examinee motivation would be a higher overall level of performance under open-book conditions than under closed-book conditions. There is no way to anticipate the effect of these variables on the rank ordering of examinees under the two testing conditions.
REFERENCES
1. Tussig L: A consideration of the open-book
examina-tions. Educ Psychob Meas 11:597, 1951.
2. Upson RH: Open-book examination. I Engin Educ
43:429, 1953.
3. Stalnaker JM, Stalnaker RC: Open-book examinations:
Results. With the Technicians 5:214, 1935.
4. Kalish BA: An experimental evaluation of the open-book examination. I Educ Psychob 49:200, 1958.
5. Marco GL: Psychological and Psychometric Correlates of
Open and Closed Book Achievement Test Modes.
Doctoral dissertation, University of Illinois, 1966.
6. Jehu D, Picton CJ, Futcher S: The use of notes in
examination. Br I Educ Psychol 40:335, 1970. 7. Krarup N, Naeraa N, Olsen C: Open-book tests in a
university course. High Educ 3: 157, 1974. 8. Michaels SA, Kieren TR: An investigation of open-book
and closed-book examinations in mathematics.
Alberta I Educ Res 19:202, 1973.
9. Feldhusen J:An evaluation of college students’ reactions
to open book examinations. Educ Psychol Meas
21:637, 1961.
10. Mankin HJ, Carter RM, Krawczyk M: The effect of permissive environment on scoring of the
ortho-paedic in-training examination. I Bone Joint Surg