Supporting Online Material for

(1)

www.sciencemag.org/cgi/content/full/319/5862/414/DC1

Supporting Online Material for

Application of Bloom’s Taxonomy Debunks the “MCAT Myth”

Alex Y. Zheng, Janessa K. Lawhorn, Thomas Lumley, Scott Freeman*

*To whom correspondence should be addressed. E-mail: [email protected]

Published 25 January 2008, Science 319, 414 (2008) DOI: 10.1126/science.1147852

This PDF file includes

Materials and Methods SOM Text

Tables S1 to S4 References

(2)

Supporting Online Material

Additional data

To test the hypothesis that essay and short-answer questions test at a higher level than multiple-choice questions, we compared the two categories of questions from exams in our sample that contained both question types: AP Biology and undergraduate exams (Table S1).

Table S1: Differences in multiple-choice vs. written answer questions from the same exams. All measurements reflect mean +/- SE. (In all cases, multiple-choice questions are lower.)

Weighted proportion Weighted

of higher-order questions ratings

AP Biology -0.21 +/- 0.15 -0.95 +/- 0.31

Undergraduate -0.46 +/- 0.10 -0.90 +/- 0.20

To evaluate how multiple-choice questions compare among the five sources of exams, we compared the weighted proportion of higher-order questions and weighted Bloom’s ratings for multiple-choice questions only. Values from the GRE, MCAT, and Medical School exams are given in Table 1; values from the AP Biology and Undergraduate exams are provided in Table S2.

Table S2: Bloom’s ratings from AP biology and undergraduate exams, multiple choice questions only. All measurements reflect mean +/- SE.

Weighted proportion Weighted

n of higher-order questions ratings AP Biology 146 0.28 +/- 0.05 1.97 +/- 0.09

Undergraduate 60 0.17 +/- 0.05 1.74 +/- 0.11

(3)

Materials and Methods

The exam questions we analyzed came from five types of sources.

• AP Biology: We obtained permission to use the 1999 and 2002 exams—the most recent exams available from the College Board.

• Introductory biology courses for undergraduate majors: We obtained recent midterm or final exam questions from instructors at Brigham Young University (BYU), Montana State University (MSU), and the University of Washington (UW). The BYU

questions were from one of the two courses in their yearlong introductory course; the MSU questions were from both courses in their yearlong introductory course; and the UW questions were from one of the three courses in their yearlong introductory course. We chose these schools because they are among the top ten feeder schools to the University of Washington School of Medicine (UWSOM). Because UWSOM has been rated the top medical school for training primary care physicians in the U.S. for over 10 consecutive years, admission is extremely selective. Thus, it is logical to expect that undergraduate exams from its top feeder schools may be more rigorous than average undergraduate exams.

• MCAT—Biology Portion: We obtained permission from the American Association of Medical Colleges to use the biology portions of the 2005 and 2006 practice exams.

The Association does not make its actual exams available, but maintains that its practice exams are accurate representations of actual exam questions.

• Biology GRE: We were granted permission to use the 2002 Biology GRE, which was the most recent exam available from the Educational Testing Service.

• 1^st-year Medical School Courses: We obtained recent midterm or final exam questions from five instructors in five first-year courses at the UWSOM.

We randomly chose approximately equal numbers of test questions from each of the five categories for analysis. There was one exception to the random sampling: we sampled the essay questions from each of the AP exams with certainty. This was necessary because the essay questions represented just 4 of the 124 total questions on each exam, but represent 40% of the point value.

We kept multi-part or sequential questions together during the sampling and rating process. For example, the MCAT and GRE contain numerous question series clustered with the same case history or graph. We kept these question series together—in the original sequence—for the experts to consider. Multi-part essay questions were treated in the same way. These types of questions represented one draw during the random

question-selection process, but were rated as multiple questions. It was not possible to analyze exactly equal numbers from each source because some questions were clustered, and because the raters were not able to complete all of the questions we originally sampled.

(4)

We re-formatted all questions to a common font and style, randomized their order in a hard copy document, and presented them to a panel of three educational experts who categorized the questions based on Bloom’s Taxonomy of learning. The experts did not know the nature of the sources. They were also unaware of the nature of the study.

Applying Bloom’s Taxonomy for Biology-related Questions

Each level of assessment from Bloom’s Taxonomy was assigned a numerical value between 1 and 6 (1 = knowledge of terms and recall of information; 2 = comprehension, or conceptual understanding; 3 = application of information and concepts to new

situations; 4 = analysis, including the ability to identify patterns and the relationships among underlying components; 5 = synthesis, or the ability to connect disparate and/or new sources of information; and 6 = evaluation of ideas, evidence, and logic).

The rating process began with three sessions where the experts scored 25-35

questions independently, then discussed each question until they were able to come to a consensus rating. These sessions were designed to clarify criteria for assigning questions to a particular level on Bloom’s taxonomy and foster agreement among raters. In the course of these initial discussions, the experts devised an extensive chart for evaluating biology exam questions at each level of Bloom’s Taxonomy, similar to a previously published chart and descriptions for biology-related questions (S1).

The raters are experienced teachers. When rating questions, they assumed that questions were being answered by students with a suitable level of knowledge at the introductory biology level. Although this assumption is appropriate for standardized exams, it may have created a bias towards rating the questions from the courses in our sample at too high of a level. This bias is unavoidable, because it is not possible for raters to know which pieces of information were explicitly stated during class. For example, if a multiple-choice question asked the student to choose an appropriate control for a given experimental design, the raters assumed the student had been taught the requirements for appropriate controls, but were not specifically told which design would be correct for the situation presented on the exam. Thus, a question of this type would be rated at the Application level—not the Knowledge (recall) level.

Evaluating Inter-rater Reliability

To evaluate the agreement between raters, we calculated a weighted Kappa and intra- class correlation coefficient for all questions. The Kappa statistic is used to evaluate the level of agreement between raters for ratings that fit into discrete categories (S2); the weighted Kappa statistic is appropriate for quantifying between-rater agreement

regarding discrete, ordinal categories, such as the levels in Blooms’ Taxonomy (S3). The

(5)

intra-class correlation coefficient is asymptotically equivalent to weighted Kappa if disagreements between categories are assumed to be proportional to the square of the distance between the categories. Values were calculated with the R statistical

computation environment (S4).

The average of three pair-wise linear weighted Kappa calculations yielded a result of 0.53, while the intra-class correlation coefficient calculation yielded a result of 0.68 with a 95% confidence interval of 0.64 < ICC < 0.71. These values indicate a moderately high level of agreement between judges.

Although Kappa values as high as we initially observed may be considered

justification for only collecting data from a single rater, we took a conservative approach and had all three experts rate the other 477 questions in the study, for a total of 593 questions with ratings.

To evaluate whether any of the three experts consistently rated questions differently from the other two, we calculated a disagreement or deviation value for the each of the raters. The equation we used was:

1

|(average rating of 3 raters) - (individual rater's rating)|

Average deviation

n

i

n

=

∑

=

This calculation allowed us to evaluate how much each rater deviated from the average rating of the three raters, over a sample of questions. The deviation values for each judge indicate that none of the judges consistently deviated strongly from the others in terms of ratings (Table S3).

Table S3: Average Deviation in Bloom’s Taxonomy Scale from Average Rating

Judge 1 Judge 2 Judge 3 Average

Average

Deviation 0.26 0.31 0.30 0.29

(6)

Assigning Final Ratings

For the 116 questions that the three raters discussed, the final ratings used in data analysis consisted of the consensus value that emerged. For the remaining 477 questions in the study, the final rating had to be assigned without discussion among the experts. We were able to make these final assignments based on patterns in how the experts resolved conflicts and came to a consensus on the 116 “discussed questions.” These patterns allowed us to construct conflict resolution rules and assign final ratings to the 477 “non- discussed questions,” as follows:

• When all raters agreed, there was no conflict. This situation occurred in 41 of the 116 discussed questions and 170 of the 477 “non-discussed questions.” For these

questions, which represent 35.6% of the total, the common value was considered the final rating.

• When two raters agreed and one disagreed, the discussion resulted in the third rater agreeing with the other two raters (n = 60). A total of 269 of the 477 (56.4%) “non- discussed questions” were in this category. For these types of non-discussed

questions, the rating common to two experts was taken as the final rating. Note that in 92% of cases, at least 2 of the 3 raters agreed on the rating of a non-discussed

question.

• When all three raters disagreed and ratings formed a sequential run of ratings—for example 1, 2, and 3—the consensus rating from discussion ended up being the intermediate value (n = 14). Just 33 of the 477 “non-discussed questions” (6.9%) were in this category. For this category of non-discussed question, we assigned the middle rating of the three as the final rating.

• When all three raters disagreed and ratings differed by more than one level on Bloom’s taxonomy, discussion resulted in a consensus for the middle value (n = 1). Only five of the 477 “non-discussed questions” (1%) were in this category. For these five questions, an average of the three values was taken as the final rating.

• Questions that did not fall into one of these categories were dealt with by averaging or were thrown out. Only 5 questions of the 477 were in this category.

To assess the validity of these conflict resolution rules, we compared the actual agreements made by judges during their discussions to ratings that were 1) predicted by the rules, 2) calculated by taking straight averages, and 3) determined by using a single judge as a sole decision-maker. Table S4 shows the average error of each of the methods, and indicates that using the decision rules allowed us to assign final ratings that were much closer to approximating the judges’ decision-making during discussions than straight averages or using any of the judges as a sole decision-maker.

(7)

Table S4: Average Errors for Various Methods of Approximating Judge Decision- Making

Method used

Direct Average

Conflict Resolution

Judge 1 decision- maker

Judge 2 decision- maker

Judge 3 decision- maker Average

error 0.27 0.17 0.35 0.27 0.37

Weighting Test Questions

In cases where questions on the same exam were awarded different point values, we weighted each question to reflect its importance on the exam in question, relative to other questions from the same exam, and the probability that it was sampled. For example, we sampled 88 questions from the 1999 AP exam—79 multiple-choice questions and four essay questions, each of which had one to three parts. Each multiple-choice question was worth 0.5% of the total exam while each of the four essay questions was worth 10% of the total exam. (For essay questions with multiple parts, we assumed that the 10% total was divided equally among parts.)

To create weights for each question in our sample, we multiplied its point value as a fraction of the total possible on the exam by the reciprocal of the probability that the question was sampled. For example, each multiple-choice question sampled from the 1999 AP exam received a weight of 0.005 * (120/79) = 0.0076. Because they were sampled with certainty, each essay question on that exam received a weight equal to its relative value on the exam (e.g. each part of the three-part questions received a weight of 0.033, while the one-part question received a weight of 0.10).

This weighting scheme allowed us to evaluate questions in the context of each exam—in effect, to analyze what proportion of the total assessment from each exam required students to demonstrate mastery at each level on Bloom’s taxonomy.

To generate the histograms in Figure 1, we summed the weights for questions

sampled at each level on Bloom’s taxonomy from each exam source, and converted each sum to a proportion of the total.

Data Analysis

Because some of the exams that we analyzed contained clustered questions, our sampling scheme was analogous to surveys where respondents are clustered by geographic area or other categories that can induce correlations among data points (S5). We thus approached the data as a stratified cluster sample, with sets of dependent questions sampled as

(8)

clusters. Weights were then post-stratified to give each question a weight proportional to the points available for that question. Because the goal was to generalize to the processes of question selection rather than to the specific finite populations of questions, we did not use a finite-population correction to the variance.

Data were analyzed using linear regression in Survey for R (S4, S6, S7); logistic regressions gave qualitatively identical results. Tests comparing multiple groups were design-based F-tests of the null hypothesis that all groups have the same mean; pairwise tests used Hommel’s correction for multiple comparisons (S8, S9).

Analyzing Exams within Categories

Before comparing ratings between types of exams, we asked whether there were significant differences among exams from each of our five sources. Linear regressions, with weights assigned as described above, indicated that there were no significant

differences in weighted proportion of higher-order questions or weighted ratings between the 1999 and 2002 AP biology exams, or between the 2005 and 2006 MCAT biology practice exams. For the final data analysis, then, we combined questions from the two years of each exam type.

Linear regressions with weighted data showed that there were significant differences among the exams from the three undergraduate institutions in weighted proportion of higher-order question (F = 3.62, df 2, 108, p = 0.03), but not in average ratings (F = 1.70, df 2, 108 p = 0.18). There were also significant differences in the five UWSOM courses (F = 10.03, df 4, 96, p < 0.0001). We combined the data within each of these two sources for the final analyses, however, because they represent a sample of a distinct set of student experiences—specifically, a sampling of exams at the introductory undergraduate level and the introductory level in medical school.

References and notes

S1. D. Allen, K.Tanner, Cell Bio. Ed. 1, 63 (2002).

S2. C. Schuster, Ed. Psych. Meas. 64, 243, (2004).

S3. D. O’Connell, A. Dobson, Biometrics 40, 973 (1984).

S4. R Development Core Team, ISBN 3-900051-07-0, http://www.R-project.org (accessed 1 October 2007).

S5. B.I. Graubard, E.L. Korn, J. Natl. Cancer Inst. 91, 1005 (1999).

S6. T. Lumley, J. Stat. Software 9, 1 (2004).

S7. T. Lumley, R package version 3.6-12.

S8. G. Hommel, Biometrika 75, 383 (1988).

S9. S.P. Wright, Biometrics 48, 1005 (1992).