CGU Theses & Dissertations CGU Student Scholarship
2011
Finding Relationships Between Multiple-Choice
Math Tests and Their Stem-Equivalent
Constructed Responses
Nayla Aad Chaoui
Claremont Graduate University
This Open Access Dissertation is brought to you for free and open access by the CGU Student Scholarship at Scholarship @ Claremont. It has been accepted for inclusion in CGU Theses & Dissertations by an authorized administrator of Scholarship @ Claremont. For more information, please [email protected].
Recommended Citation
Chaoui, Nayla Aad, "Finding Relationships Between Multiple-Choice Math Tests and Their Stem-Equivalent Constructed Responses" (2011). CGU Theses & Dissertations. Paper 21.
http://scholarship.claremont.edu/cgu_etd/21
Finding Relationships Between Multiple-Choice Math Tests And Their Stem-Equivalent Constructed Responses
By
Nayla Aad Chaoui
A dissertation submitted to the Faculty of Claremont Graduate University in partial fulfillment of the requirements for the
degree of Doctor of Philosophy in the Graduate Faculty of Education
Claremont Graduate University 2011
APPROVAL OF THE DISSERTATION COMMITTEE
We, the undersigned, certify that we have read, reviewed, and critiqued the dissertation of Nayla Aad Chaoui and do
hereby approve it as adequate in scope and quality for meriting the degree of Doctor of Philosophy.
______________________________________________________________ Dr. Mary Poplin
Chair
School of Educational Studies
______________________________________________________________ Dr. June Hilton
Committee Member
School of Educational Studies
______________________________________________________________ Dr. Phil Dreyer
Committee Member
Abstract of the Dissertation
Finding Relationships Between Multiple-Choice Math Tests And Their Stem-Equivalent Constructed Responses
By
Nayla Aad Chaoui
Claremont Graduate University, 2011
The study takes a close look at relationships between scores on a Mathematics standardized test in two different testing formats - Multiple-Choice (MC) and
Constructed Response (CR). Many studies have been dedicated to finding correlations between item format characteristics with regards to race and gender. Few studies, however, have attempted to explore differences in the performance of
English Learners in a low performing, predominantly Latino high school. The study also determined relationships
between math scores and gender and math scores and language proficiency, as well as relationships between CAHSEE and CST scores.
Statistical analyses were performed using correlations, descriptive statistics, and t-tests. Empirical data were also disaggregated and analyzed by gender, and language proficiency. Results revealed significant positive correlations between MC and CR
formats. T-tests displayed statistically significant
differences between the means of the formats, with boys and English Only students having better scores than their
counterparts. Frequency tables examining proficiency levels of students by gender and language proficiency revealed differences between MC and CR tests, with boys and English Only students earning better levels of proficiency.
Significant positive correlations were shown between CST scores and multiple-choice items, but none were found for CST scores and constructed response items.
DEDICATION
To my husband Nabil, whose patience and many sacrifices are what allowed me to complete this work uninterrupted.
To my wonderful children, who thrived and finished their schooling while their mother was busy attending classes and focusing on her study.
To my students, who are my biggest fans, and who accompanied me on this journey. May this work inspire you to always believe in yourselves and reach for the stars.
And finally, to my parents, who instilled in me the love of learning, and taught me to appreciate culture and diversity.
vi
I am indebted to my chairperson, Dr. Mary Poplin, whom I not only consider to be an invaluable source of information and guidance, but a very dear friend. I am eternally grateful for my committee members, Dr. June Hilton and Dr. Phil Dreyer, who graciously stepped in when I needed them the most. I would like to especially
acknowledge Dr. Hilton, for her efforts in teaching me how to use the software to analyze the data.
I want to thank Mr. Stacey Wilkins, my principal, without whom this study would not have been possible; Mrs. Valerie Cordova, our secretary, who patiently accessed scores and provided me with the necessary data; and to Mrs. Glenda Vazquez Hermosillo, our assistant principal in charge of testing, who supplied me with vital information as well.
Finally, I would like to acknowledge my friends and colleagues, who were rooting for me and supporting me throughout this journey. Your kindness and consideration did not go unnoticed.
vii
Dedication ...
Acknowledgements...vi
Table of Contents...vii
List of Tables ...ix
CHAPTER I
INTRODUCTION ... 1
Overview...1
Theoretical Framework...3
Background...4
Significance Of the Topic...9
Research Questions...11
Methodology...14
Summary...17
CHAPTER II LITERATURE REVIEW...18
Overview...18
Constructed response Tests...18
Multiple Choice Tests...23
MC versus CR items...27
Gender...36
Ethnicity and Language...40
Scoring Rubrics...43
Validity...45
Guidelines for Writing MC Questions...47
Guidelines for Writing CR Questions...49
Effective Math Instruction...51
CHAPTER III METHODOLOGY...76
Overview...76
Data Set...77
Key Variables...80
Instrumentation...80
The CAHSEE Math Standards...87
The Math Test...90
The CMC Scoring Rubric...100
Procedures...102
CHAPTER IV RESULTS...104
Correlations between Percents of Correct Answers...104
T-tests for Gender...105
T-tests for Language...106
Proficiency Levels on MC and CR Tests...107-110 Pearson Correlations between MC and CR Scores...111
viii
T-tests for Language...115
Pearson Correlations between CST and CAHSEE Scores...116
Pearson Correlations for Gender and Language...117
CHAPTER V CONCLUSION...119
Research Findings...120
Limitations of the Study...125
Implications of the Study...126
Appendix A...133
ix
Table 1: Demographics of Students...78 Table 2: Content Standards of the Mock CAHSEE...87 Table 3: CMC Scoring Rubric...100 Table 4: Correlations between Percents of Correct Answers on MC and CR Items...104
Table 5: T-test Results for the Differences between Percents of Correct Answers (Gender)...105
Table 6: T-test Results for the Differences between Percents of Correct Answers (Language)...106
Table 7: Proficiency Levels of 9th Graders on MC Test....107
Table 8: Proficiency Levels of 9th Graders on CR Test....108
Table 9: Comparison of Proficiency Levels of 9th graders on
CAHSEE...108
Table 10: Proficiency Levels of 10th Graders on MC Test..109
Table 11: Proficiency Levels of 10th graders on CR Test..109
Table 12: Comparison of Proficiency Levels of 10th graders on
CAHSEE...110 Table 13: Pearson Correlation between MC and CR Scores..111 Table 14: Pearson Correlations between MC and CR Questions by Gender...112
Table 15: Pearson Correlations between MC and CR questions by Strand...113
Table 16: T-test Results for Relationships between MC and CR scores by Gender...114 Table 17: T-test Results for Relationships between MC and CR Scores by Language...115 Table 18: Correlations between CST and CAHSEE...116 Table 19: Correlations between CST and CAHSEE for
Gender and Language...117 Table 20: Correlations between MC and CR for
Number Sense...133 Table 21: Correlations between MC and CR for
Statistics and Probability...133 Table 22: Correlations between MC and CR for
Algebra 1...134 Table 23: Correlations between MC and CR for
Measurement and geometry...134 Table 24: Correlations between MC and CR for
1
CHAPTER I INTRODUCTION
Overview
There are multiple ways to assess student learning in the field of mathematics. Methods range from standardized testing, using multiple choice and open-ended questions, to oral questioning and teacher-made examinations. This study focuses on the two formats used in state standardized
tests: multiple choice (MC) and constructed response (CR). Many questions can be raised about the potential
differences between multiple-choice and free-response item formats. Multiple-choice (MC) tests are depicted as
assessing simple factual recognition, and free-response or constructed-response (CR) tests are depicted as evaluating higher order thinking. A great deal of research has been devoted to comparing scores from multiple choice and
constructed response tests (Bridgeman, 1992; Frederiksen, 1984; Ackerman & Smith, 1988). Many studies have also been dedicated to finding correlations between item format
characteristics and race and gender. Some showed that there was a small advantage for men on multiple-choice items, and a small mean advantage for women on constructed response
2
items (Burton, 1996; Mazzeo & Schmitt, & Bleistein,1991). Garner and Engelhardt (1999) investigated the gender
differences in mathematics and found that women showed a statistically and consistent advantage over men on
multiple-choice items in algebra. However, few studies have shed light on the performance of English Learners on free response compared to multiple-choice tests. There is a possibility that language ability might have a confounding effect on the scores for open-ended mathematics items and the fact that open-ended items are more likely to be
omitted by examinees than multiple-choice items (Martinez, 1991).
The study aimed at finding relationships between mathematics scores in two formats – multiple-choice (MC) and constructed response (CR) items of the mock CAHSEE, differences in performance by gender and by language proficiency, as well as correlations between mock CAHSEE and CST scores. Statistical analyses were performed using correlations, descriptive statistics, and t-tests.
Empirical data were also disaggregated and analyzed by gender, and language proficiency.
3
Theoretical Framework
The theoretical framework of the study is based on work by W. James Popham in educational measurement. In
Popham’s opinion, today’s educators are increasingly caught up in a measurement-induced maelstrom focused on raising student scores on high-stakes tests. Standards-based standardized tests are in multiple-choice formats, with which teachers are more and more familiar. Due to intense pressure to raise students’ scores, some teachers “design their instruction around actual items taken from a high-stakes test to teach toward clone items – items only slightly different from the test’s actual items” (p.23). Because students are familiar with test content and format, they are trained to respond to questions by “recognizing” information, and may show mastery because they were
strictly and specifically taught the content on the test. The rationale of the study is to investigate the relationships between MC tests and their stem-equivalent constructed responses, allowing us to determine the degree to which student proficiency in one format relates to
4
Background
In the field of educational psychology, much of the literature suggests that item formats should be selected to reflect instructional intent, especially when trying to assess higher-level thinking. For instance, Haladyna (1997) writes that open-ended and performance items are more
appropriate than selection items for measuring high-inference mental skills or abilities where we want the student to construct an answer. Rodriguez (2003) suggests that although multiple-choice tests provide greater
sampling of the domain in a short time with a high level of reliability, the use of constructed response items allows greater depth of processes. One study found that teachers chose test formats according to the diverse achievement levels of their students (Fleming, Ross, Tollefson & Green, 1998). Those teachers assigned multiple-choice tests to low ability students and constructed response tests to students with higher cognitive abilities.
It is most generally assumed that multiple-choice tests do not adequately measure skills and cognitive
abilities, and although they may measure some constructs, they may neglect others (Stenmark, 1989). Each person has
5
an individual profile of characteristics, abilities and challenges that result from learning and development. These are manifested as individual differences in intelligence, creativity, cognitive style, motivation, natures and the capacity to process information, communicate, and relate to others.
Advantages and Disadvantages of MC and CR tests
Both multiple-choice and constructed response items have advantages and disadvantages. Some of the advantages of MC items are that they are machine gradable, therefore increasing scoring accuracy (Holder & Mills, 2001); they are particularly useful in large-scale evaluation projects. They facilitate timely feedback for test takers in classes (Delgado & Prieto, 2003); and they enable instructors to ask a large number of questions on a wider range of subject materials (Becker & Johnston, 1999), therefore a wider
variety of abilities can be measured. Other advantages are: - Student difficulties can be diagnosed by analyzing
incorrect responses.
- It is possible to vary the questions’ level of difficulty.
- They are economical.
6
- They may not accurately measure student ability, since it may be assumed that they are guessing (Stenmark, 1989). - Students are not able to synthesize content of any sort (Popham, 2010).
- They have an inability to tap higher order thinking skills.
- It takes a lot of time to construct a good MC test. - The test is not useful in measuring the ability to organize and present ideas (Popham, 2010).
Some of the advantages of constructed response items are that results are reported in words, diagrams or graphs (Stenmark, 1989); and they give students an opportunity to show their prowess at carrying out a carefully reasoned analysis of the problem (Popham, 2010). One major advantage is that responses are less affected by guessing, and clues about students’ thought processes can be provided. A few of the disadvantages of CR questions are that they contain relatively few questions, which in some cases prevents adequate sampling of the subject matter (Powell &
Gillespie, 1990). They are costly, and there are potential inaccuracies associated with their scoring.
Standardized Tests and Assessment
Standardized tests are designed to assess student understanding of the content. They are formative and
7
summative criterion-referenced tests that measure how well a person has learned a specific body of knowledge and
skills.
A variation of criterion-referenced testing is
“standards-based assessment”. All states and districts have adopted content standards (or curriculum frameworks), which describe what students should know to reach the basic,
proficient, or advanced levels in the subject area.
Testwiseness and guessing
Testwiseness is any skill, which allows a student to choose the correct answer on an item without knowing the correct answer. Students who are testwise look for mistakes in test construction, make guesses based on teacher
tendencies, and search for any unintentional clues that can be found in a test. This is an issue of validity because the score on a test should be a reflection of the level of the trait that the test is designed to measure (knowledge, skill), not a reflection of a general ability to do well on poorly made tests.
It is important to distinguish between random guessing and an educated guess. Good tests are designed to protect against random guessing. An educated guess is not as
harmful to the validity of a test because it indicates that the student has some knowledge of the content and has
8
narrowed down the possibilities to the most reasonable alternative (Cronbach, 1998).
Reliability, Validity and Bias
Test reliability refers to the degree to which a test is consistent and stable in measuring what is intended to measure. It must be consistent within itself and across time.
Test validity refers to the degree to which the test actually measures what it claims to measure. It is the
extent to which inferences, conclusions, and decisions made on the basis of test scores are appropriate and meaningful.
The presence of bias invalidates score inferences about target constructs that affect student performance differently across groups; constructs related to gender, race, ethnicity, linguistic background, and low socio-economic status (Lam, 1995). For example, the ability to read and understand written problems is a biasing factor in measuring mathematics skills because it is irrelevant to mathematics skills and it can affect Limited English
Proficient students’ performance differently on a math test (Stenmark, 1989).
A good assessment has both validity and reliability. In practice, however, an assessment is rarely valid or
9
reliable. In the field of educational testing, there will often be trade-offs between validity and reliability.
Significance Of The Topic
A review of the California State Department of
Education’s report on open-ended questions, A Question of
Thinking, shows that most students lack opportunities to
express mathematical ideas in writing, with fewer than 25% able to write completely about any of the problems given (Stenmark, 1989). Part of effective instruction is giving students opportunities to explain their thinking in
writing, using proofs, multiple steps, organizers and written sentences.
Historically, there wasn’t an emphasis on
communication in the math classroom, but we now know that in order to learn mathematics, students must learn to communicate mathematically (NCTM 2000). This means
listening, speaking, reading, and interpreting. It means explaining how a problem is solved, and explaining the problem and its solution using a variety of
representations: words, symbols, graphs, charts, visuals, models, and manipulatives (Leiva, 1995).
The Principles and Standards of the National Council of Teachers of Mathematics (2000) include a communication
10
standard for school mathematics. Specifically, the standard states that instructional programs from kindergarten
through grade 12 should enable students to:
Organize and consolidate their mathematical thinking through communication.
Communicate their mathematical thinking
coherently and clearly to peers, teachers and others.
Analyze and evaluate the mathematical thinking and strategies of others.
Use the language of mathematics to express mathematical ideas precisely (p.60).
The more lessons focused on teaching conceptual
understanding and problem solving, reading comprehension, and writing composition, the more likely the students were to demonstrate proficiency in all these areas (Knapp,
Adelma, Marder, McCollum, Needles & Padilla, 1995). The district where the research is conducted is
plagued by dismal math scores on the California Standards Test. In four of the five comprehensive high schools, eighty percent of the students are scoring below and far below basic in mathematics, with under ten percent of students scoring in the advanced categories (California Department of Education, 2009).
11
Research Questions
This study attempts to find out if the students, as a group and by subgroups such as gender and English Language Learners, perform similarly on MC math tests and their stem-equivalent constructed response items.
Specifically, in this research, the following questions are being asked:
1) What is the relationship between the percents of
students’ correct answers on the multiple-choice format and correct answers on the stem-equivalent constructed
responses? What are the differences by gender and language? 2) What is the relationship between students' math scores on the multiple-choice standardized mock CAHSEE test and their scores on stem-equivalent constructed responses? 3) Are there gender differences between the students' scores on the mock CAHSEE multiple-choice questions? Are there gender differences between students' scores on the stem-equivalent constructed responses?
4) Are there differences for English Learners (EL) between their scores on the multiple-choice questions and their stem-equivalent constructed responses? Are there
differences for English Only (EO) students between their scores on multiple-choice questions and their
12
5) What is the relationship between the students’
mathematics California Standards Test and their scores on the multiple-choice?
6) What is the relationship between the students’ CST scores and their scores on the constructed response tests on the mock CAHSEE?
Definition Of Terms
Multiple choice or selected response items (MC):
Multiple-choice items consist of a stem and a set of options. The stem is the beginning part of the item that presents the item as a problem to be solved, a question asked of the respondent, or an incomplete statement to be completed, as well as any other relevant information. The options are the possible answers that the examinee can
choose from, with the correct answer called the key and the incorrect answers called distractors. Only one answer can be keyed as correct.
Constructed response, or open-ended response or free response (CR): A constructed response is a student response
to a specific prompt or question given in the context of a test. It requires students to use creativity, organization skills, and logic to develop an answer. Most commonly, a constructed response takes the form of an essay response or a short-answer response.
13
Stem-equivalent: Multiple-choice and constructed
response questions will have the same stem, which is basically a math question or a problem to be solved. For example, if a student is asked a question about finding the perimeter of a figure, the MC test will provide the
optional answers, and the CR test will ask the same question and the student will have to show the solving process.
Standardized testing: Tests are called standardized
when all students answer the same questions under similar conditions and their responses are scored in the same way. They include norm-referenced tests as well as criterion-referenced or standards-based exams.
The CAHSEE: The California High School Exit Exam
(CAHSEE) is a requirement for high school graduation in the state of California, created by the California Department of Education to improve the academic performance of
California high school students, and especially of high school graduates, in the areas of reading, writing, and mathematics; public school students must pass the exam before they can receive a high school diploma, regardless of any other graduation requirements.
14
Methodology Research Design
A number of statistical analyses were used.
Correlations were run to determine relationships between scores on both testing formats (MC and CR), as well as between these scores and those on the California Standards Test in Mathematics. Frequency tables were run to
investigate percentages of students scoring at various levels of proficiency on both formats. T-tests were also performed using gender, and language (English Learners versus English Only).
Sampling
The sample consisted of 737 students enrolled as freshmen (n= 394) and sophomores (n= 343) in algebra 1, algebra 2 and geometry at a comprehensive high school in the Pomona Unified School District. The majority of the students were Latinos, but there were also Asian students of different ethnic backgrounds, African American students, and some white students. The ethnicity variable was
initially considered but the comparably insignificant
percentage of non-Latinos (9%) caused it to be discarded.
Instrumentation
The instrument is the Mock CAHSEE in mathematics. It is a test designed by the district to help the students
15
familiarize themselves with the content before taking the actual CAHSEE, and it is aimed at assessing student
knowledge in order to plan for intervention and remediation by the time they take the CAHSEE. All of the 35 questions on the tests cover the mathematics standards required to pass the CAHSEE. Eleven questions are related to Number Sense, four are related to Statistics and Probability, four are related to Algebra and Functions, six to Algebra 1, and ten to Measurement and Geometry.
Procedures
Thirty-five questions were selected from the Mock CAHSEE math booklet (2008 edition) in such a manner that they reflected different standards from the strands of Number Sense, Statistics and Probability, Algebra and
Functions, Algebra One, and Measurement and Geometry. It is customary at this particular school to administer the Mock CAHSEE to ninth graders on the day that the tenth graders are taking the actual CAHSEE. The school is on a special schedule because the test is administered all day, from 8 a.m. to 1:30 p.m. Twelve teachers administered the test to 394 Freshmen, who were given the test in constructed
response format first, then in multiple-choice format later in the day after a thirty-minute lunch break from 10:30 to 11:00 a.m.
16
The tenth graders were given the test before the 9th
graders, in their math classes two weeks before they were to take the CAHSEE. All math teachers agreed to give the multiple-choice format test first on the same day, and waited to give the constructed response test the following week over a period of two days.
Protection of human subjects
All scantrons and constructed response tests had
student ID numbers written on them to protect the identity of the students. The students were previously handed a consent form to be signed by their parents, and an assent form to be signed by them agreeing to take the test
willingly. They were all aware that it was not just per school policy that the test was given, but that their
scores would be evaluated for the purpose of the study. The results of the study will only be released to their
teachers or administrator of the school as was previously agreed upon and approved before the launch of the
experiment.
Scoring rubric
The California Mathematics Council rubric is called a general, or holistic, rubric and is used on national or state assessments that must take into account a broad range of mathematical tasks and students. It is aimed at
17
assigning an overall score rather than a score for
particular processes. This type of rubric is appropriate for assessments that are more summative, such as major tests or examinations (Kulm, 1994). “The descriptions of each score are precise enough so that in a short time, teachers can be trained to use the scoring scale with high levels of agreement and reliability” (p.88).
Summary
An extensive review of the literature describing the various findings on the different testing formats is
discussed in Chapter Two. Issues such as the advantages and disadvantages of MC and CR tests, as well as reliability and validity issues in writing those tests are also
included. Chapter Three explains the methodology used in the study, the data set, the procedures and the
instrumentation.
Descriptive statistics, correlations and t-tests are presented in Chapter Four. Results from this analysis provide insight into the results of various formats with different groups of students. The implications of the study findings are discussed in detail in Chapter Five.
18
CHAPTER II LITERATURE REVIEW
Overview
Testing formats have their advantages and disadvantages. Previous studies have lauded the effectiveness of some formats in assessing student
learning, while denigrating other formats for their poor assessment quality. In mathematics, notably, it is most important to discern and evaluate the effectiveness, or lack thereof, of the testing formats in an effort to select the best method of assessing student content knowledge.
Constructed Response Tests
Advantages. The California Mathematics Council (CMC) has
been a leader in stressing the use of open-ended questions as a technique of alternative assessment. Open-ended
questions provide insights into the misconceptions of students and allow the teacher to evaluate the various techniques they use. They also determine if students can “clarify their own thinking, make generalizations,
recognize key points in the problem, and organize and interpret information” (Kulm, p.42).
19
Constructed response tests reduce measurement error by eliminating random guessing. Second, they eliminate
unintended corrective feedback that is inherent with MC items (Bridgeman, 1992). Bridgeman (1992) found that 81% of the students reported working backwards to solve problems. For example, an algebra problem such as 2(x+4)=38-x becomes a much simpler arithmetic problem if the examinee can just substitute the possible values of x given in the answer choices until the correct value is found.
A constructed-response test allows us to watch a student marshal evidence, arrange arguments, and take purposeful action to address the problem (Wiggins, 1989). Rather than rely on right or wrong answers and unfair “distractors”, authentic tests identify strengths, which may even be hidden (Wiggins, 1989). They assess dynamic cognitive processes (Bennett, Ward, Rock, & Lahart, 1990), identifying students’ misconceptions in diagnostic testing (Birenbaum & Tatsuoka, 1987), and communicating to teachers and students the importance of practicing these real-world tasks (Sebrechts, Bennett, & Rock, 1991).
Haladyna (1997) writes that open-ended and performance items are more appropriate than selection items for
measuring high-inference mental skills or abilities and some physical skills and abilities where you want the
20
student to construct an answer. In order to assess higher order thinking, they argue that performance assessments are a more appropriate item type than selection items because they require students to construct new knowledge, which is essential to effective learning (Marzano, Pickering, & McTighe, 1993).
The shift from an emphasis of producing correct answers to the expectation that students think and
communicate is a major one for many students and teachers (Kulm, 1994). Even though the answer may not be correct, the reasoning and mathematical processes can earn high marks.
Open-ended problems must be provided to all students, even the most able ones, if we want them to develop solving strategies. The process and strategies themselves must be the objects of assessment and evaluation (Kulm, p.26).
Some of the advantages of constructed response items are that results are reported in words, diagrams or graphs (Stenmark, 1989); and they give students an opportunity to show their prowess at carrying out a carefully reasoned analysis of the problem (Popham, 2010). One major
advantage is that responses are less affected by guessing, and clues about students’ thought processes can be
21
Open-ended questions send out a message to students about the nature of math (Brahier, 2001). Students “learn” that mathematics transcends “right” and “wrong” answers (p.22). Marzano et al. (2001) stress that explaining their thinking helps students to enhance their understanding of the experimental inquiry process and their use of the steps involved. Also, the range of cognitions – such as
knowledge, procedures, images and skills - that can be elicited by CR items is greater than the range of MC items (Martinez, 1999).
Disadvantages. There are many things to consider when
choosing between constructed-response and selected-response tests. Constructed-response tests are much more difficult to grade, even though they are relatively easy to prepare. A considerable amount of time must be spent in creating clear criteria, such as scoring rubrics, for assessing the answers. One of the most evident disadvantages is the time-consuming nature of scoring those tests. The scoring of constructed-response test items involves at least some subjectivity, even when criteria have been carefully established (Powell & Gillespie, 1990; Brahier, 2001).
Another disadvantage is that these tests contain relatively few questions, which in some cases prevents adequate
22
Test anxiety my have a debilitating effect on scores. Research by Crocker and Schmitt (1987) found that the
negative effects of test anxiety on scores were moderate on MC questions but severe on the constructed response items. The prospect of having to provide an explanation can induce anxiety to the point that it interferes with cognition, therefore reducing the ability of the test taker to express proficiency (Powers, 1988). Popham (2008) suggests that if there were too few items, odds were greater that the
teacher would “draw an invalid inference from the
performance data, concluding erroneously that students have or have not mastered the building block to an acceptable degree” (p.58).
Open-ended questions may not align with instructional techniques (Brahier, 2001). If students are not often asked these types of questions in the classroom, it may be
unrealistic to expect them to answer open-ended questions on a more formal assessment (p.22). As Kulm (1994) points out, most students have not been required or requested to write or give verbal explanations of problem-solving
processes. “The idea of an assessment or grade based on anything except the correct answer is quite foreign” (p.39).
23
Multiple-Choice Tests
Advantages. Some of the advantages of MC items are that
they are machine gradable, therefore they increase scoring accuracy (Holder & Mills, 2001), and they are particularly useful in large-scale evaluation projects (Dufresne,
Leonard & Gerace, 2002). They facilitate timely feedback for test takers in class (Delgado & Prieto, 2003); and they enable teachers to ask a large number of questions on a wide range of subjects (Becker & Johnston, 1999), therefore a wider range of abilities can be measured. Student
difficulties can be diagnosed by analyzing incorrect responses, and it is possible to vary the questions’ difficulty level (Simkin & Kuechler, 2005). Roediger and Marsh (2005) postulate that in addition to being easy to score, multiple-choice tests generally improve student performance on later tests, referring to that as the
testing effect. There is a perceived objectivity in the
grading process (Wainer & Thissen, 1993); they help students avoid losing points for poor spelling or poor
writing ability (Zeidner, 1987); students find it easier to prepare for those tests (Scouller, 1998); they reduce
student anxiety (Snow, 1993); teachers may choose to write multiple versions of the same MC test to thwart cheating
24
(Kreig & Uyar, 2001); students can eliminate unlikely choices and ultimately increase their probability of picking the right answer (Bridgeman, 1992).
Multiple-choice items are amenable to item analysis, which enables the teacher to improve the item by replacing distractors that are not functioning properly. In addition, the distractors chosen by the student may be used to
diagnose misconceptions of the student or weaknesses in the teacher’s instruction (Burton et al., 1991).
Disadvantages. Some of the disadvantages are that they may not accurately measure student ability, since it may be assumed that they are guessing (Stenmark, 1989); students are not able to synthesize content of any sort (Popham, 2010); and they have an inability to tap higher order
thinking skills. It takes a lot of time to construct a good MC test; the test is not useful in measuring the ability to organize and present ideas (Popham, 2010). The format makes it easy for students to guess rather than to think through the problem.
MC items have an inability to tap higher order
thinking and allows for a higher probability of guessing correctly which causes lower reliabilities in the test for lower ability students (Cronbach, 1988). By design, MC items severely constrain the behavior of examinees.
25
Consequently, some aspects of proficiency that require complex performance are beyond the reach of the MC format (Messick, 1993). If a test consists entirely and
exclusively of MC items, it raises the possibility of construct under-representation and the validity of the
assessment will suffer because the test will fail to assess the cognitive processes that help identify the main
construct (Messick, 1995).
Webb (1997) argues that multiple-choice tests
inherently favor some students over others, so alternative forms of assessment are required to achieve fair measures of student performance. Hambleton & Murphy (1992) concluded that multiple-choice tests foster a one-right-answer
mentality, they narrow the curriculum, they focus on
discrete skills, and they under-represent the performance of lower SES examinees. Martinez (1991) argues that
language ability might have a confounding effect on the scores for open-ended mathematics items and that open-ended items are more likely to be omitted by the examinee than multiple-choice items.
Test takers are exposed to numerous incorrect answers, many of which are constructed so as to appear to be
correct. Roediger (2005) found that students tended to remember these incorrect lures as to be correct when asked
26
about them later, suggesting that students actually learn the wrong things as part of the testing process. A related disadvantage is that students receive corrective feedback whenever their own answer does not appear as one of the available alternatives, a prompt to reconsider the question and correct their mistake that would not be present in an open-ended assessment (Bridgeman, 1992). Some students
react to the availability of the possible answer by working backwards to answer the question, particularly on
quantitative problems. Students expecting a multiple-choice test, relative to an essay test, spend less time studying for the test (Kulhavey, Dyer, & Silver, 1975) and they take notes on different materials than do students expecting an essay exam (Rickards & Friedman, 1978).
According to the NCTM (1991), although the commonly used MC format may yield important data, it can have a
negative impact on how students are taught and evaluated at the school level because: a) Student scores are generated solely on the basis of right and wrong answers with no consideration or credit given to students’ strategies, b) Routine timing measures how quickly students can respond but not necessarily how well they think – some students may be excellent mathematicians but may not be fast (p.22), and
27
c) Mathematics tools such as calculators and measurement devices are not permitted (p.8).
MC Items versus CR Items
How they differ. Martinez (1999) hypothesized that MC and
CR item formats differ not only in their cognitive demand but also in the range of cognitions they can elicit. And even though the distinction between them is useful, it could be misleading. In his meta-analysis of research on test item formats, Martinez (1999) discusses research pertaining to the complexity of both MC and CR formats. Haladyna (1994) proposed that there was considerable variety within the MC format, partly in how items are structured and in the cognition they evoke. He further asserts that MC items can be written to elicit complex cognitions, such as understanding, prediction, evaluation, and problem solving. In other words, it is possible for the MC items to tap complex performances and for CR items to tap basic processes such as recall. And even when MC items evoke recall, the retrieval of information from long-term memory may require complex search strategies to access memories from various learning episodes (Nuthall& Alton-Lee, 1995). Messick (1995), however, warns that even though MC questions can be designed to elicit complex thought
28
processes, it does not mean, however, that the full range of complex thought represented in constructed responses can be captured by MC items.
Many studies have found that student scores on open-ended questions were so closely related to their scores on multiple-choice tests as to suggest that both types of
questions were measuring the same things ( Bridgeman, 1992; Lukhele, Thissen, & Wainer, 1994; Walstad & Becker, 1994), suggesting that the difficult to administer open-ended questions might not be worth the extra effort because
multiple-choice alone could be used to assess the learning. Popham (1978, pp. 44-45) states that for measuring
knowledge of factual information, the selected-response test is more efficient. This type of test is also useful when a high degree of specificity is needed, such as tests designed to see if re-teaching of facts is necessary.
However, for measuring originality, the ability to
synthesize ideas, write effectively, or solve problems, constructed-response tests are obviously better.
In an experiment led by Fleming (1998), it was found that teachers assigned tests of different formats based on students’ cognitive abilities. Low ability students were given MC tests and high ability students were given essay type or constructed-response test items. They concluded
29
that teachers judged essay questions to be more difficult than multiple-choice items, and they evaluated items that measured higher order thinking skills to be more difficult than items assessing application or memory skills.
Format preference. In a study by Hamilton (1994) high
school students enrolled in geometry, algebra 2 and algebra 1 were given a math test with multiple-choice and
constructed-response formats in counterbalanced order. After taking the tests, students were interviewed to determine which format was preferred and why. Eighty
percent of students found MC to be easier. Several students also recognized that the probability of answering an item correctly when they did not know the answer was much
greater for MC than CR. Over fifty percent of the students who preferred the CR test reported that they liked the challenge it presented. Although the majority of students preferred the MC test, a very small percentage said that it was a better indicator of what they knew.
Parmenter (2009) reflects that the literature tends to favor multiple-choice over constructed-response as far as validity and reliability were concerned. For example,
Bridgeman (1992) suggested that although multiple-choice is less reliable on a question by question basis due to
30
time to answer and grade would allow an exam made up
entirely of multiple-choice to contain more questions and therefore be more reliable than an exam containing fewer open-ended questions. It is generally assumed that correct answers to MC items can be guessed at more readily than CR items, thus MC tests are less difficult, less
dis-criminating and less reliable than CR tests of the same content. In addition, having multiple answers – one of
which is the correct one – may alert the examinee who makes a mistake in the computation and ends up with an answer that is not on the list of choices, to check and /or redo the computation. However, these expectations are not
supported by findings of empirical research (Traub and McRury, 1990).
Traub and McRury (1990) report that students have more positive attitudes towards multiple choice tests in
comparison to free response tests because they think that these tests are easier to prepare for, easier to take, and thus will bring in relatively higher scores. In the study by Ben-Chaim and Zoller (1997), the examination format
preferences of secondary school students were assessed by a questionnaire and structured interviews. Their findings suggested that students preferred written, unlimited time examinations and those in which the use of supporting
31
material was permitted. Assessment formats, which reduce stress will, according to these authors, increase the chance of success and students vastly prefer examinations which emphasize understanding rather than rote learning.
Martinez (1999), however, describes the students’ preferences of CR formats as just a “perception”. Their opinions did not constitute reliable evidence that MC items tapped lower-level cognitive processes. Birenbaum (1997) found that differences in assessment preferences correlated with differences in learning strategies. Moreover,
Birenbaum and Feldman (1998) discovered that students with a deep study approach tended to prefer essay type
questions, while students with a surface study approach tended to prefer multiple-choice formats. Students with high test anxiety had more favorable attitudes toward
multiple-choice questions while those with low test anxiety tended to prefer open ended formats (Birenbaum, 1997).
Scouller (1998) investigated the relationships between students’ learning approaches, preferences, perceptions, and performance outcomes in two assessment contexts: a multiple-choice question examination requiring knowledge across the whole course, and assignment essays requiring in-depth study of a limited area of knowledge. The results indicated that if students preferred essays, then they
32
would do better on the essay items than if they preferred multiple-choice questions.
Study skills and performance. A review of the California
State Department of Educations’ report on open-ended questions, A Question of Thinking, showed that most
students lacked opportunities to express mathematical ideas in writing, with fewer than 25% able to write completely about any of the problems given (Stenmark, 1989). According to NCTM (1991), it is the task that requires students to construct their own responses that more closely models real work and prepares students for life outside school. Tests that emphasize narrow recall will not effectively prepare students for a world that demanded thinking and
communication. There is evidence that students study
differently depending on the type of test they anticipate and this alters the nature and quality of student learning. Studies are mixed in their detection of anticipation
effects; however a majority of studies have found that
response formats make a difference in anticipatory learning and that the expectation of CR tests favors concept
learning while the anticipation of MC tests favors detail memorization (Martinez, 1999; Traub & McRury, 1990).
Douglas Reeves, chairman and founder of the Center for Performance Assessment and the International Center for
33
Educational Accountability, has said that “even if the state test is dominated by lower-level thinking skills and questions are posed in a multiple-choice format, the best preparation for such tests is not mindless testing drills, but extensive student writing, accompanied by thinking, analysis, and reasoning” (2004, p. 92).
Testwiseness. Testwiseness is any skill, which allows a
student to choose the correct answer on an item without knowing the correct answer. Students who are testwise look for mistakes in test construction, make guesses based on teacher tendencies, and search for any unintentional clues that can be found in a test. Millman, Bishop and Ebel
(1965, in McPhail, 1981) known for their theoretical work on testwiseness proclaim that “testwiseness is defined as a subject’s capacity to utilize the characteristics and
format of the test and/or the test taking situation to
receive a high score. Testwiseness is logically independent of the examinee’s knowledge of the subject matter for which items are supposedly measured”. (McPhail, 1981, p.707).
A number of researchers have investigated the belief that the results of MC tests can be influenced by
“testwiseness” (Simkin & Kuechler, 2005). The most common technique is to eliminate one or more MC answers based on
34
only a partial understanding of the knowledge being tested and thus generate misleadingly high test scores. Studies by Rogers and Hartley (1999) and Zimmerman and Williams (2003) both corroborate the influence of testwiseness on MC
examinations. Researchers have found that testwiseness skills introduced additional variance into examination scores (Fagley, 1987), and that there was a positive association between testwiseness skills and classroom examination performance (Fagley, Miller, and Downing,
1990). Teaching testwiseness would improve the validity of test results, were likely to strengthen critical thinking, and provided equal education, employment and opportunity for minorities (McPhail, 1981). There are two ways of learning testwiseness: associative learning and problem solving. Associative learning means learning from being told and from practice and drill. In problem solving, students search for a pattern; they are presented with evidence and are asked to investigate the data and draw conclusions (McPhail, 1981).
It is also beneficial to raise English Language
Learners’ awareness of the typical discourse and formats of standardized tests. ELLs may not be familiar with the kind of language that is used in tests, including many
35
predictable patterns and phrases. It may also be beneficial to teach test-taking skills (e.g., how to approach a
multiple-choice question, how to locate the main idea in a reading passage) to help prepare ELLs for specific types of test items they may encounter. Armed with a variety of
test-taking skills and strategies, ELLs may be empowered to demonstrate their knowledge on a test, rather than being intimidated by unfamiliar terms and formats (McPhail, 1981).
Guessing. Differences among students on variables that
affect the amount of guessing have been identified as a source of error on multiple-choice tests (Cronbach, 1980). Guessing on a multiple-choice item may be categorized as random (among all choices), or informed (where some wrong choices are eliminated (Frary, Cross & Lowry, 1977). Most researchers agree that the influence of blind guessing on the scores of a test diminishes as the length of a test and the number of options per item increases (Ebel & Frisbie, 1991). The guessing factor reduces the reliability of multiple-choice item scores somewhat, but increasing the number of items on the test offsets this reduction in reliability. For example, if the test includes a section with only two multiple-choice items of 4 alternatives each (a b c d), one can expect 1 out of 16 of your students to
36
correctly answer both items by guessing blindly. On the other hand if a section has 15 multiple-choice items of 4 alternatives each, you can expect only 1 out of 8,670 of your students to score 70% or more on that section by guessing blindly (Burton et al, 1991).
Gender
Research studies have shown that male/female
differences on constructed-response questions often do not parallel the male/female differences on the multiple-choice questions in the same subject (Mazzeo, Schmitt, &
Bleistein, 1992). Typically, when women and men perform equally well on the multiple-choice questions, the women outperform the men on the constructed-response questions. When women and men perform equally well on the constructed-response questions, the men outperform women on the
multiple-choice questions. The differences occur even though the multiple-choice scores and the constructed-response scores tend to agree strongly within each group. In academic subjects, there is usually a strong tendency for the students who are stronger in the skills measured by the multiple-choice questions to be stronger in the skills measured by the constructed-response questions. But if all students improve in the skills tested by the CR questions,
37
their performance on the MC questions may not reflect that improvement (Livingston, 2009).
Learning Strategies. Kimball (1989) hypothesized that
gender-related differences in performance are the result of different approaches to learning mathematics. Gallagher (1992) found that most of the items favoring men required insightful strategies, whereas all the items favoring women required standard algorithmic strategies.
Format preferences. In a study done by DeMars (1997),
scores from mathematics and science sections of pilot forms of a high school proficiency test were examined for
evidence of an interaction between gender and response
format (MC or CR). When students of all ability levels were considered, the interaction was small in science and non-existent in math. When only the highest ability students were considered, male students scored higher on the
multiple-choice section, whereas female students either scored higher on the constructed-response section or the degree to which the male students scored higher was less on the constructed-response section. Correlations between the formats were high and did not vary by gender.
Beller and Gafni (2000) gave an overview of several studies, which analyzed the students’ preferences for
38
and the influence of gender differences. In a range of studies, they found some consistent conclusions suggesting that, if gender differences are found (which was not always the case), female students preferred essay formats, and male students showed a slight preference for
multiple-choice formats. Furthermore, male students scored better on multiple-choice questions than female students and female students scored better than male students on open-ended questions than on multiple choice questions (Ben-Shakhar and Sinai, 1991; DeMars, 1997).
MC and CR formats require different sets of skills, and these skills may differ for genders. An example is the influence of verbal fluency for writing tasks. Some studies have found that females have higher verbal fluency than males (Halpern, 1992). If this is true, these higher
fluency skills may give females an advantage over males in CR tasks. Willingham and Cole (1997) reviewed national and state assessment results and concluded that writing often appeared to play a role in gender format score differences. The research they reviewed suggested writing skills and fluency differences as possible factors in the female advantage on CR tasks. They also reported that requested discussion and explanation of responses consistently showed female advantages. Clements and Ballista (1992) suggested
39
that males and females differ on preferred solution
strategies with more females choosing verbal strategies and more males choosing non-verbal strategies.
The age factor. In a meta-analysis performed by Hyde,
Fennema, and Lamon (1990) on gender differences in mathematics performance, they found that overall
differences in mathematics performance were not apparent in early childhood, but that they appeared in adolescence and usually favored boys in tasks involving high cognitive complexity, such as problem-solving, and favored girls in tasks of less complexity, such as computation. In addition, there was a slight female superiority in performance in the elementary and middle school years. A moderate male
superiority emerged in the high school years. Females were superior in computation in elementary and middle school, and the difference was essentially zero in the high school years. The gender difference was essentially zero for
understanding of mathematical concepts at all ages for which data was available. It was in problem solving that dramatic age trends emerged. The gender difference in
problem solving favored females slightly in the elementary and middle school years, but in the high school and college years, there was a moderate effect size, favoring males. It was assumed that this occurred because in high school and
40
college, students were permitted to select their own
courses, and females chose fewer mathematics courses than did males (Meece, 1992). Differences in course selection appeared to account for some but not all of the gender
difference in performance on standardized tests in the high school and college years (Kimball, 1989).
Ethnicity and Language
According to the recently published Guidelines for the Assessment of English Language Learners, by the Educational Testing Service (2009), English Language Learners (ELLs) represent one in nine students in U.S classrooms from
pre-Kindergarten through 12th grade, but most are concentrated
in the lower grades. Eighty percent are native speakers of Spanish, and about five percent are of Asian descent.
English Language Learners are concentrated in six states- California, Arizona, Texas, New York, Florida and Illinois. In California, more than 25% of the students in grades pre-K-12 are ELLS.
ELLs vary greatly as individuals. Therefore, there is no particular response format that is most advantageous for all. If the multiple-choice format is decided upon, large amounts of texts make it less likely that they will
41
If the constructed-response format is selected to assess their knowledge, the examiner might consider including tasks that allow examinees to respond, not in long, wordy sentences, but in diagrams or other visual representations (Snow, 2000). It may be challenging for students learning English to show what they know and can do in mathematics if the test items that assess this knowledge also test their English language skills. The complexity of the language in a math test item may interfere with the ability of ELLs to demonstrate their understanding of math concepts on
achievement tests (Abedi, 2002). Mathematics test items can be reworded to minimize their language load without
altering the content assessment (Abedi, 2002).
Low scores on a standardized test may mean nothing more than that a learner has not yet mastered enough English to demonstrate his or her content knowledge and skills on a test. Multiple assessments, including some performance-based or alternative assessments that mirror what students are learning in class, will paint a much more accurate picture of students’ knowledge, skills, and
progress than any single test score can indicate (Coltrane, 2002).
42
Accommodations. Using Mathematics test items from the
National Assessment of Education Progress (NAEP), Abedi et al (2002) employed accommodation strategies (modified
English, use of dictionary, extra time) and the results indicated that ELL students scored, on average, 5 points lower than non-ELL students on a 35-item math test. Also, students who were better readers achieved higher math
scores. In an earlier study using the 1990 NAEP Mathematics Assessment, it was found that members of some ethnicities were less likely to respond to open-ended items than were students in other groups. This finding suggests that the experiences students bring to the testing situation may interact with test format to influence their performance, and that elimination of the multiple-choice format may increase, rather than reduce, achievement gaps (Myerberg, 1996).
Bronwyn Coltrane of the Center for Applied Linguistics advocates teaching ELLs the discourse of tests and test-taking skills: "It is. .. beneficial to raise ELLs' awareness of the typical discourse and formats of
standardized tests. ELLs may not be familiar with the kind of language that is used in tests, including many
predictable patterns and phrases. It may also be beneficial to teach test-taking skills (e.g., how to approach a
43
multiple-choice question, how to locate the main idea of a reading passage) to help prepare ELLs for specific types of test items they may encounter. Armed with a variety of
test-taking skills and strategies, ELLs may be empowered to demonstrate their knowledge on a test, rather than being intimidated by unfamiliar terms and formats". This
preparation in how to approach test questions and answer sheets is especially important for ELLs who are recent
immigrants. Even those who have some proficiency in English may never have been exposed to the format of U.S.
standardized testing.
Scoring Rubrics
Scoring constructed-response items written by ELLs may present additional challenges. Two ways in which ELLs’
constructed responses differ are differences due to
language background and in the style of the response (Abedi & Lord, 2001). For example, if they have to use sentences to write a proof, one must overlook errors in grammar and syntax, and focus on the content knowledge and the range of that knowledge. Also, arithmetic operations are learned differently in other countries. To name a few, the
conventions for long division are different, and decimals are expressed as commas in Europe and Asia.
44
Formatting
Formatting is important for students whose processing strategies and decoding efforts result in literacy and language challenges (Abedi, 2002). Some critics suggest that, for ELLs, the most humane approach is to focus almost exclusively on the reduction of language in the text. In mathematics, for instance, asking to solve “3x + 5x” would be more fitting and less confusing than asking to solve “the sum of three times a number and five times that same number”. Although it may seem like English Language
Learners may fare better on multiple-choice tests because they are not obligated to express their reasoning in
writing – which may prove to be weak – testing them largely or exclusively on multiple choice tests may mask their real abilities.
Empirically, Kopriva and Lowrey (1994) found that a large percentage of ELLs in California said they would rather have an open-answer format as compared with
multiple-choice format for providing their responses. They said that the CR format provided them with the chance to explain what they know. It is further recommended then that CR items be used to allow for different approaches to
demonstrating mastery, such as charts, diagrams and pictures.
45
Edwards and his colleagues (2007) investigated
subgroup differences on a multiple-choice and constructed-response test of scholastic achievement in a sample of African American and White students. Although both groups had lower mean scores on the constructed-response test, the results showed a 39% reduction in subgroup differences
compared with the multiple-choice test. That proved that African Americans had more favorable perceptions on the constructed-response tests. The authors concluded that integrating constructed-response items would be a viable alternative for minimizing subgroup differences on high-stakes testing.
Validity
Many researchers and practitioners believe that standards-based reform and high-stakes testing will have the greatest impact on Blacks, Latinos, English-language learners, students with disabilities, and low-SES students (Heubert, 2009). As beneficial as it may be to include ELLs in high-stakes tests, some complications arise concerning the validity and reliability of such tests for this group of learners (Coltrane, 1992). Educators must consider what is actually being assessed by any given test: Is the test measuring ELLs’ academic knowledge and skills, or is it primarily a test of their language skills? When ELLs take
46
standardized tests, the results tend to reflect their
English language proficiency and may not accurately assess their content knowledge or skills (Menken, 2000), therefore weakening the test’s validity for them. If ELLs are not able to demonstrate their knowledge due to the linguistic difficulty of a test, the test results will not be a valid reflection of what the students know and can do.
Popham (1999) hints that there are test questions that “may appear to be appropriate for assessing students’
skills and knowledge, but in reality, there is a real presence of SES-linked content that gives an edge to children, whose parents are middle or upper class, are
better off financially or have received a higher education” (p.59). Perhaps most importantly, educators must be
cautious when interpreting the test results of ELLs. As with all learners, it is crucial to remember that one test cannot accurately reflect everything that a person has learned and is able to do. This point is particularly important if the validity and reliability of the test are questionable for ELLs, or if the students were not given appropriate testing accommodations. Similarly, high-stakes decisions should not be made regarding a program, school, or district with high numbers of ELLs based solely on test data. Such data may merely indicate that a school or
47
district has a high percentage of ELLs, and not be reflective of instructional quality or program effectiveness (Menken, 2000).
Guidelines for Writing Multiple-Choice Questions
From a teaching and learning point of view, question construction has to address specific criteria for good assessment (Earl, Land and Wise, 2000). The questions have to be a) reliable: they must produce consistent results, b) valid: the question must test what the student has been taught, c) useful: the assessment must help the student progress and reinforce the learning, d) fair: all students who take the assessment should have an equal chance of scoring full marks, and e) cost effective: questions must be efficient enough to produce the required results for the students and the institution in general.
Haladyna and Downing (1989) are recognized as major contributors to the research on multiple-choice testing. They devised guidelines for procedural and content item writing, as well as stem construction and option and distractor development. They advise the following: 1. Avoid the complex multiple-choice (Type K) format.
(e.g., A and D, A and C, All the above, None of the Above, A, B and C, etc.).
48
3. Avoid trick items, which mislead or deceive examinees into answering incorrectly.
4. Base each item on an educational or instructional objective.
5. Keep the vocabulary consistent with the examinees’ level of understanding.
6. Use multiple-choice to measure higher-level thinking. 7. Test for important or significant material; avoid trivial material.
8. State the stem in question form or completion form (note: recent research findings favor question form over
completion).
9. Ensure that the directions in the stem are clear, and that wording lets the examinee know exactly what is being asked.
10. Avoid window dressing (excessive verbiage) in the stem. 11. Word the stem positively; avoid negative phrasing.
12. Include the central idea and most of the phrasing in the stem.
13. Use as many options as are feasible; more options are desirable.
14. Place options in logical or numerical order.
15. Keep the length of the options fairly consistent.
49
17. Avoid, or use sparingly, the phrase “none of the above.”
18. Avoid the use of the phrase “I don’t know.”
19. Avoid distractors that can clue test-wise examinees; for example, avoid clang associations, absurd options, formal prompts, or semantic (overly specific or
overly general) clues.
20. Avoid giving clues through the use of faulty grammatical construction.
21. Avoid specific determiners, such as “never” and “always.”
22. Make sure there is one and only one correct option. 23. Use plausible distractors; avoid illogical distractors 24. Incorporate common errors of students in distractors. 25. Avoid technically phrased distractors.
26. Use familiar yet incorrect phrases as distractors. 27. Use true statements that do not correctly answer the item.
Guidelines for the Constructed-Response Items
There exist many references on how to construct valid constructed response items. General guidelines can be
gleaned and summarized as follows:
1. Design CRs so that students are challenged to think and not just to provide memorized answers.