Activity #1: Application of Rules
and Expectations Across Groups
“MINING EXPEDITION”
In this activity, it is vital that you follow all instructions.
Treat the situation as a TWO PART high-stakes performance testing environment.
PART 1:
20 minute Mining Expedition
(PERFORMANCE TEST)
This is not group work, individual skills are being tested.
You should receive a plastic bag and a set of rules or directions for your task.
You have 3 minutes to read, understand and memorize
the directions for mining (collecting treasures and alphabets).
Absolutely NO talking as you follow your specific instructions.
TWO ring tones will indicate when to begin.
You have 20 minutes for the Mining Expedition.
At the end of 20 minutes MULTIPLE ring tones will signal
the end of your performance task (no more mining) and time to find a seat.
PART 2: Scoring!
Examine your cache of treasures and alphabets. Calculate the value of your cache based on the rubric below.
Treasures: Gold=10, Gems=8, Silver=6, Shells=5
Alphabets: C = 5, A E I O U = 3, all others = 2
If you make an English word using 3 alphabets, double (X2) the total word value.
e.g. S U N = (2+3+2)*2 = 14
If you make an English word using four or more alphhabets, triple (X3) the total word value.
e.g. P A C K = (2+3+5+2)*3 = 36 S H E L L = (2+2+3+2+2)*3 = 33
Scores are invalid for any person who was given
Debrief /Discussion
The environment – space, restrictions Clarity of rules and instructions
Engagement - Rules & Instructions, participation,
frustrations, time constraints
Scoring (ease and results) standardization, efficacy,
interpretation, sensed of accomplishment
Fairness – what about the hints? Inherent privileges or
advantages given to particular groups
Cheating (by talking, writing or reference to rule
sheets) discuss penalties for perception of cheating, what were you thinking when others were not
following the same rules?
Implications for evaluating the quality of
Examining Commercial or Large Scale tests: Test Manuals – Test Plans
Test description and purpose Blueprints Item types Item Distribution Sample items Sample scoring Reliability Validity Fairness Cost Effectiveness/Efficiency
Examining the quality of your own
examinations 7
In examining the quality or utility of a
commercially available test or any large scale test, the first effort should be spent examining the test manual if one exists.
Test manuals provide descriptions and valuable information about the purpose and technical information pertaining to the test.
Also review studies conducted about the test – both independent and by the producer/publisher. Read professional reviews about the test.
For Teacher-made tests typically there are no
manuals, but nevertheless, there should be a plan. Plans might correspond to a course
outline or syllabus or even an instructor manual.
Most of you build tests for your classes Do you use instructor’s manuals to help
build tests
They help in regard to time pressure But provide inferior questions
Rarely written by professors
Quickly made—there is no money for the
publisher
Trust yourself
Ask for colleagues to consider your thinking Develop rubrics for assessment ask
colleagues what they think
Try to test (mostly) important information How did you feel as a student when
teachers did not test important information?
How did you feel when tests were
unbalanced/that is, the subject-matter was not fairly represented?
Blueprints serve as guides to the test in terms of content coverage and level of objectives.
Expressed as a matrix (table) Topics or categories covered number of items
identifies objectives and skills to be tested relative weight for each item or category
Item types can be categorized as being
objective or subjective and as having
constructed or selected response formats:
Objective: test that measures content or skills in
such a manner that a correct response is achieved independent of rater bias or the
examiner's own beliefs. In other words, the test is usually administered based on a bank of items that are marked and compared against
exacting scoring mechanisms (rubric) that are completely standardized.
Item types can be categorized as being
objective or subjective and as having
constructed or selected response formats:
Subjective: test that measures the opinions or
perspectives, including one’s expression, and the rationale or evidence that one would provide to support their response. In this form of testing,
there may be more than one way to approach or express a correct response. Depending on the content and how the item prompt is designed, a rubric may still be possible to establish how points are awarded.
Item types can be categorized as being
objective or subjective and as having
constructed or selected response formats:
Examples of the different types of response
formats
Selected response Multiple choice
True and False Matching
Constructed response Cloze
Fill in the blanks
Problem solving Short Answer
Essays 14
Within a test blueprint, test developers assign
how content is covered – item distribution.
However, this information is not often included in the published test manuals.
The distribution decisions are sometimes
ascribed percentage of items included per content area based on importance of sub-categories to the profession or the difficulty level associated with sub-topics.
One of the general purpose of a test is to
differentiate those who can from those who can’t. Another is to determine the extent of mastery or competence of the test taker. 15
Page counts in textbooks Time spent by teachers
Ratings of the importance of topics
Professors can use these same topics
These considerations can be used by faculty
members in weighting different topics in examinations
As a student, did you ever take a test that you considered unfair? Why?
Activity #2: The TEST PLAN
Given the information in the
previous slides:
What is the relevance or utility of a test plan?
What information would you include in a test plan?
Activity #2: The TEST PLAN
Did you include responses to either of the following questions?
What is the purpose of this test? Who is the test designed for?
What information is to be included
test?
How will items be scored?
What do scores provided to the
test-taker mean?
Most psychometricians and testing experts
would list reliability among the most important characteristics that tests and other kinds of
psychological data need to have.
Reliability and the test standards are regarded
with great importance by professional
associations representing people who work in testing.
Reliability is best defined as consistency and
sometimes as accuracy as well.
Another perspective of reliability is that it is
really a family of procedures that provide evidence of various kinds of consistency in tests.
Signal-to-noise ratio Error is noise
Signal is reliable variance
] Foundations of Classic Test Theory
regarding scores [X = T+e] Test-Retest
Alternate or Parallel Forms Inter-Rater
Internal Consistency (Alpha vs Split Half)
When a single form of a test is
administered and is administered again after a period of time has passed, we can estimate test-retest reliability.
Obviously, this testing approach permits us to see how stable scores are or how much scores change over time (pre-test / post-test / opportunity to repost-test).
A simple Pearson product-moment
correlation coefficient serves as the reliability of the test when test-retest reliability is studied.
Sometimes duplication is to assure against cheating if some people will take the test on an alternate day. Another reason may be that one or more scores are
invalidated (e.g., there was an unusual event that interfered with the testing) so retesting is required.
Alternate forms generally follow the same test outline, and the same length, and
achieve or adjust the tests to have highly similar means and standard deviations for resultant test scores.
When two or more test forms are taken
with little time between them, the reliability coefficient that results reflects the degree to which content sampling impacts
reliability.
Where time between is more extensive, the reliability coefficient that results reflects
errors that are due to either temporal or content sampling differences.
It is used when there are numerous raters whose consistency of ratings needs to be considered.
Frankly, if one has two raters, they can be considered as parallel or alternate forms of each other.
If there are more than two, an index such as Coefficient Alpha can be used, treating each rater as though he or she is a test
item. 25
Sometimes it is only possible to administer the
test to a group of clients once regardless of number of test forms available. In such
instances, internal consistency approaches to reliability can be used to assess reliability.
Internal consistency approaches to reliability
cannot demonstrate score reliability because they are almost certainly based on data
collected in a single sitting/test administration. Nevertheless, they can indicate content
sampling.
We will discuss two approaches to calculating
internal consistency: Split-Half and Coefficient Alpha
Split-Half Reliability
This technique compare the subscores emerging from
two halves of a test (odd / even).
The two halves are correlated with each other, and
an adjustment known as the Spearman-Brown
Prophecy Formula is made given that the correlation between the two halves would indicate the reliability of a test half rather than the test as a whole.
This technique was developed in a time before
computerized test administration, the optical
scanning of test forms, computers, and statistical software packages.
There is normally no reason to use it today. Better
internal consistency techniques exist.
Coefficient Alpha
The most common reliability formulation today is
Coefficient Alpha and its variations.
Coefficient alpha is said to be the mean of all
possible split half divisions of the test.
Essentially, this technique is based on only two factors:
the strength of the correlations among items
composing the test and the number of items on the test.
With a range from 0-1, Cronbach’s alpha will
generally increase when the correlations between the items increase. For this reason the coefficient
measures the internal consistency of the test.
Typical advice proffered for increasing reliability is to
lengthen the measure ( more items)s and eliminate items that do not correlate with the others.
Cronbach’s alpha (coefficient alpha) is rather
versatile, so it can be used with dichotomous, continuous and non-dichotomous data.
Therefore it can be used for objective tests as
well as tests in which participants can earn partial credit and for questionnaires using a Likert scale.
Common interpretation of alpha:
acceptable reliability = 0.7 (some say 0.6) good reliability = 0.8 or higher indicates
very high reliability = 0.95 or higher (not necessarily
desirable, it may be an indicator of redundant items)
Longer tests are more reliable
The more components they have, the
more reliable they become
Develop and use rubrics for scoring subjective responses
Make decisions about what should
influence scoring
Does grammar and proper language use count? Can one get partial credit for good work that
leads to incorrect answers?
RELIABILITY
31
Though calculating reliability is not a simple
task for a non-psychometrician, it is not impossible with some guidance or
coaching.
The proceeding slide is an example of how
reliability can be calculated using Excel for an assessment with 8 participants.
The calculations show differences for the assessment when there are 10 items, 15 items and 20 items.
Today, reliability is viewed as absolutely necessary but insufficient evidence in validation of test scores.
There are two reasons for this perception.
First, reliability sets limits on validity.
Second, newer views of validity include all test
results that bear on the interpretation of data, whether from test itself or other sources.
Currently, testing and assessment experts
acknowledge that validity is the most
important characteristic to be evaluated in testing.
A testing or an assessment is valid, but not the
test instrument itself. Tests produce scores or results and it is the interpretations of these
scores that needs to be validated.
Test scores or test results are valid to the extent
that we have evidence to support a given use, during a given time period, and with a particular well-defined population.
No extrapolation for other uses, times of use,
and populations; additional validation research would be required.
Validity is assessed through research.
Validation has become a more prominent word because it focuses upon the research that substantiates the evidential basis for test uses.
Validation is ultimately the process of
accumulating empirical evidence as well as theoretical bases to support particular uses of tests. The process is cumulative and on-going.
The primary focus of validation is the
justification of interpretations for specific test scores and for their proposed uses. It involves the construction of arguments for particular interpretations of test scores and then
subjecting these arguments to rigorous theoretical and empirical scrutiny
Rarely would research be done on a classroom examination.
Does it measure the appropriate topics? Does it assess using the correct types of
thinking?
What is the role of memory? Of concept learning? Of problem solving?
Do scores from the test lead to accurate conclusions about the students’
performance, knowledge, or capacity? 36
Define the construct(s) intended to be measured so that its components are all assessed to the extent possible.
Explicitly outline and describe the likely uses to which such a measure might be employed.
Explicitly outline and describe the primary proposed interpretations for specific scores or test results that are likely to be
suggested.
Evidence-based or hypothesized theory is often the starting point.
Testing specialists talk about different kinds of evidence to justify valid test
interpretations and uses.
Validation is ultimately the responsibility of both the test developer and the test user, which in the case of faculty members may be the same person or people in most
instances.
The test publisher is primarily responsible for identifying the uses to which a test can be put and for providing initial evidence to
demonstrate that the test aids in
accomplishing that particular test use. 38
Oscar K. Buros
Buros Center for Testing
The Mental Measurement Yearbooks
The “consumer report” of testing The review process
How to locate test reviews
Test in Print
Pruebas Publicadas en Español Assessment Literacy Series
Activity #3: Validity Evidence
Read an MMY review of the GRE
Compare notes with someone who
read the other review Discuss
What is the purpose and audience for the test?
What is similar in the reviews?
What is different in the reviews?
What technical evidence is presented?
Has the review influenced your perception?
How can this model be applied to your
Differences across pre-existing groups in performance that is unrelated to that
group’s actually ability to perform It is not “mean differences”
I am avoiding statistical definitions (often defined using regression analyses)
What does fairness mean in Saudi Arabia? At King Fahd University?
In USA
Racial and ethnic groups Gender groups
People with disabilities
Immigrants, people differing by country of origin Language minorities (sometimes)
Not just differences in averages 42
We all want our examinations scores to be considered valid
To be valid, they need to be both reliable and fair
Reliability and fairness are components of validity
There are things we can do to make our examinations more reliable, fair and valid
For course exams and maybe even
program exams – it is highly unlikely that there would be time and resource
investment to create a test manual and conduct full fledged validation studies. So, what tools or approaches are possible
for you to use to evaluate your
examinations at the course level or the program level?
Why would you or should you invest time and effort evaluating the quality of your examinations?
Test/Exam blueprint Scoring rubric
Classic Test Theory Statistics
Descriptive Statistics Mean
Mode & score distribution Variance
Standard Deviation Test data
Item variance
Reliability
Standard Errors of Measurement (SEM)
How to avoid “halo”
Read anonymously
Read one question at a time
Do not read a single paper through Have rubrics or scoring protocols
For higher stakes tests, use multiple raters
May provide printouts of test data Check the reliability
Look for items or questions that do not correlate with others
Look for answers that are different but come from successful students
Conditions: environment, equipment and
materials
Test security
Clarity of Directions / Instructions
Time expectations and Task relevance
Item prompts and what information they
generate: Objective Subjective Formatting consistency item length item structure language Rubrics 48