• No results found

Kurt F. Geisinger, Ph.D

N/A
N/A
Protected

Academic year: 2021

Share "Kurt F. Geisinger, Ph.D"

Copied!
48
0
0

Loading.... (view fulltext now)

Full text

(1)
(2)

Activity #1: Application of Rules

and Expectations Across Groups

“MINING EXPEDITION”

In this activity, it is vital that you follow all instructions.

Treat the situation as a TWO PART high-stakes performance testing environment.

(3)

PART 1:

20 minute Mining Expedition

(PERFORMANCE TEST)

 This is not group work, individual skills are being tested.

You should receive a plastic bag and a set of rules or directions for your task.

 You have 3 minutes to read, understand and memorize

the directions for mining (collecting treasures and alphabets).

Absolutely NO talking as you follow your specific instructions.

 TWO ring tones will indicate when to begin.

 You have 20 minutes for the Mining Expedition.

 At the end of 20 minutes MULTIPLE ring tones will signal

the end of your performance task (no more mining) and time to find a seat.

(4)

PART 2: Scoring!

Examine your cache of treasures and alphabets. Calculate the value of your cache based on the rubric below.

Treasures: Gold=10, Gems=8, Silver=6, Shells=5

Alphabets: C = 5, A E I O U = 3, all others = 2

If you make an English word using 3 alphabets, double (X2) the total word value.

e.g. S U N = (2+3+2)*2 = 14

If you make an English word using four or more alphhabets, triple (X3) the total word value.

e.g. P A C K = (2+3+5+2)*3 = 36 S H E L L = (2+2+3+2+2)*3 = 33

 Scores are invalid for any person who was given

(5)

Debrief /Discussion

 The environment – space, restrictions  Clarity of rules and instructions

 Engagement - Rules & Instructions, participation,

frustrations, time constraints

 Scoring (ease and results) standardization, efficacy,

interpretation, sensed of accomplishment

 Fairness – what about the hints? Inherent privileges or

advantages given to particular groups

 Cheating (by talking, writing or reference to rule

sheets) discuss penalties for perception of cheating, what were you thinking when others were not

following the same rules?

 Implications for evaluating the quality of

(6)
(7)

 Examining Commercial or Large Scale tests:  Test Manuals – Test Plans

Test description and purpose Blueprints  Item types  Item Distribution  Sample items  Sample scoring Reliability Validity Fairness  Cost Effectiveness/Efficiency

 Examining the quality of your own

examinations 7

(8)

 In examining the quality or utility of a

commercially available test or any large scale test, the first effort should be spent examining the test manual if one exists.

Test manuals provide descriptions and valuable information about the purpose and technical information pertaining to the test.

Also review studies conducted about the test – both independent and by the producer/publisher. Read professional reviews about the test.

 For Teacher-made tests typically there are no

manuals, but nevertheless, there should be a plan. Plans might correspond to a course

outline or syllabus or even an instructor manual.

(9)

 Most of you build tests for your classes  Do you use instructor’s manuals to help

build tests

 They help in regard to time pressure  But provide inferior questions

Rarely written by professors

Quickly made—there is no money for the

publisher

(10)

 Trust yourself

 Ask for colleagues to consider your thinking  Develop rubrics for assessment ask

colleagues what they think

 Try to test (mostly) important information  How did you feel as a student when

teachers did not test important information?

 How did you feel when tests were

unbalanced/that is, the subject-matter was not fairly represented?

(11)

 Blueprints serve as guides to the test in terms of content coverage and level of objectives.

Expressed as a matrix (table) Topics or categories covered number of items

identifies objectives and skills to be tested relative weight for each item or category

(12)

 Item types can be categorized as being

objective or subjective and as having

constructed or selected response formats:

Objective: test that measures content or skills in

such a manner that a correct response is achieved independent of rater bias or the

examiner's own beliefs. In other words, the test is usually administered based on a bank of items that are marked and compared against

exacting scoring mechanisms (rubric) that are completely standardized.

(13)

 Item types can be categorized as being

objective or subjective and as having

constructed or selected response formats:

Subjective: test that measures the opinions or

perspectives, including one’s expression, and the rationale or evidence that one would provide to support their response. In this form of testing,

there may be more than one way to approach or express a correct response. Depending on the content and how the item prompt is designed, a rubric may still be possible to establish how points are awarded.

(14)

 Item types can be categorized as being

objective or subjective and as having

constructed or selected response formats:

 Examples of the different types of response

formats

Selected response  Multiple choice

 True and False  Matching

Constructed response  Cloze

 Fill in the blanks

 Problem solving  Short Answer

 Essays 14

(15)

 Within a test blueprint, test developers assign

how content is covered – item distribution.

However, this information is not often included in the published test manuals.

 The distribution decisions are sometimes

ascribed percentage of items included per content area based on importance of sub-categories to the profession or the difficulty level associated with sub-topics.

 One of the general purpose of a test is to

differentiate those who can from those who can’t. Another is to determine the extent of mastery or competence of the test taker. 15

(16)

 Page counts in textbooks  Time spent by teachers

 Ratings of the importance of topics

 Professors can use these same topics

 These considerations can be used by faculty

members in weighting different topics in examinations

 As a student, did you ever take a test that you considered unfair? Why?

(17)

Activity #2: The TEST PLAN

Given the information in the

previous slides:

What is the relevance or utility of a test plan?

What information would you include in a test plan?

(18)

Activity #2: The TEST PLAN

Did you include responses to either of the following questions?

What is the purpose of this test? Who is the test designed for?

What information is to be included

test?

How will items be scored?

What do scores provided to the

test-taker mean?

(19)

 Most psychometricians and testing experts

would list reliability among the most important characteristics that tests and other kinds of

psychological data need to have.

 Reliability and the test standards are regarded

with great importance by professional

associations representing people who work in testing.

 Reliability is best defined as consistency and

sometimes as accuracy as well.

 Another perspective of reliability is that it is

really a family of procedures that provide evidence of various kinds of consistency in tests.

(20)

 Signal-to-noise ratio  Error is noise

 Signal is reliable variance

(21)

] Foundations of Classic Test Theory

regarding scores [X = T+e]  Test-Retest

 Alternate or Parallel Forms  Inter-Rater

 Internal Consistency (Alpha vs Split Half)

(22)

 When a single form of a test is

administered and is administered again after a period of time has passed, we can estimate test-retest reliability.

 Obviously, this testing approach permits us to see how stable scores are or how much scores change over time (pre-test / post-test / opportunity to repost-test).

 A simple Pearson product-moment

correlation coefficient serves as the reliability of the test when test-retest reliability is studied.

(23)

 Sometimes duplication is to assure against cheating if some people will take the test on an alternate day. Another reason may be that one or more scores are

invalidated (e.g., there was an unusual event that interfered with the testing) so retesting is required.

 Alternate forms generally follow the same test outline, and the same length, and

achieve or adjust the tests to have highly similar means and standard deviations for resultant test scores.

(24)

 When two or more test forms are taken

with little time between them, the reliability coefficient that results reflects the degree to which content sampling impacts

reliability.

 Where time between is more extensive, the reliability coefficient that results reflects

errors that are due to either temporal or content sampling differences.

(25)

 It is used when there are numerous raters whose consistency of ratings needs to be considered.

 Frankly, if one has two raters, they can be considered as parallel or alternate forms of each other.

 If there are more than two, an index such as Coefficient Alpha can be used, treating each rater as though he or she is a test

item. 25

(26)

 Sometimes it is only possible to administer the

test to a group of clients once regardless of number of test forms available. In such

instances, internal consistency approaches to reliability can be used to assess reliability.

 Internal consistency approaches to reliability

cannot demonstrate score reliability because they are almost certainly based on data

collected in a single sitting/test administration. Nevertheless, they can indicate content

sampling.

 We will discuss two approaches to calculating

internal consistency: Split-Half and Coefficient Alpha

(27)

Split-Half Reliability

This technique compare the subscores emerging from

two halves of a test (odd / even).

The two halves are correlated with each other, and

an adjustment known as the Spearman-Brown

Prophecy Formula is made given that the correlation between the two halves would indicate the reliability of a test half rather than the test as a whole.

This technique was developed in a time before

computerized test administration, the optical

scanning of test forms, computers, and statistical software packages.

There is normally no reason to use it today. Better

internal consistency techniques exist.

(28)

Coefficient Alpha

The most common reliability formulation today is

Coefficient Alpha and its variations.

Coefficient alpha is said to be the mean of all

possible split half divisions of the test.

Essentially, this technique is based on only two factors:

the strength of the correlations among items

composing the test and the number of items on the test.

With a range from 0-1, Cronbach’s alpha will

generally increase when the correlations between the items increase. For this reason the coefficient

measures the internal consistency of the test.

Typical advice proffered for increasing reliability is to

lengthen the measure ( more items)s and eliminate items that do not correlate with the others.

(29)

 Cronbach’s alpha (coefficient alpha) is rather

versatile, so it can be used with dichotomous, continuous and non-dichotomous data.

 Therefore it can be used for objective tests as

well as tests in which participants can earn partial credit and for questionnaires using a Likert scale.

 Common interpretation of alpha:

acceptable reliability = 0.7 (some say 0.6) good reliability = 0.8 or higher indicates

very high reliability = 0.95 or higher (not necessarily

desirable, it may be an indicator of redundant items)

(30)

 Longer tests are more reliable

 The more components they have, the

more reliable they become

 Develop and use rubrics for scoring subjective responses

 Make decisions about what should

influence scoring

Does grammar and proper language use count? Can one get partial credit for good work that

leads to incorrect answers?

(31)

RELIABILITY

31

 Though calculating reliability is not a simple

task for a non-psychometrician, it is not impossible with some guidance or

coaching.

 The proceeding slide is an example of how

reliability can be calculated using Excel for an assessment with 8 participants.

 The calculations show differences for the assessment when there are 10 items, 15 items and 20 items.

(32)
(33)

 Today, reliability is viewed as absolutely necessary but insufficient evidence in validation of test scores.

 There are two reasons for this perception.

First, reliability sets limits on validity.

Second, newer views of validity include all test

results that bear on the interpretation of data, whether from test itself or other sources.

(34)

 Currently, testing and assessment experts

acknowledge that validity is the most

important characteristic to be evaluated in testing.

 A testing or an assessment is valid, but not the

test instrument itself. Tests produce scores or results and it is the interpretations of these

scores that needs to be validated.

 Test scores or test results are valid to the extent

that we have evidence to support a given use, during a given time period, and with a particular well-defined population.

 No extrapolation for other uses, times of use,

and populations; additional validation research would be required.

(35)

 Validity is assessed through research.

Validation has become a more prominent word because it focuses upon the research that substantiates the evidential basis for test uses.

 Validation is ultimately the process of

accumulating empirical evidence as well as theoretical bases to support particular uses of tests. The process is cumulative and on-going.

 The primary focus of validation is the

justification of interpretations for specific test scores and for their proposed uses. It involves the construction of arguments for particular interpretations of test scores and then

subjecting these arguments to rigorous theoretical and empirical scrutiny

(36)

 Rarely would research be done on a classroom examination.

 Does it measure the appropriate topics?  Does it assess using the correct types of

thinking?

 What is the role of memory? Of concept learning? Of problem solving?

 Do scores from the test lead to accurate conclusions about the students’

performance, knowledge, or capacity? 36

(37)

 Define the construct(s) intended to be measured so that its components are all assessed to the extent possible.

 Explicitly outline and describe the likely uses to which such a measure might be employed.

 Explicitly outline and describe the primary proposed interpretations for specific scores or test results that are likely to be

suggested.

 Evidence-based or hypothesized theory is often the starting point.

(38)

 Testing specialists talk about different kinds of evidence to justify valid test

interpretations and uses.

 Validation is ultimately the responsibility of both the test developer and the test user, which in the case of faculty members may be the same person or people in most

instances.

 The test publisher is primarily responsible for identifying the uses to which a test can be put and for providing initial evidence to

demonstrate that the test aids in

accomplishing that particular test use. 38

(39)

 Oscar K. Buros

 Buros Center for Testing

 The Mental Measurement Yearbooks

The “consumer report” of testing The review process

How to locate test reviews

 Test in Print

 Pruebas Publicadas en Español  Assessment Literacy Series

(40)

Activity #3: Validity Evidence

Read an MMY review of the GRE

Compare notes with someone who

read the other review Discuss

 What is the purpose and audience for the test?

 What is similar in the reviews?

 What is different in the reviews?

 What technical evidence is presented?

 Has the review influenced your perception?

 How can this model be applied to your

(41)

 Differences across pre-existing groups in performance that is unrelated to that

group’s actually ability to perform  It is not “mean differences”

 I am avoiding statistical definitions (often defined using regression analyses)

(42)

 What does fairness mean in Saudi Arabia?  At King Fahd University?

 In USA

Racial and ethnic groups Gender groups

People with disabilities

Immigrants, people differing by country of origin Language minorities (sometimes)

 Not just differences in averages 42

(43)

 We all want our examinations scores to be considered valid

 To be valid, they need to be both reliable and fair

 Reliability and fairness are components of validity

 There are things we can do to make our examinations more reliable, fair and valid

(44)

 For course exams and maybe even

program exams – it is highly unlikely that there would be time and resource

investment to create a test manual and conduct full fledged validation studies.  So, what tools or approaches are possible

for you to use to evaluate your

examinations at the course level or the program level?

 Why would you or should you invest time and effort evaluating the quality of your examinations?

(45)

 Test/Exam blueprint  Scoring rubric

 Classic Test Theory Statistics

Descriptive Statistics  Mean

 Mode & score distribution  Variance

 Standard Deviation Test data

 Item variance

 Reliability

 Standard Errors of Measurement (SEM)

(46)

 How to avoid “halo”

Read anonymously

Read one question at a time

Do not read a single paper through Have rubrics or scoring protocols

 For higher stakes tests, use multiple raters

(47)

 May provide printouts of test data  Check the reliability

 Look for items or questions that do not correlate with others

 Look for answers that are different but come from successful students

(48)

 Conditions: environment, equipment and

materials

 Test security

 Clarity of Directions / Instructions

 Time expectations and Task relevance

 Item prompts and what information they

generate:  Objective  Subjective  Formatting consistency  item length  item structure  language  Rubrics 48

References

Related documents

Online social media sentiment apps ● Try a search of your own on one of these:.. ● Twitter sentiment http://twittersentiment.appspot.com/ ● Twends:

Having reaped low hanging fruits in 2015, FinTech Group was able to significantly turn around its business (both B2B and B2C) and to streamline its operations boosting

17% of MIT undergraduate students patent an invention within 15 years of graduation; On average, MIT alumni inventors produce six patents each. MIT engineers are 5x

matrices of the multivariate time series data of solar events as adjacency matrices of labeled graphs, and applying thresholds on edge weights can model the solar flare

The tense morphology is interpreted as temporal anteriority: the eventuality described in the antecedent is localised in the past with respect to the utterance time.. Compare this

In our “Present Value” estimates, we used as instruments, the fit and its one period lag of the difference of the weighted present value of the fundament (the mark-up over real

clinical faculty, the authors designed and implemented a Clinical Nurse Educator Academy to prepare experienced clinicians for new roles as part-time or full-time clinical

More importantly, the sharp increase in 2006 was consistent with the trend during the 2000-2004 period when, after reaching a trough in 2000, the ratio of current dollar