Cohen Based Summary

(1)

CHAPTER 1: PSYCHOLOGICAL TESTING AND ASSESSMENT 1. DVD- how would you respond to the events that take place in

the video

a) sexual harassment in the workplace b) respond to various types of emergencies

c) diagnosis/treatment plan for clients on videotape 2. thermometers, biofeedback, etc

TEST DEVELOPER

 They are the one who create tests.

 They conceive, prepare, and develop tests. They also find a way to disseminate their tests, by publishing them either commercially or through professional publications such as books or periodicals. TEST USER

 They select or decide to take a specific test off the shelf and use it for some purpose. They may also participate in other roles, e.g., as examiners or scorers.

TEST TAKER

 Anyone who is the subject of an assessment

 Test taker may vary on a continuum with respect to numerous variables including:

o The amount of anxiety they experience & the degree to

which the test anxiety might affect the results

o The extent to which they understand & agree with the

rationale of the assessment

o Their capacity & willingness to cooperate

o Amount of physical pain/emotional distress they are

experiencing

o Amount of physical discomfort

o Extent to which they are alert & wide awake o Extent to which they are predisposed to agreeing or

disagreeing when presented with stimulus

o The extent to which they have received prior coaching o May attribute to portraying themselves in a good light

 Psychological autopsy – reconstruction of a deceased individual’s psychological profile on the basis of archival records, artifacts, & interviews previously conducted with the deceased assesee TYPES OF SETTINGS

 EDUCATIONAL SETTING

o achievement test: evaluation of accomplishments or the

degree of learning that has taken place, usually with regard to an academic area.

o diagnosis: a description or conclusion reached on the basis

of evidence and opinion though a process of distinguishing the nature of something and ruling out alternative conclusions.

o diagnostic test: a tool used to make a diagnosis, usually to

identify areas of deficit to be targeted for intervention

o informal evaluation: A typically non systematic, relatively

brief, and “off the record” assessment leading to the formation of an opinion or attitude, conducted by any person in any way for any reason, in an unofficial context and not subject to the same ethics or standards as evaluation by a professiomal

 CLINICAL SETTING

o these tools are used to help screen for or diagnose

behavior problems

o group testing is used primarily for screening: identifying

those individuals who require further diagnostic evaluation.

 COUNSELING SETTING

o schools,prisons, and governmental or privately owned

institutions

o ultimate objective: the improvement of the assessee in

terms of adjustment, productivity, or some related variable.

 GERIATRIC SETTING

o quality of life: in psychological assesment, an evaluation

of variables such as perceived stress,lonliness, sources of

satisfaction, personal values, quality of living conditions, and quality of friendships and other social support.  BUSINESS AND MILITARY SETTINGS

 GOVERNMENTAL AND ORGANIZATIONAL CREDENTIALING How are Assessments Conducted?

 protocol: the form or sheet or booklet on which a testtaker’s responses are entered.

o term might also be used to refer to a description of a set of

test- or assessment- related procedures, as in the sentence , “the examiner dutifully followed the complete protocol for the stress interview”

 rapport: working relationship between the examiner and the examinee

ASSESSEMENT OF PEOPLE WITH DISABILITITES

 Define who requires alternate assessement, how such assessment are to be conducted and how meaningful inferences are to be drawn from the data derived from such assessment

 Accommodation – adaptation of a test, procedure or situation or the substitution of one test for another to make the assessment more suitable for an assesee with exceptional needs.

 Translate it into Braillee and administere in that form .

 Alternate assessment – evaluative or diagnostic procedure or process that varies from the usual, customary, or standardized way a

measurement is derived either by virtue of some special accommodation made to the assesee by means of alternative methods

 Consider these four variables on which of many different types of accommodation should be employed:

o The capabilities of the assesse o The purpose of the assessment o The meaning attached to test scores o The capabilities of the assessor

REFERENCE SOURCES

 TEST CATALOUGES – contains brief description of the test  TEST MANUALS – detailed information

 REFERENCE VOLUMES – one stop shopping, provides detailed information for each test listed, including test publisher, author, purpose, intended test population and test administration time  JOURNAL ARTICLES – contain reviews of the test

 ONLINE DATABASES – most widely used bibliographic databases TYPES OF TESTS

 INDIVIDUAL TEST – those given to only one person at a time  GROUP TEST – administered to more than one person at a time by

single examiner  ABILITY TESTS:

o ACHIEVEMENT TESTS – refers to previous learning (ex.

Spelling)

o APTITUDE/PROGNOSTIC – refers to the potential for

learning or acquiring a specific skill

o INTELLIGENCE TESTS – refers to a person’s general

potential to solve problems

 PERSONALITY TESTS: refers to overt and covert dispositions

o OBJECTIVE/STRUCTURED TESTS – usually self-report,

require the subject to choose between two or more alternative responses

o PROJECTIVE/UNSTRUCTURED TESTS – refers to all

possible uses, applications and underlying concepts of psychological and educational tests

(2)

CHAPTER 2: HISTORICAL, CULTURAL AND LEGAL/ETHICAL CONSIDERATIONS A HISTORICAL PERSPECTIVE

19TH_CENTURY

 Tests and testing programs first came into being in China  Testing was instituted as a means of selecting who, of many

applicants would obtain government jobs (Civil service)

 The job applicants are tested on proficiency in endeavors such as music, archery, knowledge and skill etc.

GRECO-ROMAN WRITINGS (Middle Ages)  World of evilness

 Deficiency in some bodily fluid as a factor believed to influence personality

 Hippocrates and Galen RENAISSANCE

 Christian von Wolff – anticipated psychology as a science and psychological measurement as a specialty within that science CHARLES DARWIN AND INDIVIDUAL DIFFERENCES

 Tests designed to measure these individual differences in ability and personality among people

 “Origin of Species”  chance variation in species would be selected or rejected by nature according to adaptivity and survival value.

“survival of the fittest” FRANCIS GALTON

 Explore and quantify individual differences between people.  Classify people “according to their natural gifts”

 Displayed the first anthropometric laboratory KARL PEARSON

 Developed the product moment correlation technique.  His work can be traced directly from Galton

WILHEM MAX WUNDT

 First experimental psychology laboratory in University of Leipzig  Focuses more on relating to how people were similar, not different

from each other. JAMES MCKEEN CATELL

 Individual differences in reaction time  Coined the term mental test

CHARLES SPEARMAN

 Originating the concept of test reliability as well as building the mathematical framework for the statistical technique of factor analysis

VICTOR HENRI

 Frenchman who collaborated with Binet on papers suggesting how mental tests could be used to measure higher mental processes EMIL KRAEPELIN

 Early experimenter of word association technique as a formal test LIGHTNER WITMER

 “Little known founder of clinical psychology”  Founded the first psychological clinic in the U.S. PSYCHE CATELL

 Daughter of James Cattell

 Cattel Infant Intelligence Scale (CIIS) & Measurement of Intelligence in Infants and Young Children

RAYMOND CATTELL

 Believed in lexical approach to defining personality which examines human languages for descriptors of personality dimensions 20 th_CENTURY

- Birth of the first formal tests of intelligence

- Testing shifted to be of more understandable relevance/meaning A. THE MEASUREMENT OF INTELLIGENCE

o Binet created first intelligence to test to identify mentally

retarded school children in Paris (individual)

o Binet-Simon Test has been revised over again o Group intelligence tests emerged with need to screen

intellect of WWI recruits

o David Wechsler – designed a test to measure adult

intelligence test

 for him Intelligence is a global capacity of the individual to act purposefully, to think rationally and to deal effectively with his environment.  Wechsler-Bellevue Intelligence Scale

Wechsler Adult Intelligence Test – was revised several times and extended the age range of

testakers from young children through senior adulthood.

B. THE MEASUREMENT OF PERSONALITY

o Field of psychology was being too test oriented o Clinical psychology was synonymous to mental testing o ROBERT WOODWORTH – develop a measure of

adjustment and emotional stability that could be administered quickly and efficiently to groups of recruits

 To disguise the true purpose of the test, questionnaire was labeled as Personal Data Sheet

 He called it Woodworth Psychoneurotic Inventory – first widely used self-report test of personality

o Self-report test:

 Advantages:

 Respondents best qualified  Disadvantages:

 Poor insight into self

 One might honestly believe something about self that isn’t true  Unwillingness to report seemingly

negative qualities

o Projective test: individual is assumed to project onto some

ambiguous stimulus (inkblot, photo, etc.) his or her own unique needs, fears, hopes, and motivations

 Ex.) Rorschack inkblot

o

C. THE ACADEMIC AND APPLIED TRADITIONS Culture and Assessment

Culture: ‘the socially transmitted behavior patterns, beliefs, and products of work f a particular population, community, or group of people’

Evolving Interest in Culture-Related Issues

Goddard tested immigrants and found most to be feebleminded

-invalid; overestimated mental deficiency, even in native English-speakers

Lead to nature-nurture debate about what intelligence tests actually measure Needed to “isolate” the cultural variable

Culture-specific tests: tests designed for use with ppl from one culture, but not from another

-minorities still scored abnormally low ex.) loaf of bread vs. tortillas

today tests undergo many steps to ensure its suitable for said nation -take testtakers reactions into account

Some Issues Regarding Culture and Assessment  Verbal Communication

o Examiner and examinee must speak the same language o Especially tricky with infrequently used vocabulary or

unusual idioms employed

o Translator may lose nuances of translation or give

unintentional hints toward more desirable answer

o Also requires understanding of culture

 Nonverbal Communication and Behavior

o Different between cultures

o Ex.) meaning of not making eye contact

o Body movement could even have physical cause

o Psychoanalysis: Freud’s theory of personality and

psychological treatment which stated that symbolic significance is assigned to many nonverbal acts.

o Timing tests in cultures not obsessed with speed o Lack of speaking could be reverence for elders

 Standards of Evaluation

o Acceptable roles for women differ throughout culture o “judgments as to who might be the best employee,

manager, or leader may differ as a function of culture, as might judgments regarding intelligence, wisdom, courage, and other psychological variables”

(3)

CHAPTER 2: HISTORICAL, CULTURAL AND LEGAL/ETHICAL CONSIDERATIONS

o must ask ‘how appropriate are the norms or other

standards that will be used to make this evaluation’ Tests and Group Membership

 ex.) must be 5’4” to be police officer - excludes cultures with short stature

 ex.) Jewish lifestyle not well suited for corporate America

 affirmative action: voluntary and mandatory efforts to combat discrimination and promote equal opportunity in education and employment for all

 Psychology, tests, and public policy Legal and Ethical Condiseration

Code of professional ethics: defines the standard of care expected of members of a given profession.

The Concerns of the Public

 Beginning in world war I, fear that tests were only testing the ability to take tests

 Legislation

o Minimum competency testing programs: formal testing

programs designed to be used in decisions regarding various aspects of students’ educations

o Truth-in-testing legislation: state laws to provide testtakers

with a means of learning the criteria by which the y are being judged

 Litigation

o Daubert ruling made federal judges the gatekeepers to

determining what expert testimony is admitted

o This overrode the Frye policy which only admitted scientific

testimony that had won general acceptance in the scientific community.

The Concerns of the Profession  Test-user qualifications

o Who should be allowed to use psych tests

o Level A: tests or aids that can adequately be administered,

scored, and interpreted with the aid of the manual and a general orientation to the kind of institution or organization in which one is working

o Level B: tests or aids that require some technical knowledge

of test construction and use and of supporting psychological and educational fields

o Level C: tests and aids requiring substantial understanding

of testing and supporting psych fields with experience  Testing people with disabilities

o Difficulty in transforming the test into a form that can be

taken by testtaker

o Transferring responses to be scorable o Meaningfully interpreting the test data

 Computerized test administration, scoring, and interpretation

o simple, convenient o easily copied, duplicated

o insufficient research to compare it to pencil-and-paper

versions

o value of computer interpretation is questionable

o unprofessional, unregulated “psychological testing” online

The Rights of Testtakers

 the right of informed consent

o right to know why they are being evaluated, how test data

will be used and what information will be released to whom

o may be obtained by parent or legal representative o must be in written form:

 general purpose of the testing

 the specific reason it is being undertaken  general type of instruments to be administered

o revealing this information before the test can contaminate

the results

o deception only used if absolutely necessary

o don’t use deception if it will cause emotional distress o fully debrief participants

 The right to be informed of test findings

o Formerly test administrators told to give participants only

positive information

o No realistic information is required

o Tell test takers as little as possible about the nature of their

performance on a particular test. So that the examinee would leave the test session feeling pleased and statisfied.

o Test takers have the right also to know what

recommendations are being made as a consequence of the test data

 The right to privacy and confidentiality

o Private right: “recognizes the freedom of the individual to

pick and choose for himself the time, circumstances, and particularly the extent to which he wishes to share or withhold from others his attitudes, beliefs, behaviors, and opinions”

o Privileged information: information protected by law from

being disclosed in legal proceeding. Protects clients from disclosure in judicial proceedings. Privilege belongs to the client not the psychologist.

o Confidentiality: concerns matters of communication

outside the courtroom

 Safekeeping of test data: It is not a good policy to maintain all records in perpetuity

 The right to the least stigmatizing label

o The standards advise that the least stigmatizing labels

(4)

CHAPTER 3: A STATISTICS REFRESHER Why We Need Statistics

- Statistics are important for purposes of education

o Numbers provide convenient summaries and allow us to

evaluate some observations relative to others

- We use statistics to make inferences, which are logical deductions

about events that cannot be observed directly

o Detective work of gathering and displaying clues –

exploratory data analysis

o Then confirmatory data analysis

- Descriptive statistics are methods used to provide a concise

description of a collection of quantitative information

- Inferential statistics are methods used to make inferences from

observations of a small group of people known as a sample to a larger group of individuals known as a population

SCALES OF MEASUREMENT

 MEASUREMENT – act of assigning numbers or symbols to characteristics of things according to rules. The rules serves as a guideline for representing the magnitude. It always involves error.  SCALE – set of numbers whose properties model empirical properties

of the objects to which the numbers are assigned.

 CONTINUOUS SCALE – interval/ratio. A scale used to measure continuous variable. Always involves error

 DISCRETE SCALE – nominal/ordinal used to measure a discrete variable (ex. Female or male)

 ERROR – collective influence of all of the factors on a test score. PROPERTIES OF SCALES

- Magnitude, equal intervals, and an absolute 0

Magnitude

- The property of “moreness”

- A scale has the property of magnitude if we can say that a particular

instance of the attribute represents more, less, or equal amounts of the given quantity than does another instance

Equal Intervals

- A scale has the property of equal intervals if the difference between

two points at any place on the scale has the same meaning as the difference between two other points that differ by the same number of scale units

- A psychological test rarely has the property of equal intervals - When a scale has the property of equal intervals, the relationship

between the measured units and some outcome can be described by a straight line or a linear equation in the form Y=a+bX

o Shows that an increase in equal units on a given scale

reflects equal increases in the meaningful correlates of units

Absolute 0

- An Absolute 0 is obtained when nothing of the property being

measured exists

- This is extremely difficult/impossible for many psychological qualities

NOMINAL SCALE

 Simplest form of measurement  Classification or categorization

 Arithmetic operations can be performed with nominal data  Ex.) Male or female

 Also includes test items

o Ex.) yes/no responses

ORDINAL SCALE

 Classifies in some kind of ranking order

 Individuals compared to others and assigned a rank

 Imply nothing about how much greater one ranking is than another  Numbers/ranks do not indicate units of measure

 No absolute zero point

 Binet: believed that data derived from intelligence test are ordinal in nature

INTERVAL SCALE

 In addition to the features of nominal and ordinal scales, contain equal intervals between numbers

 No absolute zero point  Can take average RATIO SCALE

 In addition to all the properties of nominal, ordinal, and interval measurement, ratio scale has true zero point

 Equal intervals between numbers

 Ex.) measuring amount of pressure hand can exert

 True zero doesn’t mean someone will receive a scor e of 0, but means that 0 has meaning

NOTE:

Permissible Operations

- Level of measurement is important because it defines which

mathematical operations we can apply to numerical data

- For nominal data, each observation can be placed in only one

mutually exclusive category

- Ordinal measurements can be manipulated using arithmetic - With interval data, one can apply any arithmetic operation to the

differences between scores

o Cannot be used to make statements about ratios

DESCRIBING DATA

 Distribution: set of scores arrayed for recording or study

 Raw Score: straightforward, unmodified accounting of performance, usually numerical

Frequency Distributions

 Frequency Distribution: All scores listed alongside the number of times each score occurred

 Grouped Frequency Distribution: test-score intervals (class intervals), replace the actual test scores

o Highest and lowest class intervals= upper and lower limits

of distribution

 Histogram: graph with vertical lines drawn at the true limits of each test score (or class interval) forming TOUCHING rectangles- midpoint in center of bar

 Bar Graph: rectangles DON’T touch

 Frequency Polygon: data illustrated with continuous line connecting the points where test scores or class intervals meet frequencies  A single test score means more if one relates it to other test scores  A distribution of scores summarizes the scores for a group of

individuals

 Frequency distribution: displays scores on a variable or a measure to reflect how frequently each value was obtained

o One defines all the possible scores and determines how

many people obtained each of those scores  Income is an example of a variable that has a positive skew

 Whenever you draw a frequency distribution or a frequency polygon, you must decide on the width of the class interval

 Class interval: for inches of rainfall is the unit on the horizontal axis Measures of Central Tendency

 Measure of central tendency: statistic that indicates the average or midmost score between the extreme scores in a distribution.  The Arithmetic Mean

o “X bar”

o sum of observations divided by number of observations o Sigma (X/n)

o Used for interval or ratio data when distributions are

relatively normal  The Median

o The middle score

o Used for ordinal, interval, and ratio data

o Especially useful when few scores fall at extremes

 The Mode

o Most frequently-occurring score

o Bimodal distribution- 2 scores both have highest

frequency

o Only common with nominal data

(5)

CHAPTER 3: A STATISTICS REFRESHER  Variability: indication of how scores in a distribution are scattered or

dispersed  The Range

o Difference between the highest and lowest scores o Quick but gross description of the spread of scores

 The interquartile and semi-interquartile range

o Distribution is split up by 3 quartiles, thus making 4

quarters each representing 25% of the scores

o Q2= median

o Interquartile range measure of variability equal to the

difference between Q3 and Q1

o Semi-interquartile range interquartile range divided by 2

 Quartiles and Deciles

o Quartiles are points that divide the frequency distribution

into equal fourths

o First quartile is the 25th percentile; s econd quartile is the

median, or 50th percentile; third quartile is the 75th percentile

o The interquartile range is bounded by the range of scores

that represents the middle 50% of the distribution

o Deciles are similar but use points that mark 10% rather

than 25% intervals

o Stanine system: converts any set of scores into a

transformed scale, which ranges from 1 to 9  The average deviation

o X-mean=x

o Average deviation= (sum of all deviation scores)/ total

number of scores

o Tells us on average how far scores are from the mean

 The Standard Deviation

o Similar to average deviation

o But in order to overcome the (+/-) problem, each deviation

is squared

o Standard deviation: a measure of variability equal to the

square root of the average squared deviations about the mean

o Is square root of variance

o Variance: the mean of the squares of the difference b/w

the scores in a distribution and their mean  Found by squaring and summing all the

deviation scores and then dividing by the total number of scores

o s = sample standard deviation

o sigma = population standard deviation

Skewness

 skewness: nature and extent to which symmetry is absent  POSITIVE SKEW Ex.) test was too hard

 NEGATIVELY SKEWED ex.) test was too easy

 can be gauges by examining relative distances of quartiles from the median

Kurtosis

 steepness of distribution  platykurtic: relatively flat  leptokurtic: relatively peaked  mesokurtic: somewhere in the middle The Normal Curve

Normal curve: bell-shaped, smooth, mathematically defined curve, highest at center; both sides taper as it approaches the x-axis asymptotically

-symmetrical, and thus have mean, median, mode, is same Area under the Normal Curve

Tails and body Standard Scores

Standard Score: raw score that has been converted from one scale to another scale, where the latter has arbitrarily set mean and standard deviation -used for comparison

Z-score

 conversion of a raw score into a number indicating how many standard deviation units the raw score is below or above the mean of the distribution.

 The difference between a particular raw score and the mean divided by the standard deviation

 Used to compare test scores with difference scales T-score

 Standard score system composed of a scale that ranges from 5 standard deviations below the mean to 5 standard deviations above the mean

 No negatives Other Standard Scores

 SAT

 GRE

 Linear transformation: when a standard score retains a direct numerical relationship to the original raw score

 Nonlinear transformation: required when data are not normally distributed, yet comparisons with normal distributions need to be made

o Normalized Standard Scores

 When scores don’t fall on normal distribution  “normalizing a distribution involves ‘stretching’

he skewed curve into the shape of a normal curve and creating a corresponding scale of standard scores, a scale called a normalized standard score scale”

(6)

CHAPTER 4: OF TESTS AND TESTING Some Assumptions About Psychological Testing and Assessment

- Assumption 1: Psychological Traits and States Exist

o Trait: any distinguishable, relatively enduring way in which one

individual varies from another

o States: distinguish one person from another but are relatively

less enduring

 Trait term that an observer applies, as well as strength or magnitude of the trait presumed present  based on observing a sample of behavior

o Trait and state definitions also refer to individual variation

make comparisons with respect to the hypothetical average person

o Samples of behavior:

 Direct observation

 Analysis of self-report statements  Paper-and-pencil test answers

o Psychological trait covers wide range of possible

characteristics; ex:  Intelligence

 Specific intellectual abilities  Cognitive style

 Psychopathology

o Controversy regarding how psychological tests exist

 Psychological tests exist only as constructs: an informed, scientific concept developed or constructed to describe or explain a behavior

 Cant see, hear or touch infer existence from overt behavior: refers to an observable action or the product of an observable action, including test- or assessment-related responses

o Traits not expected to be manifested in behavior 100% of the

time

 Seems to be rank-order stability in personality traits relatively high correlations between trait scores at different time points

o Whether and to what degree a trait manifests itself is

dependent on the strength and nature of the situation - Assumption 2: Psychological Traits and States Can Be Quantified and

Measured

o After acknowledged that psychological traits and states do exist,

the specific traits and states to be measured need to be defined  What types of behaviors are assumed to be

indicative of trait?

 Test developer has to provide test users with a clear operational definition of the construct under study

o After being defined, test developer considers types of item

content that would provide insight into it

 Ex: behaviors that are indicative of a particular trait

o Should all questions be weighted the same?

 Weighting the comparative value of a test’s items comes about as the result of a complex interplay among many factors:

 Technical considerations

 The way a construct has been defined (for particular test)

 Value society (and test developer) attach to behaviors evaluated

o Need to find appropriate ways to score the test and interpret

results

 Cumulative scoring: test score is presumed to represent the strength of the targeted ability or trait or state

 The more the testtaker responds in a particular direction (as keyed by test manual) the higher the testtaker is presumed to possess the targeted trait or ability

- Assumption 3: Test-Related Behavior Predicts Non-Test-Related Behavior

o Objective of test is to provide some indication of some aspects

of the examinee’s behavior

 Tasks on some tests mimic the actual behaviors that the test user is attempting to understand

o Obtained behavior is usually used to predict future behavior o Could also be used to postdict behavior to aid in the

understanding of behavior that has already taken place

o Tools of assessment, such as a diary, or case history data, might

be of great value in such an evaluation

- Assumption 4: Tests and Other Measurement Techniques Have Strengths and Weaknesses

o Competent test users understand a lot about the tests they use

 How it was developed

 Circumstances under which it is appropriate to administer the test

 How test should be administered and to whom  How results should be interpreted

o Understand and appreciation limitations for tests they use

- Assumption 5: Various Sources of Error Are Part of the Assessment Process

o Everyday error= misstates and miscalculations

o Assessment error= a long-standing assumption that factors

other than what a test attempts to measure will influence performance on a test

o Error variance: component of a test score attributable to

sources other than the trait or ability measured

 Assessees themselves are sources of error variance

o Classical test theory (CTT)/ True score theory: assumption is

made that each testtaker has a true score on a test that would be obtained but for the action of measurement error

- Assumption 6: Testing and Assessment Can Be Conducted in a Fair and Unbiased Manner

o Court challenged to various tests and testing programs have

sensitized test developers and users to the societal demand for fair tests used in a fair manner

 Publishers strive to develop instruments that are fair when used in strict accordance with guidelines in the test manual

o Fairness related problems/questions:

 Culture is different from people whom the test was intended for

 Politics

- Assumption 7: Testing and Assessment Benefit Society

o Many critical decisions are based on testing and assessment

procedures

WHAT’S A “GOOD TEST”?

- Criteria

o Clear instruction for administration, scoring, and interpretation

- Reliability

o A “good test”/measuring tool reliable

 Involves consistency: the prevision with which the test measures and the extent to which error is present in measurements

 Unreliable measurement needs to be avoided - Validity

o Test is considered valid if it doesn’t indeed measure what it

purports to measure

o If there is controversy over the definition of a construct then the

validity is sure to be criticized as well

o Questions regarding validity focus on the items that collectively

make up the test

 Adequately sample range of areas to measure construct

 Individual items contribute to or take away from test’s validity

o Validity may also be questioned on grounds related to the

interpretation of test results - Other Considerations

o “Good test” one that trained examiners can administer, score

and interpret with minimum difficulty  Useful

 Yields actionable results that will ultimately benefit individual testtakers or society at large

(7)

CHAPTER 4: OF TESTS AND TESTING

o Purpose of test compare performance of testtaker with

performance of other testtakers (contains adequate norms: normative data)

 Normative data provides standard with which results measured can be compared

NORMS

- Norm-referenced testing and assessment: method of evaluation and a way of deriving meaning from test scored by evaluating an

individual testtaker’s score and comparing it to scores of a group of testtakers

- Meaning of individual score is relative to other scores on the same test

- Norms (scholarly context): usual, average, normal, standard, expected or typical

- Norms (psychometric context): the test performance data of a particular group of testtakers that are designed for use as a reference when evaluating or interpreting individual test scores

- Normative sample: group of people whose performance on a particular test is analyzed for reference in evaluation the performance of individual testtakers

o Yields a distribution of scores

- Norming: refers to the process of deriving norms; particular type of norm derivation

o Race norming: controversial practice of norming on the

basis of race or ethnic background

- Norming a test can be very expensiveuser norms/program norms: consist of descriptive statistics based on a group of testtakers in a given period of time rather than norms obtained by form sampling methods

- Sampling to Develop Norms

- Standardization: process of administering a test to a representative sample of testtakers for the purpose of establishing norms

o Standardized when has clear, specified procedures

- Sampling

o Developer targets defined group as population test

designed for

 All have at least one common, observable characteristic

o To obtain distribution of scores:

 Test administered to everyone in targeted population

 Administer test to a sample of the population  Sample: portion of universe of

people deemed to be representative of whole population

 Sampling: process of selecting the portion of universe deemed to be representative of whole

o Subgroups within a defined population may differ with

respect to some characteristics and it is sometimes essential to have these differences proportionately represented in sample

 Stratified sampling: sample reflects statistics of whole population; helps prevent sampling bias and ultimately aid in interpretation of findings  Purposive sampling: arbitrarily select sample

we believe to be representative of population  Incidental/convenience sampling: sample that

is convenient or available for use

 Very exclusive (contain exclusionary criteria)

- TYPES OF STANDARD ERROR:

o STANDARD ERROR OF MEASUREMENT – estimate the

extent to which an observed score deviates from a true score

o STANDARD ERROR OF ESTIMATE – In regression, an

estimate of the degree of error involved in predicting the value of one variable from another

o STANDARD ERROR OF THE MEAN – a measure of sampling

error

o STANDARD ERROR OF THE DIFFERENCE – estimate how

large a difference between two scores should be before the difference is considered statistically significant - Developing norms for a standardized test

o Establish a standard set of instructions and conditions

under which the test is given makes scores of normative sample more comparable with scores of future testtakers

o All data collected and analyzed, test developer will

summarize data using descriptive statistics (measures of central tendency and variability)

 Test developer needs to provide precise description of standardization sample itself  Descriptions of normative samples vary widely

in detail Tracking

- Comparisons are usually with people of the same age

- Children at the same age level tend to go through different growth

patterns

- Pediatricians must know the child’s percentile within a given age

group

- This tendency to stay at about the same level relative to one’s peers is

known as tracking (ie height and weight)

- Diets may alter this “track”

- Faults: some believe there is an analogy between the rates of physical

growth and the rates of intellectual growth

o Some say that children learn at different rates o This system discriminates against some children

TYPES OF NORMS

o Classification of norms ex: age, grade, national, local,

percentile, etc.

o PERCENTILES

 Median= 2nd_{quartile: the point at or below which}

50% of the scores fell and above which the remaining 50% fell

 Might wish to divide distribution of scores into deciles (instead of quartiles): 10 equal parts  The Xth_{percentile is equal to the score at or below}

which X% of scores fall

 Percentile: an expression of the percentage of people whose score on a test or measure falls below a particular raw score

 Percentage correct: refers to the distribution of raw scores (number of items that were answered correctly) multiplied by 100 and divided by the total number of items *not same as percentile  Percentile is a converted score that refers

to a percentage of testtakers

 Percentiles are easily calculated popular way of organizing test related data

 Using percentiles with normal distribution real differences between raw scores may be minimized near the ends of the distribution and exaggerated in the middle (worsens with highly skewed data)

o AGE NORMS

 Age-equivalent scores/age norms: indicate the average performance of different samples of testtakers who were at various ages at the time the test was administered

 Age norm tables for physical characteristics

 “Mental” age vs. physical age (need to identify mental age)

o GRADE NORMS

 Grade norms: designed to indicate the average test performance of testtakers in a given school grade

 Developed by administering the test to representative samples of children over a range of consecutive grades

 Mean or median score for children at each grade level is calculated

(8)

CHAPTER 4: OF TESTS AND TESTING  Great intuitive appeal

 Do not provide info as to the content or type of items that a student could or could not answer correctly

 Developmental norms: (ex: grade norms and age norms) term applied broadly to norms developed on the basis of any trait, ability, skill, or other

characteristic that is presumed to develop, deteriorate, or otherwise be affected by chronological age, school grade, or stage of life

o NATIONAL NORMS

 National norms: derived from a normative sample that was nationally representative of the population at the time the norming study was conducted

o NATIONAL ANCHOR NORMS

 Many different tests purporting to measure the same human characteristics or abilities

 National anchor norms: equivalency tables for scores on tests that purpose to measure the same thing

 Could provide the tool for comparisons  Provides stability to test scores by

anchoring them to other test scores  Begins with the computation of percentile

norms for each test to be compared  Equipercentile method: equivalency of

scores on different tests is calculated with reference to corresponding percentile scores

o SUBGROUP NORMS

 Normative sample can be segmented by an criteria initially used in selecting subjects for sample  Subgroup norms: result of segmentation; more

narrowly defined

o LOCAL NORMS

 Local norms: provide normative info with respect to the local population’s performance on some test

 Typically developed by test users themselves

- Fixed Reference Group Scoring Systems

o Norms provide context for interpreting meaning of a test score o Fixed reference group scoring system: distribution of scored

obtained on the test from one group of testtakers (fixed reference group) is used as the basis for the calculation of test scores for future administrators on the test

 Ex: SAT test (developed in 1962)

NORM-REFERENCED VERSUS CRITERION-REFERENCED EVALUATION - Way to derive meaning from test score is to evaluate test score in

relation to other scores on same test ( Norm-referenced)

- Criterion-referenced: derive meaning from a test score by evaluating it on the basis of whether or not some criterion has been met

o Criterion: a standard on which a judgment or decision may

be based

- Criterion-referenced testing and assessment: method of evaluation and way of deriving meaning from test scores by evaluating an individual’s score with reference to a set standard (ex: to drive must past driving test)

o Derives from values and standards of an individual or

organization

o Also called Domain/content-referenced testing and

assessment

o Critique: if followed strictly, important info about

individual’s performance relative to others can be potentially lost

Culture and Inference

- Culture is a factor in test administration, scoring and interpretation - Test user should do research in advance on test’s available norms to

check how appropriate it is for targeted testtaker population

o Helpful to know about the culture of the testtaker

CORRELATION AND INFERENCE

CORRELATION

 Degree and direction of correspondence between two things.  Correlation coefficient (r) – expresses a linear relationship between

two continuous variables

o Numerical index that tells us the extent to which X and Y

are “co-related”

 Positive correlation: high scores on Y are associated with high scores on X, and low scores on Y correspond to low scores on X

 Negative correlation: higher scores on Y are associated with lower scores on X, and vise versa

 No correlation: the variables are not related  -1 to 1

 Correlation does not imply causation.

o Ie weight, height, intelligence

PEARSON r

 Pearson Product Moment Correlation Coefficient  Devised by Karl Pearson

 Relationship of two variables are linear and continuous

 Coefficient of Determination (r2) – indication of how much variance is shared by the X and the Y variables

SPEARMAN RHO

 Rank order correlation coefficient  Developed by Charles Spearman

 Used when the sample size is small and when both sets of measurements are in ordinal form (ranking form) BISERIAL CORRELATION

 expresses the relationship between a continuous variable and an artificial dichotomous variable

o If the dichotomous variable had been true then we would

use the point biserial correlation

o When both variables are dichotomous and at least one of

the dichotomies is true, then the association between them can be estimated using the phi coefficient

o If both dichotomous variables are artificial, we might use a

special correlation coefficient – tetrachoric correlation REGRESSION

 analysis of relationships among variables for the purpose of understanding how one variable may predict another  SIMPLE REGRESSION: one IV (X) and one DV (Y)

- Regression line: defined as the best-fitting straight line through a set

of points in a scatter diagram

o Found by using the principle of least squares, which

minimizes the squared deviation around the regression line

 Primary use: To predict one score or variable from another

 Standard error of estimate: the higher the correlation between X and Y, the greater the accuracy of the prediction and the smaller the SEE.  MULTIPLE REGRESSION: The use of more than one score to predict Y.  Regression coefficient : (b) slope of the regression line

o Sum of squares for the covariance to the sum of squares

for X

o Sum of squares is defined as the sum of the squared

deviations around the mean

o Covariance is used to express how much two measures

covary, or vary together

 Slope describes how much change is expected in Y each time X increases by one unit

 Intercept (a) is the value of Y when X is 0

o The point at which the regression line crosses the Y axis

THE BEST-FITTING LINE

 The difference between the observed and predicted score (Y- Y’) is called the residual

 The best-fitting line is most appropriately found by squaring each residual

 Best-fitting line is obtained by keeping these squared residuals as small as possible

o Principle of least squares:

 Correlation is a special case of regression in which the scores for both variables are in standardized, or Z, units

(9)

CHAPTER 4: OF TESTS AND TESTING  In correlation, the intercept is always 0

 Pearson product moment correlation coefficient is a ratio used to determine the degree of variation in one variable that can be estimated from knowledge about variation in the other variable Testing the Statistical Significance of a Correlation Coefficient

- Begin with the null hypothesis that there is no relationship between

variables

- Null hypothesis rejected is there is evidence that the association

between two variables is significantly different from 0

- t distribution is not a single distribution, but a family of distributions,

each with its own degrees of freedom

- Degrees of freedom are defined as the sample size minus 2, or N-2 - Two-tailed test

How to Interpret a Regression Plot

- Regression plots are pictures that show the relationship between

variables

- Common use of correlation is to determine the criterion validity

evidence for a test, or the relationship between a test score and some well-defined criterion

- Middle level of enjoyableness because it is the one observed most

frequently – normative because it uses info gained from representative groups

- Using the test as a predictor is not as good as perfect prediction, but

it is still better than using the normative info

- A regression line such as in 3.9 shows that the test score tells us

nothing about the criterion beyond the normative info TERMS AND ISSUES IN THE USE OF CORRELATION

Residual

- Difference between the predicted and the observed values is called

the residual

o Y-Y’

- Important property of residual is that the sum of the residuals always

equals 0

- Sum of the squared residuals is the smallest value according to the

principle of least squares Standard Error of Estimate

- Standard deviation of the residuals is the standard error of estimate - A measure of the accuracy of prediction

- Prediction is most accurate when the standard error of estimate is

relatively small Coefficient of Determination

- Correlation coefficient squared is known as the coefficient of

determination

- Tells us the proportion of the total variation in scores on Y that we

know as a function of information about X Coefficient of Alienation

- Coefficient of alienation is a measure of nonassociation between two

variables

- Square root of 1-r2 –-- r is the coefficient of determination

- High value means there is a high degree of nonassociation between 2

variables Shrinkage

- Tendency to overestimate the relationship, particularly if the sample

of subjects is small

- Shrinkage is the amount of decrease observed when a regression

equation is created for one population and then applied to another Cross Validation

- Use regression equation to predict performance in a group of subjects

other than the ones to which the equation was applied

- Standard error of estimate obtained for relationship between the

values predicted by the equation and the values actually observed – called cross validation

The Correlation-Causation Problem

- Experiments are required to determine whether manipulation of one

variable causes changes in another variable

- A correlation alone does not prove causality, although it might lead to

other research that is designed to establish the causal relationships between variables

Third Variable Explanation

- Third variable, ie poor social adjustment, causes TV viewing and

aggression

- External influence is the third variable

Restricted Range

- Correlation and regression use variability on one variable to explain

variability on a second variable

- Restricted range problem: correlation requires variability; if the

variability is restricted, then significant correlations are difficult to find

Mulvariate Analysis

- Multivariate analysis considers the relationship among combinations

of three of more variables General Approach

- Linear combination of variables is a weighted composite of the

original variables

(10)

CHAPTER 5: RELIABILITY RELIABILITY

- Dependability and consistent

- Error implies that there will always be some inaccuracy in our

measurements

- Tests that are relatively free of measurement error are deemed to be

reliable

- Reliability estimates in the range of .70 and .80 are good enough for

most purposes in basic research

- Reliability coefficient: an index that indicates the ratio between the

true score variance on a test and the total variance

- HISTORY OF RELIABILITY:

o Charles Spearman (1904): The Proof and Measurement of

Association between Two Things

o Then Thorndike

o Item response theory has taken advantage of computer

technology to advance psychological measurement significantly

o Based on Spearman’s ideas - X = T + E CLASSICAL TEST THEORY

o assumes that each person has a true score that would be

obtained if there were no errors in measurement

o Difference between the true score and the observed score

results from measurement error

o Assumption here is that errors of measurement are

random

o Basic sampling theory tells us that the distribution of

random errors is bell-shaped

 The center of the distribution should represent the true score, and the dispersion around the mean of the distribution should display the distribution of sampling errors

o Classical test theory assumes that the true score for an

individual will not change with repeated applications of the same test

o

o Variance: standard deviation squared. It is useful because

it can be broken into components:

o True variance: variance from true differences are

assumed to be stable

o Error variance: random irrelevant sources

- Standard error of measurement: we assume that the distribution of

random errors will be the same for all people, classical test theory uses the standard deviation of errors as the basic measure of error

o Standard error of measurement tells us, on the average,

how much a score varies from the true score

o Standard deviation of the observed score and the

reliability of the test are used to estimate the standard error of measurement

- Reliability: proportion of the total variance attributed to true

variance.

o the greater portion of total variance attributed to true

variance, the more reliable the test

- Measurement error: refers to collectively, all of the factors associated

with the process of measuring some variable, other than the variable being measured

o Random error: a source of error in measuring a targeted

variable caused by unpredictable fluctuations and inconsistencies of other variables in the measurement process

 This source of error fluctuates from one testing situation to another with no discernible pattern that would systematically raise or lower scores

o Systematic Error:

 A source of error in measuring a variable that is typically constant or proportionate to what is presumed to be true value of the variable being measured

 Error is predictable and fixable  Does not affect score consistency SOURCES OF ERROR VARIANCE

- TEST CONSTUCTION

o Item sampling or content sampling – refer to variation

among items within a test as well as to variation among items between test\

 The extent to which a test takers score is affected by the content sampled on a test and by the way the content is sampled (that is, the way in which the item is constructed) is a source of error variance

- TEST ADMINISTRATION

o may influence the test takers attention or motivation o Environment variables, test taker’s variables, examiner

variables. Level of professionalism

- TEST SCORING AND INTERPRETATION

o Computer scoring and a growing reliance on objective,

computer-scorable items have virtually eliminated error variance caused by scorer differences

o However, other tools of assessment still require scoring by

trained personnel

o If subjectivity is involved in scoring, then the scorer can be

a source of error variance

o Despite rigorous scoring criteria set forth in many of the

better known test of intelligence, examiner occasionally still are confronted by situations where an examinees response lies in a gray area

TEST-RETEST RELIABILITY

- Also known as time-sampling reliability

- Correlating pairs of scores from the same group on two different

administration of the same test

- Measure something that is relatively stable over time - Sources of Error variance:

o Passage of time: the longer the time that passes, the

greater the likelihood that reliability coefficient will be lower.

o Coefficient of stability : when the interval between testing

is greater than 6 months,

- Consider possibility of carryover effect: occurs when first testing

session influences scores from the second session

- If something affects all the test takers equally, then the results are

uniformly affected and no net errors occurs

- Practice tests may make this effect happen - Practice can also affect tests of manual dexterity

- Time interval between testing sessions must be selected and

evaluated carefully

- Poor test-retest correlations do not always mean that a attest is

unreliable – suggest that the characteristic under study has changed PARALLEL-FORM OR ALTERNATE FORMS RELIABILITY

- compares two equivalent forms of a test that measure the same

attribute

- Two forms should be equally constructed, both format, etc. - When two forms of the test are available, one can compare

performance on one form versus the other – equivalent forms reliability or parallel forms

- Coefficient of equivalence: degree of relationship between various

forms of a test can be evaluated by means of an alternate-forms

- Parallel forms: each form of the test, the means and variances of

observed test scores are equal

- Alternate forms: different versions of a test that have been

constructed so as to be parallel

- (1) two test administrations with the same group are required - (2) test scores may be affected by factors such as motivation etc. - Problem: developing a new version of a test

INTERNAL CONSISTENCY

- How well does each item measure the content/construct under

consideration

- How consistent the items together - Used when tests are administered once

- If all items on a test measure the same construct, then it has a good

internal consistency

(11)

CHAPTER 5: RELIABILITY SPLIT-HALF RELIABILITY

- Correlating two pairs of scores obtained from equivalent halves of a

single test administered once.

- This is useful when it is impractical to assess reliability with two tests

or to administer test twice

- Results of one half of the test are then compared with the results of

the other

- Rules in splitting forms into half:

o Do not divide test in the middle because it would lower

the reliability

o Different amounts of anxiety and differences in item

difficulty shall also be considered

o Randomly assign items to one or the other half of the test o use the odd-even system: where one subscore is obtained

for the odd-numbered items in the test and another for the even-numbered items

- To correct for half-length, apply the Spearman-Brown formula, which

allows you to estimate what the correlation between the two halves would have been if each half had been the length of the whole test

o Use this if test user wish to shorten a test

o Used to determine the number of items needed to attain a

desired level of reliability

- Reliability increases as the test length increases

KUDER-RICHARDSON FORMULAS OR KR20/KR21

- Kuder-Richardson technique simultaneously considers all possible

ways of splitting the items

- The formula for calculating the reliability of a test in which the items

are dichotomous, scored 0 or 1, is the Kuder-Richardson 20 (see p.114)

- Introduced KR21 – uses an approximation of the sum of the pq

products – the mean test score CRONBACH ALPHA

- Cronbach developed a formula that estimates the internal

consistency of tests in which the items are not scored as 0 or 1 – a more general reliability estimate, which he called coefficient alpha

- Sum the individual item variances

o Most general method of finding estimates of reliability

through internal consistency

- Domain sampling: define a domain that represents a single trait or

characteristic, and each item is an individual sample of this general characteristic

- Factor analysis deals with the situation in which a test apparently

measures several different characteristics

o Good for the process of test construction

- Most widely used as a measure of reliability because it requires only

one administration of the test

- Ranges from 0 to 1 “bigger is always better”

Other Methods of Estimating Internal Consistencies

- Inter-item consistency: refers to the degree of correlation among all

the items on a scale

o A measure of inter-item consistency is calculated from a

single administration of a single form of a test

o An index of inter-item consistency, in turn, is useful in

assessing the homogeneity of the test

o Tests are said to be homogenous if they contain items that

measure a single trait

o Definition: the degree to which a test measures a single

factor

o Heterogeneity: degree to which a test measures different

factors

o Ex: homo=test that assesses knowledge only of #-D

television repair skills vs. a general electronics repair test (hetero)

o The more homogenous a test is, the more inter-item

consistency it can be expected to have

o Test homogeneity is desirable because it allows relatively

straightforward test-score interpretation

o Test takers with the same score on a homogenous test

probably have similar abilities in the area tested

o Test takers with the same score on a heterogeneous test

may have quite different abilities

o However, homogenous testing is often an insufficient tool

for measuring multifaceted psychological variable such as intelligence or personality

Measures of Inter-Scorer Reliability

- In some types of tests under some conditions, the score may be more a

function of the scorer than of anything else

- Inter-scorer reliability: is the degree of agreement or consistency between

two or more scorers (or judges or rather) with regard to a particular measure

- Coefficient of inter-scorer reliability: coefficient of correlation to

determine the degree of consistency among scorers in the scoring of a test

- Kappa statistic is the best method for assessing the level of agreement

among several observers

o Indicates the actual agreement as a proportion of the potential

agreement following the correction for chance agreement

o Cohen’s Kappa – 2 raters o Fleiss’ Kappa – 3 or more raters

HOMOGENEITY VS. HETEROGENEITY OF TEST ITEMS

- Homogeneous items has high degree of reliability

DYNAMIC VS. STATIC CHARACTERISTICS

- Dynamic: trait, state, ability presumed to be ever-changing as a function of

situational and cognitive experiences

- Static: trait, state, ability relatively unchanging

RESTRICTION OR INFLATION OF RANGE

- If it is restricted, reliability tends to be lower. - If it is inflated, reliability tends to be higher.

SPEED TESTS VS. POWER TESTS

- Speed test: test is homogenous, means that it is easy but short time - Power test: Few items, but more complex.

CRITERION-REFERENCED TESTS

- Provide an indication of where a testtaker stands with respect to some

variable or criterion.

- Tends to contain material that has been mastered in hierarchical fashion. - Scores here tend to be interpreted in pass-fail terms.

- Measure of reliability depends on the variability of the test scores: how

different the scores are from one another. The Domain Sampling Model

- This model considers the problems created by using a limited number

of items to represent a larger and more complicated construct

- Our task in reliability analysis is to estimate how much error we would

make by using the score from the shorter test as an estimate of your true ability

- Conceptualizes reliability as the ratio of the variance of the observed

score on the shorter test and the variance of the long-run true score

- Reliability can be estimated from the correlation of the observed test

score with the true score Item Response Theory

- Classical test theory requires that exactly the same test items be

administered to each person – BAD

- Item response theory (IRT) is newer – computer is used to focus on

the range of item difficulty that helps assess an individual’s ability level

o More reliable estimate of ability is obtained using a

shorter test with fewer items

o Takes a lot of items and effort

Generalizability theory

- based on the idea that a persons test scores vary from testing to testing

because of variables in the testing situation

- Instead of conceiving of all variability in a persons scores as error, Cronbach

encouraged test developers and researchers to describe the details of the particular test situation or universe leading to a specific test score

(12)

CHAPTER 5: RELIABILITY

- This universe is described in terms of its facets: which include things like

the number of items in the test, the amount of training the test scorers have had, and the purpose of the test administration

- According to generalizability theory, given the exact same conditions of all

the facets in the universe, the exact same test score should be obtained

- Universe score: the test score obtained and its analogous to a true score in

the true score model

- Cronbach suggested that tests be developed with the aid of a

generalizability study followed by a decision study

- Generalizability study: examines how generalizable scores from a

particular test are if the test is administered in different situations

- How much of an impact different facets of the universe have on the test

score

- Ex: is the test score affected by group as opposed to individual

administration

- Coefficients of generalizability: the influence of particular facts on the test

score is represented by this. These coefficients are similar to reliability coefficients in the true score model

- Decision study: developers examine the usefulness of test scores in helping

the test user make decision

- The decision study is designed to tell the test user how test scores should

be used and how dependable those scores are as a basis for decisions, depending on the context of their use

What to Do About Low Reliability

- Two common approaches are to increase the length of the test and to

throw out items that run down the reliability

- Another procedure is to estimate what the true correlation would

have been if the test did not have measurement error Increase the Number of Items

- The larger the sample, the more likely that the test will represent the

true characteristic

o This could entail a long and costly process however - Prophecy formula

Factor and Item Analysis

- Reliability of a test depends on the extent to which all of the items

measure one common characteristic

- Factor analysis

o Tests are most reliable if they are unidimensional : one

factor should account for considerably more of the variance than any other factor

- Or examine the correlation between each item and the total score for

the test

o Called discriminability analysis: when the correlation

between the performance on a single item and the total test score is low, the item is probably measuring something different from the other items on the test Correction for Attenuation

- Potential correlations are attenuated, or diminished, by measurement

(13)

CHAPTER 6: VALIDITY The Concept of Validity

- Validity: as applied to a test, is a judgment or estimate of how well a test measures what it purports to measure in a particular context

o Judgment based on evidence about the appropriateness of

inferences drawn from test scores

o Validity of test must be shown from time to time to account for

culture and advancement - Inference: a logical result or deduction

- “Acceptable” or “weak” validity of tests and test scores

- Validation: process of gathering and evaluating evidence about validity

o Test user and testtaker both have roles in validation of test o Test users may conduct their own validation studies: may yield

insights regarding a particular population of testtakers as compared to the norming sample (in manual)

o Local validation studies: absolutely necessary when test user

plans to alter in some way the format, instructions, language, or content of the test

- Types of Validity (Trinitarian view) *not mutually exclusive all contribute to a unified picture of a test’s validity/ critiq ue approach is fragmented and incomplete

o Content validity: measure of validity based on an evaluation of

the subjects, topics, or content covered by the items in the test

o Criterion-related validity: measure of validity obtained by

evaluating the relationship of scores obtained on the test to scores on other tests or measures

o Construct validity: measure of validity that is arrived at by

executing a comprehensive analysis of: (umbrella validity every other variety of validity falls under it)

 How scores on test relate to other test scores and measures

 How scores on test can be understood within some theoretical framework for understand the construct that the test was designed to measure

- Strategies: ways of approaching the process of test validity

o Content validation strategies

o Criterion-related validation strategies o Construct validation strategies

- Face Validity

o Face validity: relates more to what a test appears to measure to

the person being tested than to what the test actually measures

o Judgment concerning how relevant the test items appear to

be usually from testtaker, not test user

o Lack of face validity= lack of confidence in perceived

effectiveness of test which decreases testtaker’s motivation/cooperation *may still be useful - Content validity

o Content validity: a judgment of how adequately a test samples

behavior representative of the universe of behavior that the test was designed to sample

 Ideally, test developers have a clear vision of the construct being measured clarity reflected in the content validity of the test

o Test blueprint: structure of the evaluation; a plan regarding the

types of information to be covered by the items, the number of items tapping each area of coverage, the organization of the items in the test, etc.

 Behavior observation is a technique frequently used in test blueprinting

o The quantification of content validity

 Important in employment settings  tests used to hire and promote

 One method: method for gauging agreement among raters or judges regarding how essential a particular item is (C.H. Lawshe)

 “Is the skill or knowledge measured by this item…

o Essential

o Useful but not essential o Not necessary

 To the performance of the job?”  Content validity ratio (CVR) :

 CVR= n_e – (N/2)

(N/2)

o CVR Content validity ratio o n_e Number of panelists

stating “essential”

o N Total number of panelists

 CVR is calculated for each item

o Culture and the relativity of content validity

 Tests thought of as either valid or invalid  What constitutes historical fact depends to some

extent on who is writing the history  Culture relativity

 Politics (politically correct) Criterion-Related Validity

- Criterion-related validity: judgment of how adequately a test score can be used to infer an individual’s most probable standing on some measure of interest (measure of interest being the criterion)

- 2 types:

o Concurrent validity: index of the degree to which a test score is

related to some criterion measure obtained at the same time (concurrently)

o Predictive validity: index of the degree to which a test score

predicts some criterion measure - What Is a Criterion?

o Criterion: a standard on which a judgment or decision may be

based; standard against which a test or a test score is evaluated (criterion-related validity)

o Characteristics of criterion

 Relevancy pertinent or applicable to the matter at hand

 Validity (for the purpose which it is being used)  UncontaminatedCriterion contamination: term

applied to a criterion measure that has been based, at least in part, on predictor measures

- Concurrent Validity

o Test scores are obtained at about the same time as the criterion

measures are obtained measures of the relationship between the test scores and the criterion provide evidence of concurrent validity

o Indicate the extent to which test scores may be used to

estimate an individuals present standing on a criterion

o Once validity of inference from test scores is established= faster,

less expensive way to offer a diagnosis or a classification decision

o Concurrent validity of a test can be explored with respect to

another test

 Prior research must have satisfactorily demonstrated the 1st_{test’s validity}

 1st_{test= validating criterion}

- Predictive validity

o Test scores may be obtained at one time and the criterion

measures obtained at a future time, usually after some intervening event has taken place

 Intervening event training, experience, therapy, medication, etc.

 Measures of relationship between the test scores and a criterion measure obtained at a future time provide an indication of the predictive validity test (how accurately scores on the test predict some criterion measure)

o Ex: SAT test score and freshman gpa

o Judgments of criterion validity are based on 2 types of statistical

evidence:

 The validity coefficient

 Validity coefficient: correlation

coefficient that provides a measure of the relationship between test scores and scores on the criterion measure

 Ex: Pearson correlation coefficient used to determine validity between 2 measures (r)

 Affected by restriction or inflation of range