CHAPTER 1: PSYCHOLOGICAL TESTING AND ASSESSMENT 1. DVD- how would you respond to the events that take place in
the video
a) sexual harassment in the workplace b) respond to various types of emergencies
c) diagnosis/treatment plan for clients on videotape 2. thermometers, biofeedback, etc
TEST DEVELOPER
They are the one who create tests.
They conceive, prepare, and develop tests. They also find a way to disseminate their tests, by publishing them either commercially or through professional publications such as books or periodicals. TEST USER
They select or decide to take a specific test off the shelf and use it for some purpose. They may also participate in other roles, e.g., as examiners or scorers.
TEST TAKER
Anyone who is the subject of an assessment
Test taker may vary on a continuum with respect to numerous variables including:
o The amount of anxiety they experience & the degree to
which the test anxiety might affect the results
o The extent to which they understand & agree with the
rationale of the assessment
o Their capacity & willingness to cooperate
o Amount of physical pain/emotional distress they are
experiencing
o Amount of physical discomfort
o Extent to which they are alert & wide awake o Extent to which they are predisposed to agreeing or
disagreeing when presented with stimulus
o The extent to which they have received prior coaching o May attribute to portraying themselves in a good light
Psychological autopsy – reconstruction of a deceased individual’s psychological profile on the basis of archival records, artifacts, & interviews previously conducted with the deceased assesee TYPES OF SETTINGS
EDUCATIONAL SETTING
o achievement test: evaluation of accomplishments or the
degree of learning that has taken place, usually with regard to an academic area.
o diagnosis: a description or conclusion reached on the basis
of evidence and opinion though a process of distinguishing the nature of something and ruling out alternative conclusions.
o diagnostic test: a tool used to make a diagnosis, usually to
identify areas of deficit to be targeted for intervention
o informal evaluation: A typically non systematic, relatively
brief, and “off the record” assessment leading to the formation of an opinion or attitude, conducted by any person in any way for any reason, in an unofficial context and not subject to the same ethics or standards as evaluation by a professiomal
CLINICAL SETTING
o these tools are used to help screen for or diagnose
behavior problems
o group testing is used primarily for screening: identifying
those individuals who require further diagnostic evaluation.
COUNSELING SETTING
o schools,prisons, and governmental or privately owned
institutions
o ultimate objective: the improvement of the assessee in
terms of adjustment, productivity, or some related variable.
GERIATRIC SETTING
o quality of life: in psychological assesment, an evaluation
of variables such as perceived stress,lonliness, sources of
satisfaction, personal values, quality of living conditions, and quality of friendships and other social support. BUSINESS AND MILITARY SETTINGS
GOVERNMENTAL AND ORGANIZATIONAL CREDENTIALING How are Assessments Conducted?
protocol: the form or sheet or booklet on which a testtaker’s responses are entered.
o term might also be used to refer to a description of a set of
test- or assessment- related procedures, as in the sentence , “the examiner dutifully followed the complete protocol for the stress interview”
rapport: working relationship between the examiner and the examinee
ASSESSEMENT OF PEOPLE WITH DISABILITITES
Define who requires alternate assessement, how such assessment are to be conducted and how meaningful inferences are to be drawn from the data derived from such assessment
Accommodation – adaptation of a test, procedure or situation or the substitution of one test for another to make the assessment more suitable for an assesee with exceptional needs.
Translate it into Braillee and administere in that form .
Alternate assessment – evaluative or diagnostic procedure or process that varies from the usual, customary, or standardized way a
measurement is derived either by virtue of some special accommodation made to the assesee by means of alternative methods
Consider these four variables on which of many different types of accommodation should be employed:
o The capabilities of the assesse o The purpose of the assessment o The meaning attached to test scores o The capabilities of the assessor
REFERENCE SOURCES
TEST CATALOUGES – contains brief description of the test TEST MANUALS – detailed information
REFERENCE VOLUMES – one stop shopping, provides detailed information for each test listed, including test publisher, author, purpose, intended test population and test administration time JOURNAL ARTICLES – contain reviews of the test
ONLINE DATABASES – most widely used bibliographic databases TYPES OF TESTS
INDIVIDUAL TEST – those given to only one person at a time GROUP TEST – administered to more than one person at a time by
single examiner ABILITY TESTS:
o ACHIEVEMENT TESTS – refers to previous learning (ex.
Spelling)
o APTITUDE/PROGNOSTIC – refers to the potential for
learning or acquiring a specific skill
o INTELLIGENCE TESTS – refers to a person’s general
potential to solve problems
PERSONALITY TESTS: refers to overt and covert dispositions
o OBJECTIVE/STRUCTURED TESTS – usually self-report,
require the subject to choose between two or more alternative responses
o PROJECTIVE/UNSTRUCTURED TESTS – refers to all
possible uses, applications and underlying concepts of psychological and educational tests
CHAPTER 2: HISTORICAL, CULTURAL AND LEGAL/ETHICAL CONSIDERATIONS A HISTORICAL PERSPECTIVE
19TH CENTURY
Tests and testing programs first came into being in China Testing was instituted as a means of selecting who, of many
applicants would obtain government jobs (Civil service)
The job applicants are tested on proficiency in endeavors such as music, archery, knowledge and skill etc.
GRECO-ROMAN WRITINGS (Middle Ages) World of evilness
Deficiency in some bodily fluid as a factor believed to influence personality
Hippocrates and Galen RENAISSANCE
Christian von Wolff – anticipated psychology as a science and psychological measurement as a specialty within that science CHARLES DARWIN AND INDIVIDUAL DIFFERENCES
Tests designed to measure these individual differences in ability and personality among people
“Origin of Species” chance variation in species would be selected or rejected by nature according to adaptivity and survival value.
“survival of the fittest” FRANCIS GALTON
Explore and quantify individual differences between people. Classify people “according to their natural gifts”
Displayed the first anthropometric laboratory KARL PEARSON
Developed the product moment correlation technique. His work can be traced directly from Galton
WILHEM MAX WUNDT
First experimental psychology laboratory in University of Leipzig Focuses more on relating to how people were similar, not different
from each other. JAMES MCKEEN CATELL
Individual differences in reaction time Coined the term mental test
CHARLES SPEARMAN
Originating the concept of test reliability as well as building the mathematical framework for the statistical technique of factor analysis
VICTOR HENRI
Frenchman who collaborated with Binet on papers suggesting how mental tests could be used to measure higher mental processes EMIL KRAEPELIN
Early experimenter of word association technique as a formal test LIGHTNER WITMER
“Little known founder of clinical psychology” Founded the first psychological clinic in the U.S. PSYCHE CATELL
Daughter of James Cattell
Cattel Infant Intelligence Scale (CIIS) & Measurement of Intelligence in Infants and Young Children
RAYMOND CATTELL
Believed in lexical approach to defining personality which examines human languages for descriptors of personality dimensions 20 th CENTURY
- Birth of the first formal tests of intelligence
- Testing shifted to be of more understandable relevance/meaning A. THE MEASUREMENT OF INTELLIGENCE
o Binet created first intelligence to test to identify mentally
retarded school children in Paris (individual)
o Binet-Simon Test has been revised over again o Group intelligence tests emerged with need to screen
intellect of WWI recruits
o David Wechsler – designed a test to measure adult
intelligence test
for him Intelligence is a global capacity of the individual to act purposefully, to think rationally and to deal effectively with his environment. Wechsler-Bellevue Intelligence Scale
Wechsler Adult Intelligence Test – was revised several times and extended the age range of
testakers from young children through senior adulthood.
B. THE MEASUREMENT OF PERSONALITY
o Field of psychology was being too test oriented o Clinical psychology was synonymous to mental testing o ROBERT WOODWORTH – develop a measure of
adjustment and emotional stability that could be administered quickly and efficiently to groups of recruits
To disguise the true purpose of the test, questionnaire was labeled as Personal Data Sheet
He called it Woodworth Psychoneurotic Inventory – first widely used self-report test of personality
o Self-report test:
Advantages:
Respondents best qualified Disadvantages:
Poor insight into self
One might honestly believe something about self that isn’t true Unwillingness to report seemingly
negative qualities
o Projective test: individual is assumed to project onto some
ambiguous stimulus (inkblot, photo, etc.) his or her own unique needs, fears, hopes, and motivations
Ex.) Rorschack inkblot
o
C. THE ACADEMIC AND APPLIED TRADITIONS Culture and Assessment
Culture: ‘the socially transmitted behavior patterns, beliefs, and products of work f a particular population, community, or group of people’
Evolving Interest in Culture-Related Issues
Goddard tested immigrants and found most to be feebleminded
-invalid; overestimated mental deficiency, even in native English-speakers
Lead to nature-nurture debate about what intelligence tests actually measure Needed to “isolate” the cultural variable
Culture-specific tests: tests designed for use with ppl from one culture, but not from another
-minorities still scored abnormally low ex.) loaf of bread vs. tortillas
today tests undergo many steps to ensure its suitable for said nation -take testtakers reactions into account
Some Issues Regarding Culture and Assessment Verbal Communication
o Examiner and examinee must speak the same language o Especially tricky with infrequently used vocabulary or
unusual idioms employed
o Translator may lose nuances of translation or give
unintentional hints toward more desirable answer
o Also requires understanding of culture
Nonverbal Communication and Behavior
o Different between cultures
o Ex.) meaning of not making eye contact
o Body movement could even have physical cause
o Psychoanalysis: Freud’s theory of personality and
psychological treatment which stated that symbolic significance is assigned to many nonverbal acts.
o Timing tests in cultures not obsessed with speed o Lack of speaking could be reverence for elders
Standards of Evaluation
o Acceptable roles for women differ throughout culture o “judgments as to who might be the best employee,
manager, or leader may differ as a function of culture, as might judgments regarding intelligence, wisdom, courage, and other psychological variables”
CHAPTER 2: HISTORICAL, CULTURAL AND LEGAL/ETHICAL CONSIDERATIONS
o must ask ‘how appropriate are the norms or other
standards that will be used to make this evaluation’ Tests and Group Membership
ex.) must be 5’4” to be police officer - excludes cultures with short stature
ex.) Jewish lifestyle not well suited for corporate America
affirmative action: voluntary and mandatory efforts to combat discrimination and promote equal opportunity in education and employment for all
Psychology, tests, and public policy Legal and Ethical Condiseration
Code of professional ethics: defines the standard of care expected of members of a given profession.
The Concerns of the Public
Beginning in world war I, fear that tests were only testing the ability to take tests
Legislation
o Minimum competency testing programs: formal testing
programs designed to be used in decisions regarding various aspects of students’ educations
o Truth-in-testing legislation: state laws to provide testtakers
with a means of learning the criteria by which the y are being judged
Litigation
o Daubert ruling made federal judges the gatekeepers to
determining what expert testimony is admitted
o This overrode the Frye policy which only admitted scientific
testimony that had won general acceptance in the scientific community.
The Concerns of the Profession Test-user qualifications
o Who should be allowed to use psych tests
o Level A: tests or aids that can adequately be administered,
scored, and interpreted with the aid of the manual and a general orientation to the kind of institution or organization in which one is working
o Level B: tests or aids that require some technical knowledge
of test construction and use and of supporting psychological and educational fields
o Level C: tests and aids requiring substantial understanding
of testing and supporting psych fields with experience Testing people with disabilities
o Difficulty in transforming the test into a form that can be
taken by testtaker
o Transferring responses to be scorable o Meaningfully interpreting the test data
Computerized test administration, scoring, and interpretation
o simple, convenient o easily copied, duplicated
o insufficient research to compare it to pencil-and-paper
versions
o value of computer interpretation is questionable
o unprofessional, unregulated “psychological testing” online
The Rights of Testtakers
the right of informed consent
o right to know why they are being evaluated, how test data
will be used and what information will be released to whom
o may be obtained by parent or legal representative o must be in written form:
general purpose of the testing
the specific reason it is being undertaken general type of instruments to be administered
o revealing this information before the test can contaminate
the results
o deception only used if absolutely necessary
o don’t use deception if it will cause emotional distress o fully debrief participants
The right to be informed of test findings
o Formerly test administrators told to give participants only
positive information
o No realistic information is required
o Tell test takers as little as possible about the nature of their
performance on a particular test. So that the examinee would leave the test session feeling pleased and statisfied.
o Test takers have the right also to know what
recommendations are being made as a consequence of the test data
The right to privacy and confidentiality
o Private right: “recognizes the freedom of the individual to
pick and choose for himself the time, circumstances, and particularly the extent to which he wishes to share or withhold from others his attitudes, beliefs, behaviors, and opinions”
o Privileged information: information protected by law from
being disclosed in legal proceeding. Protects clients from disclosure in judicial proceedings. Privilege belongs to the client not the psychologist.
o Confidentiality: concerns matters of communication
outside the courtroom
Safekeeping of test data: It is not a good policy to maintain all records in perpetuity
The right to the least stigmatizing label
o The standards advise that the least stigmatizing labels
CHAPTER 3: A STATISTICS REFRESHER Why We Need Statistics
- Statistics are important for purposes of education
o Numbers provide convenient summaries and allow us to
evaluate some observations relative to others
- We use statistics to make inferences, which are logical deductions
about events that cannot be observed directly
o Detective work of gathering and displaying clues –
exploratory data analysis
o Then confirmatory data analysis
- Descriptive statistics are methods used to provide a concise
description of a collection of quantitative information
- Inferential statistics are methods used to make inferences from
observations of a small group of people known as a sample to a larger group of individuals known as a population
SCALES OF MEASUREMENT
MEASUREMENT – act of assigning numbers or symbols to characteristics of things according to rules. The rules serves as a guideline for representing the magnitude. It always involves error. SCALE – set of numbers whose properties model empirical properties
of the objects to which the numbers are assigned.
CONTINUOUS SCALE – interval/ratio. A scale used to measure continuous variable. Always involves error
DISCRETE SCALE – nominal/ordinal used to measure a discrete variable (ex. Female or male)
ERROR – collective influence of all of the factors on a test score. PROPERTIES OF SCALES
- Magnitude, equal intervals, and an absolute 0
Magnitude
- The property of “moreness”
- A scale has the property of magnitude if we can say that a particular
instance of the attribute represents more, less, or equal amounts of the given quantity than does another instance
Equal Intervals
- A scale has the property of equal intervals if the difference between
two points at any place on the scale has the same meaning as the difference between two other points that differ by the same number of scale units
- A psychological test rarely has the property of equal intervals - When a scale has the property of equal intervals, the relationship
between the measured units and some outcome can be described by a straight line or a linear equation in the form Y=a+bX
o Shows that an increase in equal units on a given scale
reflects equal increases in the meaningful correlates of units
Absolute 0
- An Absolute 0 is obtained when nothing of the property being
measured exists
- This is extremely difficult/impossible for many psychological qualities
NOMINAL SCALE
Simplest form of measurement Classification or categorization
Arithmetic operations can be performed with nominal data Ex.) Male or female
Also includes test items
o Ex.) yes/no responses
ORDINAL SCALE
Classifies in some kind of ranking order
Individuals compared to others and assigned a rank
Imply nothing about how much greater one ranking is than another Numbers/ranks do not indicate units of measure
No absolute zero point
Binet: believed that data derived from intelligence test are ordinal in nature
INTERVAL SCALE
In addition to the features of nominal and ordinal scales, contain equal intervals between numbers
No absolute zero point Can take average RATIO SCALE
In addition to all the properties of nominal, ordinal, and interval measurement, ratio scale has true zero point
Equal intervals between numbers
Ex.) measuring amount of pressure hand can exert
True zero doesn’t mean someone will receive a scor e of 0, but means that 0 has meaning
NOTE:
Permissible Operations
- Level of measurement is important because it defines which
mathematical operations we can apply to numerical data
- For nominal data, each observation can be placed in only one
mutually exclusive category
- Ordinal measurements can be manipulated using arithmetic - With interval data, one can apply any arithmetic operation to the
differences between scores
o Cannot be used to make statements about ratios
DESCRIBING DATA
Distribution: set of scores arrayed for recording or study
Raw Score: straightforward, unmodified accounting of performance, usually numerical
Frequency Distributions
Frequency Distribution: All scores listed alongside the number of times each score occurred
Grouped Frequency Distribution: test-score intervals (class intervals), replace the actual test scores
o Highest and lowest class intervals= upper and lower limits
of distribution
Histogram: graph with vertical lines drawn at the true limits of each test score (or class interval) forming TOUCHING rectangles- midpoint in center of bar
Bar Graph: rectangles DON’T touch
Frequency Polygon: data illustrated with continuous line connecting the points where test scores or class intervals meet frequencies A single test score means more if one relates it to other test scores A distribution of scores summarizes the scores for a group of
individuals
Frequency distribution: displays scores on a variable or a measure to reflect how frequently each value was obtained
o One defines all the possible scores and determines how
many people obtained each of those scores Income is an example of a variable that has a positive skew
Whenever you draw a frequency distribution or a frequency polygon, you must decide on the width of the class interval
Class interval: for inches of rainfall is the unit on the horizontal axis Measures of Central Tendency
Measure of central tendency: statistic that indicates the average or midmost score between the extreme scores in a distribution. The Arithmetic Mean
o “X bar”
o sum of observations divided by number of observations o Sigma (X/n)
o Used for interval or ratio data when distributions are
relatively normal The Median
o The middle score
o Used for ordinal, interval, and ratio data
o Especially useful when few scores fall at extremes
The Mode
o Most frequently-occurring score
o Bimodal distribution- 2 scores both have highest
frequency
o Only common with nominal data
CHAPTER 3: A STATISTICS REFRESHER Variability: indication of how scores in a distribution are scattered or
dispersed The Range
o Difference between the highest and lowest scores o Quick but gross description of the spread of scores
The interquartile and semi-interquartile range
o Distribution is split up by 3 quartiles, thus making 4
quarters each representing 25% of the scores
o Q2= median
o Interquartile range measure of variability equal to the
difference between Q3 and Q1
o Semi-interquartile range interquartile range divided by 2
Quartiles and Deciles
o Quartiles are points that divide the frequency distribution
into equal fourths
o First quartile is the 25th percentile; s econd quartile is the
median, or 50th percentile; third quartile is the 75th percentile
o The interquartile range is bounded by the range of scores
that represents the middle 50% of the distribution
o Deciles are similar but use points that mark 10% rather
than 25% intervals
o Stanine system: converts any set of scores into a
transformed scale, which ranges from 1 to 9 The average deviation
o X-mean=x
o Average deviation= (sum of all deviation scores)/ total
number of scores
o Tells us on average how far scores are from the mean
The Standard Deviation
o Similar to average deviation
o But in order to overcome the (+/-) problem, each deviation
is squared
o Standard deviation: a measure of variability equal to the
square root of the average squared deviations about the mean
o Is square root of variance
o Variance: the mean of the squares of the difference b/w
the scores in a distribution and their mean Found by squaring and summing all the
deviation scores and then dividing by the total number of scores
o s = sample standard deviation
o sigma = population standard deviation
Skewness
skewness: nature and extent to which symmetry is absent POSITIVE SKEW Ex.) test was too hard
NEGATIVELY SKEWED ex.) test was too easy
can be gauges by examining relative distances of quartiles from the median
Kurtosis
steepness of distribution platykurtic: relatively flat leptokurtic: relatively peaked mesokurtic: somewhere in the middle The Normal Curve
Normal curve: bell-shaped, smooth, mathematically defined curve, highest at center; both sides taper as it approaches the x-axis asymptotically
-symmetrical, and thus have mean, median, mode, is same Area under the Normal Curve
Tails and body Standard Scores
Standard Score: raw score that has been converted from one scale to another scale, where the latter has arbitrarily set mean and standard deviation -used for comparison
Z-score
conversion of a raw score into a number indicating how many standard deviation units the raw score is below or above the mean of the distribution.
The difference between a particular raw score and the mean divided by the standard deviation
Used to compare test scores with difference scales T-score
Standard score system composed of a scale that ranges from 5 standard deviations below the mean to 5 standard deviations above the mean
No negatives Other Standard Scores
SAT
GRE
Linear transformation: when a standard score retains a direct numerical relationship to the original raw score
Nonlinear transformation: required when data are not normally distributed, yet comparisons with normal distributions need to be made
o Normalized Standard Scores
When scores don’t fall on normal distribution “normalizing a distribution involves ‘stretching’
he skewed curve into the shape of a normal curve and creating a corresponding scale of standard scores, a scale called a normalized standard score scale”
CHAPTER 4: OF TESTS AND TESTING Some Assumptions About Psychological Testing and Assessment
- Assumption 1: Psychological Traits and States Exist
o Trait: any distinguishable, relatively enduring way in which one
individual varies from another
o States: distinguish one person from another but are relatively
less enduring
Trait term that an observer applies, as well as strength or magnitude of the trait presumed present based on observing a sample of behavior
o Trait and state definitions also refer to individual variation
make comparisons with respect to the hypothetical average person
o Samples of behavior:
Direct observation
Analysis of self-report statements Paper-and-pencil test answers
o Psychological trait covers wide range of possible
characteristics; ex: Intelligence
Specific intellectual abilities Cognitive style
Psychopathology
o Controversy regarding how psychological tests exist
Psychological tests exist only as constructs: an informed, scientific concept developed or constructed to describe or explain a behavior
Cant see, hear or touch infer existence from overt behavior: refers to an observable action or the product of an observable action, including test- or assessment-related responses
o Traits not expected to be manifested in behavior 100% of the
time
Seems to be rank-order stability in personality traits relatively high correlations between trait scores at different time points
o Whether and to what degree a trait manifests itself is
dependent on the strength and nature of the situation - Assumption 2: Psychological Traits and States Can Be Quantified and
Measured
o After acknowledged that psychological traits and states do exist,
the specific traits and states to be measured need to be defined What types of behaviors are assumed to be
indicative of trait?
Test developer has to provide test users with a clear operational definition of the construct under study
o After being defined, test developer considers types of item
content that would provide insight into it
Ex: behaviors that are indicative of a particular trait
o Should all questions be weighted the same?
Weighting the comparative value of a test’s items comes about as the result of a complex interplay among many factors:
Technical considerations
The way a construct has been defined (for particular test)
Value society (and test developer) attach to behaviors evaluated
o Need to find appropriate ways to score the test and interpret
results
Cumulative scoring: test score is presumed to represent the strength of the targeted ability or trait or state
The more the testtaker responds in a particular direction (as keyed by test manual) the higher the testtaker is presumed to possess the targeted trait or ability
- Assumption 3: Test-Related Behavior Predicts Non-Test-Related Behavior
o Objective of test is to provide some indication of some aspects
of the examinee’s behavior
Tasks on some tests mimic the actual behaviors that the test user is attempting to understand
o Obtained behavior is usually used to predict future behavior o Could also be used to postdict behavior to aid in the
understanding of behavior that has already taken place
o Tools of assessment, such as a diary, or case history data, might
be of great value in such an evaluation
- Assumption 4: Tests and Other Measurement Techniques Have Strengths and Weaknesses
o Competent test users understand a lot about the tests they use
How it was developed
Circumstances under which it is appropriate to administer the test
How test should be administered and to whom How results should be interpreted
o Understand and appreciation limitations for tests they use
- Assumption 5: Various Sources of Error Are Part of the Assessment Process
o Everyday error= misstates and miscalculations
o Assessment error= a long-standing assumption that factors
other than what a test attempts to measure will influence performance on a test
o Error variance: component of a test score attributable to
sources other than the trait or ability measured
Assessees themselves are sources of error variance
o Classical test theory (CTT)/ True score theory: assumption is
made that each testtaker has a true score on a test that would be obtained but for the action of measurement error
- Assumption 6: Testing and Assessment Can Be Conducted in a Fair and Unbiased Manner
o Court challenged to various tests and testing programs have
sensitized test developers and users to the societal demand for fair tests used in a fair manner
Publishers strive to develop instruments that are fair when used in strict accordance with guidelines in the test manual
o Fairness related problems/questions:
Culture is different from people whom the test was intended for
Politics
- Assumption 7: Testing and Assessment Benefit Society
o Many critical decisions are based on testing and assessment
procedures
WHAT’S A “GOOD TEST”?
- Criteria
o Clear instruction for administration, scoring, and interpretation
- Reliability
o A “good test”/measuring tool reliable
Involves consistency: the prevision with which the test measures and the extent to which error is present in measurements
Unreliable measurement needs to be avoided - Validity
o Test is considered valid if it doesn’t indeed measure what it
purports to measure
o If there is controversy over the definition of a construct then the
validity is sure to be criticized as well
o Questions regarding validity focus on the items that collectively
make up the test
Adequately sample range of areas to measure construct
Individual items contribute to or take away from test’s validity
o Validity may also be questioned on grounds related to the
interpretation of test results - Other Considerations
o “Good test” one that trained examiners can administer, score
and interpret with minimum difficulty Useful
Yields actionable results that will ultimately benefit individual testtakers or society at large
CHAPTER 4: OF TESTS AND TESTING
o Purpose of test compare performance of testtaker with
performance of other testtakers (contains adequate norms: normative data)
Normative data provides standard with which results measured can be compared
NORMS
- Norm-referenced testing and assessment: method of evaluation and a way of deriving meaning from test scored by evaluating an
individual testtaker’s score and comparing it to scores of a group of testtakers
- Meaning of individual score is relative to other scores on the same test
- Norms (scholarly context): usual, average, normal, standard, expected or typical
- Norms (psychometric context): the test performance data of a particular group of testtakers that are designed for use as a reference when evaluating or interpreting individual test scores
- Normative sample: group of people whose performance on a particular test is analyzed for reference in evaluation the performance of individual testtakers
o Yields a distribution of scores
- Norming: refers to the process of deriving norms; particular type of norm derivation
o Race norming: controversial practice of norming on the
basis of race or ethnic background
- Norming a test can be very expensiveuser norms/program norms: consist of descriptive statistics based on a group of testtakers in a given period of time rather than norms obtained by form sampling methods
- Sampling to Develop Norms
- Standardization: process of administering a test to a representative sample of testtakers for the purpose of establishing norms
o Standardized when has clear, specified procedures
- Sampling
o Developer targets defined group as population test
designed for
All have at least one common, observable characteristic
o To obtain distribution of scores:
Test administered to everyone in targeted population
Administer test to a sample of the population Sample: portion of universe of
people deemed to be representative of whole population
Sampling: process of selecting the portion of universe deemed to be representative of whole
o Subgroups within a defined population may differ with
respect to some characteristics and it is sometimes essential to have these differences proportionately represented in sample
Stratified sampling: sample reflects statistics of whole population; helps prevent sampling bias and ultimately aid in interpretation of findings Purposive sampling: arbitrarily select sample
we believe to be representative of population Incidental/convenience sampling: sample that
is convenient or available for use
Very exclusive (contain exclusionary criteria)
- TYPES OF STANDARD ERROR:
o STANDARD ERROR OF MEASUREMENT – estimate the
extent to which an observed score deviates from a true score
o STANDARD ERROR OF ESTIMATE – In regression, an
estimate of the degree of error involved in predicting the value of one variable from another
o STANDARD ERROR OF THE MEAN – a measure of sampling
error
o STANDARD ERROR OF THE DIFFERENCE – estimate how
large a difference between two scores should be before the difference is considered statistically significant - Developing norms for a standardized test
o Establish a standard set of instructions and conditions
under which the test is given makes scores of normative sample more comparable with scores of future testtakers
o All data collected and analyzed, test developer will
summarize data using descriptive statistics (measures of central tendency and variability)
Test developer needs to provide precise description of standardization sample itself Descriptions of normative samples vary widely
in detail Tracking
- Comparisons are usually with people of the same age
- Children at the same age level tend to go through different growth
patterns
- Pediatricians must know the child’s percentile within a given age
group
- This tendency to stay at about the same level relative to one’s peers is
known as tracking (ie height and weight)
- Diets may alter this “track”
- Faults: some believe there is an analogy between the rates of physical
growth and the rates of intellectual growth
o Some say that children learn at different rates o This system discriminates against some children
TYPES OF NORMS
o Classification of norms ex: age, grade, national, local,
percentile, etc.
o PERCENTILES
Median= 2nd quartile: the point at or below which
50% of the scores fell and above which the remaining 50% fell
Might wish to divide distribution of scores into deciles (instead of quartiles): 10 equal parts The Xthpercentile is equal to the score at or below
which X% of scores fall
Percentile: an expression of the percentage of people whose score on a test or measure falls below a particular raw score
Percentage correct: refers to the distribution of raw scores (number of items that were answered correctly) multiplied by 100 and divided by the total number of items *not same as percentile Percentile is a converted score that refers
to a percentage of testtakers
Percentiles are easily calculated popular way of organizing test related data
Using percentiles with normal distribution real differences between raw scores may be minimized near the ends of the distribution and exaggerated in the middle (worsens with highly skewed data)
o AGE NORMS
Age-equivalent scores/age norms: indicate the average performance of different samples of testtakers who were at various ages at the time the test was administered
Age norm tables for physical characteristics
“Mental” age vs. physical age (need to identify mental age)
o GRADE NORMS
Grade norms: designed to indicate the average test performance of testtakers in a given school grade
Developed by administering the test to representative samples of children over a range of consecutive grades
Mean or median score for children at each grade level is calculated
CHAPTER 4: OF TESTS AND TESTING Great intuitive appeal
Do not provide info as to the content or type of items that a student could or could not answer correctly
Developmental norms: (ex: grade norms and age norms) term applied broadly to norms developed on the basis of any trait, ability, skill, or other
characteristic that is presumed to develop, deteriorate, or otherwise be affected by chronological age, school grade, or stage of life
o NATIONAL NORMS
National norms: derived from a normative sample that was nationally representative of the population at the time the norming study was conducted
o NATIONAL ANCHOR NORMS
Many different tests purporting to measure the same human characteristics or abilities
National anchor norms: equivalency tables for scores on tests that purpose to measure the same thing
Could provide the tool for comparisons Provides stability to test scores by
anchoring them to other test scores Begins with the computation of percentile
norms for each test to be compared Equipercentile method: equivalency of
scores on different tests is calculated with reference to corresponding percentile scores
o SUBGROUP NORMS
Normative sample can be segmented by an criteria initially used in selecting subjects for sample Subgroup norms: result of segmentation; more
narrowly defined
o LOCAL NORMS
Local norms: provide normative info with respect to the local population’s performance on some test
Typically developed by test users themselves
- Fixed Reference Group Scoring Systems
o Norms provide context for interpreting meaning of a test score o Fixed reference group scoring system: distribution of scored
obtained on the test from one group of testtakers (fixed reference group) is used as the basis for the calculation of test scores for future administrators on the test
Ex: SAT test (developed in 1962)
NORM-REFERENCED VERSUS CRITERION-REFERENCED EVALUATION - Way to derive meaning from test score is to evaluate test score in
relation to other scores on same test ( Norm-referenced)
- Criterion-referenced: derive meaning from a test score by evaluating it on the basis of whether or not some criterion has been met
o Criterion: a standard on which a judgment or decision may
be based
- Criterion-referenced testing and assessment: method of evaluation and way of deriving meaning from test scores by evaluating an individual’s score with reference to a set standard (ex: to drive must past driving test)
o Derives from values and standards of an individual or
organization
o Also called Domain/content-referenced testing and
assessment
o Critique: if followed strictly, important info about
individual’s performance relative to others can be potentially lost
Culture and Inference
- Culture is a factor in test administration, scoring and interpretation - Test user should do research in advance on test’s available norms to
check how appropriate it is for targeted testtaker population
o Helpful to know about the culture of the testtaker
CORRELATION AND INFERENCE
CORRELATION
Degree and direction of correspondence between two things. Correlation coefficient (r) – expresses a linear relationship between
two continuous variables
o Numerical index that tells us the extent to which X and Y
are “co-related”
Positive correlation: high scores on Y are associated with high scores on X, and low scores on Y correspond to low scores on X
Negative correlation: higher scores on Y are associated with lower scores on X, and vise versa
No correlation: the variables are not related -1 to 1
Correlation does not imply causation.
o Ie weight, height, intelligence
PEARSON r
Pearson Product Moment Correlation Coefficient Devised by Karl Pearson
Relationship of two variables are linear and continuous
Coefficient of Determination (r2) – indication of how much variance is shared by the X and the Y variables
SPEARMAN RHO
Rank order correlation coefficient Developed by Charles Spearman
Used when the sample size is small and when both sets of measurements are in ordinal form (ranking form) BISERIAL CORRELATION
expresses the relationship between a continuous variable and an artificial dichotomous variable
o If the dichotomous variable had been true then we would
use the point biserial correlation
o When both variables are dichotomous and at least one of
the dichotomies is true, then the association between them can be estimated using the phi coefficient
o If both dichotomous variables are artificial, we might use a
special correlation coefficient – tetrachoric correlation REGRESSION
analysis of relationships among variables for the purpose of understanding how one variable may predict another SIMPLE REGRESSION: one IV (X) and one DV (Y)
- Regression line: defined as the best-fitting straight line through a set
of points in a scatter diagram
o Found by using the principle of least squares, which
minimizes the squared deviation around the regression line
Primary use: To predict one score or variable from another
Standard error of estimate: the higher the correlation between X and Y, the greater the accuracy of the prediction and the smaller the SEE. MULTIPLE REGRESSION: The use of more than one score to predict Y. Regression coefficient : (b) slope of the regression line
o Sum of squares for the covariance to the sum of squares
for X
o Sum of squares is defined as the sum of the squared
deviations around the mean
o Covariance is used to express how much two measures
covary, or vary together
Slope describes how much change is expected in Y each time X increases by one unit
Intercept (a) is the value of Y when X is 0
o The point at which the regression line crosses the Y axis
THE BEST-FITTING LINE
The difference between the observed and predicted score (Y- Y’) is called the residual
The best-fitting line is most appropriately found by squaring each residual
Best-fitting line is obtained by keeping these squared residuals as small as possible
o Principle of least squares:
Correlation is a special case of regression in which the scores for both variables are in standardized, or Z, units
CHAPTER 4: OF TESTS AND TESTING In correlation, the intercept is always 0
Pearson product moment correlation coefficient is a ratio used to determine the degree of variation in one variable that can be estimated from knowledge about variation in the other variable Testing the Statistical Significance of a Correlation Coefficient
- Begin with the null hypothesis that there is no relationship between
variables
- Null hypothesis rejected is there is evidence that the association
between two variables is significantly different from 0
- t distribution is not a single distribution, but a family of distributions,
each with its own degrees of freedom
- Degrees of freedom are defined as the sample size minus 2, or N-2 - Two-tailed test
How to Interpret a Regression Plot
- Regression plots are pictures that show the relationship between
variables
- Common use of correlation is to determine the criterion validity
evidence for a test, or the relationship between a test score and some well-defined criterion
- Middle level of enjoyableness because it is the one observed most
frequently – normative because it uses info gained from representative groups
- Using the test as a predictor is not as good as perfect prediction, but
it is still better than using the normative info
- A regression line such as in 3.9 shows that the test score tells us
nothing about the criterion beyond the normative info TERMS AND ISSUES IN THE USE OF CORRELATION
Residual
- Difference between the predicted and the observed values is called
the residual
o Y-Y’
- Important property of residual is that the sum of the residuals always
equals 0
- Sum of the squared residuals is the smallest value according to the
principle of least squares Standard Error of Estimate
- Standard deviation of the residuals is the standard error of estimate - A measure of the accuracy of prediction
- Prediction is most accurate when the standard error of estimate is
relatively small Coefficient of Determination
- Correlation coefficient squared is known as the coefficient of
determination
- Tells us the proportion of the total variation in scores on Y that we
know as a function of information about X Coefficient of Alienation
- Coefficient of alienation is a measure of nonassociation between two
variables
- Square root of 1-r2 –-- r is the coefficient of determination
- High value means there is a high degree of nonassociation between 2
variables Shrinkage
- Tendency to overestimate the relationship, particularly if the sample
of subjects is small
- Shrinkage is the amount of decrease observed when a regression
equation is created for one population and then applied to another Cross Validation
- Use regression equation to predict performance in a group of subjects
other than the ones to which the equation was applied
- Standard error of estimate obtained for relationship between the
values predicted by the equation and the values actually observed – called cross validation
The Correlation-Causation Problem
- Experiments are required to determine whether manipulation of one
variable causes changes in another variable
- A correlation alone does not prove causality, although it might lead to
other research that is designed to establish the causal relationships between variables
Third Variable Explanation
- Third variable, ie poor social adjustment, causes TV viewing and
aggression
- External influence is the third variable
Restricted Range
- Correlation and regression use variability on one variable to explain
variability on a second variable
- Restricted range problem: correlation requires variability; if the
variability is restricted, then significant correlations are difficult to find
Mulvariate Analysis
- Multivariate analysis considers the relationship among combinations
of three of more variables General Approach
- Linear combination of variables is a weighted composite of the
original variables
CHAPTER 5: RELIABILITY RELIABILITY
- Dependability and consistent
- Error implies that there will always be some inaccuracy in our
measurements
- Tests that are relatively free of measurement error are deemed to be
reliable
- Reliability estimates in the range of .70 and .80 are good enough for
most purposes in basic research
- Reliability coefficient: an index that indicates the ratio between the
true score variance on a test and the total variance
- HISTORY OF RELIABILITY:
o Charles Spearman (1904): The Proof and Measurement of
Association between Two Things
o Then Thorndike
o Item response theory has taken advantage of computer
technology to advance psychological measurement significantly
o Based on Spearman’s ideas - X = T + E CLASSICAL TEST THEORY
o assumes that each person has a true score that would be
obtained if there were no errors in measurement
o Difference between the true score and the observed score
results from measurement error
o Assumption here is that errors of measurement are
random
o Basic sampling theory tells us that the distribution of
random errors is bell-shaped
The center of the distribution should represent the true score, and the dispersion around the mean of the distribution should display the distribution of sampling errors
o Classical test theory assumes that the true score for an
individual will not change with repeated applications of the same test
o
o Variance: standard deviation squared. It is useful because
it can be broken into components:
o True variance: variance from true differences are
assumed to be stable
o Error variance: random irrelevant sources
- Standard error of measurement: we assume that the distribution of
random errors will be the same for all people, classical test theory uses the standard deviation of errors as the basic measure of error
o Standard error of measurement tells us, on the average,
how much a score varies from the true score
o Standard deviation of the observed score and the
reliability of the test are used to estimate the standard error of measurement
- Reliability: proportion of the total variance attributed to true
variance.
o the greater portion of total variance attributed to true
variance, the more reliable the test
- Measurement error: refers to collectively, all of the factors associated
with the process of measuring some variable, other than the variable being measured
o Random error: a source of error in measuring a targeted
variable caused by unpredictable fluctuations and inconsistencies of other variables in the measurement process
This source of error fluctuates from one testing situation to another with no discernible pattern that would systematically raise or lower scores
o Systematic Error:
A source of error in measuring a variable that is typically constant or proportionate to what is presumed to be true value of the variable being measured
Error is predictable and fixable Does not affect score consistency SOURCES OF ERROR VARIANCE
- TEST CONSTUCTION
o Item sampling or content sampling – refer to variation
among items within a test as well as to variation among items between test\
The extent to which a test takers score is affected by the content sampled on a test and by the way the content is sampled (that is, the way in which the item is constructed) is a source of error variance
- TEST ADMINISTRATION
o may influence the test takers attention or motivation o Environment variables, test taker’s variables, examiner
variables. Level of professionalism
- TEST SCORING AND INTERPRETATION
o Computer scoring and a growing reliance on objective,
computer-scorable items have virtually eliminated error variance caused by scorer differences
o However, other tools of assessment still require scoring by
trained personnel
o If subjectivity is involved in scoring, then the scorer can be
a source of error variance
o Despite rigorous scoring criteria set forth in many of the
better known test of intelligence, examiner occasionally still are confronted by situations where an examinees response lies in a gray area
TEST-RETEST RELIABILITY
- Also known as time-sampling reliability
- Correlating pairs of scores from the same group on two different
administration of the same test
- Measure something that is relatively stable over time - Sources of Error variance:
o Passage of time: the longer the time that passes, the
greater the likelihood that reliability coefficient will be lower.
o Coefficient of stability : when the interval between testing
is greater than 6 months,
- Consider possibility of carryover effect: occurs when first testing
session influences scores from the second session
- If something affects all the test takers equally, then the results are
uniformly affected and no net errors occurs
- Practice tests may make this effect happen - Practice can also affect tests of manual dexterity
- Time interval between testing sessions must be selected and
evaluated carefully
- Poor test-retest correlations do not always mean that a attest is
unreliable – suggest that the characteristic under study has changed PARALLEL-FORM OR ALTERNATE FORMS RELIABILITY
- compares two equivalent forms of a test that measure the same
attribute
- Two forms should be equally constructed, both format, etc. - When two forms of the test are available, one can compare
performance on one form versus the other – equivalent forms reliability or parallel forms
- Coefficient of equivalence: degree of relationship between various
forms of a test can be evaluated by means of an alternate-forms
- Parallel forms: each form of the test, the means and variances of
observed test scores are equal
- Alternate forms: different versions of a test that have been
constructed so as to be parallel
- (1) two test administrations with the same group are required - (2) test scores may be affected by factors such as motivation etc. - Problem: developing a new version of a test
INTERNAL CONSISTENCY
- How well does each item measure the content/construct under
consideration
- How consistent the items together - Used when tests are administered once
- If all items on a test measure the same construct, then it has a good
internal consistency
CHAPTER 5: RELIABILITY SPLIT-HALF RELIABILITY
- Correlating two pairs of scores obtained from equivalent halves of a
single test administered once.
- This is useful when it is impractical to assess reliability with two tests
or to administer test twice
- Results of one half of the test are then compared with the results of
the other
- Rules in splitting forms into half:
o Do not divide test in the middle because it would lower
the reliability
o Different amounts of anxiety and differences in item
difficulty shall also be considered
o Randomly assign items to one or the other half of the test o use the odd-even system: where one subscore is obtained
for the odd-numbered items in the test and another for the even-numbered items
- To correct for half-length, apply the Spearman-Brown formula, which
allows you to estimate what the correlation between the two halves would have been if each half had been the length of the whole test
o Use this if test user wish to shorten a test
o Used to determine the number of items needed to attain a
desired level of reliability
- Reliability increases as the test length increases
KUDER-RICHARDSON FORMULAS OR KR20/KR21
- Kuder-Richardson technique simultaneously considers all possible
ways of splitting the items
- The formula for calculating the reliability of a test in which the items
are dichotomous, scored 0 or 1, is the Kuder-Richardson 20 (see p.114)
- Introduced KR21 – uses an approximation of the sum of the pq
products – the mean test score CRONBACH ALPHA
- Cronbach developed a formula that estimates the internal
consistency of tests in which the items are not scored as 0 or 1 – a more general reliability estimate, which he called coefficient alpha
- Sum the individual item variances
o Most general method of finding estimates of reliability
through internal consistency
- Domain sampling: define a domain that represents a single trait or
characteristic, and each item is an individual sample of this general characteristic
- Factor analysis deals with the situation in which a test apparently
measures several different characteristics
o Good for the process of test construction
- Most widely used as a measure of reliability because it requires only
one administration of the test
- Ranges from 0 to 1 “bigger is always better”
Other Methods of Estimating Internal Consistencies
- Inter-item consistency: refers to the degree of correlation among all
the items on a scale
o A measure of inter-item consistency is calculated from a
single administration of a single form of a test
o An index of inter-item consistency, in turn, is useful in
assessing the homogeneity of the test
o Tests are said to be homogenous if they contain items that
measure a single trait
o Definition: the degree to which a test measures a single
factor
o Heterogeneity: degree to which a test measures different
factors
o Ex: homo=test that assesses knowledge only of #-D
television repair skills vs. a general electronics repair test (hetero)
o The more homogenous a test is, the more inter-item
consistency it can be expected to have
o Test homogeneity is desirable because it allows relatively
straightforward test-score interpretation
o Test takers with the same score on a homogenous test
probably have similar abilities in the area tested
o Test takers with the same score on a heterogeneous test
may have quite different abilities
o However, homogenous testing is often an insufficient tool
for measuring multifaceted psychological variable such as intelligence or personality
Measures of Inter-Scorer Reliability
- In some types of tests under some conditions, the score may be more a
function of the scorer than of anything else
- Inter-scorer reliability: is the degree of agreement or consistency between
two or more scorers (or judges or rather) with regard to a particular measure
- Coefficient of inter-scorer reliability: coefficient of correlation to
determine the degree of consistency among scorers in the scoring of a test
- Kappa statistic is the best method for assessing the level of agreement
among several observers
o Indicates the actual agreement as a proportion of the potential
agreement following the correction for chance agreement
o Cohen’s Kappa – 2 raters o Fleiss’ Kappa – 3 or more raters
HOMOGENEITY VS. HETEROGENEITY OF TEST ITEMS
- Homogeneous items has high degree of reliability
DYNAMIC VS. STATIC CHARACTERISTICS
- Dynamic: trait, state, ability presumed to be ever-changing as a function of
situational and cognitive experiences
- Static: trait, state, ability relatively unchanging
RESTRICTION OR INFLATION OF RANGE
- If it is restricted, reliability tends to be lower. - If it is inflated, reliability tends to be higher.
SPEED TESTS VS. POWER TESTS
- Speed test: test is homogenous, means that it is easy but short time - Power test: Few items, but more complex.
CRITERION-REFERENCED TESTS
- Provide an indication of where a testtaker stands with respect to some
variable or criterion.
- Tends to contain material that has been mastered in hierarchical fashion. - Scores here tend to be interpreted in pass-fail terms.
- Measure of reliability depends on the variability of the test scores: how
different the scores are from one another. The Domain Sampling Model
- This model considers the problems created by using a limited number
of items to represent a larger and more complicated construct
- Our task in reliability analysis is to estimate how much error we would
make by using the score from the shorter test as an estimate of your true ability
- Conceptualizes reliability as the ratio of the variance of the observed
score on the shorter test and the variance of the long-run true score
- Reliability can be estimated from the correlation of the observed test
score with the true score Item Response Theory
- Classical test theory requires that exactly the same test items be
administered to each person – BAD
- Item response theory (IRT) is newer – computer is used to focus on
the range of item difficulty that helps assess an individual’s ability level
o More reliable estimate of ability is obtained using a
shorter test with fewer items
o Takes a lot of items and effort
Generalizability theory
- based on the idea that a persons test scores vary from testing to testing
because of variables in the testing situation
- Instead of conceiving of all variability in a persons scores as error, Cronbach
encouraged test developers and researchers to describe the details of the particular test situation or universe leading to a specific test score
CHAPTER 5: RELIABILITY
- This universe is described in terms of its facets: which include things like
the number of items in the test, the amount of training the test scorers have had, and the purpose of the test administration
- According to generalizability theory, given the exact same conditions of all
the facets in the universe, the exact same test score should be obtained
- Universe score: the test score obtained and its analogous to a true score in
the true score model
- Cronbach suggested that tests be developed with the aid of a
generalizability study followed by a decision study
- Generalizability study: examines how generalizable scores from a
particular test are if the test is administered in different situations
- How much of an impact different facets of the universe have on the test
score
- Ex: is the test score affected by group as opposed to individual
administration
- Coefficients of generalizability: the influence of particular facts on the test
score is represented by this. These coefficients are similar to reliability coefficients in the true score model
- Decision study: developers examine the usefulness of test scores in helping
the test user make decision
- The decision study is designed to tell the test user how test scores should
be used and how dependable those scores are as a basis for decisions, depending on the context of their use
What to Do About Low Reliability
- Two common approaches are to increase the length of the test and to
throw out items that run down the reliability
- Another procedure is to estimate what the true correlation would
have been if the test did not have measurement error Increase the Number of Items
- The larger the sample, the more likely that the test will represent the
true characteristic
o This could entail a long and costly process however - Prophecy formula
Factor and Item Analysis
- Reliability of a test depends on the extent to which all of the items
measure one common characteristic
- Factor analysis
o Tests are most reliable if they are unidimensional : one
factor should account for considerably more of the variance than any other factor
- Or examine the correlation between each item and the total score for
the test
o Called discriminability analysis: when the correlation
between the performance on a single item and the total test score is low, the item is probably measuring something different from the other items on the test Correction for Attenuation
- Potential correlations are attenuated, or diminished, by measurement
CHAPTER 6: VALIDITY The Concept of Validity
- Validity: as applied to a test, is a judgment or estimate of how well a test measures what it purports to measure in a particular context
o Judgment based on evidence about the appropriateness of
inferences drawn from test scores
o Validity of test must be shown from time to time to account for
culture and advancement - Inference: a logical result or deduction
- “Acceptable” or “weak” validity of tests and test scores
- Validation: process of gathering and evaluating evidence about validity
o Test user and testtaker both have roles in validation of test o Test users may conduct their own validation studies: may yield
insights regarding a particular population of testtakers as compared to the norming sample (in manual)
o Local validation studies: absolutely necessary when test user
plans to alter in some way the format, instructions, language, or content of the test
- Types of Validity (Trinitarian view) *not mutually exclusive all contribute to a unified picture of a test’s validity/ critiq ue approach is fragmented and incomplete
o Content validity: measure of validity based on an evaluation of
the subjects, topics, or content covered by the items in the test
o Criterion-related validity: measure of validity obtained by
evaluating the relationship of scores obtained on the test to scores on other tests or measures
o Construct validity: measure of validity that is arrived at by
executing a comprehensive analysis of: (umbrella validity every other variety of validity falls under it)
How scores on test relate to other test scores and measures
How scores on test can be understood within some theoretical framework for understand the construct that the test was designed to measure
- Strategies: ways of approaching the process of test validity
o Content validation strategies
o Criterion-related validation strategies o Construct validation strategies
- Face Validity
o Face validity: relates more to what a test appears to measure to
the person being tested than to what the test actually measures
o Judgment concerning how relevant the test items appear to
be usually from testtaker, not test user
o Lack of face validity= lack of confidence in perceived
effectiveness of test which decreases testtaker’s motivation/cooperation *may still be useful - Content validity
o Content validity: a judgment of how adequately a test samples
behavior representative of the universe of behavior that the test was designed to sample
Ideally, test developers have a clear vision of the construct being measured clarity reflected in the content validity of the test
o Test blueprint: structure of the evaluation; a plan regarding the
types of information to be covered by the items, the number of items tapping each area of coverage, the organization of the items in the test, etc.
Behavior observation is a technique frequently used in test blueprinting
o The quantification of content validity
Important in employment settings tests used to hire and promote
One method: method for gauging agreement among raters or judges regarding how essential a particular item is (C.H. Lawshe)
“Is the skill or knowledge measured by this item…
o Essential
o Useful but not essential o Not necessary
To the performance of the job?” Content validity ratio (CVR) :
CVR= ne – (N/2)
(N/2)
o CVR Content validity ratio o ne Number of panelists
stating “essential”
o N Total number of panelists
CVR is calculated for each item
o Culture and the relativity of content validity
Tests thought of as either valid or invalid What constitutes historical fact depends to some
extent on who is writing the history Culture relativity
Politics (politically correct) Criterion-Related Validity
- Criterion-related validity: judgment of how adequately a test score can be used to infer an individual’s most probable standing on some measure of interest (measure of interest being the criterion)
- 2 types:
o Concurrent validity: index of the degree to which a test score is
related to some criterion measure obtained at the same time (concurrently)
o Predictive validity: index of the degree to which a test score
predicts some criterion measure - What Is a Criterion?
o Criterion: a standard on which a judgment or decision may be
based; standard against which a test or a test score is evaluated (criterion-related validity)
o Characteristics of criterion
Relevancy pertinent or applicable to the matter at hand
Validity (for the purpose which it is being used) UncontaminatedCriterion contamination: term
applied to a criterion measure that has been based, at least in part, on predictor measures
- Concurrent Validity
o Test scores are obtained at about the same time as the criterion
measures are obtained measures of the relationship between the test scores and the criterion provide evidence of concurrent validity
o Indicate the extent to which test scores may be used to
estimate an individuals present standing on a criterion
o Once validity of inference from test scores is established= faster,
less expensive way to offer a diagnosis or a classification decision
o Concurrent validity of a test can be explored with respect to
another test
Prior research must have satisfactorily demonstrated the 1sttest’s validity
1st test= validating criterion
- Predictive validity
o Test scores may be obtained at one time and the criterion
measures obtained at a future time, usually after some intervening event has taken place
Intervening event training, experience, therapy, medication, etc.
Measures of relationship between the test scores and a criterion measure obtained at a future time provide an indication of the predictive validity test (how accurately scores on the test predict some criterion measure)
o Ex: SAT test score and freshman gpa
o Judgments of criterion validity are based on 2 types of statistical
evidence:
The validity coefficient
Validity coefficient: correlation
coefficient that provides a measure of the relationship between test scores and scores on the criterion measure
Ex: Pearson correlation coefficient used to determine validity between 2 measures (r)
Affected by restriction or inflation of range