1
PSYCHOLOGICAL ASSESSMENTConceptual Paradigm for Measurement and Evaluation
*has absolute zero: weight – there could be no or 0 [value of] weight has no absolute zero: temperature – there’s no 0 or no temperature
normal distribution of scores – if the mean, median, mode are all the same (measures of central tendency)
abnormal distribution of “ – skewed
Psychological Tests
Objective Tests Projective Tests (WIDU)
Standardized
Test Administration, Scoring, Interpreting test scores
Limited number of responses – multiple choice; true or false Group Tests Norms - norm-referenced test (NRT) - criterion-referenced test (CRT) Wishes
Intrapsychic conflict – conflict bet. desires & morals
Desires
Unconscious motives
Subjectivity on test – interpretation/clinical judgment
Self-administered/individual tests Unlimited no. of responses Norms – where we base the scores of the test takers
-- transform scores into a meaningful scale > NRT – age norms
> CRT – ex. how would we know if a basketball player is skillful? -> sharpshooter; there’s certain criterion to be met
Medium of Psychological Tests Battery of Tests – sets of tests
Paper and pencil
Objects: wooden blocks, puzzles
Machine: Galvanic skin responses (ex. EEG, CT Scan) Computer (RAP) - Recommendatio n - Action Plan - Program Development Various Techniques (DITO) - Documents - Interview - Test - Observation
Series of
tests
Single
Measure
Scales
(IRON)
- Interval - Ratio - Ordinal - NominalEvaluation
Assessment
Battery Tests
Test
Measurement
Samples of
Behavior:
- Mental Abilities - Personality -Personality
- Traits
- States
- Types – MBTI
- Aptitude
- Interest
- Values
Mental Abilities
-
General Intelligence (g)
–
IQ-
Specific Intelligence (s)
–
Non-verbal IQ-
Multiple Intelligence
Psychopathology
- Diagnosis
- classification
- severity
- Prognosis
- predicting the
dev’t of the d/o
Measurement (IRON) Parametric: Normal Distribution of Scores
(Pearson’s r)
Non-Parametric: Abnormal Distribution of Scores (Spearman, (chi-square(nominal))
Interval: Temperature, Time, (IQ) – has no absolute zero
Ordinal: Rank, Positions, Likert Scale, Birth Order Ratio: Weight, Height – has absolute zero Nominal: Sex, Civil Status – classifying
2
Psychological Tests
Ability Tests Personality Tests
Intelligence Tests - Verbal Intelligence - Non-verbal Intelligence Ex. Weschler Adult Intelligence Scale
Stanford Binet Int. Scale Culture Fair Intelligence Test
Achievement Tests - measures the extent if
one’s knowledge
- various academic subject Ex. Achievement Test (what has been learned?)
Stanford Achievement Test in reading
Personality Tests - Traits / Domains
or Factors Ex. Myers-Briggs Test Inventory
* Usually, no right or wrong answers
Objectiv e
Aptitude Tests (predicting) - Various skills /
competencies
Ex. Differential Aptitude Test
Results are integrated into a single score interpretation Assessment Techniques (DITO)
Documents -records, protocols, collateral reports Interviews -interview responses, screening Tests -Initial assessment > verification
Forms: written, verbal, visual,
Observation -behavioral observation -observation checklist
Evaluation – Recommendation, Action Plan, Program Development - Summarizing results of assessment
Test to SSCCRREEN Screen applicants Self-understanding Classify people Counsel individuals
Retain, dismiss, or promote employees Research for programs, test construction Evaluate performance for decision-making Examine and gauge abilities
Need for diagnosis and intervention
VALIDITY – measures what it purports to measure
Content Validity
- degree to which the tests represent the essence, the topics, and the areas that the test is designed to measure (appropriate domain)
- Primary concern of test developers because it is the content of the items that really reflects the whatness of the property intent to be measured
- Ex. achievement, aptitude, personality tests
- table of specification (blueprint) (TOS) (under analysis)
TOS – generate items → checked/validated by (at least 3) experts a.k.a “raters” ↓
Domains* (* in the box )
Procedures on how to achieve high degree of content validity 1. Pre-survey or Review of Related Literature
- Focus on the theoretical constructs that is related to the test you are planning to make, test used, purpose of the said test, areas covered, format, scaling techniques, etc.
- This may start the development phase of the instrument you are to construct. Item analysis – focuses on the items itself
o Ability, aptitude tests (tests that have right & wrong answers) Factor analysis – focuses on the domains
(if a factor really is a factor) o Personality tests o Uses Chronbach alpha - Empirical research
2. Development of Table of Specification (TOS)
- Determining the areas of concepts that’ll represent he nature of the variable being measured and the relative emphasis of each area are essentially judgmental
- A detailed TOS includes areas / concepts, objectives, number of items in each area 3. Consultation with Experts (raters)
- After making your own judgments, you need to consult your thesis adviser or someone who has the expertise in making judgment about the representativeness / relevance of the entries made in your TOS
4. Item Writing
- At this stage, you should know what type of items you are supposed to construct: the type of instrument, format, scaling, and scoring techniques
- Every test item is based on the creative talent of the item writer and on the background on the test content
Construct Validity
- Theoretical domains, factors / components - Personality
1. Convergent V – direct correlations between variables ( X↑Y↑)
- Measure that correlates well with other tests believed to measure the same construct
2. Divergent V (Discriminant) – demonstrates that a test measures something different from that other available tests measures
- A tests should have low correlations, or evidence for what the test does not measure
Criterion-related Validity is estimated by correlating a subject’s score on a test with an analysis of their behavior on an independent real life criterion. If this criterion you need to assess and correlate is occurring now, you are assessing concurrent validity. If the assessment criterion is to occur in the future, you are assessing predictive ability.
Construct Validity (a.k.a true validity) is the extent to which there is evidence that a test measures a particular hypothetical construct. For example, are we really measuring intelligence with an IQ test where there are so many competing theories regarding what intelligence actually is?
Coefficient value – estimate value
Variability – margin of errors (because we’re human beings)
Unsystematic error can result from varied assessment implementation. E.g. scoring via raters RELIABILITY – consistency
Observed Test Score = True Score + Measurement Error X = T + e
In theory, the reliability coefficient (rxx) gives us an index of the influence of true scores and error scores on any given test. It is the ratio of true score variance of the total variance of the test.
In actuality, rxx is very similar to correlation (r). The addition of 2 similar subscripts tells us that this r represents an rxx.
Depression
*Suicidal Ideation |*
Self-harm
- - - X Y Optimism↑ Optimism ↑ (convergent) Constructs X Y Optimism↑ Pessimism↓This suggests that the scores you gather on
psychological tests are not in fact true of real
scores. But, rather, those scores represent a
combination of many factors.
Models / Types of Reliability (the type depends on what test you are going to measure) 1. Test – Retest Reliability – Pearson’s r
- Gives the same test to the same group of test takers on 2 different occasions
- Scores on the 1st administration are compared to scores on the 2nd administration using r
- 15 days or a month
- Too early – familiarity (carryover effect); too long – maturity
- Often researchers consider this to be a better measure of temporal stability (consistency of test scores..)
- Assumption: people don’t change on 2 administrations
- PROBLEM: Practice or carryover effects ( beneficial to the test takers) 2. Alternate Forms of Reliability – r
- To eliminate the practice effects and other problems with the test-retest method (i.e reactivity), test developers often give 2 highly similar forms of the test to the same people at different times. - Reliability, in this case, is again assessed at different times.
- To develop alt. form that is equivalent in terms of content, response, and statistical characteristic. - PROBLEM: difficulty of developing another (equivalent; same difficulty) form of the test
3. Split-half Reliability – Spearman-Brown prophecy - Measures the internal consistency of the test - Eliminate / reduce the problems of the ff:
1.The need for 2 admin. of a test
2.The difficulty of developing another form 3.Carryover or reactivity effect
1. KR—20 (Kruder & Richardson, 1937, 1939) – for tests which questions can be scores either 0 or 1 (binary; dichotomous)
2. Coefficient alpha (Cronbach, 1951) – rating scales that have 2 or more possible answers
Problem: whether the test being split is homogenous (i.e measuring one characteristic) or heterogenous (i.e measuring many characteristics) every item is compared to one another
Split-half reliability is mostly similar to internal consistency. halves of the were (correlated) measured
3. Scorer Reliability (inter-rater reliability) – judgments or ratings made by different scorers are often compared using correlation to see how much they agree.
If tests are being used to make important final decision about people then the reliability of a test should be high (0.95)
Lower reliability levels may be acceptable when: Making preliminary decisions,
Sorting people into groups, Conducting research, etc.
Standard Error of Measurement (SEM or Standard Deviation)
- Index of measurement of inconsistency or the amount of expected error in an individual score (i.e how much is the score is likely to differ)
Factors that can affect reliability
1. Errors that can increase or decrease individual score: - the test itself
- the test administrator - the test scoring - the test taker
2. Test length – as a rule, adding more homogenous items will increase the reliability of the test.
3. Method used to estimate reliability – split-half reliability methods yield higher reliability estimates than test-retest or alt. forms methods
Psychometric properties: - reliability (consistency)
- validity (measures what it intends to measure) - norming
- standardization
The goal is to increase the probability of getting the true score and minimizing the standard error of measurement.
Test score is composed of observed score (actual score), true score (reflection of what you really know), and error score (difference between the true score and the actual score)
Spearman-Brown Formula
rxx=
kr
(
1+(k −1)
)
r
where rxx – reliability coefficient r – coefficient
k –
Standard Deviation:
high – heterogenous (more
spread)
low – homogenous (less
spread)
trait score – sources of errors that reside within the individual taking the test (excuses:
Observed Score = true score + error score hungry, headache, unprepared,
etc.)
method score – sources of errors that reside in the testing situation (lousy instructions, too warm/cold room, missing pages, etc.)
Reliability=
True Score
True Score+Error Score
Interrater reliability=
Number of agreements
Number of disagreements
error ↓ reliability ↑Stability – the same results are obtained over repeated administration of the instrument. - Test-retest reliability
- parallel, equivalent, or alternative forms Homogeneity – internal consistency (unidimensional)
- item-total correlations; split-half reliability; Kuder-Richardson coefficient; Cronbach-alpha
Item-total correlations – each item on an instrument is correlated to total score – an item with low correlation may be deleted. Highest and lowest correlations are usually reported
- only important if homogeneity of items is desired
Kuder-Richardson coefficient – when items have dichotomous response e.g. yes/no (binary) Cronbach’s-alpha – Likert scale or linear graphic response format
- compares the consistency of responses of all items on the scale (may need to be computed for each sample)
Equivalence – consistency of agreement of observers using the same measure among alternative forms of a tool - parallel of alternate forms (described under stability)
- interrater reliability
TEST CONSTRUCTION (has rudiments, process) Test Planning
Decision to develop a Standard Test
(1) No test exist for a particular purpose or (2) the test existing for a certain purpose are not adequate for one reason or another.
Weschler’s idea of WAIS was originated from the army alpha (literate soldiers) and army beta (illiterate soldiers), that’s why there are vocabulary and performance tests.
Weschler – both covers fluid and crystallized intelligences difference between the two, Culture Fair Intelligence test – looks into specific intelligence in terms of defining intelligence
Subject Matter Experts – test developer must seek help of the experts in evaluating the test items and even the identified constructs of component of the test
Writing Items – depending on whether the scale is to assess an attitude, content knowledge, ability or personality traits; stick to the pattern (ex. don’t shift from declarative to interrogative statement)
Guidelines
1. Deal only with one central thought; more than 1 is called double-barreled. Poor item: My instructor grades fairly and quickly
Better item: My instructor grades fairly. 2. Be precise
Poor item: I received good customer service from Y Company.
Better item: A member of the scales staff at Y Company asked me if he could assist me within minute of entering the store.
3. Be brief
4. Avoid awkward wording or dangling constructs.
Poor item: Being clear is the overall guiding principle in writing items. Better item: The overall guiding principle in writing items is to be clear. * Active voice is more preferred than passive voice.
5. Avoid irrelevant information 6. Present items in positive language
* If it’s inevitable, when using ‘not’, italicize or CAPITALIZE it. 7. Avoid double negatives
8. Avoid terms like all and none
Poor item: Which of the following never occurs …
9. Avoid indeterminate items like frequently or sometimes 10. Have someone else review your items
Table of Specifications (blueprint)
Cognitive Domain – factual knowledge, ideas, and intellectual abilities
Affective Domain – most with the values of a learner including his interests, appreciation, and attitudes Psychomotor – readiness for a particular action that may either be mental, physical, or emotional Item Analysis
- Way of measuring the quality of questions – seeing how appropriate they were for the respondents and how well they measured their ability / trait
- Way of measuring items over and over again in different tests with prior knowledge of how they are going to perform, creating a population of questions with known properties (e.g. test bank)
- At least 3 or 4 times more
Level of Difficulty – proportion of percent of examinees that answered the item correctly.
In order to determine the difficulty level, table the number of examinees with the correct answer in the item and then apply the formula.
P=
Nu
N x 100
where: P = % of students who answered the items correctly
Nu = number of examinees who answered the items correctly N = total examinees consisting the 2 groups
Level of Difficulty Using Upper and Lower Groups 1. Score the papers after checking
2. Arrange the papers from highest to lowest score
3. Determine the upper and the lower group by x27% with the total number of examinees.
4. The top 27% of the examinees is considered the upper group while the bottom 27% of the total examinees comprises the lower group
5. Get both frequencies of the examinees that answered the item correctly from the 2 groups 6. Determine the difficulty level and the discriminating power
Discriminating Power determines the difference between examinees who have done well and those who did poorly in a particular item. To determine the discriminating level, perform the steps in the difficulty level, then, determine the difference of the 2 groups and divide the difference by the half of the total examinees…. (? Di natapos)
Discriminability
Item/Total Correlation – every item will be correlated to the total score – point biserial method is best used
Point Biserial Method –
dichotomous
scored items / items with a correct answer– one dichotomous variable (correct/incorrect) correlated with one continuous variable (total score) is a point biserial correlation
– correlate the proportion of people getting each item with the total test score
CTT LTM
- gauge the performance itself but
not trait derives it - aims to look beyond that at theunderlying traits which are producing the tests performance - has the test as its basis - measured at item level and provides
sample-free measurement - statistics are often generalized to
similar students taking a similar test
CTT – “true score model” ( X = T + e )
– easiest and most widely used form
of analyses
– performed on the test as a whole
rather than on the item and
although item statistics can be
generated, they apply only to that
group of students on that collection
of items.
– a set of psychometric procedures
used to test items and scales
reliability, difficulty, discrimination,
etc …
– assumes that every person has
true score on an item pr a scale of
Item Analysis
Classical Test Theory Latent Trait Models
(CTT)
Item Response
Rasch Models
Theory (IRT)
1P 2P 3P 4P similarTable of % in level of difficulty
91 %
and aboveVery easy
Unacceptable
79% - 90%
Easy
Acceptable
26% - 78%
Optimum difficulty /moderate
Highly Acceptable
11% - 25%
Difficult
Acceptable
10%
and belowVery difficult
Unacceptable
- only applies to those students taking that test
Latent Trait Models (LTM) – made in 1940’a but widely used in 1960s – practically unfeasible to use these without specialized software
Item Response Theory (IRT) – family of latent trait models used to establish psychometric properties of items and scales
– sometimes referred to as modern psychometrics because … has completely replaced CTT – can predict if one has guessed an item
3 Basic Components (ex. individual differences on a
construct)
1. Item Response Function (IRF) – math function that related the latent trait to the probability of endorsing
an item. good item
2. Item Information Function – an indication of item quality, an item’s ability to differentiate among respondents.
3. Invariance – item characteristics
Item Response Theory (IRT) – the relationship between examinee trait level, item properties and the ability of endorsing the item.
– can be converted into Item Characteristic Curves (ICC) which are graphical functions that represents the respondents ability.
Item Parameters Location – an item’s location “b” is defined as the amount of the latent trait needed to have a 0.5 probability of endorsing the item.
Item Parameters Discrimination (a) – indicates the steepness of the IRT at the item’s location – how strongly related the item is to the latent trait like loadings in a factor analysis