Psychological Assessment

(1)

1

PSYCHOLOGICAL ASSESSMENT

Conceptual Paradigm for Measurement and Evaluation

*has absolute zero: weight – there could be no or 0 [value of] weight has no absolute zero: temperature – there’s no 0 or no temperature

normal distribution of scores – if the mean, median, mode are all the same (measures of central tendency)

abnormal distribution of “ – skewed

Psychological Tests

Objective Tests Projective Tests (WIDU)

Standardized

Test Administration, Scoring, Interpreting test scores

Limited number of responses – multiple choice; true or false Group Tests Norms - norm-referenced test (NRT) - criterion-referenced test (CRT)  Wishes

Intrapsychic conflict – conflict bet. desires & morals

Desires

Unconscious motives

 Subjectivity on test – interpretation/clinical judgment

 Self-administered/individual tests  Unlimited no. of responses Norms – where we base the scores of the test takers

-- transform scores into a meaningful scale > NRT – age norms

> CRT – ex. how would we know if a basketball player is skillful? -> sharpshooter; there’s certain criterion to be met

Medium of Psychological Tests Battery of Tests – sets of tests

 Paper and pencil

 Objects: wooden blocks, puzzles

 Machine: Galvanic skin responses (ex. EEG, CT Scan)  Computer (RAP) - Recommendatio n - Action Plan - Program Development Various Techniques (DITO) - Documents - Interview - Test - Observation

Series of

tests

Single

Measure

Scales

(IRON)

- Interval - Ratio - Ordinal - Nominal

Evaluation

Assessment

Battery Tests

Test

Measurement

Samples of

Behavior:

- Mental Abilities - Personality -

Personality

- Traits

- States

- Types – MBTI

- Aptitude

- Interest

- Values

Mental Abilities

-

General Intelligence (g)

–

IQ

-

Specific Intelligence (s)

–

Non-verbal IQ

-

Multiple Intelligence

Psychopathology

- Diagnosis

- classification

- severity

- Prognosis

- predicting the

dev’t of the d/o

Measurement (IRON) Parametric: Normal Distribution of Scores

(Pearson’s r)

Non-Parametric: Abnormal Distribution of Scores (Spearman, (chi-square(nominal))

Interval: Temperature, Time, (IQ) – has no absolute zero

Ordinal: Rank, Positions, Likert Scale, Birth Order Ratio: Weight, Height – has absolute zero Nominal: Sex, Civil Status – classifying

(2)

2

Psychological Tests

Ability Tests Personality Tests

Intelligence Tests - Verbal Intelligence - Non-verbal Intelligence Ex. Weschler Adult Intelligence Scale

Stanford Binet Int. Scale Culture Fair Intelligence Test

Achievement Tests - measures the extent if

one’s knowledge

- various academic subject Ex. Achievement Test (what has been learned?)

Stanford Achievement Test in reading

Personality Tests - Traits / Domains

or Factors Ex. Myers-Briggs Test Inventory

* Usually, no right or wrong answers

Objectiv e

Aptitude Tests (predicting) - Various skills /

competencies

Ex. Differential Aptitude Test

Results are integrated into a single score interpretation Assessment Techniques (DITO)

Documents -records, protocols, collateral reports Interviews -interview responses, screening Tests -Initial assessment > verification

Forms: written, verbal, visual,

Observation -behavioral observation -observation checklist

Evaluation – Recommendation, Action Plan, Program Development - Summarizing results of assessment

Test to SSCCRREEN  Screen applicants  Self-understanding  Classify people  Counsel individuals

 Retain, dismiss, or promote employees  Research for programs, test construction  Evaluate performance for decision-making  Examine and gauge abilities

 Need for diagnosis and intervention

VALIDITY – measures what it purports to measure

(3)

Content Validity

- degree to which the tests represent the essence, the topics, and the areas that the test is designed to measure (appropriate domain)

- Primary concern of test developers because it is the content of the items that really reflects the whatness of the property intent to be measured

- Ex. achievement, aptitude, personality tests

- table of specification (blueprint) (TOS) (under analysis)

TOS – generate items → checked/validated by (at least 3) experts a.k.a “raters” ↓

Domains* (* in the box )

Procedures on how to achieve high degree of content validity 1. Pre-survey or Review of Related Literature

- Focus on the theoretical constructs that is related to the test you are planning to make, test used, purpose of the said test, areas covered, format, scaling techniques, etc.

- This may start the development phase of the instrument you are to construct.  Item analysis – focuses on the items itself

o Ability, aptitude tests (tests that have right & wrong answers)  Factor analysis – focuses on the domains

(if a factor really is a factor) o Personality tests o Uses Chronbach alpha - Empirical research

2. Development of Table of Specification (TOS)

- Determining the areas of concepts that’ll represent he nature of the variable being measured and the relative emphasis of each area are essentially judgmental

- A detailed TOS includes areas / concepts, objectives, number of items in each area 3. Consultation with Experts (raters)

- After making your own judgments, you need to consult your thesis adviser or someone who has the expertise in making judgment about the representativeness / relevance of the entries made in your TOS

4. Item Writing

- At this stage, you should know what type of items you are supposed to construct: the type of instrument, format, scaling, and scoring techniques

- Every test item is based on the creative talent of the item writer and on the background on the test content

Construct Validity

- Theoretical domains, factors / components - Personality

1. Convergent V – direct correlations between variables ( X↑Y↑)

- Measure that correlates well with other tests believed to measure the same construct

2. Divergent V (Discriminant) – demonstrates that a test measures something different from that other available tests measures

- A tests should have low correlations, or evidence for what the test does not measure

Criterion-related Validity is estimated by correlating a subject’s score on a test with an analysis of their behavior on an independent real life criterion. If this criterion you need to assess and correlate is occurring now, you are assessing concurrent validity. If the assessment criterion is to occur in the future, you are assessing predictive ability.

Construct Validity (a.k.a true validity) is the extent to which there is evidence that a test measures a particular hypothetical construct. For example, are we really measuring intelligence with an IQ test where there are so many competing theories regarding what intelligence actually is?

Coefficient value – estimate value

Variability – margin of errors (because we’re human beings)

Unsystematic error can result from varied assessment implementation. E.g. scoring via raters RELIABILITY – consistency

Observed Test Score = True Score + Measurement Error X = T + e

In theory, the reliability coefficient (rxx) gives us an index of the influence of true scores and error scores on any given test. It is the ratio of true score variance of the total variance of the test.

In actuality, rxx is very similar to correlation (r). The addition of 2 similar subscripts tells us that this r represents an rxx.

Depression

Suicidal Ideation |

Self-harm

- - - X Y Optimism↑ Optimism ↑ (convergent) Constructs X Y Optimism↑ Pessimism↓

This suggests that the scores you gather on

psychological tests are not in fact true of real

scores. But, rather, those scores represent a

combination of many factors.

(4)

Models / Types of Reliability (the type depends on what test you are going to measure) 1. Test – Retest Reliability – Pearson’s r

- Gives the same test to the same group of test takers on 2 different occasions

- Scores on the 1st_{administration are compared to scores on the 2}nd_{administration using r}

- 15 days or a month

- Too early – familiarity (carryover effect); too long – maturity

- Often researchers consider this to be a better measure of temporal stability (consistency of test scores..)

- Assumption: people don’t change on 2 administrations

- PROBLEM: Practice or carryover effects ( beneficial to the test takers) 2. Alternate Forms of Reliability – r

- To eliminate the practice effects and other problems with the test-retest method (i.e reactivity), test developers often give 2 highly similar forms of the test to the same people at different times. - Reliability, in this case, is again assessed at different times.

- To develop alt. form that is equivalent in terms of content, response, and statistical characteristic. - PROBLEM: difficulty of developing another (equivalent; same difficulty) form of the test

3. Split-half Reliability – Spearman-Brown prophecy - Measures the internal consistency of the test - Eliminate / reduce the problems of the ff:

1.The need for 2 admin. of a test

2.The difficulty of developing another form 3.Carryover or reactivity effect

1. KR—20 (Kruder & Richardson, 1937, 1939) – for tests which questions can be scores either 0 or 1 (binary; dichotomous)

2. Coefficient alpha (Cronbach, 1951) – rating scales that have 2 or more possible answers

 Problem: whether the test being split is homogenous (i.e measuring one characteristic) or heterogenous (i.e measuring many characteristics) every item is compared to one another

 Split-half reliability is mostly similar to internal consistency. halves of the were (correlated) measured

3. Scorer Reliability (inter-rater reliability) – judgments or ratings made by different scorers are often compared using correlation to see how much they agree.

 If tests are being used to make important final decision about people then the reliability of a test should be high (0.95)

 Lower reliability levels may be acceptable when:  Making preliminary decisions,

 Sorting people into groups,  Conducting research, etc.

Standard Error of Measurement (SEM or Standard Deviation)

- Index of measurement of inconsistency or the amount of expected error in an individual score (i.e how much is the score is likely to differ)

Factors that can affect reliability

1. Errors that can increase or decrease individual score: - the test itself

- the test administrator - the test scoring - the test taker

2. Test length – as a rule, adding more homogenous items will increase the reliability of the test.

3. Method used to estimate reliability – split-half reliability methods yield higher reliability estimates than test-retest or alt. forms methods

Psychometric properties: - reliability (consistency)

- validity (measures what it intends to measure) - norming

- standardization

 The goal is to increase the probability of getting the true score and minimizing the standard error of measurement.

 Test score is composed of observed score (actual score), true score (reflection of what you really know), and error score (difference between the true score and the actual score)

Spearman-Brown Formula

rxx=

kr

(

1+(k −1)

)

r

where rxx – reliability coefficient r – coefficient

k –

Standard Deviation:

high – heterogenous (more

spread)

low – homogenous (less

spread)

(5)

trait score – sources of errors that reside within the individual taking the test (excuses:

Observed Score = true score + error score hungry, headache, unprepared,

etc.)

method score – sources of errors that reside in the testing situation (lousy instructions, too warm/cold room, missing pages, etc.)

Reliability=

True Score

True Score+Error Score

Interrater reliability=

Number of agreements

Number of disagreements

error ↓ reliability ↑

Stability – the same results are obtained over repeated administration of the instrument. - Test-retest reliability

- parallel, equivalent, or alternative forms Homogeneity – internal consistency (unidimensional)

- item-total correlations; split-half reliability; Kuder-Richardson coefficient; Cronbach-alpha

Item-total correlations – each item on an instrument is correlated to total score – an item with low correlation may be deleted. Highest and lowest correlations are usually reported

- only important if homogeneity of items is desired

Kuder-Richardson coefficient – when items have dichotomous response e.g. yes/no (binary) Cronbach’s-alpha – Likert scale or linear graphic response format

- compares the consistency of responses of all items on the scale (may need to be computed for each sample)

Equivalence – consistency of agreement of observers using the same measure among alternative forms of a tool - parallel of alternate forms (described under stability)

- interrater reliability

TEST CONSTRUCTION (has rudiments, process) Test Planning

Decision to develop a Standard Test

(1) No test exist for a particular purpose or (2) the test existing for a certain purpose are not adequate for one reason or another.

 Weschler’s idea of WAIS was originated from the army alpha (literate soldiers) and army beta (illiterate soldiers), that’s why there are vocabulary and performance tests.

 Weschler – both covers fluid and crystallized intelligences difference between the two,  Culture Fair Intelligence test – looks into specific intelligence in terms of defining intelligence

Subject Matter Experts – test developer must seek help of the experts in evaluating the test items and even the identified constructs of component of the test

Writing Items – depending on whether the scale is to assess an attitude, content knowledge, ability or personality traits; stick to the pattern (ex. don’t shift from declarative to interrogative statement)

Guidelines

1. Deal only with one central thought; more than 1 is called double-barreled. Poor item: My instructor grades fairly and quickly

Better item: My instructor grades fairly. 2. Be precise

Poor item: I received good customer service from Y Company.

Better item: A member of the scales staff at Y Company asked me if he could assist me within minute of entering the store.

3. Be brief

4. Avoid awkward wording or dangling constructs.

Poor item: Being clear is the overall guiding principle in writing items. Better item: The overall guiding principle in writing items is to be clear. * Active voice is more preferred than passive voice.

5. Avoid irrelevant information 6. Present items in positive language

* If it’s inevitable, when using ‘not’, italicize or CAPITALIZE it. 7. Avoid double negatives

8. Avoid terms like all and none

Poor item: Which of the following never occurs …

(6)

9. Avoid indeterminate items like frequently or sometimes 10. Have someone else review your items

Table of Specifications (blueprint)

 Cognitive Domain – factual knowledge, ideas, and intellectual abilities

 Affective Domain – most with the values of a learner including his interests, appreciation, and attitudes  Psychomotor – readiness for a particular action that may either be mental, physical, or emotional Item Analysis

- Way of measuring the quality of questions – seeing how appropriate they were for the respondents and how well they measured their ability / trait

- Way of measuring items over and over again in different tests with prior knowledge of how they are going to perform, creating a population of questions with known properties (e.g. test bank)

- At least 3 or 4 times more

Level of Difficulty – proportion of percent of examinees that answered the item correctly.

In order to determine the difficulty level, table the number of examinees with the correct answer in the item and then apply the formula.

P=

Nu

N x 100

where: P = % of students who answered the items correctly

Nu = number of examinees who answered the items correctly N = total examinees consisting the 2 groups

Level of Difficulty Using Upper and Lower Groups 1. Score the papers after checking

2. Arrange the papers from highest to lowest score

3. Determine the upper and the lower group by x27% with the total number of examinees.

4. The top 27% of the examinees is considered the upper group while the bottom 27% of the total examinees comprises the lower group

5. Get both frequencies of the examinees that answered the item correctly from the 2 groups 6. Determine the difficulty level and the discriminating power

Discriminating Power determines the difference between examinees who have done well and those who did poorly in a particular item. To determine the discriminating level, perform the steps in the difficulty level, then, determine the difference of the 2 groups and divide the difference by the half of the total examinees…. (? Di natapos)

Discriminability

Item/Total Correlation – every item will be correlated to the total score – point biserial method is best used

Point Biserial Method –

dichotomous

scored items / items with a correct answer

– one dichotomous variable (correct/incorrect) correlated with one continuous variable (total score) is a point biserial correlation

– correlate the proportion of people getting each item with the total test score

CTT LTM

- gauge the performance itself but

not trait derives it - aims to look beyond that at theunderlying traits which are producing the tests performance - has the test as its basis - measured at item level and provides

sample-free measurement - statistics are often generalized to

similar students taking a similar test

CTT – “true score model” ( X = T + e )

– easiest and most widely used form

of analyses

– performed on the test as a whole

rather than on the item and

although item statistics can be

generated, they apply only to that

group of students on that collection

of items.

– a set of psychometric procedures

used to test items and scales

reliability, difficulty, discrimination,

etc …

– assumes that every person has

true score on an item pr a scale of

Item Analysis

Classical Test Theory Latent Trait Models

(CTT)

Item Response

Rasch Models

Theory (IRT)

1P 2P 3P 4P similar

Table of % in level of difficulty

91 %

and above

Very easy

Unacceptable

79% - 90%

Easy

Acceptable

26% - 78%

Optimum difficulty /

moderate

Highly Acceptable

11% - 25%

Difficult

Acceptable

10%

and below

Very difficult

Unacceptable

(7)

- only applies to those students taking that test

Latent Trait Models (LTM) – made in 1940’a but widely used in 1960s – practically unfeasible to use these without specialized software

Item Response Theory (IRT) – family of latent trait models used to establish psychometric properties of items and scales

– sometimes referred to as modern psychometrics because … has completely replaced CTT – can predict if one has guessed an item

3 Basic Components (ex. individual differences on a

construct)

1. Item Response Function (IRF) – math function that related the latent trait to the probability of endorsing

an item. good item

2. Item Information Function – an indication of item quality, an item’s ability to differentiate among respondents.

3. Invariance – item characteristics

Item Response Theory (IRT) – the relationship between examinee trait level, item properties and the ability of endorsing the item.

– can be converted into Item Characteristic Curves (ICC) which are graphical functions that represents the respondents ability.

 Item Parameters Location – an item’s location “b” is defined as the amount of the latent trait needed to have a 0.5 probability of endorsing the item.

 Item Parameters Discrimination (a) – indicates the steepness of the IRT at the item’s location – how strongly related the item is to the latent trait like loadings in a factor analysis

Psychological Assessment

1

Series of

tests

Single

Measure

Scales

(IRON)

Evaluation

Assessment

Battery Tests

Test

Measurement

Samples of

Behavior:

Personality

- Traits

- States

- Types – MBTI

- Aptitude

- Interest

- Values

Mental Abilities

-

General Intelligence (g)

–

-

Specific Intelligence (s)

–

-

Multiple Intelligence

Psychopathology

- Diagnosis

- classification

- severity

- Prognosis

- predicting the

dev’t of the d/o

2

Depression

*Suicidal Ideation |*

Self-harm

This suggests that the scores you gather on

psychological tests are not in fact true of real

scores. But, rather, those scores represent a

combination of many factors.

Spearman-Brown Formula

rxx=

kr

(

1+(k −1)

)

r

Standard Deviation:

high – heterogenous (more

spread)

low – homogenous (less

spread)

Reliability=

True Score

True Score+Error Score

Interrater reliability=

Number of agreements

Number of disagreements

P=

Nu

N x 100

dichotomous

CTT – “true score model” ( X = T + e )

– easiest and most widely used form

of analyses

– performed on the test as a whole

rather than on the item and

although item statistics can be

generated, they apply only to that

group of students on that collection

of items.

– a set of psychometric procedures

used to test items and scales

reliability, difficulty, discrimination,

Suicidal Ideation |