Steps in Scale Construction:Techniques and Guidelines

(1)

Scale Construction

Psychological and Psychometric Testing

Session 6

Prof. Swati Dhir

Guidelines in Scale Development

• Determine clearly What it is you want to measure

Step 1

• Generate an Item Pool

Step 2

• Determine the format for Measurement

Step 3

• Have initial item pool reviewed by Experts

Step 4

• Consider inclusion of Validation items

Step 5

• Administer Items to a Development Sample

Step 6

• Evaluate the items

Step 7

• Optimize scale Length

Step 8

Purpose

Step 1 •Determine clearly What it is you want to measure Step 2 •Generate an Item Pool Step 3 •Determine the format for Measurement Step 4

•Have initial item pool reviewed by Experts Step 5 •Consider inclusion of Validation items Step 6 •Administer Items to a Development Sample Step 7 •Evaluate the items Step 8 •Optimize scale Length

To design a questionnaire that provides a

quantitative measurement of an abstract

theoretical variable

Not all surveys are scales; Decide whether it

really is a scale

Good scales possess both validity and

reliability

Constructs and Measurement

Should the scale be based in theory or should you strike out

in new intellectual direction ?

Figuring out how to measure what you want to measure

Should some aspect of phenomenon be emphasized more

than others ?

Construct Development : A construct is a hypothetical

variable composed of different elements that are thought to

be related (e.g., 5 questions tapping Job satisfaction)

Theory as an aid

to clarity Boundaries of the phenomenon must be recognized so that

the content of the scale does not drift into unintended domains

Specificity as an

aid to clarity Locus of control is a widely used concept that concerns who

or what influences important outcomes in lives

Multidimensional

LOC LOC: oneself , powerful others and chance or fate

Depends largely what level of locus relates to the questions

What to include in

a measure Items that cross over into a related construct can be

problematic

Creating Items

Writing good items for a scale is definitely an art rather than a science Think creatively about the construct you seek to measure Make the questions simple, specific and straightforward Avoid biased language (emotional words, emphasized text) Avoid double-barreled questions

• Do you think that the technical service department is prompt and helpful? Avoid Nonmonotonic questions

• Only people in the military should be allowed to personally own assault rifles.

(2)

Creating Items

Redundancy: Reliability= f (no. of items)

• I will do almost anything to ensure my child’s success • No sacrifice is too great if it helps my child achieve success

No. of items- 2:1

Avoid exceptionally lengthy items; Reading difficulty level

Use reverse coding a number of your items

• Highest value +Lowest value – selected response

Common structure, self contained and no dependency between items

Three Components of Attitudes

Cognitive

Component

• How a person thinks about an attitude object (product, issue, candidate, idea)

Affective

Component

• How a person feels about an attitude object

Behavioral

• A person’s behavioral predisposition to respond to an attitude object in a certain way

On the Importance of Attitudes

I believe both candidates

bring strengths to the table

Measurement

• The term questionnaire item is used to

denote a single question on a survey,

corresponding to a single column in a

dataset.

• Scales typically denote sets of questions

which become mathematical

combinations of survey items.

Step 1 •Determine clearly What it is you want to measure Step 2 •Generate an Item Pool Step 3 •Determine the format for Measurement Step 4 •Have initial item pool reviewed by Experts Step 5 •Consider inclusion of Validation items Step 6 •Administer Items to a Development Sample Step 7 •Evaluate the items Step 8 •Optimize scale Length

Measurement/Scaling Properties

• Assignment

• You can assign objects to categories • Order (Magnitude)

• You can order objects in terms of having more or less of some quality • Distance (Equal Intervals)

• The distance between adjacent points on the scale is identical • Origin (Absolute Zero Point)

• Zero “means something” (absence of a given quality)

Types of Scales

• Nominal Scale

• Has Assignment Only (What is Your Gender?)

• Ordinal

• Has Assignment, Order (Education)

– What is your income? (5-10k; 11-15k; 16-20k; 20-25k; 25-30k)

• Interval

• Has Assignment, Order, Equal Intervals (Temperature)

• Hybrid Ordinally-Interval Scale

• Like an ordinal scale, but researcher “pretends” it is an interval scale (e.g., assumes 1 to 7 scale is an interval scale); commonly used in questionnaires

• Ratio

• Has Assignment, Order, Equal Intervals, Absolute Zero (Number of Cars, weight)

(3)

Formats for Measurement

Thurstone

Scaling

• Different intensities of the attribute, spaced to represent equal intervals • Could be formatted with agree-disagree response option

Guttman Scaling

• Series of items tapping progressively higher levels of an attribute

• Do you smoke?

• Do you smoke more than 10 cigarettes' in a day? • Do you smoke more than a pack?

Semantic Differential

• A list of adjective pairs either unipolar or bipolar • e.g. Friendly or not friendly ; Friendly or hostile

Likert Scale

• The item is prepared as a declarative sentence, followed by response options indicating varying degree of agreement

• Widely used in measuring opinions, beliefs and attitudes

Issues in Designing Verbal Rating Scales

• Many measures taken by researchers are verbal ratings

• What do we need to consider when we develop verbal rating scales? – Number of categories

– Forced vs. unforced scale – Balanced or unbalanced scale – Extent of verbal description

– Should response categories be numbered or not – Comparative vs. noncomparative scale – Scale direction

Number of Response Categories?

• To what extent are you satisfied with your current Laptop ?

• Most researchers suggest between 5 and 7 categories;

for example:

1 2 3 4 5 6 7

Extremely Dissatisfied Somewhat Neither Somewhat Satisfied Extremely Dissatisfied Dissatisfied Satisfied Satisfied

• Too few does not give you enough information

• Too many and it will be hard for people to discriminate between the

options (e.g., a 100-point scale)

Forced vs. Unforced Scale?

• How likely would you be to buy a car manufactured in Brazil?

• Forced Scale (even number of options forces the respondent to lean

one way or the other):

1 2 3 4 5 6

Very Unlikely Somewhat Somewhat Likely Very Unlikely Unlikely Likely Likely

• Unforced scale gives people a neutral option:

1 2 3 4 5 6 7

Very Unlikely Somewhat Neither Somewhat Likely Very Unlikely Unlikely Likely Likely

Balanced vs. Unbalanced Scale?

• How satisfied are you with your current hair stylist?

• Balanced scale (same number of positive and negative options):

1 2 3 4 5 6 7

Extremely Dissatisfied Somewhat Neither Somewhat Satisfied Extremely Dissatisfied Dissatisfied Satisfied Satisfied

• Unbalanced scale (here

all options are positive

):

1 2 3 4 5 6 7

Somewhat Very

Satisfied Satisfied

• Unbalanced scale can give biased results; unless distribution is

naturally skewed to one side of the scale, should use balanced scale

Extent of Verbal Description?

• India should invest in Infrastructure.

• Label endpoints or label all options?

1 2 3 4 5 6 7

Strongly Strongly

Disagree Agree

1 2 3 4 5 6 7

Strongly Moderately Slightly Neither Agree Slightly Moderately Strongly Disagree Disagree Disagree or Disagree Agree Agree Agree

(4)

Should Categories be Numbered?

• Toyota is an Environment Friendly Company

1 2 3 4 5 6 7

-3 -2 -1 0 1 2 3

• Numbers can help respondents understand scale

• 1 to 7 scale quite common

• But -3 to +3 can help interpretation of scale (disagree is negative,

agree is positive), however, it may overemphasize negativity

• Judgment call; pretesting both scales could help identify problems

Should we have numbers here?

Comparative vs. Noncomparative?

• Noncomparative question

• How would you evaluate Pepsodent toothpaste? • Comparative question

• Compared to your current brand, how would you evaluate

Pepsodent toothpaste?

• Comparative questions establish the referent and can be useful if you need to know how your product compares to a specific competitor or the customer’s current brand

• Noncomparative have the advantage of allowing the respondent to create their own referent, which can potentially improve accuracy

Direction of Scale?

• Typical direction (lower values, negative connotation on left):

1 2 3 4 5 6 7

• Some scales are not valenced, so must be careful about positioning. For a semantic differential scale, with amusing positioning:

Unpleasant -2 -1 0 1 2 Pleasant Flimsy -2 -1 0 1 2 Sturdy Male -2 -1 0 1 2 Female

• This arrangement suggests that males are to be evaluated negatively; must be careful in designing scales so as not to bias results

Single-items adequate for measurement?

• Suppose an instructor had single-question exams?

• Suppose the CAT (or GMAT) had only 5 possible

scores (similar to A,B,C,D,F grades)?

Composite, or Multiple-Item Scales

Capture the sensitivity to the continuous nature

of many subtle differences among consumers

Simultaneously address concerns of: Accuracy

and Consistency

All relate to larger issue of measurement error

Formative and Reflective Items

Can be combined to measure the multiple aspects of a construct, though not necessary that respondents answer each item similarly

Formative

items

Measures a single trait and respondents should answer each item similarly

Reflective

items

Items within a scale are typically interchangeable for reflective items but not for formative items

(5)

Formative Scale Items: Satisfaction

My last flight on JA departed on-time.

An airline could always be on-time if they made that their priority JA has competitive fares.

It upsets me to know others on the same flight have paid a lower price for their seat. JA ticketing personnel are polite.

JA has friendly reservation operators.

I know it’s not the airline’s fault when a flight is cancelled.

The two-item restriction on carry-on luggage is insensitive to the needs of today’s passengers. JA has ample leg-room for me in coach seating.

JA did not lose my luggage on my last trip.

I have not been “bumped” from a JA flight in the last two years.”

Timeliness

Pricing

Staff

Travelling Comfort Service

Reflective Items: Materialism

I admire people who own expensive homes, cars, and clothes.

Some of the most important achievements in life include acquiring

material possessions.

I don’t place much emphasis on the amount of material objects

people own as a sign of success.*

The things I own say a lot about how well I’m doing in life.

I don’t pay much attention to the material objects other people own.*

• * Reverse coded

Reviewed By Experts

Step 1 •Determine clearly What it is you want to measure Step 2 •Generate an Item Pool Step 3 •Determine the format for Measurement Step 4 •Have initial item

pool reviewed by Experts Step 5 •Consider inclusion of Validation items Step 6 •Administer Items to a Development Sample Step 7 •Evaluate the items Step 8 •Optimize scale Length

• Ask panel of expert to rate how relevant they think each item

is to what you intend to measure

• Provide the expert the working definition of the construct

• Can evaluate the items clarity and conciseness (by rating

relevance as high, moderate or low)

• Can provide pointing out ways of tapping the phenomenon

that you have failed to include

Validity and Reliability

• Internal Validity (No confounds)

• External Validity (Generalized to your target

population)

• Content related Evidence: Face validity

• Criterion Related Evidence: Predictive Validity,

Concurrent Validity

• Construct Related Evidence: Convergent Validity,

Discriminant Validity

• Reliability: Test- Retest Method, Alternate forms

method and Split haves method

Consider inclusion of validation items

• Social desirability

-Social desirability scale (Strahan and Gerbasi,

1972)

- For detecting undesirable response

tendencies we can use MMPI

(Minnesota Multiphasic Personality

Inventory) and response biases can be

detected

Administer Items to a Development

Sample

• Administer items along with the pool of new

items to some subjects

• The subject sample should be large enough to

eliminate subject variance as a significant concern

• If a single scale is to be extracted from a pool of

about 20 items , fewer than 300 subject may

suffice

• Entering the data

– Using Computer software

• www.surveymonkey.com

• http://www.qualtrics.com

(6)

Why a large sample ??

In small sample, patterns of co variation among the

items may not be stable

Development sample may not represent the

population for which the scale is intended

• Level of attributes present in sample v/s intended

population

• A sample that is qualitatively rather than quantitatively

different from the target population (the relationship among

items or constructs may differ from the population)

Evaluate the items

• An item should

high co relation with the true

score of latent variable

– Inspect the correlation matrix

– higher the co relation among items higher are the

individual item reliabilities

• Reverse Scoring

• Item Scale co relation-

an uncorrected item-

total co relation makes good conceptual

sense , the reality is that the item’s inclusion

in scale can inflate the co relation coefficient

Evaluate the items

• Item variance

–valuable attribute for a scale

item is relatively high variance

• Items means

– close to center of the range of

possible scores is also desirable otherwise item

might fail to detect certain values of construct

• Coefficient alpha

-is an indication of proportion

of variance in the scale scores that is

attributable to true score

–a non central mean, poor variability, negative co

relation among items, low item scale co relation

and weak inter item co relation –will tend to

reduce alpha

Optimize Scale length

• Effect of scale length on reliability

-Scale alpha is dependent on co variation among the

items and no of items

-If a scale reliability is too low, then brevity is no value

• Effects of dropping bad items

–if an item has

sufficiently lower than average correlation with the

other item, dropping it will raise alpha

• Tinkering with scale length

- items whose omission

has the least –ve or most +ve effect on alpha is the

best one to drop first

• Split Items-

- If developmental sample is sufficient large, split

it into two sub samples one can serve as

primary development sample and other can be

used to cross validate the findings

- Splitting provides valuable information about

scale stability

Psychological and

Psychometric testing

Session 8: Item Analysis

Prof. Swati Dhir

(7)

In constructing a new test (or shortening or

lengthening an existing one), the final set of items

is usually identified through a process known as

item analysis.

—Linda Croker

Both the validity and the reliability of any test depend

ultimately on the characteristics of its items.

Item Analysis - Outline

1. Types of test items

• Selected response items

• Constructed response items

2. Parts of test items

3. Guidelines for writing test items

4. Item Analysis

• Distracter measures

• Item difficulty measures

• Item discrimination measures

1. Types

of test items

Selected response

• Multiple choice

• Likert scale

• Q-sort

Constructed response

• Free response

• Fill-in-the-blank

• Essay tests

• Portfolios

• In-basket technique

A. Selected response

• Multiple

choice or

forced choice

• Task is to choose between set answers

• Advantage: Ease of scoring &

scoring requires little skill

• Disadvantage: may test memory rather

than comprehension

• Correct response must be distinct

• Distracters should not be obvious or

ambiguous

A. Selected response

• Multiple choice or

forced choice

• Likert format

• Test-taker chooses a point

on a scale that expresses

their attitude or belief

• Data lend themselves to

factor analysis

A. Selected response

• Multiple choice or

forced choice

• Likert format

•Q-sort

• A large set of cards each with

statement referring to a

“target”

• Test-taker sorts cards into

piles in terms of how

• accurate statements are as a

description of target

• Generally 9 piles

(8)

B. Constructed response items

• Test-taker responds without constraint • Describes what is important to him/her

Free response

• Used to test for knowledge or to find out about beliefs and attitudes

Fill-in-the-blank

• Preferred when you want to assess test-taker’s ability to think

analytically, integrate ideas, and express himself/herself

Essay tests

• Not really a test

• Collections of things the person being evaluated has produced

Portfolios

• Used in business; Job candidate gets a set of“everyday” problems, says how he or she would deal with those problems • Requires expert raters to grade response

In-basket

technique

B. Constructed response items

Strengths

Assess higher-order skills More useful feedback to test-taker

Positive influence on study

habits

Easier to create items

Weaknesses

Time consuming to use

Possible subjectivity in

scoring

2. Parts of test items

• What the subject responds to

Stimulus or item stem

• Typically multiple choice, Likert or constructed response

Response format or method

• time limits; allowing probes for ambiguous responses; how response is recorded

Conditions governing the response

• Particularly important for constructed response items

Procedures for scoring the response

3. Writing test items – guidelines

A. Define clearly

B. Generate a pool of potential items

C. Monitor reading level

D. Use unitary items

E. Avoid long items

F. Break any response “set”

4. Item analysis

Multiple choice distracter analysis Item difficulty measure P Discrimination index D Item – total correlation

A. Multiple choice – distracter measures

• How many people

choose each

distracter?

• Distracters should be

equally attractive

• Correct choice should be

based on knowledge

• Where knowledge is

lacking, choice should be

random

(9)

B. Item Difficulty Measure Pi

The item difficulty for item i, p

i

, is defined as the proportion of

examinees who get that item correct.

P(i) = # got item correct

# taking test

Though the proportion of examinees passing an item

traditionally has been called the item difficulty, this

proportion logically should be called item easiness,

because the proportion increase as the item becomes

easier.

Method for

Dichotomously

Scored Item

Method for

Polytomously

Scored Item

Grouping

Method

Estimation Methods

Difficulty Factor

P is the difficulty of a certain item.

R is the number of examinees who get

that item correct.

N is the total number of examinees

N

R

P 

Method for Dichotomously Scored Items

Difficulty Factor

Range 0 -1; Optimal Level is .5

The HIGHER the difficulty factor – the easier the

question is, so a value of 1 would mean all the students

got the question correct and it may be too easy

If you want the subjects to master the topic area, high

difficulty values should be expected

Example 1

There are 80 high school students attending a

science achievement test, and 61 students pass item

1, 32 students pass item 10. Please calculate the

difficulty for item 1 and 10 separately.

P

1

= 0.76; P

10

= 0.4

Guided Practice

What is the P for Items 1-3

Student score Raw Item 1 Item 2 Item 3 Item 4 Item 5

A 8 a b a d e B 6 c b e c e C 6 a c e c b D 4 a b e a c E 2 c a b d c F 8 a b c c e G 10 a b a c e H 6 a b c d e I 8 a c a c e J 4 a c a d b

(10)

Difficulty Factor

What does it mean?

• Item # 1 = .8 may be too easy

• Item # 2 = .6 good

• Item # 3 = .4 may be slightly difficult

• Item # 4 = 0.5 Optimum

• Item # 5 = 0.6 Good

The perfect scores of one open- ended item is 20

points, the average score of total examinees on

this item is 11 points. What is the item difficulty?

P = .55

max

X

P 

X

, the mean of

total examinees’ scores on one item

max

X

_{, the perfect scores of that item}

Method for Polytomously Scored Items

Upper (U) and Lower (L) Criterion groups are

selected from the extremes of distribution of

test scores or job ratings.

2

L U

P





is the proportion for examinees of upper group who get the item correct.

is the proportion for examinees of lower group who get the item correct.

U

P

L

P

Grouping Method (Use of Extreme Groups) (T. L. Kelley,

1939)

_{Example 3}

There are 371 examinees attending a language test.

Known that 64 examinees of 27% upper extreme

group pass item 5, and 33 examinees of 27%

lower extreme group pass the same item. Please

compute the difficulty of item 5.

Key : 0.49

1

1 





K

KP

CP

,

corrected item difficulty

P

, item difficulty

K

, the number of choices for that item

The difficulty of one five-choice item is .50, the difficulty of

another four-choice item is .53. Which item is more difficulty?

Correct Chance Effects on Item Difficulty for Multiple-Choice Item

ANSWER

So, the four-choice item is more difficult.

38 .

0

1

5

1

5 .

0

5

1

_













K

KP

CP

37 .

0

1

4

1

53 .

0

4

1

2

_













K

KP

CP

(11)

C. Item Discrimination Measures

Discrimination

index D

Item-total

correlation

Item discrimination refers to the degree to which an item

differentiates correctly among test takers in the behavior that

the test is designed to measure

To be able to discriminate between different levels of

achievement, the difficulty factor should be between .3 and .7

Item Discrimination

Discrimination Index D (Used for dichotomously scored items)

• Extreme groups method

– U = # getting item correct in ‘top’ group

– L = # getting item correct in ‘bottom’ group

– n

U

= # in top group

– n

L

= # in bottom group

D= U – L

n

U

n

L

Values of D may range from -1.00 to 1.00.

Example 1

There are 141 students attending a world history test.

(1) If we use the ratio 27% to determine the upper and

lower group, then how many examinees are there in the

upper and lower group separately?

(2) If 18 examinees in upper group answer item 5 correctly,

and 6 examinees in lower group answer it correctly,

then calculate the discrimination index for item 5.

Answer: 38, 0.315

D≥.40, the item is functioning quite satisfactorily

.30≤ D≤.39, little or no revision is required

.20 ≤ D≤.29, the item is marginal and needs revision

D≤.19, the item should be eliminated or completely

revised

Guidelines for Interpretation of D Value

Item Total Correlation

Good item

High correlation

People who get item correct have high score on the test

People who get item wrong have low score on the test

Poor item

Low correlation: look at wording – may be testing reading skill

(12)

Choice Analysis

Whether the examinees who choose the correct choice is

more than those who choose the wrong choices

Whether a lot of examinees choose the wrong choices

Whether the examinees of upper group who choose the

correct choice is more than the examinees of lower group

Whether the examinees of upper group who choose the

wrong choice is more than those of lower group

Whether there is any item that quite a number of

examinees make no choices

Psychological and Psychometric

Testing

Session 8&9

Prof. Swati Dhir

Excel Add-ins

• Use the Analysis ToolPak to perform complex

data analysis

• If data analysis command is not available

• Command: File_Option_Add

Ins_Manage_Select_ Analysis Toolpak (check

box and ok

Literature Review (Home work)

Research Methodology

• Item Generation

• Content validation

• Adding some criterion related construct

• Context of the study

• Interitem Analysis

• Exploratory Factor Analysis

• Construct validity (Convergent and Divergent)

• External Validity

• Sampling Adequacy

• Reliability

• Criterion Validity (Predictive and Concurrent)

Content Validity

• Rating by experts

• 80% consensus

• Drop the items if it is not consistent

• Items may be reworded

• Command: Analyze_ Descriptive Statistics_

Cross tabs

• Select rater 1 as row and rater 2 as column

• Click statistics_ select kappa_ continue

(13)

Example

Content Validity

Kappa

Interpretation

< 0 Poor agreement 0.0 – 0.20 Slight agreement 0.21 – 0.40 Fair agreement 0.41 – 0.60 Moderate agreement 0.61 – 0.80 Substantial agreement 0.81 – 1.00 Almost perfect agreement

Kappa might be interpreted (Landis & Koch,1977)

Data Entry

• Files export

• Variable view

• Missing Values (Analyze_Missing Value)

Descriptive Statistics (DS):

– Frequency (Analyze_DS_Frequency

• Data cleaning

Interitem Analysis

• Selection of closely associated items thereby

increasing the reliability of the scale

• Mean, Standard Deviation and Intercorrelations

• Though, there is no definite cutoff score for

adequate variability

• However, SD of 1 represents adequate amount of

variability for usefulness of an item

• Any item that correlates at less than 0.40 with all

other items should be dropped

• Too high means for particular item_ Outliers

Command: Analyze_Correlate_Bivariate

Exploratory Factor Analysis

• Validity Coefficient: The relationship between a test

and a criterion is usually expressed as a correlation

called a validity coefficient

• Principal Axis Factor analysis with Varimax Rotation

• Factor loading >0.5

• Square of factor loading is the percentage of variation

in the criterion we can know from the test scores

• Command: Analyze_Dimension Reduction_Factor

(14)

• Most widely used of all factor number rules

• For any matrix of correlations, it is possible to compute a set

of numerical values called eigen values.

• They reflect the variance accounted for by principal

components,

–with the first value reflecting the variance explained by the strongest component,

–the second value the variance explained by the second strongest component and so on.