Reliability and Validity

(1)

Reliability and Validity

Introduction to Study Skills & Research Methods (HL10040)

Dr James Betts

(2)

Lecture Outline:

•Definition of Terms

•Types of Validity

•Threats to Validity

•Types of Reliability

•Threats to Reliability

•Introduction to Measurement Error.

(3)

Commonly used terms…

“She has a valid point”

“My car is unreliable”

…in science…

“The conclusion of the study was not valid”

“The findings of the study were not reliable”.

(4)

Some definitions…

• Validity

“The soundness or appropriateness of a test or instrument in measuring what it is

designed to measure”

(Vincent 1999)

(5)

Some definitions…

• Validity

“Degree to which a test or instrument measures what it purports to measure”

(Thomas & Nelson 1996)

(6)

Some definitions…

• Reliability

“…the degree to which a test or measure produces the same scores when applied in the same circumstances…”

(Nelson 1997)

(7)

Some definitions…

• Objectivity

“…the degree to which different observers agree on measurements…”

(Atkinson & Nevill 1998)

(8)

Types of Experimental Validity

• Internal

– Is the experimenter measuring the effect of the independent variable on the dependent variable?

• External

– Can the results be generalised to the wider population?

(9)

Logical Statistical

AKA Criterion

Face Content Predictive

Construct

Concurrent

Validity

Consistency

Reliability Objectivity

(10)

Logical Validity

• Face Validity

– Infers that a test is valid by definition

– It is clear that the test measures what it is supposed to

e.g.

If you want to assess reaction time, measuring how long it takes an individual to react to a given stimulus would have

face validity Externally

Valid?

(11)

Logical Validity

• Face Validity

– Infers that a test is valid by definition

– It is clear that the test measures what it is supposed to

Assessing face validity is therefore a subjective process.

i.e.

Would assessing 15 m sprint time be a valid means of

assessing reaction time?

(12)

Logical Validity

• Content Validity

– Infers that the test measures all aspects contributing to the variable of interest

…also a subjective process.

e.g.

Who is the most physically

VOfit?₂ max test?

Wingate test?

1 RM?

(13)

Overall:

A logically valid test simply appears to

measure the right variable in its entirety?

(14)

Statistical Validity

• Concurrent Validity

– Infers that the test produces similar results to a previously validated test

e.g.

VO₂ max

Incremental Treadmill Protocol

with expired gas analysis Multi-Stage Fitness (Beep) Test

(15)

Statistical Validity

• Predictive Validity

– Infers that the test provides a valid reflection of future performance using a similar test

e.g.

Can performance during test A be

used to predict future performance

in test B?

A B

http://www.youtube.com/watch?v=vdPQ3QxDZ1s

(16)

Overall:

A statistically valid test produces results

that agree with other similar tests?

(17)

Logical/Statistical Validity

• Construct Validity

– Infers not only that the test is measuring what it is supposed to, but also that it is capable of detecting what should exist, theoretically

– Therefore relates to hypothetical or intangible constructs

e.g.

Team Rivalry

Sportsmanship.

(18)

Logical/Statistical Validity

• Construct Validity

– Infers not only that the test is measuring what it is supposed to, but also that it is capable of detecting what should exist, theoretically

– Therefore relates to hypothetical or intangible constructs

– This makes assessment difficult,

i.e. if what should exist cannot be detected, this could mean:

a) Test Invalid? b) Theory Incorrect? c) Sensitivity/Specificity Issues?

(19)

Interesting Example: Breast Cancer

• Incidence: ~1 % (0.8 %)

(i.e. a positive result should be detected for approximately 1 in every 100 women tested)

• Sensitivity: ~90 % (87 %)

(the mammogram is sensitive enough that approximately 90 in every 100 breast cancer patients will receive a positive result)

• Specificity: ~90 % (93 %)

(the mammogram is specific enough that approximately 90 in every 100 healthy patients will receive a negative result).

Data from Kerlikowske et al. (1996)

(20)

Quick Test

• What is the probability that a patient receiving a positive

result actually has breast

cancer?

(21)

(22)

Threats to Validity

(and possible solutions?)

(23)

Threats to Internal Validity

• Maturation

– Changes in the DV over time irrespective of the IV

(24)

Threats to Internal Validity

• Maturation

e.g. One Group Pre-test Post-test

O

1

T O

2

(25)

Threats to Internal Validity

• Maturation (possible solution) Time series

O

1

O

2

O

3

T O

4

O

5

O

6

(26)

Threats to Internal Validity

• Maturation (possible solution)

Pre-test Post-test Randomised Group Comparison

O

1

T O

2

P O

4

O

3

R ^n.b. ^RCT

(27)

Threats to Internal Validity

• Maturation (possible solution)

Repeated measures designs can occasionally be an inappropriate solution, even when randomised and counterbalanced

e.g.

Muscle Damage (repeated bout effect)

Vitamin Supplementation (wash-out period)

In which case independent measures designs could be used.

(28)

Threats to Internal Validity

• History

– Unplanned events between measurements

(29)

Threats to Internal Validity

• History

O

1

T O

2

e.g. exercise?

Therefore, solution = control extraneous variables!

(30)

Threats to Internal/External Validity

• Pre-testing

– Interactive effects due to the pre-test (e.g. learning, sensitisation, etc.)

– Also influences External Validity

(31)

• Pre-testing

…but then respond better to the T than the P…

e.g.

O

1

T O

2

O

3

P

R O

4

…so it is actually T+O₁ that is better than P, not T alone.

Threats to Internal/External Validity

Assessing muscle

mass here could make them train harder in both trials…

(32)

• Pre-testing (possible solution)

Solomon Four- Group Design

O

1

T O

2

R O

4

O

3

P

P O

6

T O

5

Threats to Internal/External Validity

(33)

Threats to Internal Validity

• Statistical Regression

– AKA regression to the mean

– An initial extreme score is likely to be

followed by less extreme subsequent scores

e.g.

Training has the greatest effect on untrained individuals.

Therefore, solution = effective sampling.

Sophomore Slump & SI

‘Cover Jinx’

(34)

Threats to Internal Validity

• Instrumentation

– A difference in the way 2 comparable variables were measured

e.g.

Uncalibrated equipment

Therefore, solution = calibrate!

(35)

Threats to Internal Validity

• Selection Bias

– The groups for comparison are not equivalent

(36)

Threats to Internal Validity

• Selection Bias

e.g. Groups not randomly assigned

Static Group Comparison

T O

1

O

_a

P

i.e.

Group T were resistance trained to start with

(37)

Threats to Internal Validity

• Selection Bias (possible solution)

T O

1

O

_a

P

Either:

-Randomise group assignment,

-Pre-test and post- test difference,

-Repeated Measures Design.

(38)

Threats to Internal/External Validity

• Experimental Mortality

– Missing Data due to subject drop-out – Reduced n = reduced statistical Power

– Not only challenges quality of data gathered (Internal Validity) but

also our ability to generalise

(External Validity).

Therefore, solution = recruit sufficient

participants (young?)

(39)

Threats to External Validity

• Inadequate description

– 5^th characteristic of research…

…should be

replicable

If nobody can replicate the methods of a given study, then it is irrefutable and therefore lacks external validity.

Therefore, solution = comprehensive methodology

(40)

Threats to External Validity

• Biased sampling

– Linked to statistical regression

– Sample does not reflect target population – n ≠ N

Results generalised across gender

Therefore, solution = random sample (of target population).

(41)

Threats to External Validity

• Hawthorne Effect

– DV is influenced by the fact that it is being recorded

e.g.

Fastest sprint when professor enters lab

Therefore, solution =

control the lab environment.

(42)

Threats to External Validity

CHO H₂O

Therefore, solution = double or single

blinding.

• Demand Characteristics

– Participants detect the purpose of the study and behave accordingly

e.g.

Sports Science students already know that the carbohydrate drink is supposedly superior

(43)

Threats to External Validity

• Operationalisation

– AKA Ecological Validity

– The DV must have some relevance in the

‘real world’

e.g.

TTE has no Olympic equivalent

Therefore, solution = choose your DV carefully.

(44)

Reliability

• Reliability is a pre-requisite of validity

e.g. Direct versus Indirect measures of VO₂ max

-Gold Standard -Expensive -Complex

-Predictive -Cheap -Easy

(i.e. valid and reliable)

(45)

Reliability

Subject 1 60 ml.kg^-1.min^-1 60 ml.kg^-1.min-1 60 ml.kg^-1.min^-1

Subject 2 ^{55 ml.kg}^-1^.min^-1 55 ml.kg^-1.min-1 55 ml.kg^-1.min^-1

Valid and Reliable

(46)

Reliability

Not Valid but Reliable

^{5 ml.kg}correction?^-1^.min^-1

(47)

Reliability

Not Valid and not Reliable

i.e. a test can never be valid without being reliable?

(48)

Types of Reliability

• Relative

• Absolute

• Rater reliability (Objectivity)

– Intrarater reliability – Interrater reliability.

(49)

Relative Reliability

Relatively Reliable

i.e. Individuals maintain position in the group

(50)

Absolute Reliability

Not Absolutely Reliable

i.e. Test-Retest within individuals

(51)

Rater Reliability

• Intrarater reliability

– The consistency of a given observer or

measurement tool on more than one occasion

(52)

Rater Reliability

• Interrater reliability

– The consistency of a given measurement from more than one observer or measurement tool

e.g.

Score for the American Gymnast British Judge = 9.9 French Judge = 4.4 Japanese Judge = 7.0

(53)

Threats to Reliability

• Fatigue

8 am 9 am 10 am

Therefore, solution = increase time between tests.

(54)

Threats to Reliability

• Habituation

Therefore, solution = familiarise prior to test.

(55)

Threats to Reliability

• Standardisation of Procedures

– Control of extraneous variables

• Precision of Measurements

– i.e. if we are happy to measure VO₂ max to the nearest 10 ml.kg^-1.min^-1, then it could probably be reliably

predicted from your training volume and age.

(56)

Measurement Errors

• Ultimately, reliability is dependent on the

degree of measurement error in a given study

• The overall error in any measurement is

comprised of both systematic and random error

• We will address measurement error further next

week…

(57)

Literature Search Assignment

• The handout lists 8 questions which can be

answered through retrieving the corresponding source articles

• Answer as many as possible and bring them to next week’s lecture

• DO NOT contact author or order articles.

(58)

Selected Reading

• Atkinson, G. and A. M. Nevill. Statistical methods for

assessing measurement error (Reliability) in variables relevant to sports medicine. Sports Medicine. 26:217-238, 1998.

• Holmes, T. H. Ten categories of statistical errors: a guide for research in endocrinology and metabolism. American Journal of Physiology. 286: E495-501.

• Thomas J. R. & Nelson J. K. (2001) Research Methods in Physical Activity, 4th edition. Champaign, Illinois: Human Kinetics

(59)

(60)

(61)

[email protected]