Reliability and Validity
Introduction to Study Skills & Research Methods (HL10040)
Dr James Betts
Lecture Outline:
•Definition of Terms
•Types of Validity
•Threats to Validity
•Types of Reliability
•Threats to Reliability
•Introduction to Measurement Error.
Commonly used terms…
“She has a valid point”
“My car is unreliable”
…in science…
“The conclusion of the study was not valid”
“The findings of the study were not reliable”.
Some definitions…
• Validity
“The soundness or appropriateness of a test or instrument in measuring what it is
designed to measure”
(Vincent 1999)
Some definitions…
• Validity
“Degree to which a test or instrument measures what it purports to measure”
(Thomas & Nelson 1996)
Some definitions…
• Reliability
“…the degree to which a test or measure produces the same scores when applied in the same circumstances…”
(Nelson 1997)
Some definitions…
• Objectivity
“…the degree to which different observers agree on measurements…”
(Atkinson & Nevill 1998)
Types of Experimental Validity
• Internal
– Is the experimenter measuring the effect of the independent variable on the dependent variable?
• External
– Can the results be generalised to the wider population?
Logical Statistical
AKA Criterion
Face Content Predictive
Construct
Concurrent
Validity
Consistency
Reliability Objectivity
Logical Validity
• Face Validity
– Infers that a test is valid by definition
– It is clear that the test measures what it is supposed to
e.g.
If you want to assess reaction time, measuring how long it takes an individual to react to a given stimulus would have
face validity Externally
Valid?
Logical Validity
• Face Validity
– Infers that a test is valid by definition
– It is clear that the test measures what it is supposed to
Assessing face validity is therefore a subjective process.
i.e.
Would assessing 15 m sprint time be a valid means of
assessing reaction time?
Logical Validity
• Content Validity
– Infers that the test measures all aspects contributing to the variable of interest
…also a subjective process.
e.g.
Who is the most physically
VOfit?2 max test?
Wingate test?
1 RM?
Overall:
A logically valid test simply appears to
measure the right variable in its entirety?
Statistical Validity
• Concurrent Validity
– Infers that the test produces similar results to a previously validated test
e.g.
VO2 max
Incremental Treadmill Protocol
with expired gas analysis Multi-Stage Fitness (Beep) Test
Statistical Validity
• Predictive Validity
– Infers that the test provides a valid reflection of future performance using a similar test
e.g.
Can performance during test A be
used to predict future performance
in test B?
A B
http://www.youtube.com/watch?v=vdPQ3QxDZ1s
Overall:
A statistically valid test produces results
that agree with other similar tests?
Logical/Statistical Validity
• Construct Validity
– Infers not only that the test is measuring what it is supposed to, but also that it is capable of detecting what should exist, theoretically
– Therefore relates to hypothetical or intangible constructs
e.g.
Team Rivalry
Sportsmanship.
Logical/Statistical Validity
• Construct Validity
– Infers not only that the test is measuring what it is supposed to, but also that it is capable of detecting what should exist, theoretically
– Therefore relates to hypothetical or intangible constructs
– This makes assessment difficult,
i.e. if what should exist cannot be detected, this could mean:
a) Test Invalid? b) Theory Incorrect? c) Sensitivity/Specificity Issues?
Interesting Example: Breast Cancer
• Incidence: ~1 % (0.8 %)
(i.e. a positive result should be detected for approximately 1 in every 100 women tested)
• Sensitivity: ~90 % (87 %)
(the mammogram is sensitive enough that approximately 90 in every 100 breast cancer patients will receive a positive result)
• Specificity: ~90 % (93 %)
(the mammogram is specific enough that approximately 90 in every 100 healthy patients will receive a negative result).
Data from Kerlikowske et al. (1996)
Quick Test
• What is the probability that a patient receiving a positive
result actually has breast
cancer?
Threats to Validity
(and possible solutions?)
Threats to Internal Validity
• Maturation
– Changes in the DV over time irrespective of the IV
Threats to Internal Validity
• Maturation
e.g. One Group Pre-test Post-test
O
1T O
2Threats to Internal Validity
• Maturation (possible solution) Time series
O
1O
2O
3T O
4O
5O
6Threats to Internal Validity
• Maturation (possible solution)
Pre-test Post-test Randomised Group Comparison
O
1T O
2P O
4O
3R n.b. RCT
Threats to Internal Validity
• Maturation (possible solution)
Repeated measures designs can occasionally be an inappropriate solution, even when randomised and counterbalanced
e.g.
Muscle Damage (repeated bout effect)
Vitamin Supplementation (wash-out period)
In which case independent measures designs could be used.
Threats to Internal Validity
• History
– Unplanned events between measurements
Threats to Internal Validity
• History
O
1T O
2e.g. exercise?
Therefore, solution = control extraneous variables!
Threats to Internal/External Validity
• Pre-testing
– Interactive effects due to the pre-test (e.g. learning, sensitisation, etc.)
– Also influences External Validity
• Pre-testing
…but then respond better to the T than the P…
e.g.
O
1T O
2O
3P
R O
4…so it is actually T+O1 that is better than P, not T alone.
Threats to Internal/External Validity
Assessing muscle
mass here could make them train harder in both trials…
• Pre-testing (possible solution)
Solomon Four- Group Design
O
1T O
2R O
4O
3P
P O
6T O
5Threats to Internal/External Validity
Threats to Internal Validity
• Statistical Regression
– AKA regression to the mean
– An initial extreme score is likely to be
followed by less extreme subsequent scores
e.g.
Training has the greatest effect on untrained individuals.
Therefore, solution = effective sampling.
Sophomore Slump & SI
‘Cover Jinx’
Threats to Internal Validity
• Instrumentation
– A difference in the way 2 comparable variables were measured
e.g.
Uncalibrated equipment
Therefore, solution = calibrate!
Threats to Internal Validity
• Selection Bias
– The groups for comparison are not equivalent
Threats to Internal Validity
• Selection Bias
e.g. Groups not randomly assigned
Static Group Comparison
T O
1O
aP
i.e.
Group T were resistance trained to start with
Threats to Internal Validity
• Selection Bias (possible solution)
T O
1O
aP
Either:
-Randomise group assignment,
-Pre-test and post- test difference,
-Repeated Measures Design.
Threats to Internal/External Validity
• Experimental Mortality
– Missing Data due to subject drop-out – Reduced n = reduced statistical Power
– Not only challenges quality of data gathered (Internal Validity) but
also our ability to generalise
(External Validity).
Therefore, solution = recruit sufficient
participants (young?)
Threats to External Validity
• Inadequate description
– 5th characteristic of research…
…should be
replicable
If nobody can replicate the methods of a given study, then it is irrefutable and therefore lacks external validity.
Therefore, solution = comprehensive methodology
Threats to External Validity
• Biased sampling
– Linked to statistical regression
– Sample does not reflect target population – n ≠ N
Results generalised across gender
Therefore, solution = random sample (of target population).
Threats to External Validity
• Hawthorne Effect
– DV is influenced by the fact that it is being recorded
e.g.
Fastest sprint when professor enters lab
Therefore, solution =
control the lab environment.
Threats to External Validity
CHO H2O
Therefore, solution = double or single
blinding.
• Demand Characteristics
– Participants detect the purpose of the study and behave accordingly
e.g.
Sports Science students already know that the carbohydrate drink is supposedly superior
Threats to External Validity
• Operationalisation
– AKA Ecological Validity
– The DV must have some relevance in the
‘real world’
e.g.
TTE has no Olympic equivalent
Therefore, solution = choose your DV carefully.
Reliability
• Reliability is a pre-requisite of validity
e.g. Direct versus Indirect measures of VO2 max
-Gold Standard -Expensive -Complex
-Predictive -Cheap -Easy
(i.e. valid and reliable)
Reliability
Subject 1 60 ml.kg-1.min-1 60 ml.kg-1.min-1 60 ml.kg-1.min-1
Subject 2 55 ml.kg-1.min-1 55 ml.kg-1.min-1 55 ml.kg-1.min-1
Subject 3 70 ml.kg-1.min-1 70 ml.kg-1.min-1 70 ml.kg-1.min-1
Valid and Reliable
Reliability
Subject 1 60 ml.kg-1.min-1 65 ml.kg-1.min-1 65 ml.kg-1.min-1
Subject 2 55 ml.kg-1.min-1 60 ml.kg-1.min-1 60 ml.kg-1.min-1
Subject 3 70 ml.kg-1.min-1 75 ml.kg-1.min-1 75 ml.kg-1.min-1
Not Valid but Reliable
5 ml.kgcorrection?-1.min-1Reliability
Subject 1 60 ml.kg-1.min-1 72 ml.kg-1.min-1 57 ml.kg-1.min-1
Subject 2 55 ml.kg-1.min-1 61 ml.kg-1.min-1 52 ml.kg-1.min-1
Subject 3 70 ml.kg-1.min-1 40 ml.kg-1.min-1 84 ml.kg-1.min-1
Not Valid and not Reliable
i.e. a test can never be valid without being reliable?Types of Reliability
• Relative
• Absolute
• Rater reliability (Objectivity)
– Intrarater reliability – Interrater reliability.
Relative Reliability
Subject 1 60 ml.kg-1.min-1 63 ml.kg-1.min-1 57 ml.kg-1.min-1
Subject 2 55 ml.kg-1.min-1 56 ml.kg-1.min-1 48 ml.kg-1.min-1
Subject 3 70 ml.kg-1.min-1 65 ml.kg-1.min-1 66 ml.kg-1.min-1
Relatively Reliable
i.e. Individuals maintain position in the groupAbsolute Reliability
Subject 1 60 ml.kg-1.min-1 63 ml.kg-1.min-1 57 ml.kg-1.min-1
Subject 2 55 ml.kg-1.min-1 56 ml.kg-1.min-1 48 ml.kg-1.min-1
Subject 3 70 ml.kg-1.min-1 65 ml.kg-1.min-1 66 ml.kg-1.min-1
Not Absolutely Reliable
i.e. Test-Retest within individualsRater Reliability
• Intrarater reliability
– The consistency of a given observer or
measurement tool on more than one occasion
Rater Reliability
• Interrater reliability
– The consistency of a given measurement from more than one observer or measurement tool
e.g.
Score for the American Gymnast British Judge = 9.9 French Judge = 4.4 Japanese Judge = 7.0
Threats to Reliability
• Fatigue
Subject 1 60 ml.kg-1.min-1 55 ml.kg-1.min-1 50 ml.kg-1.min-1
8 am 9 am 10 am
Therefore, solution = increase time between tests.
Threats to Reliability
• Habituation
Subject 1 60 ml.kg-1.min-1 65 ml.kg-1.min-1 70 ml.kg-1.min-1
Therefore, solution = familiarise prior to test.
Threats to Reliability
• Standardisation of Procedures
– Control of extraneous variables
• Precision of Measurements
– i.e. if we are happy to measure VO2 max to the nearest 10 ml.kg-1.min-1, then it could probably be reliably
predicted from your training volume and age.
Measurement Errors
• Ultimately, reliability is dependent on the
degree of measurement error in a given study
• The overall error in any measurement is
comprised of both systematic and random error
• We will address measurement error further next
week…
Literature Search Assignment
• The handout lists 8 questions which can be
answered through retrieving the corresponding source articles
• Answer as many as possible and bring them to next week’s lecture
• DO NOT contact author or order articles.
Selected Reading
• Atkinson, G. and A. M. Nevill. Statistical methods for
assessing measurement error (Reliability) in variables relevant to sports medicine. Sports Medicine. 26:217-238, 1998.
• Holmes, T. H. Ten categories of statistical errors: a guide for research in endocrinology and metabolism. American Journal of Physiology. 286: E495-501.
• Thomas J. R. & Nelson J. K. (2001) Research Methods in Physical Activity, 4th edition. Champaign, Illinois: Human Kinetics