Assumptions of linear models
Assumptions
• Apply to response variable
– within each group if predictor categorical
• Apply to error terms from linear model
– check by analysing residuals• Normality
• Homogeneity of variance
• Independence
Data exploration
• Describe distribution of data
– transform if required and appropriate – logs, square/fourth root• Check assumptions of analysis
• Evaluate fit of model
• Find patterns in multivariate data
50 100 150 200 250 300 350 Length
Largest value Smallest value
Median
25% of values 25% of values
Boxplot
0 10 20 30 40 50 60 70 80 90 Limpet numbers per quadrat 0
10 20 30 40 50 60 70
Count
Outliers
1. SYMMETRICAL
EQUAL VARIANCES 2. SKEWED
4. UNEQUAL VARIANCES 3. OUTLIERS
*
*
*
*
*
Scatterplots
• Plotting bivariate data
• Value of two variables recorded for each observation
• Each variable plotted on one axis (X or Y)
• Symbols represent each observation
• Assess relationship between two variables
0 10000 20000 30000 AREA 0
10 20 30
SPECIES
Model residuals
• Residual is difference between observed and predicted value of response variable
– regression model – ANOVA model
• Standardised (studentised) residuals
– residual / SE residuals– follow a t-distribution
( y
ij− y
i) ( y
i− y $ )
iNormality
Y normally distributed at each value of X:
– boxplots of Y, separate for each group if appropriate, should be symmetrical - watch out for outliers and skewness
– transformations of Y often help
– regression and ANOVA tests robust to this assumption
Homogeneity of variance
Variance (spread) of Y should be constant for each value of x
i(homogeneity of variance):
– skewed populations or outliers produce unequal variances
– transformations that improve normality of Y will also usually make variance of Y more constant
Plots of residuals in regression
0 -ve +ve
Predicted y
iResidual
x y
0 -ve +ve
Predicted y
iResidual
x y
ANOVA checks
• Plot residuals (or variances) against group means
• Tests for equal variances – Bartlett’s, Cochran’s, Levene’s
tests
• ANOVA reliable if group n’s are equal and variances not too different:
• ratio of largest to smallest variance ≤ 3:1
Mean VarianceResiduals
Independence
Values of Y are independent of each other:
– no replicate used more than once
– observations independent within and between groups
– watch out for data which are a time series on same experimental or sampling units
– should be considered at design stage
Repeated measures analyses
– suitable for some non-independent designs
Linearity (regression)
True population relationship between Y and
X is linear:– scatterplot of Y against X
– watch out for asymptotic or exponential patterns
– transformations of Y or Y and X often help
Transformations
• Transform variables to new scale – e.g. degrees Fahrenheit to degrees Celsius
• Statistical transformations
– non-linear (changes shape of distribution) – monotonic (retains rank order of values)
• If Y (therefore error terms) skewed:
– log or power transformation of Y – improves homogeneity of variance – can reduce influence of outliers
• If nonlinear relationship:
– linearise by transformation of Y and/or X
Data transformations
• Common transformations for biol data – log, square or 4throot for skewed continuous
distributions
– arcsin√ for proportions and %
• Transformed variables must make biological sense
Transformation issues
• Zeros in skewed distributions
– log (y + constant) or power transformation
• Power transformations
– 4th root useful for abundance data with large range
• Base [10 or natural (e)] for log transformations – makes no difference to result
• Arcsin for % or proportions – little effect unless close to zero or 100
• Presentation of results
– back transformation of means and errors
• Generalised linear models – non-normal error distributions
Mussel clumps
0 10000 20000 30000 AREA
0 10 20 30
SPECIES
5000 10000
15000 20000
25000 30000 AREA 0
10 20 30
SPECIES
Other regression diagnostics
• Check assumptions
• Check fit of model
• Warn about influential observations and outliers
Anscombe (1973) data set
0 2 4 6 8 10 12
0 5 10 15
0 2 4 6 8 10 12
0 5 10 15
0 2 4 6 8 10 12 14
0 5 10 15
0 2 4 6 8 10 12 14
0 5 10 15 20
0 2 4 6 8 10 12
0 5 10 15
0 2 4 6 8 10 12
0 5 10 15
0 2 4 6 8 10 12 14
0 5 10 15
0 2 4 6 8 10 12 14
0 5 10 15 20
R2= 0.667, y = 3.0 + 0.5*x, t = 4.24, P = 0.002
Outliers
• Unusual sample values very different from rest of sample
– detect using boxplots
• Sample values along way from fitted model – detect by analysing residuals from fitted model
• Solutions
– if impossible values, delete and adjust df – run analysis twice, outliers in and outliers omitted
• if result changes – problems!
Influence
• Cook’s D statistic:
– calculated for each observation – measures change in regression slope if
observation omitted
– observations with large D have large influence on estimated slope
• also large residual
• Observation 1 is X and Y outlier but not influential
• Observation 2 has large residual – outlier
• Observation 3 is very influential (large Cook’s D) - also outlier
Y 1
X 2
3
Assumptions not met - regression
• Transformations useful
• Non-parametric tests
– robust regression• LAD, ranks – randomisation tests
• randomise observations or residuals
• Smoothing functions
Smoothers
• Nonparametric description of relationship between Y and X
– unconstrained by specific model structure
• Useful exploratory technique:
– is linear model appropriate?
– are particular observations influential?
• Used in generalized additive modeling (GAM)
Smoothers
• Each observation replaced by value reflecting neighbouring observations
– mean or median or predicted value of regression model through neighbouring observations
• Window size determines neighbouring observations – size of window (number of observations) determined by
smoothing parameter
• Adjacent windows overlap – resulting line is smooth
– smoothness controlled by smoothing parameter (size of windows)
• Any section of line robust to values in other windows
Types of smoothers
• Running (moving) means or averages:
– means or medians within each window
• Lo(w)ess:
– locally weighted regression scatterplot smoothing – observations within window
weighted differently – observations replaced by
predicted values from local
regression line 0 10000 20000 30000
AREA 0
10 20 30
SPECIES
Assumptions not met - ANOVA
• Robust if equal n
• Transformations useful
• Non-parametric tests
– rank transform tests• Kruskal-Wallis for single factor designs
• ranks inappropriate for testing interaction terms – randomisation tests
• randomises observations or residuals
Generalized linear models
• Select distribution for response variable
– poisson, binomial, lognormal
• Logistic models
– binary data• Log-linear models
– count data in contingency tables
Outliers
• Observations further from fitted model than remaining observations
– might be different fromsample outliers in boxplots