Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model

(1)

Assumptions of linear models

Assumptions

• Apply to response variable

– within each group if predictor categorical

• Apply to error terms from linear model

– check by analysing residuals

• Normality

• Homogeneity of variance

• Independence

Data exploration

• Describe distribution of data

– transform if required and appropriate – logs, square/fourth root

• Check assumptions of analysis

• Evaluate fit of model

• Find patterns in multivariate data

50 100 150 200 250 300 350 Length

Largest value Smallest value

Median

25% of values 25% of values

Boxplot

0 10 20 30 40 50 60 70 80 90 Limpet numbers per quadrat 0

10 20 30 40 50 60 70

Count

Outliers

1. SYMMETRICAL

EQUAL VARIANCES 2. SKEWED

4. UNEQUAL VARIANCES 3. OUTLIERS

*

(2)

Scatterplots

• Plotting bivariate data

• Value of two variables recorded for each observation

• Each variable plotted on one axis (X or Y)

• Symbols represent each observation

• Assess relationship between two variables

0 10000 20000 30000 AREA 0

10 20 30

SPECIES

Model residuals

• Residual is difference between observed and predicted value of response variable

– regression model – ANOVA model

• Standardised (studentised) residuals

– residual / SE residuals

– follow a t-distribution

( y

_ij

− y

_i

) ( y

_i

− y $ )

_i

Normality

Y normally distributed at each value of X:

– boxplots of Y, separate for each group if appropriate, should be symmetrical - watch out for outliers and skewness

– transformations of Y often help

– regression and ANOVA tests robust to this assumption

Homogeneity of variance

Variance (spread) of Y should be constant for each value of x

_i

(homogeneity of variance):

– skewed populations or outliers produce unequal variances

– transformations that improve normality of Y will also usually make variance of Y more constant

Plots of residuals in regression

0 -ve +ve

Predicted y

_i

Residual

x y

0 -ve +ve

Predicted y

_i

Residual

x y

ANOVA checks

• Plot residuals (or variances) against group means

• Tests for equal variances – Bartlett’s, Cochran’s, Levene’s

tests

• ANOVA reliable if group n’s are equal and variances not too different:

• ratio of largest to smallest variance ≤ 3:1

Mean VarianceResiduals

(3)

Independence

Values of Y are independent of each other:

– no replicate used more than once

– observations independent within and between groups

– watch out for data which are a time series on same experimental or sampling units

– should be considered at design stage

Repeated measures analyses

– suitable for some non-independent designs

Linearity (regression)

True population relationship between Y and

X is linear:

– scatterplot of Y against X

– watch out for asymptotic or exponential patterns

– transformations of Y or Y and X often help

Transformations

• Transform variables to new scale – e.g. degrees Fahrenheit to degrees Celsius

• Statistical transformations

– non-linear (changes shape of distribution) – monotonic (retains rank order of values)

• If Y (therefore error terms) skewed:

– log or power transformation of Y – improves homogeneity of variance – can reduce influence of outliers

• If nonlinear relationship:

– linearise by transformation of Y and/or X

Data transformations

• Common transformations for biol data – log, square or 4^throot for skewed continuous

distributions

– arcsin√ for proportions and %

• Transformed variables must make biological sense

Transformation issues

• Zeros in skewed distributions

– log (y + constant) or power transformation

• Power transformations

– 4th root useful for abundance data with large range

• Base [10 or natural (e)] for log transformations – makes no difference to result

• Arcsin for % or proportions – little effect unless close to zero or 100

• Presentation of results

– back transformation of means and errors

• Generalised linear models – non-normal error distributions

Mussel clumps

(4)

0 10000 20000 30000 AREA

0 10 20 30

SPECIES

5000 10000

15000 20000

25000 30000 AREA 0

10 20 30

SPECIES

Other regression diagnostics

• Check assumptions

• Check fit of model

• Warn about influential observations and outliers

Anscombe (1973) data set

0 2 4 6 8 10 12

0 5 10 15

0 2 4 6 8 10 12

0 5 10 15

0 2 4 6 8 10 12 14

0 5 10 15

0 2 4 6 8 10 12 14

0 5 10 15 20

0 2 4 6 8 10 12

0 5 10 15

0 2 4 6 8 10 12

0 5 10 15

0 2 4 6 8 10 12 14

0 5 10 15

0 2 4 6 8 10 12 14

0 5 10 15 20

R²= 0.667, y = 3.0 + 0.5*x, t = 4.24, P = 0.002

Outliers

• Unusual sample values very different from rest of sample

– detect using boxplots

• Sample values along way from fitted model – detect by analysing residuals from fitted model

• Solutions

– if impossible values, delete and adjust df – run analysis twice, outliers in and outliers omitted

• if result changes – problems!

Influence

• Cook’s D statistic:

– calculated for each observation – measures change in regression slope if

observation omitted

– observations with large D have large influence on estimated slope

• also large residual

(5)

• Observation 1 is X and Y outlier but not influential

• Observation 2 has large residual – outlier

• Observation 3 is very influential (large Cook’s D) - also outlier

Y 1

X 2

3

Assumptions not met - regression

• Transformations useful

• Non-parametric tests

– robust regression

• LAD, ranks – randomisation tests

• randomise observations or residuals

• Smoothing functions

Smoothers

• Nonparametric description of relationship between Y and X

– unconstrained by specific model structure

• Useful exploratory technique:

– is linear model appropriate?

– are particular observations influential?

• Used in generalized additive modeling (GAM)

Smoothers

• Each observation replaced by value reflecting neighbouring observations

– mean or median or predicted value of regression model through neighbouring observations

• Window size determines neighbouring observations – size of window (number of observations) determined by

smoothing parameter

• Adjacent windows overlap – resulting line is smooth

– smoothness controlled by smoothing parameter (size of windows)

• Any section of line robust to values in other windows

Types of smoothers

• Running (moving) means or averages:

– means or medians within each window

• Lo(w)ess:

– locally weighted regression scatterplot smoothing – observations within window

weighted differently – observations replaced by

predicted values from local

regression line 0 10000 20000 30000

AREA 0

10 20 30

SPECIES

Assumptions not met - ANOVA

• Robust if equal n

• Transformations useful

• Non-parametric tests

– rank transform tests

• Kruskal-Wallis for single factor designs

• ranks inappropriate for testing interaction terms – randomisation tests

• randomises observations or residuals

(6)

Generalized linear models

• Select distribution for response variable

– poisson, binomial, lognormal

• Logistic models

– binary data

• Log-linear models

– count data in contingency tables

Outliers

• Observations further from fitted model than remaining observations

– might be different from

sample outliers in boxplots

Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model

Assumptions of linear models

Assumptions

• Apply to response variable

• Apply to error terms from linear model

• Normality

• Homogeneity of variance

• Independence

Data exploration

• Describe distribution of data

• Check assumptions of analysis

• Evaluate fit of model

• Find patterns in multivariate data

Boxplot

Scatterplots

Model residuals

• Residual is difference between observed and predicted value of response variable

• Standardised (studentised) residuals

( y

− y

) ( y

− y $ )

Normality

Homogeneity of variance

Variance (spread) of Y should be constant for each value of x

(homogeneity of variance):

Plots of residuals in regression

Predicted y

Predicted y

ANOVA checks

Independence

Values of Y are independent of each other:

Repeated measures analyses

Linearity (regression)

True population relationship between Y and

Transformations

Data transformations

Transformation issues

Mussel clumps

Other regression diagnostics

• Check assumptions

• Check fit of model

• Warn about influential observations and outliers

Anscombe (1973) data set

Outliers

Influence

• Cook’s D statistic:

Assumptions not met - regression

• Transformations useful

• Non-parametric tests

• Smoothing functions

Smoothers

• Nonparametric description of relationship between Y and X

• Useful exploratory technique:

• Used in generalized additive modeling (GAM)

Smoothers

Types of smoothers

Assumptions not met - ANOVA

• Robust if equal n

• Transformations useful

• Non-parametric tests

Generalized linear models

• Select distribution for response variable

– poisson, binomial, lognormal

• Logistic models

• Log-linear models

Outliers

• Observations further from fitted model than remaining observations

• Large residual

outlier