An Introduction to
Statistical Methods
in GenStat
Alex Glaser
VSN International, 5 The Waterhouse,
Waterhouse Street, Hemel Hempstead, UK
email:
[email protected]
[email protected]
Many thanks to Roger Payne for the original slides
Programme
•
Day 1
•
Introduction to GenStat
•
From t-test to one-way anova
•
Basic principles of design and blocking
•
Treatment structure
−
factorials & interactions
and checking the assumptions
•
Day 2
•
Simple linear regression
•
Multiple linear regression
•
GLM –
counts and binomial data
Aim of course
To give you an overall
introduction to the
GenStat 13th Edition
system.
Learning
objectives
•
By the end of the course, you will be able to
•Navigate the GenStat interface
•Obtain help from the system where necessary
•Input and manage data
•Analyse data through GenStat menus
•
What happens when you select
input log
in
the
window navigator
?
•
Can you see yourself using this feature in you
work? If so, how?
•
What happens to status bar when you click the
button?
•
Resize the
input log
and
output
window so
that you can see both simultaneously
•
What happens when you click the button?
•
Use the
tools|customize toolbar
menu to
add or remove buttons from the toolbar to suit
your needs.
Exercise 1.2
•
What happens to the text in right hand corner
of the
status bar
if you press the
insert
key?
•
What do you think this part of the
status bar
means?
•
Open a new text window using the
button.
In this window, type the following GenStat
command
PRINT ‘This is my first time using GenStat’
•
Execute the command using the
Run|Submit
Line
menu option. Now select the
Window|
Event Log
entry for this action. Is there an
GenStat Client
Menus Commands
Exercise 2
•
Find help for what’s new in the 13
thedition of
GenStat
•
Find help on the GenStat spreadsheet
•
Open the
Tools|Options
menu and find help
about the
ECHO COMMANDS
setting on the
AUDIT TRAIL
tab.
•
Open a new test window and type in the word
FIT
. Place the cursor in the word and press the
F1
key. What is
FIT
? Type in a statistical term
and press the
F1
key.
•
View the Introduction to GenStat guide (pdf
format)
•
View an example program for a two-sample
t-test.
Spread
Menu
Blank / type data
Data in GenStat to edit
From clipboard
Excel
Set up ODBC query
DDE link
File
Menu
Data / Load
Menu
Central Data Core
ASCII
Spreadsheet Database files
Other Statistics packages GenStat Save
Set up ODBC query Saved ODBC Queries DDE links Spreadsheet Other Statistics packages GIS GenStat Save GenStat session Database files
Saved ODBC Queries
Exercise 3.1
•
Clear all the data from GenStat and use the
file|open
menu to read the data from the
file
sulphur.xls
from installsets\Data
•
Clear all the data from GenStat. Go to the
tools|spreadsheet options|file
menu and
uncheck the
use excel import wizard on
file open
option. Repeat part 1 using the
file|open
menu. Which approach best suits
your way of working?
•
The file
bacteria.xls
, that you met earlier,
contains data from a second experiment in
the worksheet called Bacteria Counts. The
data are not stored in standard format; the
data can be found in the range of cells
D3:E13
. Clear the
data core
. Read the
data into GenStat using the
Excel import
Exercise 3.2 & 3.3
•
Using the data in the
iris.gsh
file:
•
Produce a scatter plot of
Sepal Width
versus
Petal
Width
. There is one point in this plot that stands
alone. What are the coordinates of this point? Can
you suggest a method of easily identifying to which
species of iris this unusual point belongs?
•
Produce a scatter plot of
Sepal Length
versus
Petal
Length
. Give each factor a different symbol and
colours. Experiment with labelling.
•
Produce a histogram of
Petal lengths
versus
Petal
widths.
•
Using your own data, experiment with the
different aspects of the
graphics
window.
That is, explore the different menus and
toolbars. If you have not brought your on data
sets, experiment with any of the course data
files.
Exercise 4.1
•
Using the Excel Import Wizard,
load in the file
Traffic.xls
•
On the second screen enter B3:D43 in
the Specified Range box.
•
Click OK on the Select Columns to
Convert to Factors menu
•
Convert Day and Month to factors
using the methods of your choice.
Exercise 4.2
•
Continue using the file
Traffic.xls
•
Select a cell in the
Day
column.
Delete the value, type ‘F’
and then
press return. Repeat the process but
with the value ‘G’. What property of
the GenStat spreadsheet do you think
this illustrates.
•
Select the
Tools|Spreadsheet
Options |Conversions
menu. Check
the
Allow new factor levels in Edit
box. Now repeat the above question.
What happens now?
Exercise 4.3
•
Continue using the file
Traffic.xls
•
Create a new variate which contains
the log of the Counts.
•
Sort the columns in descending order
of the Counts.
•
Use the
Spread| Manipulate|
Unstack
to create separate variables
for each day of the week.
•
Experiment with the
Calculate
menu
with your own data.
1
From t‐test to one‐way anova
•
In this session you will learn
•
how to use the t-test to compare two treatments•
the T-Test menu•
how to use one-way ANOVA to compare several treatments•
the model fitted in one-way anova•
the statistical philosophy behind one-way anova•
the relationship between one-way anova and the t-test for two treatments•
how to use the One- and two-way ANOVA menu for one-way anova•
how to plot the means from one-way anova•
how to do multiple comparisonst‐test
•
suppose we have 2 sets of units, that have received 2different treatments:
•
animals that have been fed two different diets•
plots that have been given different fertilisers•
subjects with different drugs•
plants with different fungicides .•
assume the units do not have any special structure e.g.•
the animals are all of the same breed•
the plots are in a fairly uniform field•
the subjects are of similar ages, weights and heights•
with 2 treatments we may then do a t-test•
assume each group from a Normal distribution•
usually assume distributions have the same s.e. (can check)•
filter by the courseGuide to Anova and Design
•
select the file•
click on Open data•
data sets for the examples
and practicals
can be
accessed using the
Example Data Sets
menu
t‐test
•
experiment to study yields from 2
manufacturing methods
•
data in
Manufacture.gsh
•
do yields differ more than we
would expect from the random
variation?
•
can we estimate mean yields from
each method?
t‐test menu
Practical 1.2
•
spreadsheet
Pots.gsh
stores
data from a fertilizer
experiment
•
7 plants grown in pots with nofertilizer
•
8 plants grown in similarconditions with fertilizer
•
do a two-sample t-test to
assess whether fertilizer has
an effect
One‐way analysis of variance
•
linear model y
ij= μ
+ a
i+ ε
ij•
represent each mean by
•
grand mean μ•
+ effect ai•
observations described by
•
fitted value μ + aiResidual
variation
•
may arise from many different causes:
•
the units may not be absolutely identical (discuss
later how to allocate units to treatments to take
account of this)
•
they may experience slightly different conditions
during the experiment
•
there may be measurement errors
•
they may be being dealt with by different people
during the experiment
•
and you can no doubt think of others!
•
so estimation is not exact
•
analysis must estimate the amount of variation
One‐way anova
•
linear model
y
ij=
μ
+
a
i+
ε
ij•
if treatments have no effect
•
a1 = a2 = 0•
yij = μ + εij•
estimate grand mean by average of all data values•
assess lack of fit of model by sum of squared residuals (RSS0)•
degrees of freedom (d.f.) is n1+n2−1 (fitted 1 parameter μ)•
fit full model
•
estimate ai by average for group i minus grand mean•
assess lack of fit of model by sum of squared residuals (RSS1)•
this has n1+n2−2 d.f. (2 parameters as (n1a1+n2a2)/(n1+n2)=0)•
assess treatments
•
sum of squares due to treatments is TSS=RSS0−RSS1 on 1 d.f.•
assess underlying variation by residual from full model RSS1•
variance ratio is treatment mean square / residual mean squareOutput
Å
aov table
Å
tables of means
Å
s.e.'s for
differences
between means
(m1 – m2)/sed = t
ANOVA
Options
menu
ANOVA
Further
Output
menu
•
Further Output
menu provides more output
ANOVA
Means
Plots
menu
•
Means Plots
menu plots means
•
as points•
or joined by lines•
or with original data points tooPractical 1.4
•
spreadsheet
Pots.gsh
stores
data from a fertilizer
experiment used in Practical
1.2
•
7 plants grown in pots with nofertilizer
•
8 plants grown in similarconditions with fertilizer
•
do a one-way analysis of
variance to assess if fertilizer
has an effect
•
compare results with t-test
from Practical 1.2
One‐way anova
with >2 treatments
•
spreadsheet Rat.gsh
has data
from an experiment to study
effect of dietary supplements
on gain in weight of rats
•
5 diet
treatments (a-e)
•
20 rats allocated at random, 4
per treatment
•
can use One-and two-way
ANOVA
menu, and plot means,
Output
Å
aov table
Å
means
Plot of means
•
suppose a-e
represent
amounts 0-4
of supplement
•
might want to
assess linear
(& quadratic?)
effects of
supplement
Multiple comparison tests
•
in favour•
there may be many possible comparisons between pairs oftreatment means (with t treatments there are t×(t–1)/2)
•
so some researchers feel their significance levels should beadjusted to take account of all the tests that they might make
•
against•
multiple-comparisons are unnecessary if you have only a smallnumber of comparisons to make – either because there are few
treatments, or because you should have identified beforehand the comparisons that you feel are likely to be of interest
•
they are inappropriate also if the treatments have any sort ofstructure e.g. levels may represent different amounts of a
substance like a fertiliser or a drug, then illogical to assume that
only some of the amounts might have an effect
Multiple comparisons
•
check that they
are enabled on
the Menus tab
of the Options
menu
Multiple comparisons
•
the
Multiple Comparisons
button will then be available to
click on the ANOVA Further Output menu
•
check Multiple Comparisons•
select Treatment and type of TestPractical 1.9
•
spreadsheet
Octane.gsh
stores
data from an experiment to study
the effect of different additives
A
-
E
on the octane level of gasoline
used in Practical 1.7
•
do a one-way analysis of variance
to assess if
Gasoline
has an effect
•
do a Bonferroni
multiple-comparison test to compare the
types of gasoline
2 Blocking structures
•
In this session you will learn
•
how to improve the precision of an experiment by grouping the units into similar sets called "blocks"•
how randomization can avoid bias by guarding against unforeseen differences amongst the units•
how to design and analyse a complete randomized block design•
how to recognise situations that may require more than one type of blocking•
how to design and analyse a Latin square designCompletely‐randomized design
•
design used for all examples so far
•
no formal structure is imposed on the units•
assumes units effectively identical e.g.
•
in a field experiment, no systematic differences in underlying fertility, drainage etc of the plots•
in a glasshouse, assumes that light and temperature are the same for each row of pots•
in a factory, that workforce behaves in essentially the same way at different times of day, days of the week etc•
in educational studies, that children in different schools are approximately the same, or students studying differentsubjects at Universities, or in different year groups etc
Non‐uniform units
•
for example field experiment on a slope
•
best plots may be at top of slope•
random allocation of treatments to plots may not seem "fair"• e.g. replicates of treatment A mainly on "good" plots & replicates of treatment B mainly on "bad" plots − if no actual difference between A & B, could lead to A appearing to be much better than B
•
systematic differences between plots increase the residual sum ofsquares, & hence the estimate of random variability
• treatment differences must be larger to give a significant F-test
• standard errors of differences between treatments will be larger i.e. experiment will give less precise results
•
if you know there are differences between units
•
avoid bias & improve precision by grouping (blocking) units intoRandomized block design
•
single grouping factor usually known as blocks
•
within each block
•
same number of units for each treatment (one per treatment in a randomized-complete-block design)•
treatments are allocated randomly to the units•
in analysis block-effects are estimated and
removed, leading to more-precise estimates
•
e.g.
One‐way anova
with blocks
•
another experiment to
study effect of dietary
supplements on gain in
weight of rats
•
8 litters of 5 rats
•
assume rats from same
litter more similar than
those from different litters
•
5 Diet
treatments (A-E),
allocated at random to
rats within each litter
No blocking
Å
residual m.s. 206.8
variance ratio 0.42
With litters as blocks
Differences between litters
residual m.s.
40.63 (c.f. 206.8)
Å
variance ratio
2.13 (c.f. 0.42)
Practical 2.3
•
spreadsheet
Wheatstrains.gsh
contains
the results from a
randomized block design to
assess 4 strains of wheat
•
analyse
the experiment
•
give your assessment of
whether the blocking was
worthwhile
Blocking in 2 directions
•
e.g. experiment on pot plants in a glasshouse
•
door in east wall which may cause temperature differences•
sunlight mainly from the south•
other e.g.
•
weekday × time-of-day•
school × year-group•
factory × weekdayLatin square design
•
a design for t
treatments
•
arranged in t
rows and t
columns (i.e. t
2units)
•
each treatment occurs exactly once in each row
and once in each column
•
randomized by randomly permuting rows &
columns
Latin square example
•
experiment to assess the
(in?)consistency
of 6
samplers in assessing the
heights of wheat plants
•
6 areas of wheat to assess
•
may also be ordering
effects (accuracy of
samplers may vary during
experiment)
•
so 6×6 Latin square used
with blocking factors Areas
and Orders
Analysis
of
Variance
menu
Output
Å
between Areas
Å
between Orders
Å
Samplers more
precisely
estimated
(residual m.s.
3.328 c.f. 5.801)
Practical 2.5
•
spreadsheet
Fabric.gsh
contains
the results from
a Latin square design to
assess wear resistance of
rubber-covered fabrics
•
column factor is 4
different runs
•
row factor is four
positions on testing
machine used to
generate wear under
simulated natural
conditions
3 Treatment structure
•
In this session you will learn how to
•
recognise the need for more than one treatment factor•
analyse designs with two treatment factors using the One-and two-way ANOVA menu•
define and interpret interactions between factors•
analyse designs with two treatment factors using the general Analysis of Variance menu•
use the Anova Contrasts menu•
estimate comparisons between levels of treatments•
interpret interactions between treatment contrasts•
use model formulae to define the treatment terms to be fitted•
include control treatments in a factorial experiment•
use covariates to improve precision by using additional background information about the experimental units (not used for blockingTypes of treatment
•
experiments may study different types of treatment e.g.
•
several different drugs at a range of different doses•
several different types of fertiliser•
varieties of wheat and types of fungicide•
represent each type of treatment by a different treatment
factor, with levels to represent the various possibilities
e.g.
•
Drug − levels Morphine, Amidone, Phenadoxone, Pethidine;•
Dose − levels 2.5, 5, 10, 15;•
Nitrogen − levels 0, 50, 100, 150;•
Phosphate − levels 50, 100;•
Fungicide − levels Carbendazim, Prochloraz;Two treatment factors
•
experiment on canola
(oil-seed rape)
•
2 treatment factors
•
N (nitrogen) 0, 180, 230•
S (sulphur) 0, 10, 20, 40•
randomized-block
design
•
with 3 blocks (factorblock)
One
and
two
‐
way
ANOVA
menu
•
Two-way
analysis (Treatment
factors N
& S)
Output
Å
line for each term: N
& S main effects,
and N.S interaction
Å
table of means for
each treatment term
Å
s.e.d. for each table
of means
Linear model
•
y
ijk= μ
+ β
i+ n
j+ s
k+ ns
jk+ ε
ijk•
βi represent the block effects (block stratum in the aov)•
εijk are the residuals•
nj represent the main effect of nitrogen (N)•
sk represent the main effect of sulphur (S)•
nsjk represent the interaction between nitrogen & sulphur (N.S)•
analysis fits each term in turn, so you can
decide how complicated a model is required
•
analysis-of-variance table has a line for each term, so you can assess whether its parameters are needed in the modelWithout
interaction
•
lines are parallel
•
can decide on best level of
S without considering N
•
or best level of N without
considering S
•
need present only one-way
tables of means
General
Analysis
of
Variance
menu
•
Design:
Two-way ANOVA (in Randomized Blocks)
•
click on
Contrasts
button to fit comparisons (or
other contrasts)
Comparison
contrasts
•
1 comparison between levels of N
•
clicking
OK
opens matrix spreadsheet
Cont
General
Analysis
of
Variance
menu
•
notice function Comp in Treatment 1
(1 comparison of N
defined by Cont)
Output
Å
extra line for N
assesses the
comparison
Å
also extra line
for N.S to assess
interaction of
comparison
with S
Practical
3.3
•
spreadsheet
Ratfactorial.gsh
contains
the results from an
experiment to study the effect
of 6 different diets on the gain
in weight of rats
•
treatment factors concern the
protein in the diet
•
Amount (High or Low)•
Source (Beef, Cereal or Pork)•
analyse the data as a
two-way factorial
•
fit 2 comparison contrasts
between levels of Source
•
Animal vs
Vegetable
Model formula
•
define a model to be fitted in an analysis
•
formed automatically by the menus – or can define your own•
list of model terms, linked by operator "
+
”
•
e.g. A + B•
2 terms representing main effects of factors A & B•
Higher-order terms
specified as series of
factors separated by dots (e.g. interactions):
meaning depends on contents of formula
•
e.g. N + S + N.S N.S is an interaction•
e.g. Block + Block.Plot Block.Plot representsplot-within-block effects: differences between individual plots after removing the overall similarity between plots in same block
Operators for formulae
•
crossing operator
* specifies factorial
structures
e.g. N * S
is expanded automatically to become N + S + N.S
•
nesting operator
/ occurs most often in block
formulae
e.g Block / Plot
Several operators
•
3-factor factorial model
A * B * C
becomes A + B + C + A.B + A.C + B.C + A.B.C
•
3 nested factors (e.g. block model of split-plot)
block / wplot / subplot
becomes block + block.wplot + block.wplot.subplot
•
factorial-plus-added-control
treatment structure Control / (Drug * Dose)
expands to Control + Control.Drug + Control.Dose + Control.Drug.Dose
•
NB: many commands and menus have a
FACTORIAL
option to control the number of factors/variates in the
terms to fit
Factorial plus added control
•
4 different fumigants to
control nematodes
•
CN, CS, CM and CK•
2 levels of dose
•
single and double•
also include a control
treatment
•
none (no fumigant at any dose)•
randomized-block design
•
4 blocks•
12 plots per block•
(4 replicates of control treatment in each block)•
effects proportional
Analysis
of
Variance
menu
•
select Design
to be General Treatment
Structure (in Randomized Blocks)
Factorial plus added control
•
treatment structure Fumigant / ( Level * Type )
•
Fumigant represents the overall effect of anyfumigant at any (non-zero) dose
•
Fumigant.Level represents comparison between single anddouble doses (averaged over different types)
•
Fumigant.Type represents overall differences between types(averaged over single and double doses)
•
Fumigant.Level.Type represents the interaction between Level and Type (given that some sort of fumigant has been applied)Output
Å
notice different
sed's according to
the replication of
the means
Covariates
•
provide additional background information•
often measurements made before expt (not used for blocking)•
e.g. (log) prior nematode counts•
incorporated in model as linear (regression) terms•
yijkl = μ + βi + fj + ftjk + fljl + ftljkl + b × (xijkl− xmean) + εijkl•
improve precision•
remove potential biases caused by non-uniformity of units•
in aov table•
extra line(s) to assess effect of covariate(s) on y-variate, afterremoving effects of treatments
•
treatment s.s. (and effects) adjusted to take account of the factthat the plots with the various treatments have different covariate values
•
cov.ef. for treatment is efficiency remaining after adjustmentOutput
Å
regression coefficient for
adjustment in Blocks stratum
Å
regression coefficient for
adjustment within Blocks
Å
combined estimate
Practical 3.7
•
spreadsheet Ratmuscles.gsh contains data from an experiment to study the effect of electrical stimulation inpreventing the wasting away of denervated muscles of rats
•
3 treatment factors• length of each treatment
• number of treatment periods per day
• type of current
•
randomized block design with 2 blocks•
denervated muscles weregastrocnemius muscles on one side of each rat
•
the normal muscle on the other side of each rat was also measured, for use as a covariate in the analysis4 Checking the assumptions
•
In this session you will learn
•
what assumptions are needed to ensure validity of an aov•
why the variance must be homogeneous (e.g. variability of residuals should be the same at high as low response values)•
how to assess whether the variance is homogeneous•
that residuals should come from identical and independent Normal distributions•
how to assess the Normality of the residuals•
why the model must be additive (i.e. differences between treatment effects must remain the same however large or small the underlying size of the response variable)•
how to identify outliers•
how transforming the response variate may correct for failures in the assumptions•
how to print back-transformed tables of means•
how to do a random permutation testHomogeneity of variance
•
random variation must be similar over all units
•
beware: it may change with the size of response
•
assess by plotting residuals against fitted values
Non‐homogeneity of variance
•
if variation increases with size of response
•
s.e.d.'s
between treatment means will be
•
over-estimated for differences between low means•
under-estimated for differences between larger means•
this could lead you to the wrong conclusions!
•
if plot of residuals against fitted values
indicates non-homogeneity of variances
•
consider transforming the response variate•
(or using a generalized linear model; see Guide to Linear, Nonlinear and Generalized Linear Models in GenStat)Normality of residuals
•
histogram –
should be "bell-shaped"
•
Normal plot
•
residuals in ascending order plotted against Normal quantiles•
should give an approximately straight line•
half-Normal plot
Additivity
•
differences between treatment effects remain the samehowever large or small the underlying size of the response
•
e.g. in randomized-block design, assume that theoretical valueof difference between two treatments remains the same within a block where responses are low, as in one where they are
high
•
fitting an additive model when non-additivity is present•
often leads to detection of (spurious) interactions•
analysis will be harder to interpret•
predictions will be unreliable•
but take care – genuine interactions may also occur e.g. if one treatment modifies the mode of action of another•
data that shows signs of non-additivity often also violatesother assumptions
•
use background knowledge of the process•
if a multiplicative model appropriate take a log transformationOutliers
•
are extreme observation, leading to very large residuals•
look for warnings in ANOVA Information Summary•
or for extreme points in histogram of residuals•
or high or low points in plot of residuals against fitted values•
or points away from line at end of Normal or half-Normal plot•
outliers may arise from•
errors in recording or punching data•
if the wrong treatment has been applied to a unit•
where there is a problem in the experimental procedure•
outliers•
distort treatment means•
inflate the error variance, decreasing the precision of estimates•
if you have outliers investigate to see if errors haveoccurred
•
if you find an error try to recover the correct data value•
if you cannot find the correct data value, insert a missing value•
if you cannot find any possible source of error, perhaps the outlier might be a true data value – is your model wrong?Transformations
•
can correct failures of assumptions
•
e.g. to stabilize variance
•
counts square root•
binomial percentages angulari.e. arcsine(sqrt(p/100))
•
s.e. proportional to mean log•
e.g. non-additivity
•
multiplicative effects loge.g. log10(n+1) for counts
•
percentages logit = log(p/(100-p)) p=100×(r+½)/(n+1) for binomial•
note: must make inferences on transformed
scale
•
but can present back-transformed means using Save andLog transformed data
•
study of plankton numbers
•
4 types of plankton (treatments)
•
sampled in 12 hauls (blocks)
•
compare analyses for
untransformed and log10
transformed numbers
Practical 4.6
•
spreadsheet
Wine.gsh
contains results from an
experiment to assess the %
alcohol of wine
•
5 types of wine A-E
•
3 bottles of each type were
tested in a random order
•
analyse
the percentages &
plot residuals against fitted
values
•
transform the percentages
using a logit
transformation,
re-analyse
the data & replot
residuals against fitted values
Permutation tests
•
if the distributional assumptions are not satisfied, you
might use a random permutation test as an alternative
way to assess the significance of the terms in the analysis
•
model must still be additive for results to be meaningful•
but residuals need no longer follow Normal distributions with equalvariances
•
click on
Permutation Test
in
ANOVA Further Output
menu
to open
ANOVA Permutation Test
menu
•
specify Number of permutations•
select Seed (0 automatic)•
click on Run•
probability for each treatmentterm is now determined from its distribution over the randomly permuted data sets