December 2007
Question 1
(a) Acceptable analyses: Paired t-test, 2-factor ANOVA without interaction, sign test, simple linear regression.
Acceptable graphs: dot, box or stem-leaf plot of differences; interaction plot; scatter plot with fitted line. (Graph 2, suitable analysis 2, correct calculation 4, assumptions 2, conclusions 2)
Paired t-test
Assumptions: Differences independent (can’t test), normal (looks OK on dot plot or stem & leaf plot).
Conclusions: There is no evidence (P = 0.53) from these data of a difference in mean between the two methods.
> iron
dich thio diff 1 28.22 28.27 -0.05 2 33.95 33.99 -0.04 3 38.25 38.20 0.05 4 42.52 42.42 0.10 5 37.62 37.64 -0.02 6 36.84 36.85 -0.01 7 36.12 36.21 -0.09 8 35.11 35.20 -0.09 9 34.45 34.40 0.05 10 52.83 52.86 -0.03
> t.test(iron$dich,iron$thio,pair=T)
Paired t-test
data: iron$dich and iron$thio
t = -0.6591, df = 9, p-value = 0.5263
alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval:
-0.05761671 0.03161671 sample estimates:
mean of the differences -0.013
> varidiff <- var(iron$diff) > vardiff
Error: object "vardiff" not found > varidiff
[1] 0.00389
> stem(iron$diff)
The decimal point is 1 digit(s) to the left of the |
-0 | 995 -0 | 4321 0 | 0 | 55 1 | 0
Assumptions: Differences are independent (can’t test).
Conclusions: There are 3 positive differences out of 10 non-zero differences, so P = 0.34 and there is no evidence from these data of a difference in the median iron content between the two methods.
> 2*pbinom(3,10,.5) [1] 0.34375
(b) Acceptable analyses: Two-sample t-test or 1-factor ANOVA.
Acceptable graphs: Comparative dot, box or stem-leaf plots. (Graph 2, suitable analysis 2, correct calculation 3, F-test of variances 2, assumptions 2, conclusions 2)
Assumptions: Independence (can’t test), normality (maybe OK by plot), homoscedasticity (graph looks heteroscedastic; F-test gives evidence (P = 0.0012) that the variances are not equal).
Conclusions: There is some evidence (P = 0.06) from these data that the mean sugar content differs between day 1 and day 2. There is strong evidence (P = 0.0012) that variances on the 2 days are not equal; the variance of sugar content is higher on day 2.
> broth sugar day 1 5.0 1 2 4.8 1 3 5.1 1 4 5.1 1 5 4.8 1 6 5.1 1 7 4.8 1 8 4.8 1 9 5.0 1 10 5.2 1 11 4.9 1 12 4.9 1 13 5.0 1 14 5.8 2 15 4.7 2 16 4.7 2 17 4.9 2 18 5.1 2 19 4.9 2 20 5.4 2 21 5.3 2 22 5.3 2 23 4.8 2 24 5.7 2 25 5.1 2 26 5.7 2
> boxplot(sugar~day, broth, horizontal=T, xlab="sugar", ylab="day") > t.test(sugar~day,broth,var.eq=T)
Two Sample t-test
data: sugar by day
t = -1.9664, df = 24, p-value = 0.06092
alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval:
mean in group 1 mean in group 2 4.961538 5.184615
1
2
4.8 5.0 5.2 5.4 5.6 5.8
sugar
da
y
> varbroth<-sapply(split(broth$sugar,broth$day),var) > varbroth
1 2 0.01923077 0.14807692 > varbroth[2]/varbroth[1] 2
7.7
> 2*(1-pf(varbroth[2]/varbroth[1],12,12)) 2
0.001271843
> t.test(sugar~day,broth,var.eq=F)
Not Required: If we don’t want to assume homoscedasticity we can do the unequal-variance t-test to compare the means but the P-value is almost exactly the same as before..
Welch Two Sample t-test
data: sugar by day
t = -1.9664, df = 15.065, p-value = 0.06796
alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval:
-0.46478864 0.01863479 sample estimates:
Question 2
(a) William Sealey Gosset, + 3 interesting facts. (4 marks)
(b) Definitions (4 marks)
The variance of a random variable is the expected squared deviation from the mean.
The correlation coefficient is a dimensionless measure of association between two random variables. Pearson's correlation coefficient is computed as their covariance divided by the product of their standard deviations. It ranges from +1 (perfect linear relationship with positive slope) to -1 (perfect linear relationship with negative slope), with 0 indicating no relationship.
Autocorrelation is correlation between consecutive observations in a time series. A sequence of independent observations has zero autocorrelation.
A pivotal quantity is a function of a statistic and the parameter of interest that follows a standard distribution. The distribution may not include any unknown parameters.
(c) Let X be the number of damaged fibres out of the 80000 in a square metre of carpet. Assuming that damaged fibres occur independently of each other, X ~ Bin(80000, 0.0001) and the probability of 2 or more damaged fibres is 0.99698. For hand calculation, the Poisson approximation to the Binomial applies because n = 80000 is large and p = 0.0001 is small; the mean number of damaged fibres is 8 and this approximation also gives 0.99698. (6 marks for either calculation)
> 1 - pbinom(1, 80000, 0.0001) [1] 0.9969818
> 1 - (.9999^80000 + 80000*0.0001*.9999^79999) [1] 0.9969818
> 1 - ppois(1, 8) [1] 0.9969808
> 1 - (1+8)*exp(-8) [1] 0.9969808
(d) Assuming that the harmful bacteria are distributed independently and randomly in the sample (i.e. as a Poisson process), the number of bacteria in a sample will follow a Poisson distribution with mean = 1 at low contamination, mean = 5 at high concentration. (11 marks)
Let C be the event contamination is high, let X be the number of harmful bacteria.
P(C|X=1) = P(X=1|C)P(C)/[P(X=1|C)P(C) + P(X=1|C’)P(C’)] P(C’|X=5) = P(X=1|C’)P(C’)/[P(X=5|C)P(C) + P(X=5|C’)P(C’)]
> dpois(1,5)*.01/(dpois(1,5)*.01+dpois(1,1)*.99) [1] 0.0009241774
> dpois(5,1)*.99/(dpois(5,5)*.01+dpois(5,1)*.99) [1] 0.6336553
(e) Remark: Note that in this example the coefficient of variation is small, the data do not span more than one order of magnitude, and so there is very little difference between the normal and the
lognormal distributions. You should plot the pdfs to demonstrate this. (8 mark for the calculations)
> theta <- log(120.87)-.5*omega^2 > theta
[1] 4.790682
> 1-plnorm(140,theta,omega) [1] 0.04640791
> 1-pnorm(log(140),theta,omega) [1] 0.04640791
> 1-pnorm((log(140)-theta)/omega) [1] 0.04640791
> 1-pnorm(140, 120.87, 0.09*120.87) [1] 0.03932726
Question 3
The analysis is a 2-factor ANOVA with a test for interaction.
Model: (14-1); ANOVA identity: (14-3, 14-4). (6 marks)
Assumptions: Normality, independence, homoscedasticity. (3 marks)
Conclusions: There is strong evidence (P < 0.001) from these data of an interaction between travel speed and accelerating voltage. This means that both travel speed and accelerating voltage affect hardness but the effect of accelerating voltage is different at different travel speeds. Because the interaction is significant, we should not test the main effects
.
The interaction plot shows that travel speed does not make much difference at low voltage. Raising the voltage gives slightly reduced hardness at high travel speed and much reduced hardness at low travel speed.
The plot of Residuals vs Fitted is very straight and flat, indicating that the model is correct. The spread of the residuals above and below the line is reasonably uniform along the line, validating the
assumption of homoscedasticity. The Normal QQ plot of Residuals is adequately straight, validating the assumption of normality. (9 marks)
Residual variance: The estimate of residual variance is MSE = 234 on 9 df, so a 95% confidence interval is (9*234/19.02, 9*234/2.70) = (111, 780) (3 marks)
R code: (4 marks)
> vickers
hard speed volt 1 875 10 10 2 896 10 10 3 712 10 25 4 719 10 25 5 568 10 50 6 546 10 50 7 876 20 10 8 835 20 10 9 889 20 25 10 876 20 25 11 756 20 50 12 732 20 50 13 901 30 10 14 926 30 10 15 789 30 25 16 801 30 25 17 792 30 50 18 786 30 50
> anova(lm(hard~speed*volt, vickers))
Question 4
The first analysis is a simple linear regression anova.
Model: (11-1); ANOVA identity: (11-24)
Assumptions: Independence, normality, homoscedasticity, linearity.
Conclusions: There is no evidence (P = 0.74) from these data that the slope is not zero. However, the plot of the fitted line and the plot of residuals vs fitted both indicate that a straight line fit is not appropriate so this analysis should not be used.
R code:
> anova(lm(yield~temp, react)) > plot(yield~temp, react)
> abline(lm(yield~temp, react)) > plot(lm(yield~temp, react))
The second analysis is a simple linear regression using the repeated x-values to get a lack-of-fit test. .
Model:
!
Yij ="0+"1xi+#i+$ij
ANOVA identity:
!
(yij"y ++
j=1
ni
#
i=1
a
#
)2= ni i=1
a
#
( ˆ y i"y ++)2
+ ni i=1
a
#
(y i+"y ˆ i)2
+ (yij"y i+
j=1
ni
#
i=1
a
#
)2(6 marks)
Assumptions: Independence, normality, homoscedasticity. (3 marks)
Conclusions: There is strong evidence (P = 0.002) from these data that the line is not straight, so the test for slope has no meaning. This means that temperature affects yield in a non-linear manner. The plot of yield vs temperature indicates that yield is maximized when temperature is around 80 degrees.
The plot of Residuals vs Fitted is very straight and flat, indicating that the model is correct. The spread of the residuals above and below the line is reasonably uniform along the line, validating the
assumption of homoscedasticity. The Normal QQ plot of Residuals is adequately straight, validating the assumption of normality. (9 marks)
Residual variance: The estimate of residual variance is MSE = 3.2 on 10 df, so a 95% confidence interval is (10*3.2/20.48, 10*3.2/3.25) = (1.56, 9.85) (3 marks)
R code: (4 marks)