LMs fit a number of explanatory variables against a continuous response vari-able. The explanatory variables can be discrete or continuous. In a lot of places you will see recommendations to have more continuous than discrete variables, but this is not necessary.

The assumptions of a LM have been discussed earlier in the chapter, but as a recap they are linearity, independence, residual normality, and consistent error variation.

Example 7.1 will walk through a LM example in R, looking at whether concen-tration “Conc” has an effect on yield “Yield.”

In the interest of space when plotting the graphs in any of the examples in
* this chapter, only the basic ggplot() code will be shown; however the plots I *
create will include more detail than the basic code produces, this detail will be
explained in Chapter 9. In addition to this, usually you would plot basic graphs
for EDA at the beginning of the analysis along with well-presented graphs to
visually complement the output summary; to save space I will only be creating

*have a lot of dependencies.*

**the final plot. To plot the graphs we will be using the R package ggplot2, which****EXAMPLE 7.1**

We have a response variable of Yield and an explanatory variable of Concentration;

both the response and explanatory variables are continuous. We are looking to see if Concentration has an effect on Yield.

*# Input the data*

**Yield = c(498,480.3,476.4,546,715.4,666,741.2,522,683.6,574,804,637,**
** 700,750,600,650,590)**

**Conc = c(3.9,3.8,3.6,4.2,5.7,5,5.5,3.7,4.9,4,6,5,5.2,5.9,4.8,4.7,4.3)**
**data18 = data.frame(Yield, Conc)**

*# Fit the model and print the output*
**mod = lm(Yield ~ Conc, data = data18)**
**summary(mod)**

**Chapter 7 | Statistical Modeling**

174

Call:

lm(formula = Yield ~ Conc, data = data18) Residuals:

Min 1Q Median 3Q Max -35.737 -23.542 5.458 19.435 37.481 Coefficients:

Estimate Std. Error t value Pr(>|t|) (Intercept) 40.425 40.338 1.002 0.332 Conc 124.023 8.442 14.692 2.6e-10 ***

---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 26.43 on 15 degrees of freedom Multiple R-squared: 0.935, Adjusted R-squared: 0.9307 F-statistic: 215.8 on 1 and 15 DF, p-value: 2.602e-10

The model output shows that Concentration is having a significant positive effect on Yield at the 99% confidence level, p-value of 2.6e-10. For every 1 unit increase in Concentration there is a 124 unit increase in Yield the estimate by Conc is 124.023.

It also shows that the model accounts for 93.1% of the total variation, which is very
good—adjusted R^{2} of 0.9307.

*# Check the diagnostics*
**plot(mod, which = 2)**
**plot(mod, which = 1)**

The diagnostic plots show that we can roughly assume normality on the residuals, it’s not great but it is good enough, and that there is consistent error variation as the points are roughly scattered.

**Translating Statistics to Make Decisions**^{175}

*# Calculate confidence intervals for the estimates*
**confint(mod)**

2.5 % 97.5 % (Intercept) -45.55406 126.4039 Conc 106.03008 142.0167

The confidence intervals highlight that for every unit increase in Concentration the increase in Yield could be between 106 and 142. The confidence intervals for the intercept also include 0, which is good as we would expect a Yield of 0 at a Concentration of 0.

*# Plot the data*
**library(ggplot2)**

**ggplot(data18, aes(x = Conc, y = Yield)) + theme_bw() +**
** geom_point() + geom_smooth(method = "lm")**

Finally the scatter plot highlights the linearity of the data and shows the line of best fit with confidence intervals. This clearly emphasises the positive relationship between Concentration and Yield.

**Chapter 7 | Statistical Modeling**

176

Once the data has been input, you assign a model to the data and call it a
* name, in this case “mod.” The lm( ) stands for linear model, then the response *
variable is specified followed by the explanatory variable(s), finally the data set
to be used is included.

* The next few commands are carried out on this “mod” object, so the summary() *
gives a summary of the model output. Next the diagnostic plots are drawn,

**by specifying the “which =” part allows us to define which diagnostic plot***default of the model estimates, to change this level you would amend the code*

**we would like drawn. The confint() calculates 95% confidence intervals by**

**as follows: confint(mod, level = 0.99).*** Finally using ggplot( ) I plotted a scatter plot including the line of best fit from *
the linear model that has 95% confidence interval shading.

As an aside, a model can be fitted without an intercept term. This just means that the slope of the fit is forced through the origin, (0,0), which in some cases makes sense but in others may not.

### ANOVA

Many people have heard of ANOVA, it is simply an extension of a two-sample
*t test. As with LMs, they also are used to fit a number of explanatory variables *
against a continuous response variable. Again, the explanatory variables can be
discrete or continuous. However in a lot of places you will see
recommenda-tions to have more discrete variables than continuous variables, this should be
adhered to, otherwise you can just run a LM instead.

The key thing to note here is that there isn’t much difference between fitting a LM or an ANOVA. You can in fact get the same ANOVA results by computing an ANOVA table of the LM results. The main difference between the two is that with a LM it doesn’t matter what order you put the explanatory variables into the model, whereas it does matter for the general ANOVA model.

The LM will use t-values to calculate the p-values; the t-values test the mar-ginal impact of the levels of the explanatory variables given the fact that all the other variables are present. The ANOVA will use F-values to calculate the p-values; the F-values test whether the explanatory variable as a whole reduces the residual sum of squares (SS) compared to the previous explana-tory variable(s). So for explanaexplana-tory variable one this will be tested against the response, then explanatory variable two will be tested given explanatory vari-able one is present, and so forth.

An advantage of the ANOVA table is that it can tidy up high level explana-tory variables. A LM will show the estimate parameters, that is, one row per explanatory variable level, whereas an ANOVA will show the variables, that is, one row per explanatory variable. For example, a LM output of one explana-tory variable with 10 levels will show 9 rows, whereas an ANOVA will only

**Translating Statistics to Make Decisions**^{177}

show 1 row. It can be simpler to use an ANOVA table to simplify the model;

however I would compare the two along the way, and always use a LM output for the final results.

The assumptions of the ANOVA are similar to that of the LM including inde-pendence, consistent error variation, and residual normality. It also includes the assumption that the levels of the explanatory variable have similar varia-tion. The recommended test to use for this is Bartlett’s, along with plotting a box plot, recall Chapter 6.

Example 7.2 will look at an ANOVA example in R, looking at whether the either of the two materials “Material” or any of the four methods of imple-mentation “Method” have an effect on the total volume produced “Volume.”

It will also show a LM output of the same model along with how you can calculate an ANOVA table from that output.

As there are multiple “Methods” we may need to carry out multiple
* compari-sons, if so we will need to use the R package lsmeans. Least-squares means can *
be calculated for each explanatory variable combination along with their
con-trasts to determine if there are significant differences between multiple groups.

**EXAMPLE 7.2**

We have a response variable of Volume and two explanatory variables of Material and Method. The response variable is continuous and both explanatory variables are discrete. We are looking to see if Material, Method, or an interaction between the two has an effect on Volume.

*# Input the data*

**Volume = c(28.756,29.305,28.622,30.195,27.736,17.093,17.076,17.354,**
** 16.353,15.880, 36.833,35.653,34.583,35.504,35.236,30.333,**
** 30.030,28.339,28.748,29.020,32.591,30.572,32.904,31.942,**
** 33.653, 20.725,22.198,21.988,22.403,21.324,38.840,40.137,**
** 39.295,39.006,40.731,32.136,33.209,34.558,32.782,31.460)**
**Material = rep(c("A","B"), each = 20)**

**Method = rep(c("I","II","III","IV"), each = 5, 2)**
**data19 = data.frame(Volume, Material, Method)**

*# Check for equal variances*

**bartlett.test(Volume ~ interaction(Material,Method), data = data19)**
Bartlett test of homogeneity of variances

data: Volume by interaction(Material, Method)

Bartlett's K-squared = 2.6181, df = 7, p-value = 0.9179

The Bartlett’s test gave a p-values of 0.918 that suggested no evidence to reject the null hypothesis, which means we can assume equal variances and that was also confirmed by the roughly equal sizes of the box and whiskers on the box plot at the end of Example 7.2.

**Chapter 7 | Statistical Modeling**

178

*# Fit the full model and print the output*

**mod = aov(Volume ~ Material*Method, data = data19)**
**summary(mod)**

The output of the full ANOVA model that contains the interaction showed that the interaction was not significant to Volume with a p-value of 0.215. This is backed by the plot of the data that shows a similar pattern and gradient for Material and Method.

As such the model can be simplified by removing the interaction term.

*# Simplify the model and print the output*

**mod2 = aov(Volume ~ Material + Method, data = data19)**
**summary(mod2)**

Once simplified the model output showed that both Material and Method had a significant effect on Volume at the 99% confidence level, again it’s clear to see a difference between the two Methods and you also can see that there is a consistent difference between the Materials using the final plot. This means that the model cannot be simplified any further.

The ANOVA output showed us that there is a difference between Method A and B and looking at the box plot we can see that Method B gives a higher Volume. However we don’t know if all Methods are different to each other, and this is what the lsmeans output will show us.

*# Check for differences between all Methods*
**library(lsmeans)**

**Translating Statistics to Make Decisions**^{179}

Results are averaged over the levels of: Material Confidence level used: 0.95 Results are averaged over the levels of: Material

P-value adjustment: tukey method for comparing a family of 4 estimates

Using the second section of the lsmeans output we can see that all Methods are significantly different to each other at the 99% confidence level except Method I and Method IV, which are not significantly different to each other—p-value of 0.7183.

Using the first section of the lsmeans output and/or the box plot shown at the end of the example, you also can order the Methods accordingly, from highest Volume to lowest: Method III, Method IV and Method I, then Method II.

*# Compare the two models*
**anova(mod, mod2)**

Analysis of Variance Table

Model 1: Volume ~ Material * Method Model 2: Volume ~ Material + Method

Res.Df RSS Df Sum of Sq F Pr(>F)
1 32 25.777
**2 35 29.575 -3 -3.7984 1.5718 0.2154**

To check the model could be simplified an ANOVA comparing the two models was run, comparing the simpler model to the more complex model. This showed a p-value of 0.215 that suggests no significant difference between the two models, hence it is fine to use the simpler model.

*# Check the diagnostics*
**plot(mod2, which = 2)**
**plot(mod2, which = 1)**

**Chapter 7 | Statistical Modeling**

180

The diagnostic plots show that we can roughly assume normality on the residuals and that there is consistent error variation, the points are roughly scattered. These plots will be exactly the same if they were run on the LM output.

*# Calculate confidence intervals for the estimates*
**confint(mod2)**

2.5 % 97.5 % (Intercept) 27.9726891 29.29226 MaterialB 3.4001196 4.58038 MethodII -12.2227704 -10.55363 MethodIII 6.1196296 7.78877 MethodIV -0.4006704 1.26847

The downside of the ANOVA summary output is that it doesn’t show the estimates for each level of the explanatory variables; however this could be calculated from the confidence intervals output, as they are symmetrical. For example, the estimate for Material A would be (27.97 + ((29.29 – 27.97)/2)) = 28.63, which you will see matches the LM output next.

*# Fit linear model and print results*

**mod3 = lm(Volume ~ Material + Method, data = data19)**
**summary(mod3)**

Call:

lm(formula = Volume ~ Material + Method, data = data19) Residuals:

Min 1Q Median 3Q Max -2.05073 -0.59837 -0.03905 0.69276 1.56253 Coefficients:

Estimate Std. Error t value Pr(>|t|) (Intercept) 28.6325 0.3250 88.100 < 2e-16 ***

MaterialB 3.9903 0.2907 13.727 1.18e-15 ***

MethodII -11.3882 0.4111 -27.702 < 2e-16 ***

**Translating Statistics to Make Decisions**^{181}

MethodIII 6.9542 0.4111 16.916 < 2e-16 ***

MethodIV 0.4339 0.4111 1.055 0.298

---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.9192 on 35 degrees of freedom Multiple R-squared: 0.9847, Adjusted R-squared: 0.9829 F-statistic: 562.6 on 4 and 35 DF, p-value: < 2.2e-16

By fitting the LM we can see that the Volume for Material A and Method I is 28.63, the Volume for Material A and Method II is (28.63 – 11.39) = 17.24, through to the Volume for Material B and Method IV, which is (28.63 + 3.99 + 0.43) = 33.05. We also can see that the model explains 98.3% of the variation.

*# Calculate confidence intervals – only first section of output shown*
**lsmeans(mod2, pairwise ~ Method*Material)**

If we wanted confidence intervals on each group level, we could use the output from a least-squares means output, which includes the interaction term. So the confidence interval values we would get for the three examples above are 27.97 to 29.29, 16.58 to 17.90, and 32.40 to 33.72, respectively.

*# Create ANOVA table from linear model results*
**anova(mod3)**

By creating an ANOVA table from the LM output we can see that it gives exactly the same results as running the original ANOVA model on the data.

*# Plot data*

**ggplot(data19, aes(x = Method, y = Volume)) + theme_bw() +**

** facet_wrap( ~ Material) + stat_boxplot(geom = "errorbar") +**
** geom_boxplot()**

**Chapter 7 | Statistical Modeling**

182

Finally the box plot highlights the differences found between the explanatory variables in terms of Volume: All four boxes in the B box give a higher Volume than their corresponding boxes in the A box, and the order of the Materials is the same regardless of the Method. It also emphasises the similar variation as mentioned previously, in fact showing that there was very little variation due to the boxes and whiskers being so small.

Once the data has been input, you check for equal variance using the Bartlett’s
**test, bartlett.test( ), then assign a model to the data and call it a name, which *** in this case “mod.” The aov( ) stands for ANOVA, then the response variable *
is specified followed by the explanatory variables and interaction, finally the
data set to be used is included.

The next few commands are carried out on this “mod” object or “mod2”

* object, which was our simplified model so the summary( ) gives a summary *
of the model output. The summary can show us differences between the

“Materials” as there were only two, but it cannot show us differences between
* the “Methods.” Therefore lsmeans( ) does that by specifying the model along *
with which explanatory variable is of interest, in this case “Method.”