The R markdown document for this section is availablehere⁶⁷.
⁶⁷https://github.com/genomicsclass/labs/tree/master/matrixalg/matrix_operations.Rmd
a + b + c = 6 3a− 2b + c = 2 2a + b− c = 1
We described how this system can be rewritten and solved using matrix algebra:
Having described matrix notation, we will explain the operation we perform with them. For example, above we have matrix multiplication and we also have a symbol representing the inverse of a matrix.
The importance of these operations and others will become clear once we present specific examples related to data analysis.
Multiplying by a scalar
We start with one of the simplest operations: scalar multiplication. If a is scalar and X is a matrix, then:
R automatically follows this rule when we multiply a number by a matrix using*:
X <- matrix(1:12,4,3)
a <- 2
The transpose is an operation that simply changes columns to rows. We use a⊤ to denote a transpose.
The technical definition is as follows: if X is as we defined it above, here is the transpose which will be p× N:
## [1,] 1 2 3 4
## [2,] 5 6 7 8
## [3,] 9 10 11 12
Matrix multiplication
We start by describing the matrix multiplication shown in the original system of equations example:
a + b + c = 6 3a− 2b + c = 2 2a + b− c = 1
What we are doing is multiplying the rows of the first matrix by the columns of the second. Since the second matrix only has one column, we perform this multiplication by doing the following:
Here is a simple example. We can check to see if abc=c(3,2,1)is a solution:
X <- matrix(c(1,3,2,1,-2,1,1,1,-1),3,3) abc <- c(3,2,1) #use as an example
rbind( sum(X[1,]*abc), sum(X[2,]*abc), sum(X[3,]%*%abc))
## [,1]
## [1,] 6
## [2,] 6
## [3,] 7
We can use the%*%to perform the matrix multiplication and make this much more compact:
X%*%abc
## [,1]
## [1,] 6
## [2,] 6
## [3,] 7
We can see thatc(3,2,1)is not a solution as the answer here is not the requiredc(6,2,1). To get the solution, we will need to invert the matrix on the left, a concept we learn about below.
Here is the general definition of matrix multiplication of matrices A and X:
AX =
You can only take the product if the number of columns of the first matrix A equals the number of rows of the second one X. Also, the final matrix has the same row numbers as the first A and the same column numbers as the second X. After you study the example below, you may want to come back and re-read the sections above.
The identity matrix
The identity matrix is analogous to the number 1: if you multiply the identity matrix by another matrix, you get the same matrix. For this to happen, we need it to be like this:
I =
By this definition, the identity always has to have the same number of rows as columns or be what we call a square matrix.
If you follow the matrix multiplication rule above, you notice this works out:
XI =
In R you can form an identity matrix this way:
n <- 5 #pick dimensions diag(n)
The inverse of matrix X, denoted with X−1, has the property that, when multiplied, gives you the identity X−1X = I. Of course, not all matrices have inverses. For example, a 2× 2 matrix with 1s in all its entries does not have an inverse.
As we will see when we get to the section on applications to linear models, being able to compute the inverse of a matrix is quite useful. A very convenient aspect of R is that it includes a predefined functionsolveto do this. Here is how we would use it to solve the linear of equations:
X <- matrix(c(1,3,2,1,-2,1,1,1,-1),3,3)
Please note thatsolveis a function that should be used with caution as it is not generally numerically stable. We explain this in much more detail in the QR factorization section.
Exercises
1. SupposeXis a matrix in R. Which of the following is not equivalent toX?
• A)t( t(X) )
• B)X %*% matrix(1,ncol(X) )
• C)X*1
• D)X%*%diag(ncol(X))
2. Solve the following system of equations using R:
3a + 4b− 5c + d = 10 2a + 2b + 2c− d = 5
a− b + 5c − 5d = 7 5a + d = 4 What is the solution for c?
3. Load the following two matrices into R:
a <- matrix(1:12, nrow=4) b <- matrix(1:15, nrow=3)
Note the dimension ofaand the dimension ofb.
In the question below, we will use the matrix multiplication operator in R,%*%, to multiply these two matrices.
What is the value in the 3rd row and the 2nd column of the matrix product ofaandb? 4. Multiply the 3rd row of a with the 2nd column of b, using the element-wise vector
multiplication with*.
What is the sum of the elements in the resulting vector?
Examples
The R markdown document for this section is availablehere⁶⁸.
Now we are ready to see how matrix algebra can be useful when analyzing data. We start with some simple examples and eventually arrive at the main one: how to write linear models with matrix algebra notation and solve the least squares problem.
⁶⁸https://github.com/genomicsclass/labs/tree/master/matrixalg/matrix_algebra_examples.Rmd
To compute the sample average and variance of our data, we use these formulas ¯Y = N1Yi and var(Y ) = N1 ∑N
i=1(Yi− ¯Y )2. We can represent these with matrix multiplication. First, define this N × 1 matrix made just of 1s:
Note that we are multiplying by the scalar 1/N. In R, we multiply matrix using%*%: library(UsingR)
As we will see later, multiplying the transpose of a matrix with another is very common in statistics.
In fact, it is so common that there is a function in R:
barY=crossprod(A,Y) / N print(barY)
## [,1]
## [1,] 68.68407
For the variance, we note that if:
r≡
In R, if you only send one matrix intocrossprod, it computes: r⊤rso we can simply type:
r <- y - barY crossprod(r)/N
## [,1]
## [1,] 7.915196
Which is almost equivalent to:
library(rafalib) popvar(y)
## [1] 7.915196
Linear models
Now we are ready to put all this to use. Let’s start with Galton’s example. If we define these matrices:
Y =
Then we can write the model:
Yi = β0+ β1xi+ ε, i = 1, . . . , N which is a much simpler way to write it.
The least squares equation becomes simpler as well since it is the following cross-product:
(Y− Xβ)⊤(Y− Xβ)
So now we are ready to determine which values of β minimize the above, which we can do using calculus to find the minimum.
Advanced: Finding the minimum using calculus
There are a series of rules that permit us to compute partial derivative equations in matrix notation.
By equating the derivative to 0 and solving for the β, we will have our solution. The only one we need here tells us that the derivative of the above equation is:
2X⊤(Y− X ˆβ) = 0
X⊤X ˆβ =X⊤Y
β = (Xˆ ⊤X)−1X⊤Y
and we have our solution. We usually put a hat on the β that solves this, ˆβ , as it is an estimate of the “real” β that generated the data.
Remember that the least squares are like a square (multiply something by itself) and that this formula is similar to the derivative of f(x)2being 2f(x)f′(x).
Finding LSE in R
Let’s see how it works in R:
library(UsingR) x=father.son$fheight y=father.son$sheight X <- cbind(1,x)
betahat <- solve( t(X) %*% X ) %*% t(X) %*% y
###or
betahat <- solve( crossprod(X) ) %*% crossprod( X, y )
Now we can see the results of this by computing the estimated ˆβ0+ ˆβ1xfor any value of x:
newx <- seq(min(x),max(x),len=100) X <- cbind(1,newx)
fitted <- X%*%betahat
plot(x,y,xlab="Father's height",ylab="Son's height") lines(newx,fitted,col=2)
Galton’s data with fitted regression line.
This ˆβ = (X⊤X)−1X⊤Y is one of the most widely used results in data analysis. One of the advantages of this approach is that we can use it in many different situations. For example, in our falling object problem:
g <- 9.8 #meters per second n <- 25
tt <- seq(0,3.4,len=n) #time in secs, t is a base function d <- 56.67 - 0.5*g*tt^2 + rnorm(n,sd=1)
Notice that we are using almost the same exact code:
X <- cbind(1,tt,tt^2) y <- d
betahat <- solve(crossprod(X))%*%crossprod(X,y) newtt <- seq(min(tt),max(tt),len=100)
X <- cbind(1,newtt,newtt^2) fitted <- X%*%betahat
plot(tt,y,xlab="Time",ylab="Height") lines(newtt,fitted,col=2)
Fitted parabola to simulated data for distance travelled versus time of falling object measured with error.
And the resulting estimates are what we expect:
betahat
## [,1]
## 56.5317368
## tt 0.5013565
## -5.0386455
The Tower of Pisa is about 56 meters high. Since we are just dropping the object there is no initial velocity, and half the constant of gravity is 9.8/2=4.9 meters per second squared.
ThelmFunction
R has a very convenient function that fits these models. We will learn more about this function later, but here is a preview:
X <- cbind(tt,tt^2)
## -2.5295 -0.4882 0.2537 0.6560 1.5455
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.5317 0.5451 103.701 <2e-16 ***
## Xtt 0.5014 0.7426 0.675 0.507
## X -5.0386 0.2110 -23.884 <2e-16 ***
##
---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9822 on 22 degrees of freedom
## Multiple R-squared: 0.9973, Adjusted R-squared: 0.997
## F-statistic: 4025 on 2 and 22 DF, p-value: < 2.2e-16 Note that we obtain the same values as above.
We have shown how to write linear models using linear algebra. We are going to do this for several examples, many of which are related to designed experiments. We also demonstrated how to obtain least squares estimates. Nevertheless, it is important to remember that because y is a random variable, these estimates are random as well. In a later section, we will learn how to compute standard error for these estimates and use this to perform inference.
Exercises
1. Suppose we are analyzing a set of 4 samples. The first two samples are from a treatment group A and the second two samples are from a treatment group B. This design can be represented with a model matrix like so:
X <- matrix(c(1,1,1,1,0,0,1,1),nrow=4) rownames(X) <- c("a","a","b","b") X
## [,1] [,2]
## a 1 0
## a 1 0
## b 1 1
## b 1 1
Suppose that the fitted parameters for a linear model give us:
beta <- c(5, 2)
Use the matrix multiplication operator,%*%, in R to answer the following questions:
What is the fitted value for the A samples? (The fitted Y values.) 2. What is the fitted value for the B samples? (The fitted Y values.)
3. Suppose now we are comparing two treatments B and C to a control group A, each with two samples. This design can be represented with a model matrix like so:
X <- matrix(c(1,1,1,1,1,1,0,0,1,1,0,0,0,0,0,0,1,1),nrow=6) rownames(X) <- c("a","a","b","b","c","c")
X
Suppose that the fitted values for the linear model are given by:
beta <- c(10,3,-3)
What is the fitted value for the B samples?
4. What is the fitted value for the C samples?
Many of the models we use in data analysis can be presented using matrix algebra. We refer to these types of models as linear models. “Linear”” here does not refer to lines, but rather to linear combinations. The representations we describe are convenient because we can write models more succinctly and we have the matrix algebra mathematical machinery to facilitate computation. In this chapter, we will describe in some detail how we use matrix algebra to represent and fit.
In this book, we focus on linear models that represent dichotomous groups: treatment versus control, for example. The effect of diet on mice weights is an example of this type of linear model. Here we describe slightly more complicated models, but continue to focus on dichotomous variables.
As we learn about linear models, we need to remember that we are still working with random variables. This means that the estimates we obtain using linear models are also random variables.
Although the mathematics is more complex, the concepts we learned in previous chapters apply here. We begin with some exercises to review the concept of random variables in the context of linear models.
Exercises
The standard error of an estimate is the standard deviation of the sampling distribution of an estimate. In previous chapters, we saw that our estimate of the mean of a population changed depending on the sample that we took from the population. If we repeatedly sampled from the population and each time estimated the mean, the collection of mean estimates would form the sampling distribution of the estimate. When we took the standard deviation of those estimates, that was the standard error of our mean estimate.
In the case of a linear model written as:
Yi = β0+ β1Xi+ εi, i = 1, . . . , n
εis considered random. Every time we re-run the experiment, we will see different ε dichotomous.
This implies that in different application ε represents different things: measurement error or variability between individuals for example.
If we were to re-run the experiment many times and estimate linear model terms ˆβeach time, the distribution of these ˆβis called the sampling distribution of the estimates. If we take the standard deviation of all of these estimates from repetitions of the experiment, this is called the standard error of the estimate. While we are not necesarily sampling individuals, you can think about the repetition of the experiment as “sampling” new errors in our observation of Y .
1. We have shown how to find the least squares estimates with matrix algebra. These estimates are random variables as they are linear combinations of the data. For these estimates to be useful, we also need to compute the standard errors. Here we review standard errors in the context of linear models. To see this, we can run a Monte Carlo simulation to imitate the collection of falling object data. Specifically, we will generate the data repeatedly and compute the estimate for the quadratic term each time.
g = 9.8
Now we act as if we didn’t know h0, v0 and -0.5*g and use regression to estimate these. We can rewrite the model as y = β0+ β1t + β2t2+ εand obtain the LSE we have used in this class. Note that g = -2 β2.
To obtain the LSE in R we could write:
X = cbind(1,tt,tt^2)
A = solve(crossprod(X))%*%t(X)
Given how we have definedA, which of the following is the LSE of g, the acceleration due to gravity? Hint: try the code in R.
• A)9.8
• B)A %*% y
• C)-2 * (A %*% y) [3]
• D)A[3,3]
2. In the lines of code above, the functionrnormintroduced randomness. This means that each time the lines of code above are repeated, the estimate of g will be different.
Use the code above in conjunction with the functionreplicateto generate 100,000 Monte Carlo simulated datasets. For each dataset, compute an estimate of g. (Remember to multiply by -2.)
What is the standard error of this estimate?
3. In the father and son height examples, we have randomness because we have a random sample of father and son pairs. For the sake of illustration, let’s assume that this is the entire population:
library(UsingR)
x = father.son$fheight y = father.son$sheight n = length(y)
again. Here is how we obtain one sample:
Use the function replicate to take 10,000 samples.
What is the standard error of the slope estimate? That is, calculate the standard deviation of the estimate from the observed values obtained from many random samples.
4. Later in this chapter we will introduce a new concept: covariance. The covariance of two lists of numbers X = x1, ..., xnand Y = y1, ..., ynis:
n <- 100 Y <- rnorm(n) X <- rnorm(n)
mean( (Y - mean(Y))*(X-mean(X) ) )
Which of the following is closest to the covariance between father heights and son heights?
• A) 0
• B) -4
• C) 4
• D) 0.5