SF2940: Probability theory
Lecture 8: Multivariate Normal Distribution
Timo Koski
24.09.2014
Learning outcomes
Random vectors, mean vector, covariance matrix, rules of transformation
Multivariate normal R.V., moment generating functions, characteristic function, rules of transformation
Density of a multivariate normal RV Joint PDF of bivariate normal RVs
Conditional distributions in a multivariate normal distribution
PART 1: Mean vector, Covariance matrix, MGF,
Characteristic function
Vector Notation: Random Vector
A random vector X is a column vector
X =
X
1X
2.. . X
n
= ( X
1, X
2, . . . , X
n)
TEach X
iis a random variable.
Sample Value Random Vector
A column vector
x =
x
1x
2.. . x
n
= ( x
1, x
2, . . . , x
n)
TWe can think of x
iis an outcome of X
i.
Joint CDF, Joint PDF
The joint CDF (=cumulative distribution function) of a continuous random vector X is
F
X( x ) = F
X1,...,Xn( x
1, . . . , x
n) = P ( X ≤ x ) =
= P ( X
1≤ x
1, . . . , X
n≤ x
n) Joint probability density function (PDF)
f
X( x ) = ∂
n
∂x
1. . . ∂x
nF
X1,...,Xn( x
1, . . . , x
n)
Mean Vector
µ
X= E [ X ] =
E [ X
1] E [ X
2]
.. . E [ X
n]
,
a column vector of means (=expectations) of X.
Matrix, Scalar Product
If X
Tis the transposed column vector (=a row vector), then XX
Tis a n × n matrix, and
X
TX =
n
∑
i=1
X
i2is a scalar product, a real valued R.V..
Covariance Matrix of A Random Vector
Covariance matrix
C
X: = E h
( X − µ
X) ( X − µ
X)
Ti where the element ( i , j )
C
X( i, j ) = E [( X
i− µ
i) ( X
j− µ
j)]
is the covariance of X
iand X
j.
A Quadratic Form
x
TC
Xx =
n
∑
i=1 n
∑
j=1
x
ix
jC
X( i, j ) . We see that
=
n
∑
i=1 n
∑
j=1
x
ix
jE [( X
i− µ
i) ( X
j− µ
j)]
= E
"
n∑
i=1 n
∑
j=1
x
ix
j( X
i− µ
i) ( X
j− µ
j)
#
(∗)
Properties of a Covariance Matrix
Covariance matrix is nonnegative definite, i.e., for all x we have x
TC
Xx ≥ 0
Hence
det C
X≥ 0.
The covariance matrix is symmetric
C
X= C
XTProperties of a Covariance Matrix
The covariance matrix is symmetric C
X= C
XTsince
C
X( i, j ) = E [( X
i− µ
i) ( X
j− µ
j)]
= E [( X
j− µ
j) ( X
i− µ
i)] = C
X( j , i )
Properties of a Covariance Matrix
A covariance matrix is positive definite, x
TC
Xx > 0 for all x 6= 0 iff
det C
X> 0
(i.e. C
Xis invertible).
Properties of a Covariance Matrix
Proposition
x
TC
Xx ≥ 0 Pf: By (∗) above
x
TC
Xx = x
TE h
( X − µ
X) ( X − µ
X)
Ti x
= E h
x
T( X − µ
X) ( X − µ
X)
Tx i
= E h
x
Tw · w
Tx i
where we have set w = ( X − µ
X) . Then by linear algebra x
Tw = w
Tx
= ∑
ni=1w
ix
i. Hence
E h
x
Tww
Tx i
= E
n
∑
=
w
ix
i!
2
≥ 0.
Properties of a Covariance Matrix
In terms of the entries c
i,jof a covariance matrix C = ( c
i,j)
n,n,i=1,j=1there are the following necessary properties.
1
c
i,j= c
j,i(symmetry).
2
c
i,i= Var ( X
i) = σ
i2≥ 0 (the elements in the main diagonal are the variances, and thus all elements in the main diagonal are
nonnegative).
3
c
i2,j≤ c
i,i· c
j,j(Cauchy-Schwartz’ inequality).
Coefficient of Correlation
The Coefficient of Correlation ρ of X and Y is defined as ρ : = ρ
X,Y: = Cov ( X , Y )
p Var ( X ) · Var ( Y ) ,
where Cov ( X , Y ) = E [( X − µ
X) ( Y − µ
Y)] . This is normalized
− 1 ≤ ρ
X,Y≤ 1 For random variables X and Y ,
Cov ( X , Y ) = ρ
X,Y= 0 does not always mean that X , Y are
independent.
Special case: Covariance Matrix of A Bivariate Vector
X = ( X
1, X
2)
T.
C
X=
σ
12ρσ
1σ
2ρσ
1σ
2σ
22,
where ρ is the coefficient of correlation of X
1and X
2, and σ
12= Var ( X
1) , σ
22= Var ( X
2) . C
Xis invertible iff ρ
26= 1, for proof we note that
det C
X= σ
12σ
221 − ρ
2Special case: Covariance Matrix of A Bivariate Vector
Λ =
σ
12ρσ
1σ
2ρσ
1σ
2σ
22, if ρ
26= 1, the inverse exists and
Λ
−1= 1
σ
12σ
22( 1 − ρ
2)
σ
22− ρσ
1σ
2− ρσ
1σ
2σ
12,
Y = BX + b
Proposition
X is a random vector with mean vector µ
Xand covariance matrix C
X. B is a m × n matrix. If Y = BX + b, then
E Y = B µ
X+ b C
Y= BC
XB
TPf: For simplicity of writing, take b = µ = 0. Then C
Y= E YY
T= EB X ( B X )
T=
= EBXX
TB
T= BE h XX
Ti
B
T= BC
XB
TMoment Generating and Characteristic Functions
Definition
Moment generating function of X is defined as
ψ
X( t )
def= Ee
tTX= Ee
t1X1+t2X2+···+tnXnDefinition
Characteristic function of X is defined as
ϕ
X( t )
def= Ee
itTX= Ee
i(t1X1+t2X2+···+tnXn)Special cases: take t
1= 1, t
2= t
3= . . . = t
n= 0, then
ϕ
X( t ) = ϕ
X1( t
1) .
PART 2: Def I of a multivariate normal distribution
We recall first some of the properties of univariate normal distribution
Normal (Gaussian) One-dimensional RVs
X is a normal random variable if f
X( x ) = 1
σ √
2π e
−2σ21 (x−µ)2where µ is real and σ > 0.
Notation: X ∈ N ( µ, σ
2)
Properties: E ( X ) = µ, Var = σ
2Normal (Gaussian) One-dimensional RVs
−2 0 2 4 6
0 0.2 0.4 0.6 0.8
x
f X(x)
−2 0 2 4 6
0 0.2 0.4 0.6 0.8
x
fX(x)
(a)
µ = 2, σ = 1/2 , (b) µ = 2, σ = 2
Linear Transformation
X ∈ N ( µ
X, σ
2) ⇒ Y = aX + b is N ( aµ
X+ b, a
2σ
2) Thus Z =
X−σµXX
∈ N ( 0, 1 ) and
P ( X ≤ x ) = P X − µ
Xσ
X≤ x − µ
Xσ
Xor
F
X( x ) = P
Z ≤ x − µ
Xσ
X= Φ x − µ
Xσ
XNormal (Gaussian) One-dimensional RVs
X ∈ N ( µ, σ
2) then the moment generating function is ψ
X( t ) = E h
e
tXi
= e
tµ+12t2σ2, and the characteristic function is
ϕ
X( t ) = E h e
itXi
= e
itµ−12t2σ2as found in previous Lectures.
Multivariate Normal Def. I
Definition
An n × 1 random vector X has a normal distribution iff for every n × 1-vector a the one-dimensional random vector a
TX has a normal distribution.
We write X ∈ N ( µ, Λ ) , when µ is the mean vector and Λ is the
covariance matrix.
Consequences of Def. I (1)
An n × 1 vector X ∈ N ( µ, Λ ) iff the one-dimensional random vector a
TX has a normal distribution for every n-vector a .
Now we know that (take B = a
Tin the preceding) Ea
TX = a
Tµ, Var h
a
TX i
= a
TΛa
Consequences of Def. I (2)
Hence, if Y = a
TX, then Y ∈ N a
Tµ, a
TΛa and the moment generating function of Y is
ψ
Y( t ) = E h e
tYi
= e
taTµ+12t2aTΛa. Therefore
ψ
X( a ) = Ee
aTX= ψ
Y( 1 ) = e
aTµ+12aTΛa.
Consequences of Def. I (3)
Hence we have shown that if X ∈ N ( µ, Λ ) , then
ψ
X( t ) = Ee
tTX= e
tTµ+21tTΛt.
is the moment generating function of X.
Consequences of Def. I (4)
In the same way we can find that
ϕ
X( t ) = Ee
itTX= e
itTµ−12tTΛt.
is the characteristic function of X ∈ N ( µ, Λ ) .
Consequences of Def. I (5)
Let Λ be a diagonal covariance matrix with λ
2is on the main diagonal, i.e.,
Λ =
λ
210 0 . . . 0 0 λ
220 . . . 0 0 0 λ
23. . . 0 0 . .. ... . . . 0 0 0 0 . . . λ
2n
,
Proposition
If X ∈ N ( µ, Λ ) , then X
1, X
2, . . . , X
nare independent normal variables.
Consequences of Def. I (6)
Pf: Λ is diagonal, the quadratic form becomes a single sum of squares.
ϕ
X( t ) = e
itTµ−12tTΛt=
= e
i ∑ni=1µiti−12∑ni=1λ2iti2= e
iµ1t1−12λ21t12e
iµ2t2−12λ22t22· · · e
iµntn−12λ2nt2nis the product of the characteristic functions of X
i∈ N µ
i, λ
2i, which are thus seen to be independent N µ
i, λ
2i.
Kac’s theorem: Thm 8.1.3. in LN
Theorem
X = ( X
1, X
2, · · · , X
n)
′. The components X
1, X
2, · · · , X
nare independent if and only if
φ
X( s ) = E h e
is′Xi =
n
∏
i=1
φ
Xi( s
i) ,
where φ
Xi( s
i) is the characteristic function for X
i.
Further properties of the multivariate normal
X ∈ N ( µ, Λ )
Every component X
kis one-dimensional normal. To prove this we take a = ( 0, 0, . . . , 1
|{z}
position k
, 0, . . . , 0 )
Tand the conclusion follows by Def. I.
X
1+ X
2+ · · · X
nis one-dimensional normal. Note: The terms in the
sum need not be independent.
Properties of multivariate normal
X ∈ N ( µ, Λ )
Every marginal distribution of k variables ( 1 ≤ k < n is normal. To
prove this we consider any k variables X
i1, X
i2. . . X
ikand then take a
such that a
j= 0 for j 6= i
1, . . . i
kand then apply Def. I.
Properties of multivariate normal
Proposition
X ∈ N ( µ, Λ ) and Y = BX + b. Then
Y ∈ N B µ + b, BΛB
T. Pf:
ψ
Y( s ) = E h e
sTYi
= E h
e
sT(b+BX)i
=
= e
sTbE h e
sTBXi
= e
sTbE
e (
BTs)
TXE
e (
BTs)
TX= ψ
XB
Ts
.
Properties of multivariate normal
X ∈ N ( µ, Λ ) ψ
XB
Ts
= e (
BTs)
Tµ+21(
BTs)
TΛ(
BTs) .
B
Ts
Tµ = s
TB µ,
B
Ts
TΛ B
Ts
= s
TBΛB
Ts,
e (
BTs)
Tµ+12(
BTs)
TΛ(
BTs) = e
sTBµ+12sTBΛBTsProperties of multivariate normal
ψ
XB
Ts
= e
sTBµ+21sTBΛBTs.
ψ
Y( s ) = e
sTbψ
XB
Ts
= e
sTbe
sTBµ+12sTBΛBTsψ
Y( s ) = e
sT(b+Bµ)+12sTBΛBTs,
which proves the claim as asserted.
PART 3: Multivariate normal, Def. II: characteristic
function, DEF III: density
Multivariate normal, Def. II: char. fnctn
Definition
A random vector X with mean vector µ and a covariance matrix Λ is N ( µ, Λ ) if its characteristic function is
ϕ
X( t ) = Ee
itTX= e
itTµ−12tTΛt.
Multivariate normal, Def. II implies Def. I
We need to show that the one-dimensional random vector Y = a
TX has a normal distribution.
ϕ
Y( t ) = E h e
itYi
= E h
e
it ∑ni=1ai·Xii
=
= E h e
itaTXi
= ϕ
X( ta ) =
= e
itaTµ−12t2aTΛaand this is the characteristic function of N a
Tµ, a
TΛa
.
Multivariate normal, Def. III: joint PDF
Definition
A random vector X with mean vector µ and an invertible covariance matrix Λ is N ( µ, Λ ) , if the density is
f
X( x ) = 1
( 2π )
n/2p det ( Λ ) e
−12(x−µ)TΛ−1(x−µ)
Multivariate normal
It can be checked by a computation that e
itTµ−21tTΛt=
Z
Rn
e
itTx1
( 2π )
n/2p det ( Λ ) e
−12(x−µ)TΛ−1(x−µ)
d x
(complete the square) Hence Def. III implies the property in Def. II. The
three definitions are equivalent, in the case inverse of the covariance
matrix exists.
PART 4: Bivariate normal with density
Multivariate Normal: the bivariate case
As soon as ρ
26= 1, the matrix Λ =
σ
12ρσ
1σ
2ρσ
1σ
2σ
22, is invertible, and the inverse is
Λ
−1= 1
σ
12σ
22( 1 − ρ
2)
σ
22− ρσ
1σ
2− ρσ
1σ
2σ
12,
Multivariate Normal: the bivariate case
ρ
26= 1, and X = ( X
1, X
2)
T, then f
X( x ) = 1
2π √
det Λ e
−12(x−µX)TΛ−1(x−µX)= 1
2πσ
1σ
2p 1 − ρ
2e
−12Q(x1,x2)
Multivariate Normal: the bivariate case
where
Q ( x
1, x
2) = 1
( 1 − ρ
2) ·
" x
1− µ
1σ
1 2− 2ρ ( x
1− µ
1)( x
2− µ
2)
σ
1σ
2+ x
2− µ
2σ
2 2#
For this, invert the matrix Λ and expand the quadratic form !
ρ = 0
0 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0
0
3 2 1 0 -1 -2 -3
0
3 2
1 0
-1 -2
-3
ρ = 0.9
0 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0
0
3 2 1 0 -1 -2 -3
0
3 2
1 0
-1 -2
-3
ρ = − 0.9
0 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0
0
3 2 1 0 -1 -2 -3
0
3 2
1 0
-1 -2
-3
Conditional densities for the bivariate normal
Complete the square of the exponent to write f
X,Y( x, y ) = f
X( x ) f
Y|X( y ) where
f
X( x ) = 1 σ
1√ 2π e
−1 2σ21
(x−µ1)2
f
Y|X( y ) = 1
˜σ
2√ 2π e
−1 2 ˜σ22
(y−˜µ2(x))2
˜
µ
2( x ) = µ
2+ ρ σ
2σ
1( x − µ
1) , ˜σ
2= σ
2q
1 − ρ
2Bivariate normal properties
E ( X ) = µ
1Given X = x, Y is Gaussian
Conditional mean of Y given X = x:
˜
µ
2( x ) = µ
2+ ρ σ
2σ
1( x − µ
1) = E ( Y | X = x ) Conditional variance of Y given X = x:
Var ( Y | X = x ) = σ
221 − ρ
2Bivariate normal properties
Conditional mean of Y given X = x:
˜
µ
2( x ) = µ
2+ ρ σ
2σ
1( x − µ
1) = E ( Y | X = x ) Conditional variance of Y given X = x:
Var ( Y | X = x ) = σ
221 − ρ
2Check Section 3.7.3. and Exercise 3.8.4.6. By this is seen that the conditional mean of Y given X variable in a bivariate normal
distribution is also the best LINEAR predictor of Y based on X , and
the conditional variance is the variance of the estimation error.
Marginal PDFs
Proof of conditional pdf
Consider
f
X,Y( x, y ) f
X( x ) = σ
1√
2π 2πσ
1σ
2p 1 − ρ
2e
−12Q(x,y)+ 1
2σ21
(x−µ1)2
Proof of conditional pdf
− 1 2 Q ( x, y ) + 1
2σ
12( x − µ
1)
2= − 1 2 H ( x, y ) ,
Proof of conditional pdfs
H ( x, y ) = 1
( 1 − ρ
2) ·
"
x − µ
1σ
1 2− 2ρ ( x − µ
1)( y − µ
2)
σ
1σ
2+ y − µ
2σ
2 2#
− x − µ
1σ
1 2Proof of conditional pdf
H ( x, y ) = ρ
2( 1 − ρ
2)
( x − µ
1)
2σ
12− 2ρ ( x − µ
1)( y − µ
2)
σ
1σ
2( 1 − ρ
2) + ( y − µ
2)
2σ
22( 1 − ρ
2)
Proof of conditional pdf
H ( x, y ) =
y − µ
2− ρ
σσ21( x − µ
1)
2σ
22( 1 − ρ
2)
Conditional pdf
f
X,Y( x, y ) f
X( x ) = 1
p 1 − ρ
2σ
2√ 2π e
−12
(
y−µ2−ρσ2 σ1(x−µ1))
2σ22(1−ρ2)
This establishes the bivariate normal properties claimed above.
Bivariate normal properties : ρ
Proposition
( X , Y ) bivariate normal ⇒ ρ = ρ
X,YProof:
E [( X − µ
1)( Y − µ
2)]
= E ( E ([( X − µ
1)( Y − µ
2)] | X ))
= E (( X − µ
1) E [ Y − µ
2] | X ))
Bivariate normal properties : ρ
= E (( X − µ
1) E [( Y − µ
2)] | X ))
= E ( X − µ
1) [ E ( Y | X ) − µ
2]
= E (( X − µ
1)
µ
2+ ρ σ
2σ
1( X − µ
1) − µ
2= ρ σ
2σ
1E ( X − µ
1)(( X − µ
1))
Bivariate normal properties : ρ
= ρ σ
2σ
1E ( X − µ
1)( X − µ
1)
= ρ σ
2σ
1E ( X − µ
1)
2= ρ σ
2σ
1σ
12= ρσ
2σ
1Bivariate normal properties : ρ
In other words we have checked that
ρ = E [( X − µ
1)( Y − µ
2)]
σ
2σ
1ρ = 0 ⇔ bivariate normal X , Y are independent.
PART 5: Generating a multivariate normal variable
Standard Normal Vector: definition
Z ∈ N ( 0, I ) is a standard normal vector.
I is the n × n identity matrix.
f
Z( z ) = 1
( 2π )
n/2p det ( I ) e
−12(z−0)TI−1(z−0)
= 1
( 2π )
n/2e
−12zTz
Distribution of X = AZ + b
X = AZ + b, Z is standard Gaussian, then X = N
b, AA
T(follows by a rule in the preceding)
Multivariate Normal: the bivariate case
If
Λ =
σ
12ρσ
1σ
2ρσ
1σ
2σ
22, then Λ = AA
T, where
A =
σ
10 ρσ
2σ
2p 1 − ρ
2,
Standard Normal Vector
X ∈ N ( µ
X, Λ ) , and A is such that Λ = AA
T(An invertible matrix A with this property exists always, if Λ is positive definite (we need the symmetry of Λ, too.) Then
Z = A
−1( X − µ
X) is a standard Gaussian vector.
Proof: We give the first idea of his proof, a rule of transformation.
Rule of transformation
If X has density f
X( x ) , Y = AX + b, A is invertible, then
f
Y( y ) = 1
| det A | f
XA
−1
( y − b )
Note that if Λ = AA
T, then
det Λ = det A · det A
T= det A · det A = det A
2, so that | det A | = √
det Λ.
Johann Carl Friedrich Gauss (30 April 1777 23 February
1855)
Diagonalizable Matrices
An n × n matrix A is orthogonally diagonalizable, if there is an orthogonal matrix P (i.e., P
TP = PP
T= I ) such that
P
TAP = Λ,
where Λ is a diagonal matrix.
Diagonalizable Matrices
Theorem
If A is an n × n matrix, then the following are equivalent:
(i) A is orthogonally diagonalizable.
(ii) A has an orthonormal set of eigenvectors.
(iii) A is symmetric.
Since covariance matrices are symmetric, we have by the theorem above
that all covariance matrices are orthogonally diagonalizable.
Diagonalizable Matrices
Theorem
If A is a symmetric matrix, then
(i) Eigenvalues of A are all real numbers.
(ii) Eigenvectors from different eigenspaces are orthogonal.
That is, all eigenvalues of a covariance matrix are real.
Diagonalizable Matrices
Hence we have for any covariance matrix the spectral decomposition C =
n
∑
i=1