Statistical Learning. Lecture 0 Mathematical Tools: Algebra and Probabilities

(1)

Statistical Learning

Lecture 0

Mathematical Tools: Algebra and Probabilities

M´ario A. T. Figueiredo

Instituto Superior T´ecnico & Instituto de Telecomunica¸c˜oes University of Lisbon, Portugal

(2)

(3)

(4)

Linear Algebra (Informally)

Linear algebra provides (among many other things) a compact way of representing, studying, and solving linear systems of equations

Example: the system

4 x1− 5 x2= −13

−2 x₁+ 3 x2= 9

can be written compactly as Ax = b , where

A = 4 −5 −2 3 , b = −13 9 , and can be solved as

x = A−1b = 1.5 2.5 1 2 −13 9 = 3 5 .

(5)

Notation: Matrices and Vectors

A ∈ Rm×n is a matrixwith m rows and n columns.

A =    A1,1 · · · A1,n .. . . .. ... Am,1 · · · Am,n   .

x ∈ Rn _{is a} _vector _{with n components,}

x =    x1 .. . xn   .

A(column) vector is a matrix with n rows and 1 column.

(6)

Matrix Transpose and Products

Given matrix A ∈ Rm×n, itstransposeAT is such that (AT)i,j = Aj,i.

A matrix A issymmetricif AT = A.

Given matrices A ∈ Rm×n _{and B ∈ R}n×p_{, their} _product _is

C = A B ∈ Rm×p where Ci,j =

n

X

k=1

Ai,kBk,j

Inner product between vectors x, y ∈ Rn:

hx, yi = xT_{y = y}T_{x =}

n

X

i=1

xiyi ∈ R.

Outer product: x ∈ Rn and y ∈ Rm: x yT ∈ Rn×m_{, where}

(7)

Properties of Matrix Products and Transposes

Given matrices A ∈ Rm×n and B ∈ Rn×p, their product is

C = A B ∈ Rm×p where Ci,j =

n

X

k=1

Ai,kBk,j

Matrix product isassociative: (AB)C = A(BC).

In general, matrix product is not commutative: AB 6= BA.

Transpose of product: (A B)T = BTAT.

(8)

Special Matrices

Theidentity matrix I ∈ Rn×n is a square matrix such that

Iij = 1, i = j 0, i 6= j I =      1 0 · · · 0 0 1 · · · 0 .. . ... . .. ... 0 0 · · · 1     

Neutral element of matrix product: A I = I A = A.

Diagonal matrix: (i 6= j) ⇒ Ai, j = 0.

Upper triangular matrix: (j < i) ⇒ Ai, j = 0. Lower triangularmatrix: (j > i) ⇒ Ai, j = 0.

(9)

Eigenvalues, eigenvectors, determinant, trace

A vector x ∈ Rn is an eigenvector of matrix A ∈ Rn×n if

A x = λ x,

where λ ∈ R is the corresponding eigenvalue.

The eigenvalues of a diagonal matrix are the elements in the diagonal. (quiz: what are the eigenvectors?)

Matrix trace: trace(A) =P

iAi,i =

P

iλi

Matrix determinant: |A| = det(A) =Q

iλi

Properties of determinant: |AB| = |A||B|, |AT| = |A|,

|α A| = αn_|A|

Properties of the trace: trace(A + B) = trace(A) + trace(B),

trace(ABC) = trace(CAB) = trace(BCA) (cyclic permutations)

(10)

Matrix Inverse

Matrix A ∈ Rn×n in invertible if there is B ∈ Rn×n s.t.

AB = BA = I.

...matrix B, such that AB = BA = I, denoted B = A−1.

Matrix A ∈ Rn×n is invertible ⇔ det(A) 6= 0.

Determinant of inverse: det(A−1) = 1

det(A).

Solving system Ax = b, if A is invertible: x = A−1b.

Properties: (A−1)−1 = A, (A−1)T = (AT)−1, (A B)−1 = B−1A−1

There are many algorithms to compute A−1; general case,

(11)

Quadratic Forms and Positive (Semi-)Definite Matrices

Given matrix A ∈ Rn×n and vector x ∈ Rn,

xTA x = n X i=1 n X j=1 Ai, jxixj ∈ R

is called aquadratic form.

A symmetric matrix A ∈ Rn×n _is_{positive semi-definite} _{(PSD) if, for}

any x ∈ Rn, xTA x ≥ 0.

A symmetric matrix A ∈ Rn×n ispositive definite (PD) if, for any

x ∈ Rn, (x 6= 0) ⇒ xTA x > 0.

Matrix A ∈ Rn×n is PSD ⇔ all λi(A) ≥ 0.

(12)

A Bit More Formal: Vector Spaces

Avector space over a field F (e.g., R) is a set V and a pair of

operations, +: V × V → V and ·_{: F × V → V, that satisfy the}

following axioms, ∀x, y, z ∈ V and ∀α, β ∈ F:

X + is associative and commutative;

X ∃0 ∈ V, such that 0 + x = x;

X ∃ − x ∈ V, such that −x + x = 0;

X α · (β · x) = (α · β) · x;

X 1 · x = x, where 1 ∈ F is such that 1 · α = α;

X α · (x + y) = α · x + α · y;

X (α + β) · x = α · x + β · x.

Elements of V are calledvectors; elements of F arescalars.

(13)

Vector Space: Examples

“Usual vectors” (Rn_{, +, ·) over field R}

X x = (x1, ..., xn), y = (y1, ..., yn), x + y = (x1+ y1, ..., xn+ yn);

X x = (x1, ..., xn), α x = (α x1, ..., α xn)

Real matrices (Rm×n, +, ·) over field R

X usual matrix addition and multiplication by scalar;

Complex matrices (Cm×n_{, +, ·) over field C (complex numbers).}

Binary vectors ({0, 1}n, +, ·) over GF (2) = {0, 1} (Galois field),

X + is addition modulo 2 : 0 + 0 = 0, 0 + 1 = 1 + 0 = 1, 1 + 1 = 0.

X · is standard multiplication: 0 · 0 = 0 · 1 = 1 · 0 = 0, 1 · 1 = 1.

Set of all functions f : Ω → R, with point-wise addition and multiplication, is a vector space over R.

(14)

Norm on Vector Space V

Anorm is a function k · k : V → R+ satisfying,

∀x, y ∈ V and ∀α ∈ R,

X homogeneity, kα xk = |α| kxk;

X triangle inequality, kx + yk ≤ kxk + kyk;

X definiteness, kxk = 0 ⇔ x = 0.

Aseminorm may not satisfy definiteness.

Classical example in Rn: Euclidean norm: kxk =

s n P i=1 x2 i.

Two norms k · k and k · k0 are equivalentif ∃α, β > 0 such that

∀x ∈ V, αkxk ≤ kxk0 ≤ βkxk;

(15)

Other Norms

The `p norm of a vector x ∈ Rn, where p ≥ 1,

kxkp = n X i=1 |xi|p !1/p . Notable cases: I `2 (Euclidean) norm. I `1 norm, kxk1=Pi|xi|.

I `∞ norm, limp→∞kxkp= max{|x1|, ..., |xn|} ≡ kxk∞

I `0 “norm”(not a norm), limp→0kxkp= #{i : xi6= 0} ≡ kxk0

Some equivalences:kxk2 ≤ kxk1 ≤ √ n kxk2 kxk∞ ≤ kxk2 ≤ √ n kxk∞ kxk∞ ≤ kxk1 ≤ n kxk∞

(16)

Inner Product on Vector Space V Over R

An inner product is a function h·, ·i : V × V → R, satisfying, ∀x, y ∈ V and ∀α ∈ R,

X symmetry, hx, yi = hy, xi

X (bi)linearity, hαx + βz, yi = αhx, yi + βhz, yi;

X definiteness, hx, xi ≥ 0 and hx, xi = 0 ⇔ x = 0.

Standard inner product in Rn_{: hx, yi =}Pn

i=1xiyi= xTy

Also an inner product in Rn: hx, yi = xTM y, where M is PD

Norm (is it?) induced by an inner product kxk =phx, xi

X for hx, yi = xT_{y, then kxk}2_{= x}T_{x =}Pn i=1x 2 i (Euclidean norm) X for hx, yi = xT_{M y, then kxk}2 M = xTM x (Mahalanobis norm)

(17)

Key Properties of Inner Products

If k · k is induced by inner product h·, ·i (that is kxk2= hx, xi), then

kx + yk2= hx + y, x + yi = kxk2+ kyk2+ 2hx, yi

Cauchy-Schwarz inequality: |hx, yi| ≤ kxk kyk

“Des d´emonstrations qui font boum!” (“Proofs that make boom!”)

[Jean-Baptiste Hiriart-Urruty] 0 ≤ 1 2 x kxk± y kyk 2 = 1 ± hx, yi kxk kyk ⇔ hx, yi ≤ kxk kyk −hx, yi ≤ kxk kyk

Corollary: k · k is indeed a norm, as it satisfies the triangle inequality:

kx + yk2_{≤ kxk}2

+ kyk2+ 2|hx, yi| ≤ kxk2+ kyk2+ 2kxk kyk = kxk + kyk2 Hilbert space: complete vector space equipped with an inner product.

(18)

Basis and Dimension of a Vector Space V

Basis: collection of vectors B = {b1, b2, ...} ⊂ V satisfying:

X linear independence: for any finite linear combination α1b1+ ... + αmbm= 0 ⇒ α1= · · · = αm= 0

X spanning_{ability: any vector v ∈ V can be written as} v = α1b1+ ... + αmbn; in other words, V = span(B).

Dimension of V: dim(V) = #B Orthogonal basis: i 6= j ⇒ hbi, bji = 0

(19)

Rank, Range, and Null Space

Consider some real matrix A ∈ Rm×n

Range of A: R(A) =_{y ∈ R}m: ∃ x ∈ Rnsuch that y = Ax_{⊆ R}m Null space of A: N (A) =x ∈ Rn: Ax = 0 ⊆ Rn.

Both R(A) and N (A) are vector spaces.

Dimension theorem: dim R(A) + dim N (A) = n

Rank: rank(A) = dim R(A) ≤ min{m, n}

(20)

Singular Value Decomposition (SVD)

Any rank-r matrix A ∈ Rm×n can be written as A = U ΛVT

X columns of U ∈ Rm×r are an orthonormal basis of R(A);

X columns of V ∈ Rn×r _{are an orthonormal basis of R(A}T_);

X Λ = diag(σ1, ..., σr) is a r × r diagonal matrix;

X σ1, ..., σr are calledsingular values.

X σ1, ..., σr are square roots of the eigenvalues of ATA or AAT.

Orthonormality of U and V : UTU = I and VTV = I .

(21)

Singular Value Decomposition (SVD)

A = U ΛVT_{, where U ∈ R}m×r _{and V ∈ R}n×r_.

(22)

Singular Value Decomposition (SVD)

A = U ΛVT_{, where U ∈ R}m×r _{and V ∈ R}n×r_.

(23)

(24)

(25)

(26)

Probability theory

“Essentially, all models are wrong, but some are useful”;G. Box, 1987

The study of probability has roots in games of chance (dice, cards, ...)

Great names in science:Cardano, Fermat, Pascal, Laplace, Gauss,

Huygens, Legendre, Poisson, Kolmogorov, Bernoulli, Cauchy, Gibbs, Boltzman, de Finetti, ...

Natural tool to modeluncertainty, information, knowledge, belief, ...

(29)

What is probability?

Classical definition: P(A) = NA

N

...with N mutually exclusive equally likely outcomes,

NA of which result in the occurrence of A. Laplace, 1814

Example: P(randomly drawn card is ♣) = 13/52.

Example: P(getting 1 in throwing a fair die) = 1/6.

Frequentist definition: P(A) = lim

N →∞

NA

N

...relative frequency of occurrence of A in infinite number of trials.

Subjective probability: P(A) is a degree of belief. de Finetti, 1930s

(30)

“The mathematics of probability can be developed on an entirely

(31)

Key concepts: Sample space and events

Sample space X = set of possible outcomes of a random experiment.

Examples:

I Tossing two coins: X = {HH, T H, HT, T T }

I Roulette: X = {1, 2, ..., 36}

I Draw a card from a shuffled deck: X = {A♣, 2♣, ..., Q♦, K♦}.

An event A is a subset of X : A ⊆ X (also written A ∈ 2X).

Examples:

I “exactly one H in 2-coin toss”:

A = {T H, HT } ⊂ {HH, T H, HT, T T }.

I “odd number in the roulette”: B = {1, 3, ..., 35} ⊂ {1, 2, ..., 36}.

(32)

Key concepts: Sample space and events

Sample space X = set of possible outcomes of a random experiment.

(More delicate) examples:

I _{Time until you receive the next email X = R}₊ I _{Location of the next rain drop X = R}2

An event A is a measurablesubset of X , i.e., A ∈ σ(F ), F ⊆ 2X. σ-algebra: σ(F ) I A ∈ σ(F ) ⇒ Ac∈ σ(F ) I A1, A2, ... ∈ σ(F ) ⇒ ∞ [ i=1 Ai∈ σ(F )

Classical example in Rn: the collection of Lebesgue measurable sets

constitute a σ−algebra.

(33)

Kolmogorov’s Axioms for Probability

Probability is a function that maps events A into the interval [0, 1]. Kolmogorov’s axioms(1933) for probability P : σ(F ) → [0, 1]

I _{For any A ∈ σ(F ), P(A) ≥ 0} I _{P(X ) = 1}

I If A1, A2... ⊆ X are disjoint events, then P [ i Ai =X i P(Ai)

From these axioms, many results can be derived. Examples:

I _{P(∅) = 0}

I _{C ⊂ D ⇒ P(C) ≤ P(D)}

I _{P(A ∪ B) = P(A) + P(B) − P(A ∩ B)} I _{P(A ∪ B) ≤ P(A) + P(B) (union bound)}

(34)

Conditional Probability and Independence

If P(B) > 0, P(A|B) = P(A ∩ B)

P(B) (conditional prob. of A given B) ...satisfies all of Kolmogorov’s axioms:

I _{For any A ⊆ X , P(A|B) ≥ 0} I _{P(X |B) = 1}

I If A1, A2, ... ⊆ X are disjoint, then P [ i Ai B =X i P(Ai|B)

(35)

Conditional Probability and Independence

If P(B) > 0, P(A|B) = P(A ∩ B)

P(B)

Events A, B are independent (A ⊥⊥ B) ⇔ P(A ∩ B) = P(A) P(B).

Relationship with conditional probabilities:

A ⊥⊥ B ⇔ P(A|B) = P(A ∩ B)

P(B) =

P(A) P(B)

P(B) = P(A)

Example: X = “52 cards”, A = {3♥, 3♣, 3♦, 3♠}, and B = {A♥, 2♥, ..., K♥}; then, P(A) = 1/13, P(B) = 1/4

P(A ∩ B) = P({3♥}) = 1 52 P(A) P(B) = 1 13 1 4 = 1 52 P(A|B) = P(“3”|“♥”) = 1 13 = P(A)

(36)

Bayes Theorem

Law of total probability: if A1, ..., An are a partition of X

P(B) = X i P(B|Ai)P(Ai) =X i P(B ∩ Ai)

Bayes’ theorem: if {A1, ..., An} is a partition of X

P(Ai|B) = P(B ∩ Ai ) P(B) = P(B|Ai) P(Ai) X i P(B|Ai)P(Ai) = P(B|Ai) P(Ai) P(B)

(37)

Random Variables

A (real)random variable (RV) is a function: X : X → R

I Discrete RV_{: range of X is countable (e.g., N or {0, 1})}

I Continuous RV_{: range of X is uncountable (e.g., R or [0, 1])}

I Example: number of head in tossing two coins, X = {HH, HT, T H, T T },

X(HH) = 2, X(HT ) = X(T H) = 1, X(T T ) = 0. Range of X = {0, 1, 2}.

(38)

Random Variables: Distribution Function

Distribution function: FX(x) = P({ω ∈ X : X(ω) ≤ x})

Example: number of heads in tossing 2 coins; range(X) = {0, 1, 2}.

Probability mass function(discrete RV): fX(x) = P(X = x),

FX(x) =

X

x≤x

(39)

Properties of Distribution Functions

FX : R → [0, 1] is the distribution function of some r.v. X iff:

it is non-decreasing: x1 < x2⇒ FX(x1) ≤ FX(x2);

lim

x→−∞FX(x) = 0;

lim

x→+∞FX(x) = 1;

it is right semi-continuous: lim

x→z+FX(x) = FX(z) Further properties: P(X = x) = fX(x) = FX(x) − lim z→x−FX(z); P(z < X ≤ y) = FX(y) − FX(z); P(X > x) = 1 − FX(x).

(40)

Important Discrete Random Variables

Uniform: X ∈ {x1, ..., xK}, pmf fX(xi) = 1/K. Bernoulli RV: X ∈ {0, 1}, pmf fX(x) = p ⇐ x = 1 1 − p ⇐ x = 0

Can be written compactly as fX(x) = px(1 − p)1−x.

Binomial RV: X ∈ {0, 1, ..., n} (sum on n Bernoulli RVs)

fX(x) = Binomial(x; n, p) = n x px(1 − p)(n−x) Binomial coefficients (“n choose x”): n x = n! (n − x)! x!

(41)

More Important Discrete Random Variables

Geometric(p): X ∈ N, pmf fX(x) = p(1 − p)x−1.

(e.g., number of trials until the first success). Poisson(λ): X ∈ N ∪ {0}, pmf fX(x) = e−λλx x! Notice thatP∞ x=0λ x x! = e λ_{, thus} P∞ x=0fX(x) = 1.

“...probability of the number of independent occurrences in a fixed (time/space) interval if these occurrences have known average rate”

(42)

Random Variables: Distribution Function

Distribution function: FX(x) = P({ω ∈ X : X(ω) ≤ x})

Example: continuous RV with uniform distribution on [a, b].

Probability density function(pdf, continuous RV): fX(x)

FX(x) = Z x −∞ fX(u) du, P(X ∈ [c, d]) = Z d c fX(x) dx,

(43)

Important Continuous Random Variables

Uniform: fX(x) = Uniform(x; a, b) = 1 b−a ⇐ x ∈ [a, b] 0 ⇐ x 6∈ [a, b] (previous slide). Gaussian: fX(x) = N (x; µ, σ2) = 1 √ 2 π σ2e −(x−µ)2 2 σ2 Exponential: fX(x) = Exp(x; λ) = λe−λ x ⇐ x ≥ 0 0 ⇐ x < 0

(44)

Expectation of Random Variables

Expectation: E(X) =        X i xifX(xi) X ∈ {x1, ..., xK} ⊂ R Z ∞ −∞ x fX(x) dx X continuous

Example: Bernoulli, fX(x) = px(1 − p)1−x, for x ∈ {0, 1}.

E(X) = 0 (1 − p) + 1 p = p.

Example: Binomial, fX(x) = n_xpx(1 − p)n−x, for x ∈ {0, ..., n}.

E(X) = n p.

Example: Gaussian, fX(x) = N (x; µ, σ2). E(X) = µ. Linearity of expectation:

(45)

Expectation of Functions of Random Variables

E(g(X)) =        X i g(xi)fX(xi) X discrete, g(xi) ∈ R Z ∞ −∞ g(x) fX(x) dx X continuous

Example: variance, var(X) = E X − E(X)2= E(X2) − E(X)2

Example: Bernoulli variance, E(X2

) = E(X) = p , thus var(X) = p(1 − p).

Example: Gaussian variance, E (X − µ)2_{= σ}2_.

Probability as expectation of indicator, 1A(x) =

1 ⇐ x ∈ A 0 ⇐ x 6∈ A P(X ∈ A) = Z A fX(x) dx = Z 1A(x) fX(x) dx = E(1A(X))

(46)

Moments of Random Variables

Non-central moments of order k:

E(|X|k) =        X i |xi|kfX(xi) X discrete, g(xi) ∈ R Z ∞ −∞ |x|kfX(x) dx X continuous

Central moments of order k ∈ N: E |X − E(X)|k.

...if the integral/sum exits.

If k-th moment exists, then the j-th moment exists, for j ≤ k.

Z ∞ −∞ |x|jfX(x) dx = Z |x|≤1 |x|jfX(x) dx + Z |x|>1 |x|jfX(x) dx ≤ Z |x|≤1 fX(x) dx + Z |x|>1 |x|kfX(x) dx ≤ 1 + E |X|k < ∞

(47)

Two (or More) Random Variables

Joint pmf of two discrete RVs: fX,Y(x, y) = P(X = x ∧ Y = y).

Extends trivially to more than two RVs.

Joint pdf of two continuous RVs: fX,Y(x, y), such that

P (X, Y ) ∈ A =

Z Z

A

fX,Y(x, y) dx dy, A ∈ σ(R2)

Extends trivially to more than two RVs.

Marginalization: fY(y) =        X x

fX,Y(x, y), if X is discrete

Z ∞ −∞ fX,Y(x, y) dx, if X continuous Independence: X ⊥⊥ Y ⇔ f_X,Y(x, y) = fX(x) fY(y) ⇒

(48)

Conditionals and Bayes’ Theorem

Conditional pmf(discrete RVs):

f_X|Y(x|y) = P(X = x|Y = y) = P(X = x ∧ Y = y)

P(Y = y) =

fX,Y(x, y)

fY(y)

.

Conditional pdf(continuous RVs): fX|Y(x|y) =

fX,Y(x, y)

fY(y) ...the meaning is technically delicate.

Bayes’ theorem: f_X|Y(x|y) = fY |X(y|x) fX(x) fY(y)

(pdf or pmf).

(49)

Joint, Marginal, and Conditional Probabilities: An Example

A pair of binary variables X, Y ∈ {0, 1}, with jointpmf:

Marginals: fX(0) = 1₅ +2₅ = 3₅, fX(1) = ₁₀1 + ₁₀3 = ₁₀4,

fY(0) = 1₅ +₁₀1 = ₁₀3, fY(1) = 2₅ +₁₀3 = ₁₀7 . Conditionalprobabilities:

(50)

An Important Multivariate RV: Multinomial

Multinomial: X = (X1, ..., XK), Xi ∈ {0, ..., n}, such that

P iXi= n, fX(x1, ..., xK) = n x1x2··· xK p x1 1 p x2 2 · · · p xK k ⇐ P ixi = n 0 ⇐ P ixi 6= n n x1 x2 · · · xK = n! x1! x2! · · · xK!

Parameters: p1, ..., pK ≥ 0, such thatPipi = 1.

Generalizes the binomial from binary to K-classes.

Example: tossing n independent fair dice, p1 = · · · = p6 = 1/6.

(51)

Covariance, Correlation, and all that...

Covariance between two RVs, X, Y ∈ R :

cov(X, Y ) = Eh X − E(X) Y − E(Y )i = E(X Y ) − E(X) E(Y )

Relationship with variance: var(X) = cov(X, X).

Correlation: corr(X, Y ) = ρ(X, Y ) = √ cov(X,Y )

var(X)√var(Y ) ∈ [−1, 1]

X ⊥⊥ Y ⇔ f_X,Y(x, y) = fX(x) fY(y)

⇒

6⇐ cov(X, Y ) = 0(example)

Covariance matrix of multivariate RV, X ∈ Rn:

cov(X) = Eh X − E(X)

X − E(X)Ti

(52)

An Important Multivariate RV: Gaussian

Multivariate Gaussian: X ∈ Rn_, fX(x) = N (x; µ, C) = 1 pdet(2 π C)exp −1 2(x − µ) T_C−1_{(x − µ)}

Parameters: vector µ ∈ Rn_{and matrix C ∈ R}n×n_.

(53)

Key Properties of Multivariate Gaussian

Marginals are Gaussian. Conditionals are Gaussian.

(54)

Multivariate Gaussian Conditionals

fX(x) = N (x; µ, C) = 1 pdet(2 π C)exp −1 2(x − µ) T_C−1_{(x − µ)} Partition X = [X1, ..., Xi | {z } Xa , Xi+1, ..., Xn | {z } Xb ]T and µ = [µ1, ..., µi | {z } µa , µi+1, ..., µn | {z } µb ]T;

Corresponding partition of covariance: C = Caa Cab Cba Cbb fXa(·) = N (·; µa, Caa); fXb(·) = N (·; µb, Cbb); f_X_b|Xa(xb|xa) = N xb; µb+ CbaC −1 aa(xa− µa), Cbb− CbaCaa−1Cab

(55)

More on Expectations and Covariances

Let A ∈ Rn×n be a matrix and a ∈ Rn a vector.

If E(X) = µ and Y = AX, then E(Y ) = Aµ;

If cov(X) = C and Y = AX, then cov(Y ) = ACAT_;

If cov(X) = C and Y = aTX ∈ R, then var(Y ) = aTCa ≥ 0;

If cov(X) = C and Y = C−1/2X, then cov(Y ) = I;

If fX(x) = N (x; 0, I) and Y = µ + C1/2X, then

fY(y) = N (y; µ, C);

If fX(x) = N (x;µ, C) and Y = C−1/2(X − µ), then

(56)

Transformations

X ∼ fX and Y = g(X) ⇒ fY =?

Discrete case:

fY(y) = P(g(X) = y) = P {x : g(x) = y} = P g−1(y)

Continuous case (for g strictly monotonic, thus invertible):

fY(y) = fX g−1(y) d g−1(y) d y

Continuous multivariate case (invertible):

fY(y) = fX g−1(y)

detJ_g−1(y)

(57)

Central Limit Theorem

Take n independent r.v. X1, ..., Xn such that E[Xi] = µi and

var(Xi) = σi2 Their sum, Yn= n X i=1 Xi satisfies: E[Yn] = n X i=1 µi ≡ µ(n) var(Yn) = X i σ_i2 ≡ σ_(n)2 ...thus, if Zn= Yn− µ(n) σ(n) E[Zn] = 0 var(Zn) = 1

Central limit theorem (CLT): under some mild conditions on X1, ..., Xn

lim

(58)

Central Limit Theorem

(59)

Exponential Families

Consider a pdf or pmf fX(x|θ), with parameter(s) θ ∈ Rc, for X ∈ Xn

fX(x|θ) is in anexponential family if it has the form

fX(x|θ) =

1

Z(θ)h(x) exp η(θ)

T_{φ(x) = h(x) exp η(θ)}T_{φ(x) − A(θ),}

where A(θ) = log Z(θ) and Z(θ) =

Z

Xn

h(x) exp η(θ)Tφ(x) dx.

Canonical parameter(s): η(θ) ∈ Rd; that is, η : Rc→ Rd

Sufficient statistics: φ(x) ∈ Rd; that is, φ : Xn→ Rd

Partition function: Z(θ)

(60)

Exponential Families

fX(x|θ) = 1 Z(θ)h(x) exp η(θ) T_φ(x) Examples: Bernoulli: X ∈ {0, 1}, with P(X = 1) = θ; fX(x) = θx(1 − θ)1−x,

fX(x) = exp x log θ+(1−x) log(1−θ) = exp x log_1−θθ +log(1−θ),

thus η(θ) = log_1−θθ , φ(x) = x, Z(θ) = _1−θ1 , and h(x) = 1.

Gaussian: X ∈ R, fX(x) = N (x; µ, σ2); θ = (µ, σ2) fX(x) = exp −(x−µ)_2σ22 √ 2πσ2 = exp −_2σx22 + µx σ2 − µ2 2σ2 √ 2πσ2 , thus η(µ, σ2) = [µ/σ2, −1/(2σ2)]T, φ(x) = [x, x2]T, Z(µ, σ2) =√2πσ2_exp µ2 2σ2, and h(x) = 1.

(61)

More on Exponential Families

Assume η(θ) = θ ∈ Rd; canonical parameterization.

Independent identically distributed (i.i.d.) observations: X1, ..., Xm∼ fX(x) = 1 Z(θ)h(x) exp θ T_φ(x) then fX1,...,Xm(x1, ..., xm) = 1 Z(θ)m   m Y j=1 h(xj)  exp θT m X j=1 φ(xj)

Expected sufficient statistics:

∇_θlog Z(θ) = ∇θZ(θ) Z(θ) = 1 Z(θ) Z φ(x)h(x) exp θTφ(x)_{dx = Eφ(X)}

(62)

Important Inequalities

Markov’s ineqality: if X ≥ 0 is an RV with expectation E(X), then

P(X > t) ≤ E(X) t Simple proof: t P(X > t) = Z ∞ t t fX(x) dx ≤ Z ∞ t x fX(x) dx = E(X)− Z t 0 x fX(x) dx | {z } ≥0 ≤ E(X)

Chebyshev’s inequality: µ = E(Y ) and σ2= var(Y ), then

P(|Y − µ| ≥ s) ≤ σ

2

s2

(63)

Other Important Inequalities: Cauchy-Schwartz

Cauchy-Schwartz’s inequalityfor RVs:

|E(X Y )| ≤pE(X2) E(Y2)

...why? Because hX, Y i ≡ E[XY ] is a valid inner product Important corollary: let E[X] = µ and E[X] = ν,

|cov(X, Y )| = E[(X − µ)(Y − ν)]

≤p_{E[(X − µ)}2_{] E[(Y − ν)}2_]

=pvar(X) var(Y )

Implication for correlation:

corr(X, Y ) = ρ(X, Y ) = cov(X, Y )

(64)

Other Important Inequalities: Jensen

Recall that a real function g is convex if, for any x, y, and α ∈ [0, 1] g(αx + (1 − α)y) ≤ αg(x) + (1 − α)g(y)

Jensen’s inequality: if g is a real convex function, then g(E(X)) ≤ E(g(X))

Examples: E(X)2 ≤ E(X2_{) ⇒ var(X) = E(X}2_{) − E(X)}2_{≥ 0.}

(65)

Entropy and all that...

Entropy of a discrete RV X ∈ {1, ..., K}: H(X) = − K X x=1 fX(x) log fX(x) Positivity: H(X) ≥ 0 ;

H(X) = 0 ⇔ fX(i) = 1, for exactly one i ∈ {1, ..., K}.

Upper bound: H(X) ≤ log K ;

H(X) = log K ⇔ fX(x) = 1/k, for all x ∈ {1, ..., K}

Measure ofuncertainty/randomness of X

Continuous RV X,differential entropy: h(X) = −

Z

fX(x) log fX(x) dx

h(X) can be positive or negative. Example, if

fX(x) = Uniform(x; a, b), h(X) = log(b − a).

If fX(x) = N (x; µ, σ2), then h(X) = 1₂log(2πeσ2).

If var(Y ) = σ2, then h(Y ) ≤ 1₂log(2πeσ2)

(66)

Kullback-Leibler divergence

Kullback-Leibler divergence(KLD) between two pmf:

D(fXkgX) = K X x=1 fX(x) log fX(x) gX(x) Positivity: D(fXkgX) ≥ 0 D(fXkgX) = 0 ⇔ fX(x) = gX(x), for x ∈ {1, ..., K} KLD between two pdf: D(fXkgX) = Z fX(x) log fX(x) gX(x) dx Positivity: D(fXkgX) ≥ 0 D(fXkgX) = 0 ⇔ fX(x) = gX(x), almost everywhere

(67)

Mutual information

Mutual information (MI) between two random variables:

I(X; Y ) = D fX,YkfXfY

Positivity: I(X; Y ) ≥ 0

I(X; Y ) = 0 ⇔ X, Y are independent.

(68)

Statistical Learning. Lecture 0 Mathematical Tools: Algebra and Probabilities