• No results found

Statistical Learning. Lecture 0 Mathematical Tools: Algebra and Probabilities

N/A
N/A
Protected

Academic year: 2021

Share "Statistical Learning. Lecture 0 Mathematical Tools: Algebra and Probabilities"

Copied!
68
0
0

Loading.... (view fulltext now)

Full text

(1)

Statistical Learning

Lecture 0

Mathematical Tools: Algebra and Probabilities

M´ario A. T. Figueiredo

Instituto Superior T´ecnico & Instituto de Telecomunica¸c˜oes University of Lisbon, Portugal

(2)
(3)
(4)

Linear Algebra (Informally)

Linear algebra provides (among many other things) a compact way of representing, studying, and solving linear systems of equations

Example: the system

4 x1− 5 x2= −13

−2 x1+ 3 x2= 9

can be written compactly as Ax = b , where

A =  4 −5 −2 3  , b =  −13 9  , and can be solved as

x = A−1b =  1.5 2.5 1 2   −13 9  =  3 5  .

(5)

Notation: Matrices and Vectors

A ∈ Rm×n is a matrixwith m rows and n columns.

A =    A1,1 · · · A1,n .. . . .. ... Am,1 · · · Am,n   .

x ∈ Rn is a vector with n components,

x =    x1 .. . xn   .

A(column) vector is a matrix with n rows and 1 column.

(6)

Matrix Transpose and Products

Given matrix A ∈ Rm×n, itstransposeAT is such that (AT)i,j = Aj,i.

A matrix A issymmetricif AT = A.

Given matrices A ∈ Rm×n and B ∈ Rn×p, their product is

C = A B ∈ Rm×p where Ci,j =

n

X

k=1

Ai,kBk,j

Inner product between vectors x, y ∈ Rn:

hx, yi = xTy = yTx =

n

X

i=1

xiyi ∈ R.

Outer product: x ∈ Rn and y ∈ Rm: x yT ∈ Rn×m, where

(7)

Properties of Matrix Products and Transposes

Given matrices A ∈ Rm×n and B ∈ Rn×p, their product is

C = A B ∈ Rm×p where Ci,j =

n

X

k=1

Ai,kBk,j

Matrix product isassociative: (AB)C = A(BC).

In general, matrix product is not commutative: AB 6= BA.

Transpose of product: (A B)T = BTAT.

(8)

Special Matrices

Theidentity matrix I ∈ Rn×n is a square matrix such that

Iij =  1, i = j 0, i 6= j I =      1 0 · · · 0 0 1 · · · 0 .. . ... . .. ... 0 0 · · · 1     

Neutral element of matrix product: A I = I A = A.

Diagonal matrix: (i 6= j) ⇒ Ai, j = 0.

Upper triangular matrix: (j < i) ⇒ Ai, j = 0. Lower triangularmatrix: (j > i) ⇒ Ai, j = 0.

(9)

Eigenvalues, eigenvectors, determinant, trace

A vector x ∈ Rn is an eigenvector of matrix A ∈ Rn×n if

A x = λ x,

where λ ∈ R is the corresponding eigenvalue.

The eigenvalues of a diagonal matrix are the elements in the diagonal. (quiz: what are the eigenvectors?)

Matrix trace: trace(A) =P

iAi,i =

P

iλi

Matrix determinant: |A| = det(A) =Q

iλi

Properties of determinant: |AB| = |A||B|, |AT| = |A|,

|α A| = αn|A|

Properties of the trace: trace(A + B) = trace(A) + trace(B),

trace(ABC) = trace(CAB) = trace(BCA) (cyclic permutations)

(10)

Matrix Inverse

Matrix A ∈ Rn×n in invertible if there is B ∈ Rn×n s.t.

AB = BA = I.

...matrix B, such that AB = BA = I, denoted B = A−1.

Matrix A ∈ Rn×n is invertible ⇔ det(A) 6= 0.

Determinant of inverse: det(A−1) = 1

det(A).

Solving system Ax = b, if A is invertible: x = A−1b.

Properties: (A−1)−1 = A, (A−1)T = (AT)−1, (A B)−1 = B−1A−1

There are many algorithms to compute A−1; general case,

(11)

Quadratic Forms and Positive (Semi-)Definite Matrices

Given matrix A ∈ Rn×n and vector x ∈ Rn,

xTA x = n X i=1 n X j=1 Ai, jxixj ∈ R

is called aquadratic form.

A symmetric matrix A ∈ Rn×n ispositive semi-definite (PSD) if, for

any x ∈ Rn, xTA x ≥ 0.

A symmetric matrix A ∈ Rn×n ispositive definite (PD) if, for any

x ∈ Rn, (x 6= 0) ⇒ xTA x > 0.

Matrix A ∈ Rn×n is PSD ⇔ all λi(A) ≥ 0.

(12)

A Bit More Formal: Vector Spaces

Avector space over a field F (e.g., R) is a set V and a pair of

operations, +: V × V → V and ·: F × V → V, that satisfy the

following axioms, ∀x, y, z ∈ V and ∀α, β ∈ F:

X + is associative and commutative;

X ∃0 ∈ V, such that 0 + x = x;

X ∃ − x ∈ V, such that −x + x = 0;

X α · (β · x) = (α · β) · x;

X 1 · x = x, where 1 ∈ F is such that 1 · α = α;

X α · (x + y) = α · x + α · y;

X (α + β) · x = α · x + β · x.

Elements of V are calledvectors; elements of F arescalars.

(13)

Vector Space: Examples

“Usual vectors” (Rn, +, ·) over field R

X x = (x1, ..., xn), y = (y1, ..., yn), x + y = (x1+ y1, ..., xn+ yn);

X x = (x1, ..., xn), α x = (α x1, ..., α xn)

Real matrices (Rm×n, +, ·) over field R

X usual matrix addition and multiplication by scalar;

Complex matrices (Cm×n, +, ·) over field C (complex numbers).

Binary vectors ({0, 1}n, +, ·) over GF (2) = {0, 1} (Galois field),

X + is addition modulo 2 : 0 + 0 = 0, 0 + 1 = 1 + 0 = 1, 1 + 1 = 0.

X · is standard multiplication: 0 · 0 = 0 · 1 = 1 · 0 = 0, 1 · 1 = 1.

Set of all functions f : Ω → R, with point-wise addition and multiplication, is a vector space over R.

(14)

Norm on Vector Space V

Anorm is a function k · k : V → R+ satisfying,

∀x, y ∈ V and ∀α ∈ R,

X homogeneity, kα xk = |α| kxk;

X triangle inequality, kx + yk ≤ kxk + kyk;

X definiteness, kxk = 0 ⇔ x = 0.

Aseminorm may not satisfy definiteness.

Classical example in Rn: Euclidean norm: kxk =

s n P i=1 x2 i.

Two norms k · k and k · k0 are equivalentif ∃α, β > 0 such that

∀x ∈ V, αkxk ≤ kxk0 ≤ βkxk;

(15)

Other Norms

The `p norm of a vector x ∈ Rn, where p ≥ 1,

kxkp = n X i=1 |xi|p !1/p . Notable cases: I `2 (Euclidean) norm. I `1 norm, kxk1=Pi|xi|.

I `∞ norm, limp→∞kxkp= max{|x1|, ..., |xn|} ≡ kxk∞

I `0 “norm”(not a norm), limp→0kxkp= #{i : xi6= 0} ≡ kxk0

Some equivalences:kxk2 ≤ kxk1 ≤ √ n kxk2 kxk∞ ≤ kxk2 ≤ √ n kxk∞ kxk∞ ≤ kxk1 ≤ n kxk∞

(16)

Inner Product on Vector Space V Over R

An inner product is a function h·, ·i : V × V → R, satisfying, ∀x, y ∈ V and ∀α ∈ R,

X symmetry, hx, yi = hy, xi

X (bi)linearity, hαx + βz, yi = αhx, yi + βhz, yi;

X definiteness, hx, xi ≥ 0 and hx, xi = 0 ⇔ x = 0.

Standard inner product in Rn: hx, yi =Pn

i=1xiyi= xTy

Also an inner product in Rn: hx, yi = xTM y, where M is PD

Norm (is it?) induced by an inner product kxk =phx, xi

X for hx, yi = xTy, then kxk2= xTx =Pn i=1x 2 i (Euclidean norm) X for hx, yi = xTM y, then kxk2 M = xTM x (Mahalanobis norm)

(17)

Key Properties of Inner Products

If k · k is induced by inner product h·, ·i (that is kxk2= hx, xi), then

kx + yk2= hx + y, x + yi = kxk2+ kyk2+ 2hx, yi

Cauchy-Schwarz inequality: |hx, yi| ≤ kxk kyk

“Des d´emonstrations qui font boum!” (“Proofs that make boom!”)

[Jean-Baptiste Hiriart-Urruty] 0 ≤ 1 2 x kxk± y kyk 2 = 1 ± hx, yi kxk kyk ⇔  hx, yi ≤ kxk kyk −hx, yi ≤ kxk kyk

Corollary: k · k is indeed a norm, as it satisfies the triangle inequality:

kx + yk2≤ kxk2

+ kyk2+ 2|hx, yi| ≤ kxk2+ kyk2+ 2kxk kyk = kxk + kyk2 Hilbert space: complete vector space equipped with an inner product.

(18)

Basis and Dimension of a Vector Space V

Basis: collection of vectors B = {b1, b2, ...} ⊂ V satisfying:

X linear independence: for any finite linear combination α1b1+ ... + αmbm= 0 ⇒ α1= · · · = αm= 0

X spanningability: any vector v ∈ V can be written as v = α1b1+ ... + αmbn; in other words, V = span(B).

Dimension of V: dim(V) = #B Orthogonal basis: i 6= j ⇒ hbi, bji = 0

(19)

Rank, Range, and Null Space

Consider some real matrix A ∈ Rm×n

Range of A: R(A) =y ∈ Rm: ∃ x ∈ Rnsuch that y = Ax ⊆ Rm Null space of A: N (A) =x ∈ Rn: Ax = 0 ⊆ Rn.

Both R(A) and N (A) are vector spaces.

Dimension theorem: dim R(A) + dim N (A) = n

Rank: rank(A) = dim R(A) ≤ min{m, n}

(20)

Singular Value Decomposition (SVD)

Any rank-r matrix A ∈ Rm×n can be written as A = U ΛVT

X columns of U ∈ Rm×r are an orthonormal basis of R(A);

X columns of V ∈ Rn×r are an orthonormal basis of R(AT);

X Λ = diag(σ1, ..., σr) is a r × r diagonal matrix;

X σ1, ..., σr are calledsingular values.

X σ1, ..., σr are square roots of the eigenvalues of ATA or AAT.

Orthonormality of U and V : UTU = I and VTV = I .

(21)

Singular Value Decomposition (SVD)

A = U ΛVT, where U ∈ Rm×r and V ∈ Rn×r.

(22)

Singular Value Decomposition (SVD)

A = U ΛVT, where U ∈ Rm×r and V ∈ Rn×r.

(23)
(24)
(25)
(26)

Recommended Reading

Z. Kolter and C. Do, “Linear Algebra Review and Reference”, Stanford University, 2015 (https://tinyurl.com/44x2qj4)

(27)
(28)

Probability theory

“Essentially, all models are wrong, but some are useful”;G. Box, 1987

The study of probability has roots in games of chance (dice, cards, ...)

Great names in science:Cardano, Fermat, Pascal, Laplace, Gauss,

Huygens, Legendre, Poisson, Kolmogorov, Bernoulli, Cauchy, Gibbs, Boltzman, de Finetti, ...

Natural tool to modeluncertainty, information, knowledge, belief, ...

(29)

What is probability?

Classical definition: P(A) = NA

N

...with N mutually exclusive equally likely outcomes,

NA of which result in the occurrence of A. Laplace, 1814

Example: P(randomly drawn card is ♣) = 13/52.

Example: P(getting 1 in throwing a fair die) = 1/6.

Frequentist definition: P(A) = lim

N →∞

NA

N

...relative frequency of occurrence of A in infinite number of trials.

Subjective probability: P(A) is a degree of belief. de Finetti, 1930s

(30)

“The mathematics of probability can be developed on an entirely

(31)

Key concepts: Sample space and events

Sample space X = set of possible outcomes of a random experiment.

Examples:

I Tossing two coins: X = {HH, T H, HT, T T }

I Roulette: X = {1, 2, ..., 36}

I Draw a card from a shuffled deck: X = {A♣, 2♣, ..., Q♦, K♦}.

An event A is a subset of X : A ⊆ X (also written A ∈ 2X).

Examples:

I “exactly one H in 2-coin toss”:

A = {T H, HT } ⊂ {HH, T H, HT, T T }.

I “odd number in the roulette”: B = {1, 3, ..., 35} ⊂ {1, 2, ..., 36}.

(32)

Key concepts: Sample space and events

Sample space X = set of possible outcomes of a random experiment.

(More delicate) examples:

I Time until you receive the next email X = R+ I Location of the next rain drop X = R2

An event A is a measurablesubset of X , i.e., A ∈ σ(F ), F ⊆ 2X. σ-algebra: σ(F ) I A ∈ σ(F ) ⇒ Ac∈ σ(F ) I A1, A2, ... ∈ σ(F ) ⇒ ∞ [ i=1 Ai∈ σ(F )

Classical example in Rn: the collection of Lebesgue measurable sets

constitute a σ−algebra.

(33)

Kolmogorov’s Axioms for Probability

Probability is a function that maps events A into the interval [0, 1]. Kolmogorov’s axioms(1933) for probability P : σ(F ) → [0, 1]

I For any A ∈ σ(F ), P(A) ≥ 0 I P(X ) = 1

I If A1, A2... ⊆ X are disjoint events, then P [ i Ai  =X i P(Ai)

From these axioms, many results can be derived. Examples:

I P(∅) = 0

I C ⊂ D ⇒ P(C) ≤ P(D)

I P(A ∪ B) = P(A) + P(B) − P(A ∩ B) I P(A ∪ B) ≤ P(A) + P(B) (union bound)

(34)

Conditional Probability and Independence

If P(B) > 0, P(A|B) = P(A ∩ B)

P(B) (conditional prob. of A given B) ...satisfies all of Kolmogorov’s axioms:

I For any A ⊆ X , P(A|B) ≥ 0 I P(X |B) = 1

I If A1, A2, ... ⊆ X are disjoint, then P [ i Ai B  =X i P(Ai|B)

(35)

Conditional Probability and Independence

If P(B) > 0, P(A|B) = P(A ∩ B)

P(B)

Events A, B are independent (A ⊥⊥ B) ⇔ P(A ∩ B) = P(A) P(B).

Relationship with conditional probabilities:

A ⊥⊥ B ⇔ P(A|B) = P(A ∩ B)

P(B) =

P(A) P(B)

P(B) = P(A)

Example: X = “52 cards”, A = {3♥, 3♣, 3♦, 3♠}, and B = {A♥, 2♥, ..., K♥}; then, P(A) = 1/13, P(B) = 1/4

P(A ∩ B) = P({3♥}) = 1 52 P(A) P(B) = 1 13 1 4 = 1 52 P(A|B) = P(“3”|“♥”) = 1 13 = P(A)

(36)

Bayes Theorem

Law of total probability: if A1, ..., An are a partition of X

P(B) = X i P(B|Ai)P(Ai) =X i P(B ∩ Ai)

Bayes’ theorem: if {A1, ..., An} is a partition of X

P(Ai|B) = P(B ∩ Ai ) P(B) = P(B|Ai) P(Ai) X i P(B|Ai)P(Ai) = P(B|Ai) P(Ai) P(B)

(37)

Random Variables

A (real)random variable (RV) is a function: X : X → R

I Discrete RV: range of X is countable (e.g., N or {0, 1})

I Continuous RV: range of X is uncountable (e.g., R or [0, 1])

I Example: number of head in tossing two coins, X = {HH, HT, T H, T T },

X(HH) = 2, X(HT ) = X(T H) = 1, X(T T ) = 0. Range of X = {0, 1, 2}.

(38)

Random Variables: Distribution Function

Distribution function: FX(x) = P({ω ∈ X : X(ω) ≤ x})

Example: number of heads in tossing 2 coins; range(X) = {0, 1, 2}.

Probability mass function(discrete RV): fX(x) = P(X = x),

FX(x) =

X

x≤x

(39)

Properties of Distribution Functions

FX : R → [0, 1] is the distribution function of some r.v. X iff:

it is non-decreasing: x1 < x2⇒ FX(x1) ≤ FX(x2);

lim

x→−∞FX(x) = 0;

lim

x→+∞FX(x) = 1;

it is right semi-continuous: lim

x→z+FX(x) = FX(z) Further properties: P(X = x) = fX(x) = FX(x) − lim z→x−FX(z); P(z < X ≤ y) = FX(y) − FX(z); P(X > x) = 1 − FX(x).

(40)

Important Discrete Random Variables

Uniform: X ∈ {x1, ..., xK}, pmf fX(xi) = 1/K. Bernoulli RV: X ∈ {0, 1}, pmf fX(x) =  p ⇐ x = 1 1 − p ⇐ x = 0

Can be written compactly as fX(x) = px(1 − p)1−x.

Binomial RV: X ∈ {0, 1, ..., n} (sum on n Bernoulli RVs)

fX(x) = Binomial(x; n, p) = n x  px(1 − p)(n−x) Binomial coefficients (“n choose x”): n x  = n! (n − x)! x!

(41)

More Important Discrete Random Variables

Geometric(p): X ∈ N, pmf fX(x) = p(1 − p)x−1.

(e.g., number of trials until the first success). Poisson(λ): X ∈ N ∪ {0}, pmf fX(x) = e−λλx x! Notice thatP∞ x=0λ x x! = e λ, thus P∞ x=0fX(x) = 1.

“...probability of the number of independent occurrences in a fixed (time/space) interval if these occurrences have known average rate”

(42)

Random Variables: Distribution Function

Distribution function: FX(x) = P({ω ∈ X : X(ω) ≤ x})

Example: continuous RV with uniform distribution on [a, b].

Probability density function(pdf, continuous RV): fX(x)

FX(x) = Z x −∞ fX(u) du, P(X ∈ [c, d]) = Z d c fX(x) dx,

(43)

Important Continuous Random Variables

Uniform: fX(x) = Uniform(x; a, b) =  1 b−a ⇐ x ∈ [a, b] 0 ⇐ x 6∈ [a, b] (previous slide). Gaussian: fX(x) = N (x; µ, σ2) = 1 √ 2 π σ2e −(x−µ)2 2 σ2 Exponential: fX(x) = Exp(x; λ) =  λe−λ x ⇐ x ≥ 0 0 ⇐ x < 0

(44)

Expectation of Random Variables

Expectation: E(X) =        X i xifX(xi) X ∈ {x1, ..., xK} ⊂ R Z ∞ −∞ x fX(x) dx X continuous

Example: Bernoulli, fX(x) = px(1 − p)1−x, for x ∈ {0, 1}.

E(X) = 0 (1 − p) + 1 p = p.

Example: Binomial, fX(x) = nxpx(1 − p)n−x, for x ∈ {0, ..., n}.

E(X) = n p.

Example: Gaussian, fX(x) = N (x; µ, σ2). E(X) = µ. Linearity of expectation:

(45)

Expectation of Functions of Random Variables

E(g(X)) =        X i g(xi)fX(xi) X discrete, g(xi) ∈ R Z ∞ −∞ g(x) fX(x) dx X continuous

Example: variance, var(X) = E X − E(X)2= E(X2) − E(X)2

Example: Bernoulli variance, E(X2

) = E(X) = p , thus var(X) = p(1 − p).

Example: Gaussian variance, E (X − µ)2 = σ2.

Probability as expectation of indicator, 1A(x) =

 1 ⇐ x ∈ A 0 ⇐ x 6∈ A P(X ∈ A) = Z A fX(x) dx = Z 1A(x) fX(x) dx = E(1A(X))

(46)

Moments of Random Variables

Non-central moments of order k:

E(|X|k) =        X i |xi|kfX(xi) X discrete, g(xi) ∈ R Z ∞ −∞ |x|kfX(x) dx X continuous

Central moments of order k ∈ N: E |X − E(X)|k.

...if the integral/sum exits.

If k-th moment exists, then the j-th moment exists, for j ≤ k.

Z ∞ −∞ |x|jfX(x) dx = Z |x|≤1 |x|jfX(x) dx + Z |x|>1 |x|jfX(x) dx ≤ Z |x|≤1 fX(x) dx + Z |x|>1 |x|kfX(x) dx ≤ 1 + E |X|k < ∞

(47)

Two (or More) Random Variables

Joint pmf of two discrete RVs: fX,Y(x, y) = P(X = x ∧ Y = y).

Extends trivially to more than two RVs.

Joint pdf of two continuous RVs: fX,Y(x, y), such that

P (X, Y ) ∈ A =

Z Z

A

fX,Y(x, y) dx dy, A ∈ σ(R2)

Extends trivially to more than two RVs.

Marginalization: fY(y) =        X x

fX,Y(x, y), if X is discrete

Z ∞ −∞ fX,Y(x, y) dx, if X continuous Independence: X ⊥⊥ Y ⇔ fX,Y(x, y) = fX(x) fY(y) ⇒

(48)

Conditionals and Bayes’ Theorem

Conditional pmf(discrete RVs):

fX|Y(x|y) = P(X = x|Y = y) = P(X = x ∧ Y = y)

P(Y = y) =

fX,Y(x, y)

fY(y)

.

Conditional pdf(continuous RVs): fX|Y(x|y) =

fX,Y(x, y)

fY(y) ...the meaning is technically delicate.

Bayes’ theorem: fX|Y(x|y) = fY |X(y|x) fX(x) fY(y)

(pdf or pmf).

(49)

Joint, Marginal, and Conditional Probabilities: An Example

A pair of binary variables X, Y ∈ {0, 1}, with jointpmf:

Marginals: fX(0) = 15 +25 = 35, fX(1) = 101 + 103 = 104,

fY(0) = 15 +101 = 103, fY(1) = 25 +103 = 107 . Conditionalprobabilities:

(50)

An Important Multivariate RV: Multinomial

Multinomial: X = (X1, ..., XK), Xi ∈ {0, ..., n}, such that

P iXi= n, fX(x1, ..., xK) =  n x1x2··· xK p x1 1 p x2 2 · · · p xK k ⇐ P ixi = n 0 ⇐ P ixi 6= n  n x1 x2 · · · xK  = n! x1! x2! · · · xK!

Parameters: p1, ..., pK ≥ 0, such thatPipi = 1.

Generalizes the binomial from binary to K-classes.

Example: tossing n independent fair dice, p1 = · · · = p6 = 1/6.

(51)

Covariance, Correlation, and all that...

Covariance between two RVs, X, Y ∈ R :

cov(X, Y ) = Eh X − E(X) Y − E(Y )i = E(X Y ) − E(X) E(Y )

Relationship with variance: var(X) = cov(X, X).

Correlation: corr(X, Y ) = ρ(X, Y ) = √ cov(X,Y )

var(X)√var(Y ) ∈ [−1, 1]

X ⊥⊥ Y ⇔ fX,Y(x, y) = fX(x) fY(y)

6⇐ cov(X, Y ) = 0(example)

Covariance matrix of multivariate RV, X ∈ Rn:

cov(X) = Eh X − E(X)

X − E(X)Ti

(52)

An Important Multivariate RV: Gaussian

Multivariate Gaussian: X ∈ Rn, fX(x) = N (x; µ, C) = 1 pdet(2 π C)exp  −1 2(x − µ) TC−1(x − µ) 

Parameters: vector µ ∈ Rnand matrix C ∈ Rn×n.

(53)

Key Properties of Multivariate Gaussian

Marginals are Gaussian. Conditionals are Gaussian.

(54)

Multivariate Gaussian Conditionals

fX(x) = N (x; µ, C) = 1 pdet(2 π C)exp  −1 2(x − µ) TC−1(x − µ)  Partition X = [X1, ..., Xi | {z } Xa , Xi+1, ..., Xn | {z } Xb ]T and µ = [µ1, ..., µi | {z } µa , µi+1, ..., µn | {z } µb ]T;

Corresponding partition of covariance: C =  Caa Cab Cba Cbb  fXa(·) = N (·; µa, Caa); fXb(·) = N (·; µb, Cbb); fXb|Xa(xb|xa) = N xb; µb+ CbaC −1 aa(xa− µa), Cbb− CbaCaa−1Cab 

(55)

More on Expectations and Covariances

Let A ∈ Rn×n be a matrix and a ∈ Rn a vector.

If E(X) = µ and Y = AX, then E(Y ) = Aµ;

If cov(X) = C and Y = AX, then cov(Y ) = ACAT;

If cov(X) = C and Y = aTX ∈ R, then var(Y ) = aTCa ≥ 0;

If cov(X) = C and Y = C−1/2X, then cov(Y ) = I;

If fX(x) = N (x; 0, I) and Y = µ + C1/2X, then

fY(y) = N (y; µ, C);

If fX(x) = N (x;µ, C) and Y = C−1/2(X − µ), then

(56)

Transformations

X ∼ fX and Y = g(X) ⇒ fY =?

Discrete case:

fY(y) = P(g(X) = y) = P {x : g(x) = y} = P g−1(y)



Continuous case (for g strictly monotonic, thus invertible):

fY(y) = fX g−1(y)  d g−1(y) d y

Continuous multivariate case (invertible):

fY(y) = fX g−1(y)



detJg−1(y)

(57)

Central Limit Theorem

Take n independent r.v. X1, ..., Xn such that E[Xi] = µi and

var(Xi) = σi2 Their sum, Yn= n X i=1 Xi satisfies: E[Yn] = n X i=1 µi ≡ µ(n) var(Yn) = X i σi2 ≡ σ(n)2 ...thus, if Zn= Yn− µ(n) σ(n) E[Zn] = 0 var(Zn) = 1

Central limit theorem (CLT): under some mild conditions on X1, ..., Xn

lim

(58)

Central Limit Theorem

(59)

Exponential Families

Consider a pdf or pmf fX(x|θ), with parameter(s) θ ∈ Rc, for X ∈ Xn

fX(x|θ) is in anexponential family if it has the form

fX(x|θ) =

1

Z(θ)h(x) exp η(θ)

Tφ(x) = h(x) exp η(θ)Tφ(x) − A(θ),

where A(θ) = log Z(θ) and Z(θ) =

Z

Xn

h(x) exp η(θ)Tφ(x) dx.

Canonical parameter(s): η(θ) ∈ Rd; that is, η : Rc→ Rd

Sufficient statistics: φ(x) ∈ Rd; that is, φ : Xn→ Rd

Partition function: Z(θ)

(60)

Exponential Families

fX(x|θ) = 1 Z(θ)h(x) exp η(θ) Tφ(x) Examples: Bernoulli: X ∈ {0, 1}, with P(X = 1) = θ; fX(x) = θx(1 − θ)1−x,

fX(x) = exp x log θ+(1−x) log(1−θ) = exp x log1−θθ +log(1−θ),

thus η(θ) = log1−θθ , φ(x) = x, Z(θ) = 1−θ1 , and h(x) = 1.

Gaussian: X ∈ R, fX(x) = N (x; µ, σ2); θ = (µ, σ2) fX(x) = exp −(x−µ)22  √ 2πσ2 = exp −x22 + µx σ2 − µ2 2σ2  √ 2πσ2 , thus η(µ, σ2) = [µ/σ2, −1/(2σ2)]T, φ(x) = [x, x2]T, Z(µ, σ2) =√2πσ2exp µ2 2σ2, and h(x) = 1.

(61)

More on Exponential Families

Assume η(θ) = θ ∈ Rd; canonical parameterization.

Independent identically distributed (i.i.d.) observations: X1, ..., Xm∼ fX(x) = 1 Z(θ)h(x) exp θ Tφ(x) then fX1,...,Xm(x1, ..., xm) = 1 Z(θ)m   m Y j=1 h(xj)  exp  θT m X j=1 φ(xj) 

Expected sufficient statistics:

θlog Z(θ) = ∇θZ(θ) Z(θ) = 1 Z(θ) Z φ(x)h(x) exp θTφ(x) dx = Eφ(X)

(62)

Important Inequalities

Markov’s ineqality: if X ≥ 0 is an RV with expectation E(X), then

P(X > t) ≤ E(X) t Simple proof: t P(X > t) = Z ∞ t t fX(x) dx ≤ Z ∞ t x fX(x) dx = E(X)− Z t 0 x fX(x) dx | {z } ≥0 ≤ E(X)

Chebyshev’s inequality: µ = E(Y ) and σ2= var(Y ), then

P(|Y − µ| ≥ s) ≤ σ

2

s2

(63)

Other Important Inequalities: Cauchy-Schwartz

Cauchy-Schwartz’s inequalityfor RVs:

|E(X Y )| ≤pE(X2) E(Y2)

...why? Because hX, Y i ≡ E[XY ] is a valid inner product Important corollary: let E[X] = µ and E[X] = ν,

|cov(X, Y )| = E[(X − µ)(Y − ν)]

≤pE[(X − µ)2] E[(Y − ν)2]

=pvar(X) var(Y )

Implication for correlation:

corr(X, Y ) = ρ(X, Y ) = cov(X, Y )

(64)

Other Important Inequalities: Jensen

Recall that a real function g is convex if, for any x, y, and α ∈ [0, 1] g(αx + (1 − α)y) ≤ αg(x) + (1 − α)g(y)

Jensen’s inequality: if g is a real convex function, then g(E(X)) ≤ E(g(X))

Examples: E(X)2 ≤ E(X2) ⇒ var(X) = E(X2) − E(X)2≥ 0.

(65)

Entropy and all that...

Entropy of a discrete RV X ∈ {1, ..., K}: H(X) = − K X x=1 fX(x) log fX(x) Positivity: H(X) ≥ 0 ;

H(X) = 0 ⇔ fX(i) = 1, for exactly one i ∈ {1, ..., K}.

Upper bound: H(X) ≤ log K ;

H(X) = log K ⇔ fX(x) = 1/k, for all x ∈ {1, ..., K}

Measure ofuncertainty/randomness of X

Continuous RV X,differential entropy: h(X) = −

Z

fX(x) log fX(x) dx

h(X) can be positive or negative. Example, if

fX(x) = Uniform(x; a, b), h(X) = log(b − a).

If fX(x) = N (x; µ, σ2), then h(X) = 12log(2πeσ2).

If var(Y ) = σ2, then h(Y ) ≤ 12log(2πeσ2)

(66)

Kullback-Leibler divergence

Kullback-Leibler divergence(KLD) between two pmf:

D(fXkgX) = K X x=1 fX(x) log fX(x) gX(x) Positivity: D(fXkgX) ≥ 0 D(fXkgX) = 0 ⇔ fX(x) = gX(x), for x ∈ {1, ..., K} KLD between two pdf: D(fXkgX) = Z fX(x) log fX(x) gX(x) dx Positivity: D(fXkgX) ≥ 0 D(fXkgX) = 0 ⇔ fX(x) = gX(x), almost everywhere

(67)

Mutual information

Mutual information (MI) between two random variables:

I(X; Y ) = D fX,YkfXfY



Positivity: I(X; Y ) ≥ 0

I(X; Y ) = 0 ⇔ X, Y are independent.

(68)

Recommended Reading

A. Maleki and T. Do, “Review of Probability Theory”, Stanford University, 2017 (https://tinyurl.com/pz7p9g5)

K. Murphy, “Machine Learning: A Probabilistic Perspective”, MIT Press, 2012.

L. Wasserman, “All of Statistics: A Concise Course in Statistical Inference”, Springer, 2004.

References

Related documents

Proposition 3.1 If x and y are eigen values of the real symmetric matrix A, corresponding to different eigen values λ and µ, then x and y are orthogonal.. The proof is

A_reconstructed = np.matmul(U, np.matmul(Sigma, Vt)) print('Are all entries equal up to machine precision?') print('Yes' if np.allclose(A, A_reconstructed) else

–  Probability &amp; Statistics –  Information Theory –  Linear Algebra.

Click here is being made beyond for cc south africa, a solution vector spaces and lecture notes linear algebra mit courses at least squares projection matrix sums and linear

With this program, when &lt;condition&gt; takes a matrix value, the program is continuously executed if all the entries are non-zero, and the program exits the loop statement if

The mathematical abstraction to understand these situations is the concept of a vector space, which consists of vector like objects on which a set of scalars, to be called as

Every symmetric (hermitian) matrix of dimension n has a set of (not necessarily unique) n orthogonal eigenvectors... Eigendecomposition is

Nalmefene is recommended as an option for reducing alcohol consumption, for people with alcohol dependence, who have a high drinking risk level (defined as alcohol.. consumption