Statistical Learning
Lecture 0
Mathematical Tools: Algebra and Probabilities
M´ario A. T. Figueiredo
Instituto Superior T´ecnico & Instituto de Telecomunica¸c˜oes University of Lisbon, Portugal
Linear Algebra (Informally)
Linear algebra provides (among many other things) a compact way of representing, studying, and solving linear systems of equations
Example: the system
4 x1− 5 x2= −13
−2 x1+ 3 x2= 9
can be written compactly as Ax = b , where
A = 4 −5 −2 3 , b = −13 9 , and can be solved as
x = A−1b = 1.5 2.5 1 2 −13 9 = 3 5 .
Notation: Matrices and Vectors
A ∈ Rm×n is a matrixwith m rows and n columns.
A = A1,1 · · · A1,n .. . . .. ... Am,1 · · · Am,n .
x ∈ Rn is a vector with n components,
x = x1 .. . xn .
A(column) vector is a matrix with n rows and 1 column.
Matrix Transpose and Products
Given matrix A ∈ Rm×n, itstransposeAT is such that (AT)i,j = Aj,i.
A matrix A issymmetricif AT = A.
Given matrices A ∈ Rm×n and B ∈ Rn×p, their product is
C = A B ∈ Rm×p where Ci,j =
n
X
k=1
Ai,kBk,j
Inner product between vectors x, y ∈ Rn:
hx, yi = xTy = yTx =
n
X
i=1
xiyi ∈ R.
Outer product: x ∈ Rn and y ∈ Rm: x yT ∈ Rn×m, where
Properties of Matrix Products and Transposes
Given matrices A ∈ Rm×n and B ∈ Rn×p, their product is
C = A B ∈ Rm×p where Ci,j =
n
X
k=1
Ai,kBk,j
Matrix product isassociative: (AB)C = A(BC).
In general, matrix product is not commutative: AB 6= BA.
Transpose of product: (A B)T = BTAT.
Special Matrices
Theidentity matrix I ∈ Rn×n is a square matrix such that
Iij = 1, i = j 0, i 6= j I = 1 0 · · · 0 0 1 · · · 0 .. . ... . .. ... 0 0 · · · 1
Neutral element of matrix product: A I = I A = A.
Diagonal matrix: (i 6= j) ⇒ Ai, j = 0.
Upper triangular matrix: (j < i) ⇒ Ai, j = 0. Lower triangularmatrix: (j > i) ⇒ Ai, j = 0.
Eigenvalues, eigenvectors, determinant, trace
A vector x ∈ Rn is an eigenvector of matrix A ∈ Rn×n if
A x = λ x,
where λ ∈ R is the corresponding eigenvalue.
The eigenvalues of a diagonal matrix are the elements in the diagonal. (quiz: what are the eigenvectors?)
Matrix trace: trace(A) =P
iAi,i =
P
iλi
Matrix determinant: |A| = det(A) =Q
iλi
Properties of determinant: |AB| = |A||B|, |AT| = |A|,
|α A| = αn|A|
Properties of the trace: trace(A + B) = trace(A) + trace(B),
trace(ABC) = trace(CAB) = trace(BCA) (cyclic permutations)
Matrix Inverse
Matrix A ∈ Rn×n in invertible if there is B ∈ Rn×n s.t.
AB = BA = I.
...matrix B, such that AB = BA = I, denoted B = A−1.
Matrix A ∈ Rn×n is invertible ⇔ det(A) 6= 0.
Determinant of inverse: det(A−1) = 1
det(A).
Solving system Ax = b, if A is invertible: x = A−1b.
Properties: (A−1)−1 = A, (A−1)T = (AT)−1, (A B)−1 = B−1A−1
There are many algorithms to compute A−1; general case,
Quadratic Forms and Positive (Semi-)Definite Matrices
Given matrix A ∈ Rn×n and vector x ∈ Rn,
xTA x = n X i=1 n X j=1 Ai, jxixj ∈ R
is called aquadratic form.
A symmetric matrix A ∈ Rn×n ispositive semi-definite (PSD) if, for
any x ∈ Rn, xTA x ≥ 0.
A symmetric matrix A ∈ Rn×n ispositive definite (PD) if, for any
x ∈ Rn, (x 6= 0) ⇒ xTA x > 0.
Matrix A ∈ Rn×n is PSD ⇔ all λi(A) ≥ 0.
A Bit More Formal: Vector Spaces
Avector space over a field F (e.g., R) is a set V and a pair of
operations, +: V × V → V and ·: F × V → V, that satisfy the
following axioms, ∀x, y, z ∈ V and ∀α, β ∈ F:
X + is associative and commutative;
X ∃0 ∈ V, such that 0 + x = x;
X ∃ − x ∈ V, such that −x + x = 0;
X α · (β · x) = (α · β) · x;
X 1 · x = x, where 1 ∈ F is such that 1 · α = α;
X α · (x + y) = α · x + α · y;
X (α + β) · x = α · x + β · x.
Elements of V are calledvectors; elements of F arescalars.
Vector Space: Examples
“Usual vectors” (Rn, +, ·) over field R
X x = (x1, ..., xn), y = (y1, ..., yn), x + y = (x1+ y1, ..., xn+ yn);
X x = (x1, ..., xn), α x = (α x1, ..., α xn)
Real matrices (Rm×n, +, ·) over field R
X usual matrix addition and multiplication by scalar;
Complex matrices (Cm×n, +, ·) over field C (complex numbers).
Binary vectors ({0, 1}n, +, ·) over GF (2) = {0, 1} (Galois field),
X + is addition modulo 2 : 0 + 0 = 0, 0 + 1 = 1 + 0 = 1, 1 + 1 = 0.
X · is standard multiplication: 0 · 0 = 0 · 1 = 1 · 0 = 0, 1 · 1 = 1.
Set of all functions f : Ω → R, with point-wise addition and multiplication, is a vector space over R.
Norm on Vector Space V
Anorm is a function k · k : V → R+ satisfying,
∀x, y ∈ V and ∀α ∈ R,
X homogeneity, kα xk = |α| kxk;
X triangle inequality, kx + yk ≤ kxk + kyk;
X definiteness, kxk = 0 ⇔ x = 0.
Aseminorm may not satisfy definiteness.
Classical example in Rn: Euclidean norm: kxk =
s n P i=1 x2 i.
Two norms k · k and k · k0 are equivalentif ∃α, β > 0 such that
∀x ∈ V, αkxk ≤ kxk0 ≤ βkxk;
Other Norms
The `p norm of a vector x ∈ Rn, where p ≥ 1,
kxkp = n X i=1 |xi|p !1/p . Notable cases: I `2 (Euclidean) norm. I `1 norm, kxk1=Pi|xi|.
I `∞ norm, limp→∞kxkp= max{|x1|, ..., |xn|} ≡ kxk∞
I `0 “norm”(not a norm), limp→0kxkp= #{i : xi6= 0} ≡ kxk0
Some equivalences:kxk2 ≤ kxk1 ≤ √ n kxk2 kxk∞ ≤ kxk2 ≤ √ n kxk∞ kxk∞ ≤ kxk1 ≤ n kxk∞
Inner Product on Vector Space V Over R
An inner product is a function h·, ·i : V × V → R, satisfying, ∀x, y ∈ V and ∀α ∈ R,
X symmetry, hx, yi = hy, xi
X (bi)linearity, hαx + βz, yi = αhx, yi + βhz, yi;
X definiteness, hx, xi ≥ 0 and hx, xi = 0 ⇔ x = 0.
Standard inner product in Rn: hx, yi =Pn
i=1xiyi= xTy
Also an inner product in Rn: hx, yi = xTM y, where M is PD
Norm (is it?) induced by an inner product kxk =phx, xi
X for hx, yi = xTy, then kxk2= xTx =Pn i=1x 2 i (Euclidean norm) X for hx, yi = xTM y, then kxk2 M = xTM x (Mahalanobis norm)
Key Properties of Inner Products
If k · k is induced by inner product h·, ·i (that is kxk2= hx, xi), then
kx + yk2= hx + y, x + yi = kxk2+ kyk2+ 2hx, yi
Cauchy-Schwarz inequality: |hx, yi| ≤ kxk kyk
“Des d´emonstrations qui font boum!” (“Proofs that make boom!”)
[Jean-Baptiste Hiriart-Urruty] 0 ≤ 1 2 x kxk± y kyk 2 = 1 ± hx, yi kxk kyk ⇔ hx, yi ≤ kxk kyk −hx, yi ≤ kxk kyk
Corollary: k · k is indeed a norm, as it satisfies the triangle inequality:
kx + yk2≤ kxk2
+ kyk2+ 2|hx, yi| ≤ kxk2+ kyk2+ 2kxk kyk = kxk + kyk2 Hilbert space: complete vector space equipped with an inner product.
Basis and Dimension of a Vector Space V
Basis: collection of vectors B = {b1, b2, ...} ⊂ V satisfying:
X linear independence: for any finite linear combination α1b1+ ... + αmbm= 0 ⇒ α1= · · · = αm= 0
X spanningability: any vector v ∈ V can be written as v = α1b1+ ... + αmbn; in other words, V = span(B).
Dimension of V: dim(V) = #B Orthogonal basis: i 6= j ⇒ hbi, bji = 0
Rank, Range, and Null Space
Consider some real matrix A ∈ Rm×n
Range of A: R(A) =y ∈ Rm: ∃ x ∈ Rnsuch that y = Ax ⊆ Rm Null space of A: N (A) =x ∈ Rn: Ax = 0 ⊆ Rn.
Both R(A) and N (A) are vector spaces.
Dimension theorem: dim R(A) + dim N (A) = n
Rank: rank(A) = dim R(A) ≤ min{m, n}
Singular Value Decomposition (SVD)
Any rank-r matrix A ∈ Rm×n can be written as A = U ΛVT
X columns of U ∈ Rm×r are an orthonormal basis of R(A);
X columns of V ∈ Rn×r are an orthonormal basis of R(AT);
X Λ = diag(σ1, ..., σr) is a r × r diagonal matrix;
X σ1, ..., σr are calledsingular values.
X σ1, ..., σr are square roots of the eigenvalues of ATA or AAT.
Orthonormality of U and V : UTU = I and VTV = I .
Singular Value Decomposition (SVD)
A = U ΛVT, where U ∈ Rm×r and V ∈ Rn×r.
Singular Value Decomposition (SVD)
A = U ΛVT, where U ∈ Rm×r and V ∈ Rn×r.
Recommended Reading
Z. Kolter and C. Do, “Linear Algebra Review and Reference”, Stanford University, 2015 (https://tinyurl.com/44x2qj4)
Probability theory
“Essentially, all models are wrong, but some are useful”;G. Box, 1987
The study of probability has roots in games of chance (dice, cards, ...)
Great names in science:Cardano, Fermat, Pascal, Laplace, Gauss,
Huygens, Legendre, Poisson, Kolmogorov, Bernoulli, Cauchy, Gibbs, Boltzman, de Finetti, ...
Natural tool to modeluncertainty, information, knowledge, belief, ...
What is probability?
Classical definition: P(A) = NA
N
...with N mutually exclusive equally likely outcomes,
NA of which result in the occurrence of A. Laplace, 1814
Example: P(randomly drawn card is ♣) = 13/52.
Example: P(getting 1 in throwing a fair die) = 1/6.
Frequentist definition: P(A) = lim
N →∞
NA
N
...relative frequency of occurrence of A in infinite number of trials.
Subjective probability: P(A) is a degree of belief. de Finetti, 1930s
“The mathematics of probability can be developed on an entirely
Key concepts: Sample space and events
Sample space X = set of possible outcomes of a random experiment.
Examples:
I Tossing two coins: X = {HH, T H, HT, T T }
I Roulette: X = {1, 2, ..., 36}
I Draw a card from a shuffled deck: X = {A♣, 2♣, ..., Q♦, K♦}.
An event A is a subset of X : A ⊆ X (also written A ∈ 2X).
Examples:
I “exactly one H in 2-coin toss”:
A = {T H, HT } ⊂ {HH, T H, HT, T T }.
I “odd number in the roulette”: B = {1, 3, ..., 35} ⊂ {1, 2, ..., 36}.
Key concepts: Sample space and events
Sample space X = set of possible outcomes of a random experiment.
(More delicate) examples:
I Time until you receive the next email X = R+ I Location of the next rain drop X = R2
An event A is a measurablesubset of X , i.e., A ∈ σ(F ), F ⊆ 2X. σ-algebra: σ(F ) I A ∈ σ(F ) ⇒ Ac∈ σ(F ) I A1, A2, ... ∈ σ(F ) ⇒ ∞ [ i=1 Ai∈ σ(F )
Classical example in Rn: the collection of Lebesgue measurable sets
constitute a σ−algebra.
Kolmogorov’s Axioms for Probability
Probability is a function that maps events A into the interval [0, 1]. Kolmogorov’s axioms(1933) for probability P : σ(F ) → [0, 1]
I For any A ∈ σ(F ), P(A) ≥ 0 I P(X ) = 1
I If A1, A2... ⊆ X are disjoint events, then P [ i Ai =X i P(Ai)
From these axioms, many results can be derived. Examples:
I P(∅) = 0
I C ⊂ D ⇒ P(C) ≤ P(D)
I P(A ∪ B) = P(A) + P(B) − P(A ∩ B) I P(A ∪ B) ≤ P(A) + P(B) (union bound)
Conditional Probability and Independence
If P(B) > 0, P(A|B) = P(A ∩ B)
P(B) (conditional prob. of A given B) ...satisfies all of Kolmogorov’s axioms:
I For any A ⊆ X , P(A|B) ≥ 0 I P(X |B) = 1
I If A1, A2, ... ⊆ X are disjoint, then P [ i Ai B =X i P(Ai|B)
Conditional Probability and Independence
If P(B) > 0, P(A|B) = P(A ∩ B)
P(B)
Events A, B are independent (A ⊥⊥ B) ⇔ P(A ∩ B) = P(A) P(B).
Relationship with conditional probabilities:
A ⊥⊥ B ⇔ P(A|B) = P(A ∩ B)
P(B) =
P(A) P(B)
P(B) = P(A)
Example: X = “52 cards”, A = {3♥, 3♣, 3♦, 3♠}, and B = {A♥, 2♥, ..., K♥}; then, P(A) = 1/13, P(B) = 1/4
P(A ∩ B) = P({3♥}) = 1 52 P(A) P(B) = 1 13 1 4 = 1 52 P(A|B) = P(“3”|“♥”) = 1 13 = P(A)
Bayes Theorem
Law of total probability: if A1, ..., An are a partition of X
P(B) = X i P(B|Ai)P(Ai) =X i P(B ∩ Ai)
Bayes’ theorem: if {A1, ..., An} is a partition of X
P(Ai|B) = P(B ∩ Ai ) P(B) = P(B|Ai) P(Ai) X i P(B|Ai)P(Ai) = P(B|Ai) P(Ai) P(B)
Random Variables
A (real)random variable (RV) is a function: X : X → R
I Discrete RV: range of X is countable (e.g., N or {0, 1})
I Continuous RV: range of X is uncountable (e.g., R or [0, 1])
I Example: number of head in tossing two coins, X = {HH, HT, T H, T T },
X(HH) = 2, X(HT ) = X(T H) = 1, X(T T ) = 0. Range of X = {0, 1, 2}.
Random Variables: Distribution Function
Distribution function: FX(x) = P({ω ∈ X : X(ω) ≤ x})
Example: number of heads in tossing 2 coins; range(X) = {0, 1, 2}.
Probability mass function(discrete RV): fX(x) = P(X = x),
FX(x) =
X
x≤x
Properties of Distribution Functions
FX : R → [0, 1] is the distribution function of some r.v. X iff:
it is non-decreasing: x1 < x2⇒ FX(x1) ≤ FX(x2);
lim
x→−∞FX(x) = 0;
lim
x→+∞FX(x) = 1;
it is right semi-continuous: lim
x→z+FX(x) = FX(z) Further properties: P(X = x) = fX(x) = FX(x) − lim z→x−FX(z); P(z < X ≤ y) = FX(y) − FX(z); P(X > x) = 1 − FX(x).
Important Discrete Random Variables
Uniform: X ∈ {x1, ..., xK}, pmf fX(xi) = 1/K. Bernoulli RV: X ∈ {0, 1}, pmf fX(x) = p ⇐ x = 1 1 − p ⇐ x = 0Can be written compactly as fX(x) = px(1 − p)1−x.
Binomial RV: X ∈ {0, 1, ..., n} (sum on n Bernoulli RVs)
fX(x) = Binomial(x; n, p) = n x px(1 − p)(n−x) Binomial coefficients (“n choose x”): n x = n! (n − x)! x!
More Important Discrete Random Variables
Geometric(p): X ∈ N, pmf fX(x) = p(1 − p)x−1.
(e.g., number of trials until the first success). Poisson(λ): X ∈ N ∪ {0}, pmf fX(x) = e−λλx x! Notice thatP∞ x=0λ x x! = e λ, thus P∞ x=0fX(x) = 1.
“...probability of the number of independent occurrences in a fixed (time/space) interval if these occurrences have known average rate”
Random Variables: Distribution Function
Distribution function: FX(x) = P({ω ∈ X : X(ω) ≤ x})
Example: continuous RV with uniform distribution on [a, b].
Probability density function(pdf, continuous RV): fX(x)
FX(x) = Z x −∞ fX(u) du, P(X ∈ [c, d]) = Z d c fX(x) dx,
Important Continuous Random Variables
Uniform: fX(x) = Uniform(x; a, b) = 1 b−a ⇐ x ∈ [a, b] 0 ⇐ x 6∈ [a, b] (previous slide). Gaussian: fX(x) = N (x; µ, σ2) = 1 √ 2 π σ2e −(x−µ)2 2 σ2 Exponential: fX(x) = Exp(x; λ) = λe−λ x ⇐ x ≥ 0 0 ⇐ x < 0Expectation of Random Variables
Expectation: E(X) = X i xifX(xi) X ∈ {x1, ..., xK} ⊂ R Z ∞ −∞ x fX(x) dx X continuousExample: Bernoulli, fX(x) = px(1 − p)1−x, for x ∈ {0, 1}.
E(X) = 0 (1 − p) + 1 p = p.
Example: Binomial, fX(x) = nxpx(1 − p)n−x, for x ∈ {0, ..., n}.
E(X) = n p.
Example: Gaussian, fX(x) = N (x; µ, σ2). E(X) = µ. Linearity of expectation:
Expectation of Functions of Random Variables
E(g(X)) = X i g(xi)fX(xi) X discrete, g(xi) ∈ R Z ∞ −∞ g(x) fX(x) dx X continuousExample: variance, var(X) = E X − E(X)2= E(X2) − E(X)2
Example: Bernoulli variance, E(X2
) = E(X) = p , thus var(X) = p(1 − p).
Example: Gaussian variance, E (X − µ)2 = σ2.
Probability as expectation of indicator, 1A(x) =
1 ⇐ x ∈ A 0 ⇐ x 6∈ A P(X ∈ A) = Z A fX(x) dx = Z 1A(x) fX(x) dx = E(1A(X))
Moments of Random Variables
Non-central moments of order k:
E(|X|k) = X i |xi|kfX(xi) X discrete, g(xi) ∈ R Z ∞ −∞ |x|kfX(x) dx X continuous
Central moments of order k ∈ N: E |X − E(X)|k.
...if the integral/sum exits.
If k-th moment exists, then the j-th moment exists, for j ≤ k.
Z ∞ −∞ |x|jfX(x) dx = Z |x|≤1 |x|jfX(x) dx + Z |x|>1 |x|jfX(x) dx ≤ Z |x|≤1 fX(x) dx + Z |x|>1 |x|kfX(x) dx ≤ 1 + E |X|k < ∞
Two (or More) Random Variables
Joint pmf of two discrete RVs: fX,Y(x, y) = P(X = x ∧ Y = y).
Extends trivially to more than two RVs.
Joint pdf of two continuous RVs: fX,Y(x, y), such that
P (X, Y ) ∈ A =
Z Z
A
fX,Y(x, y) dx dy, A ∈ σ(R2)
Extends trivially to more than two RVs.
Marginalization: fY(y) = X x
fX,Y(x, y), if X is discrete
Z ∞ −∞ fX,Y(x, y) dx, if X continuous Independence: X ⊥⊥ Y ⇔ fX,Y(x, y) = fX(x) fY(y) ⇒
Conditionals and Bayes’ Theorem
Conditional pmf(discrete RVs):
fX|Y(x|y) = P(X = x|Y = y) = P(X = x ∧ Y = y)
P(Y = y) =
fX,Y(x, y)
fY(y)
.
Conditional pdf(continuous RVs): fX|Y(x|y) =
fX,Y(x, y)
fY(y) ...the meaning is technically delicate.
Bayes’ theorem: fX|Y(x|y) = fY |X(y|x) fX(x) fY(y)
(pdf or pmf).
Joint, Marginal, and Conditional Probabilities: An Example
A pair of binary variables X, Y ∈ {0, 1}, with jointpmf:
Marginals: fX(0) = 15 +25 = 35, fX(1) = 101 + 103 = 104,
fY(0) = 15 +101 = 103, fY(1) = 25 +103 = 107 . Conditionalprobabilities:
An Important Multivariate RV: Multinomial
Multinomial: X = (X1, ..., XK), Xi ∈ {0, ..., n}, such that
P iXi= n, fX(x1, ..., xK) = n x1x2··· xK p x1 1 p x2 2 · · · p xK k ⇐ P ixi = n 0 ⇐ P ixi 6= n n x1 x2 · · · xK = n! x1! x2! · · · xK!
Parameters: p1, ..., pK ≥ 0, such thatPipi = 1.
Generalizes the binomial from binary to K-classes.
Example: tossing n independent fair dice, p1 = · · · = p6 = 1/6.
Covariance, Correlation, and all that...
Covariance between two RVs, X, Y ∈ R :
cov(X, Y ) = Eh X − E(X) Y − E(Y )i = E(X Y ) − E(X) E(Y )
Relationship with variance: var(X) = cov(X, X).
Correlation: corr(X, Y ) = ρ(X, Y ) = √ cov(X,Y )
var(X)√var(Y ) ∈ [−1, 1]
X ⊥⊥ Y ⇔ fX,Y(x, y) = fX(x) fY(y)
⇒
6⇐ cov(X, Y ) = 0(example)
Covariance matrix of multivariate RV, X ∈ Rn:
cov(X) = Eh X − E(X)
X − E(X)Ti
An Important Multivariate RV: Gaussian
Multivariate Gaussian: X ∈ Rn, fX(x) = N (x; µ, C) = 1 pdet(2 π C)exp −1 2(x − µ) TC−1(x − µ)Parameters: vector µ ∈ Rnand matrix C ∈ Rn×n.
Key Properties of Multivariate Gaussian
Marginals are Gaussian. Conditionals are Gaussian.
Multivariate Gaussian Conditionals
fX(x) = N (x; µ, C) = 1 pdet(2 π C)exp −1 2(x − µ) TC−1(x − µ) Partition X = [X1, ..., Xi | {z } Xa , Xi+1, ..., Xn | {z } Xb ]T and µ = [µ1, ..., µi | {z } µa , µi+1, ..., µn | {z } µb ]T;Corresponding partition of covariance: C = Caa Cab Cba Cbb fXa(·) = N (·; µa, Caa); fXb(·) = N (·; µb, Cbb); fXb|Xa(xb|xa) = N xb; µb+ CbaC −1 aa(xa− µa), Cbb− CbaCaa−1Cab
More on Expectations and Covariances
Let A ∈ Rn×n be a matrix and a ∈ Rn a vector.
If E(X) = µ and Y = AX, then E(Y ) = Aµ;
If cov(X) = C and Y = AX, then cov(Y ) = ACAT;
If cov(X) = C and Y = aTX ∈ R, then var(Y ) = aTCa ≥ 0;
If cov(X) = C and Y = C−1/2X, then cov(Y ) = I;
If fX(x) = N (x; 0, I) and Y = µ + C1/2X, then
fY(y) = N (y; µ, C);
If fX(x) = N (x;µ, C) and Y = C−1/2(X − µ), then
Transformations
X ∼ fX and Y = g(X) ⇒ fY =?
Discrete case:
fY(y) = P(g(X) = y) = P {x : g(x) = y} = P g−1(y)
Continuous case (for g strictly monotonic, thus invertible):
fY(y) = fX g−1(y) d g−1(y) d y
Continuous multivariate case (invertible):
fY(y) = fX g−1(y)
detJg−1(y)
Central Limit Theorem
Take n independent r.v. X1, ..., Xn such that E[Xi] = µi and
var(Xi) = σi2 Their sum, Yn= n X i=1 Xi satisfies: E[Yn] = n X i=1 µi ≡ µ(n) var(Yn) = X i σi2 ≡ σ(n)2 ...thus, if Zn= Yn− µ(n) σ(n) E[Zn] = 0 var(Zn) = 1
Central limit theorem (CLT): under some mild conditions on X1, ..., Xn
lim
Central Limit Theorem
Exponential Families
Consider a pdf or pmf fX(x|θ), with parameter(s) θ ∈ Rc, for X ∈ Xn
fX(x|θ) is in anexponential family if it has the form
fX(x|θ) =
1
Z(θ)h(x) exp η(θ)
Tφ(x) = h(x) exp η(θ)Tφ(x) − A(θ),
where A(θ) = log Z(θ) and Z(θ) =
Z
Xn
h(x) exp η(θ)Tφ(x) dx.
Canonical parameter(s): η(θ) ∈ Rd; that is, η : Rc→ Rd
Sufficient statistics: φ(x) ∈ Rd; that is, φ : Xn→ Rd
Partition function: Z(θ)
Exponential Families
fX(x|θ) = 1 Z(θ)h(x) exp η(θ) Tφ(x) Examples: Bernoulli: X ∈ {0, 1}, with P(X = 1) = θ; fX(x) = θx(1 − θ)1−x,fX(x) = exp x log θ+(1−x) log(1−θ) = exp x log1−θθ +log(1−θ),
thus η(θ) = log1−θθ , φ(x) = x, Z(θ) = 1−θ1 , and h(x) = 1.
Gaussian: X ∈ R, fX(x) = N (x; µ, σ2); θ = (µ, σ2) fX(x) = exp −(x−µ)2σ22 √ 2πσ2 = exp −2σx22 + µx σ2 − µ2 2σ2 √ 2πσ2 , thus η(µ, σ2) = [µ/σ2, −1/(2σ2)]T, φ(x) = [x, x2]T, Z(µ, σ2) =√2πσ2exp µ2 2σ2, and h(x) = 1.
More on Exponential Families
Assume η(θ) = θ ∈ Rd; canonical parameterization.
Independent identically distributed (i.i.d.) observations: X1, ..., Xm∼ fX(x) = 1 Z(θ)h(x) exp θ Tφ(x) then fX1,...,Xm(x1, ..., xm) = 1 Z(θ)m m Y j=1 h(xj) exp θT m X j=1 φ(xj)
Expected sufficient statistics:
∇θlog Z(θ) = ∇θZ(θ) Z(θ) = 1 Z(θ) Z φ(x)h(x) exp θTφ(x) dx = Eφ(X)
Important Inequalities
Markov’s ineqality: if X ≥ 0 is an RV with expectation E(X), then
P(X > t) ≤ E(X) t Simple proof: t P(X > t) = Z ∞ t t fX(x) dx ≤ Z ∞ t x fX(x) dx = E(X)− Z t 0 x fX(x) dx | {z } ≥0 ≤ E(X)
Chebyshev’s inequality: µ = E(Y ) and σ2= var(Y ), then
P(|Y − µ| ≥ s) ≤ σ
2
s2
Other Important Inequalities: Cauchy-Schwartz
Cauchy-Schwartz’s inequalityfor RVs:
|E(X Y )| ≤pE(X2) E(Y2)
...why? Because hX, Y i ≡ E[XY ] is a valid inner product Important corollary: let E[X] = µ and E[X] = ν,
|cov(X, Y )| = E[(X − µ)(Y − ν)]
≤pE[(X − µ)2] E[(Y − ν)2]
=pvar(X) var(Y )
Implication for correlation:
corr(X, Y ) = ρ(X, Y ) = cov(X, Y )
Other Important Inequalities: Jensen
Recall that a real function g is convex if, for any x, y, and α ∈ [0, 1] g(αx + (1 − α)y) ≤ αg(x) + (1 − α)g(y)
Jensen’s inequality: if g is a real convex function, then g(E(X)) ≤ E(g(X))
Examples: E(X)2 ≤ E(X2) ⇒ var(X) = E(X2) − E(X)2≥ 0.
Entropy and all that...
Entropy of a discrete RV X ∈ {1, ..., K}: H(X) = − K X x=1 fX(x) log fX(x) Positivity: H(X) ≥ 0 ;H(X) = 0 ⇔ fX(i) = 1, for exactly one i ∈ {1, ..., K}.
Upper bound: H(X) ≤ log K ;
H(X) = log K ⇔ fX(x) = 1/k, for all x ∈ {1, ..., K}
Measure ofuncertainty/randomness of X
Continuous RV X,differential entropy: h(X) = −
Z
fX(x) log fX(x) dx
h(X) can be positive or negative. Example, if
fX(x) = Uniform(x; a, b), h(X) = log(b − a).
If fX(x) = N (x; µ, σ2), then h(X) = 12log(2πeσ2).
If var(Y ) = σ2, then h(Y ) ≤ 12log(2πeσ2)
Kullback-Leibler divergence
Kullback-Leibler divergence(KLD) between two pmf:
D(fXkgX) = K X x=1 fX(x) log fX(x) gX(x) Positivity: D(fXkgX) ≥ 0 D(fXkgX) = 0 ⇔ fX(x) = gX(x), for x ∈ {1, ..., K} KLD between two pdf: D(fXkgX) = Z fX(x) log fX(x) gX(x) dx Positivity: D(fXkgX) ≥ 0 D(fXkgX) = 0 ⇔ fX(x) = gX(x), almost everywhere
Mutual information
Mutual information (MI) between two random variables:
I(X; Y ) = D fX,YkfXfY
Positivity: I(X; Y ) ≥ 0
I(X; Y ) = 0 ⇔ X, Y are independent.
Recommended Reading
A. Maleki and T. Do, “Review of Probability Theory”, Stanford University, 2017 (https://tinyurl.com/pz7p9g5)
K. Murphy, “Machine Learning: A Probabilistic Perspective”, MIT Press, 2012.
L. Wasserman, “All of Statistics: A Concise Course in Statistical Inference”, Springer, 2004.