• No results found

Ising Models with Latent Continuous Variables

N/A
N/A
Protected

Academic year: 2021

Share "Ising Models with Latent Continuous Variables"

Copied!
40
0
0

Loading.... (view fulltext now)

Full text

(1)

Ising Models with Latent Continuous Variables

Joachim Giesen (joint work with Frank Nussbaum)

Friedrich-Schiller-Universit¨at Jena

(2)

Ising models

Ising models are probability distributions on the sample space X = {0, 1}n

of the form

p(x ) ∝ exp x>Sx

, S ∈ Sym(n).

Typically only a few of the variables interact with each other, thus S is sparse.

Ising models are also known in machine learning as Boltzmann machines.

(3)

Ising models

Ising models are probability distributions on the sample space X = {0, 1}n

of the form

p(x ) ∝ exp x>Sx

, S ∈ Sym(n).

Typically only a few of the variables interact with each other, thus S is sparse.

Ising models are also known in machine learning as Boltzmann machines.

(4)

Ising models

Ising models are probability distributions on the sample space X = {0, 1}n

of the form

p(x ) ∝ exp x>Sx

, S ∈ Sym(n).

Typically only a few of the variables interact with each other, thus S is sparse.

Ising models are also known in machine learning as Boltzmann machines.

(5)

Multivariate Gaussians

Multivariate Gaussians are probability distributions on the sample space

Y = Rm of the form

p(y ) ∝ exp



−1

2(y − µ)>Σ−1(y − µ)



, Σ ∈ PD(m).

Typically only a few of the variables interact with each other, and thus Λ = Σ−1 is sparse (Gaussian graphical model).

(6)

Multivariate Gaussians

Multivariate Gaussians are probability distributions on the sample space

Y = Rm of the form

p(y ) ∝ exp



−1

2(y − µ)>Σ−1(y − µ)



, Σ ∈ PD(m).

Typically only a few of the variables interact with each other, and thus Λ = Σ−1 is sparse (Gaussian graphical model).

(7)

CG distributions

Restricted CG distributions that are defined on the sample space X × Y = {0, 1}n× Rm

are of the form p(x , y ) ∝ exp



x>Sx + y>Rx − 1 2y>Λy



, R ∈ Rm×n.

(8)

Ising models with latent continuous variables

Marginalizing out the continuous variables from a CG distribution gives the marginal distribution on

X = {0, 1}n of the form

p(x ) ∝ exp

 x>

 S +1

2R>ΛR

 x

 .

If m  n (number of continuous variables is much smaller than the number Bernoulli variables), then the PSD matrix L = 12R>ΛR is of small rank.

(9)

Ising models with latent continuous variables

Marginalizing out the continuous variables from a CG distribution gives the marginal distribution on

X = {0, 1}n of the form

p(x ) ∝ exp

 x>

 S +1

2R>ΛR

 x

 .

If m  n (number of continuous variables is much smaller than the number Bernoulli variables), then the PSD matrix L =12R>ΛR is of small rank.

(10)

Likelihood function for latent variable Ising model

Given data points

x(1), . . . , x(k)∈ X = {0, 1}n,

we want to estimate the model parameters S ∈ Sym(n) (sparse) and L ∈ Sym(n) (PSD and low rank).

The log-likelihood function for the latent variable Ising model is

`(S + L) =

n

X

i =1

x(i )>(S + L)x(i ) − a(S + L),

where a is a normalization function.

(11)

Likelihood function for latent variable Ising model

Given data points

x(1), . . . , x(k)∈ X = {0, 1}n,

we want to estimate the model parameters S ∈ Sym(n) (sparse) and L ∈ Sym(n) (PSD and low rank).

The log-likelihood function for the latent variable Ising model is

`(S + L) =

n

X

i =1

x(i )>(S + L)x(i ) − a(S + L),

where a is a normalization function.

(12)

Promoting sparse + low rank solutions

Regularized log-likelihood problem

maxS,L `(S + L) − ckS k1− λTr(L) s.t. L  0, where c, λ > 0 are regularization parameters.

Question, under which conditions can we recover S and L individually from a solution of the regularized log-likelihood problem?

(13)

Promoting sparse + low rank solutions

Regularized log-likelihood problem

maxS,L `(S + L) − ckS k1− λTr(L) s.t. L  0, where c, λ > 0 are regularization parameters.

Question, under which conditions can we recover S and L individually from a solution of the regularized log-likelihood problem?

(14)

Alternative derivation of the problem

Maximum Entropy Principle [Jaynes 1955]

From the probability distributions that represent the current state of knowledge choose the one with largest entropy.

(15)

Maximum entropy principle

Current state of knowledge:

1. Sample points x(1), . . . , x(k) drawn from the sample space X = {0, 1}n.

2. Functions on X (sufficient statistics). Here we consider ϕij : x = (x1, . . . , xn) 7→ xixj, i , j ∈ [n].

(16)

Entropy maximization problem

maxp∈P H(p) s.t. E [ϕij] = 1 n

k

X

l =1

ϕij(x(l ))

Or more compactly

maxp∈P H(p) s.t. E [Φ] = Φk,

if we collect the functions ϕij in the n × n matrix Φ and set Φk =Pk

l =1x(l )x(l )>.

(17)

Entropy maximization problem

maxp∈P H(p) s.t. E [ϕij] = 1 n

k

X

l =1

ϕij(x(l ))

Or more compactly

maxp∈P H(p) s.t. E [Φ] = Φk,

if we collect the functions ϕij in the n × n matrix Φ and set Φk =Pk

l =1x(l )x(l )>.

(18)

Relaxed entropy maximization and its dual

The problem with relaxed constraint reads as maxp∈P H(p) s.t. kE [Φ] − Φkk≤ c,

whose Lagrangian dual is given as max

S `(S ) − ckS k1 with S ∈ Sym(n).

Maximum entropy – maximum likelihood duality

(19)

Relaxed entropy maximization and its dual

The problem with relaxed constraint reads as maxp∈P H(p) s.t. kE [Φ] − Φkk≤ c,

whose Lagrangian dual is given as max

S `(S ) − ckS k1 with S ∈ Sym(n).

Maximum entropy – maximum likelihood duality

(20)

Relaxed entropy maximization and its dual

The problem with relaxed constraint reads as maxp∈P H(p) s.t. kE [Φ] − Φkk≤ c,

whose Lagrangian dual is given as max

S `(S ) − ckS k1 with S ∈ Sym(n).

Maximum entropy – maximum likelihood duality

(21)

Spectral norm relaxation

Alternatively/additionally we can impose the constraint kE [Φ] − Φkk2 ≤ λ,

which gives

maxp∈P H(p)

s.t. kE [Φ] − Φkk≤ c, kE [Φ] − Φkk2≤ λ.

(22)

Spectral norm relaxation

Alternatively/additionally we can impose the constraint kE [Φ] − Φkk2 ≤ λ,

which gives

maxp∈P H(p)

s.t. kE [Φ] − Φkk≤ c, kE [Φ] − Φkk2≤ λ.

(23)

Dual of spectral norm relaxed problem ...

... is the regularized maximum likelihood problem

S,Lmax1,L2

`(S − L1+ L2) − ckS k1− λTr(L1+ L2) s.t. L1, L2 0,

where S , L1, L2∈ Sym(n).

The regularization term Tr(L1+ L2) promotes a low rank of L1+ L2, and thus also of L2− L1.

Hence, the interaction matrix S − L1+ L2 has a sparse (S ) + low rank (L2− L1) decomposition.

(24)

Dual of spectral norm relaxed problem ...

... is the regularized maximum likelihood problem

S,Lmax1,L2

`(S − L1+ L2) − ckS k1− λTr(L1+ L2) s.t. L1, L2 0,

where S , L1, L2∈ Sym(n).

The regularization term Tr(L1+ L2) promotes a low rank of L1+ L2, and thus also of L2− L1.

Hence, the interaction matrix S − L1+ L2 has a sparse (S ) + low rank (L2− L1) decomposition.

(25)

Dual of spectral norm relaxed problem ...

... is the regularized maximum likelihood problem

S,Lmax1,L2

`(S − L1+ L2) − ckS k1− λTr(L1+ L2) s.t. L1, L2 0,

where S , L1, L2∈ Sym(n).

The regularization term Tr(L1+ L2) promotes a low rank of L1+ L2, and thus also of L2− L1.

Hence, the interaction matrix S − L1+ L2 has a sparse (S ) + low rank (L2− L1) decomposition.

(26)

Weakening of the spectral norm constraint

The spectral norm constraint

kE [Φ] − Φkk2 ≤ λ, can also be written as

E [Φ] − Φk  λ and Φk− E [Φ]  λ

Skipping the first constraint gives maxp∈P H(p)

s.t. kE [Φ] − Φkk≤ c, Φk − E [Φ]  λ.

Whose dual is given as our marginal model

maxS,L `(S + L) − ckS k1− λTr(L) s.t. L  0.

(27)

Weakening of the spectral norm constraint

The spectral norm constraint

kE [Φ] − Φkk2 ≤ λ, can also be written as

E [Φ] − Φk  λ and Φk− E [Φ]  λ Skipping the first constraint gives

maxp∈P H(p)

s.t. kE [Φ] − Φkk≤ c, Φk − E [Φ]  λ.

Whose dual is given as our marginal model

maxS,L `(S + L) − ckS k1− λTr(L) s.t. L  0.

(28)

Weakening of the spectral norm constraint

The spectral norm constraint

kE [Φ] − Φkk2 ≤ λ, can also be written as

E [Φ] − Φk  λ and Φk− E [Φ]  λ Skipping the first constraint gives

maxp∈P H(p)

s.t. kE [Φ] − Φkk≤ c, Φk − E [Φ]  λ.

Whose dual is given as our marginal model

maxS,L `(S + L) − ckS k1− λTr(L) s.t. L  0.

(29)

Consistency guarantees

We consider the slightly reformulated problem maxS,L `(S + L) − λk γkS k1+ Tr(L)

s.t. L  0, where the likelihood function ` depends on the sample points x(1), . . . , x(k) through the covariance matrix Φk, and λk goes to zero with growing k.

Assume that the sample points are drawn from distribution with interaction parameter S?, L? ∈ Sym(n), where S? is sparse and L? is of low rank.

1. Can we approximate S? and L? from a solution to the regularized likelihood problem?

2. Does the solution recover the sparsity of S? and the rank of L??

(30)

Consistency guarantees

We consider the slightly reformulated problem maxS,L `(S + L) − λk γkS k1+ Tr(L)

s.t. L  0, where the likelihood function ` depends on the sample points x(1), . . . , x(k) through the covariance matrix Φk, and λk goes to zero with growing k.

Assume that the sample points are drawn from distribution with interaction parameter S?, L? ∈ Sym(n), where S? is sparse and L? is of low rank.

1. Can we approximate S? and L? from a solution to the regularized likelihood problem?

2. Does the solution recover the sparsity of S? and the rank of L??

(31)

Consistency guarantees

We consider the slightly reformulated problem maxS,L `(S + L) − λk γkS k1+ Tr(L)

s.t. L  0, where the likelihood function ` depends on the sample points x(1), . . . , x(k) through the covariance matrix Φk, and λk goes to zero with growing k.

Assume that the sample points are drawn from distribution with interaction parameter S?, L? ∈ Sym(n), where S? is sparse and L? is of low rank.

1. Can we approximate S? and L? from a solution to the regularized likelihood problem?

2. Does the solution recover the sparsity of S? and the rank of L??

(32)

Consistency guarantees

We consider the slightly reformulated problem maxS,L `(S + L) − λk γkS k1+ Tr(L)

s.t. L  0, where the likelihood function ` depends on the sample points x(1), . . . , x(k) through the covariance matrix Φk, and λk goes to zero with growing k.

Assume that the sample points are drawn from distribution with interaction parameter S?, L? ∈ Sym(n), where S? is sparse and L? is of low rank.

1. Can we approximate S? and L? from a solution to the regularized likelihood problem?

2. Does the solution recover the sparsity of S? and the rank of L??

(33)

Problem: non-identifiability

The matrix

M =

1 0 . . . 0 0 . . . ... ... . ..

 is sparse and of low rank.

In general, we cannot distinguish (S?, L?) from (S?+ M, L?− M), because both have the same compound matrix S?+ L?.

(34)

Problem: non-identifiability

The matrix

M =

1 0 . . . 0 0 . . . ... ... . ..

 is sparse and of low rank.

In general, we cannot distinguish (S?, L?) from (S?+ M, L?− M), because both have the same compound matrix S?+ L?.

(35)

Transversality assumption

Idea: Uniqueness of the solution of

maxS,L `(S + L) − λk γkS k1+ Tr(L) is necessary for individual recovery.

Optimality can be geometrically characterized as:

The gradient ∇`(S , L) needs to be normal to the tangent space of the variety of sparse matrices at S and also normal to the tangent space of the variety of low rank matrices at L.

For uniqueness we need that the two tangent spaces are transversal, i.e., only share the origin.

(36)

Transversality assumption

Idea: Uniqueness of the solution of

maxS,L `(S + L) − λk γkS k1+ Tr(L) is necessary for individual recovery.

Optimality can be geometrically characterized as:

The gradient ∇`(S , L) needs to be normal to the tangent space of the variety of sparse matrices at S and also normal to the tangent space of the variety of low rank matrices at L.

For uniqueness we need that the two tangent spaces are transversal, i.e., only share the origin.

(37)

Transversality assumption

Idea: Uniqueness of the solution of

maxS,L `(S + L) − λk γkS k1+ Tr(L) is necessary for individual recovery.

Optimality can be geometrically characterized as:

The gradient ∇`(S , L) needs to be normal to the tangent space of the variety of sparse matrices at S and also normal to the tangent space of the variety of low rank matrices at L.

For uniqueness we need that the two tangent spaces are transversal, i.e., only share the origin.

(38)

Gap assumption

We also need to require that

smin≥ cSλk and σmin≥ cLλk,

where smin is the smallest magnitude of any non-zero entry in S? and σmin is the smallest non-zero eigenvalue of L?. Furthermore, cS and cL are positive constants.

(39)

Consistency theorem

Theorem Let (S?, L?) be the true model parameters and (Sk, Lk) be the solution to the regularized likelihood problem. Let

k > c1· t · n log n and λk = c2

rt · n log n k

for constants c1, c2, c3, t > 0. Then with probability at least 1 − k−t

1. maxkSk − S?k, kLk− L?k2 ≤ c3λk, and

2. Sk and S? have the same support, and Lk and L? have the same rank.

Proof Similar to the consistency proof for Gaussian latent variable graphical models by Chandrasekaran, Parrilo and Willsky.

(40)

Consistency theorem

Theorem Let (S?, L?) be the true model parameters and (Sk, Lk) be the solution to the regularized likelihood problem. Let

k > c1· t · n log n and λk = c2

rt · n log n k

for constants c1, c2, c3, t > 0. Then with probability at least 1 − k−t

1. maxkSk − S?k, kLk− L?k2 ≤ c3λk, and

2. Sk and S? have the same support, and Lk and L? have the same rank.

Proof Similar to the consistency proof for Gaussian latent variable graphical models by Chandrasekaran, Parrilo and Willsky.

References

Related documents

The most important contributions of this paper are (i) the proposal of a concern mining approach for KDM, what may encourage other groups to start researching on

The PROMs questionnaire used in the national programme, contains several elements; the EQ-5D measure, which forms the basis for all individual procedure

The design of the robot controller is based on STM32F103VET6 as the main control chip, the car through the ultrasonic ranging to get the distance of the obstacle

An on-site garden, used either as wander space or a treatment environment or both, is perceived as improving quality of life, sleep, appetite, stress levels, and mood for

The summary resource report prepared by North Atlantic is based on a 43-101 Compliant Resource Report prepared by M. Holter, Consulting Professional Engineer,

The key segments in the mattress industry in India are; Natural latex foam, Memory foam, PU foam, Inner spring and Rubberized coir.. Natural Latex mattresses are

• Excellent mechanical stability and corrosion protection • Excellent oxidation stability, helping extend grease life • Excellent extreme pressure, anti-rust and water

An analysis of the economic contribution of the software industry examined the effect of software activity on the Lebanese economy by measuring it in terms of output and value