Change point estimation in high dimensional Markov random-field models

(1)

Supplement to the paper “Change-Point Estimation in

High-Dimensional Markov Random Field Models”

Sandipan Roy†, Yves Atchad ´e† and George Michailidis†

University of Michigan, Ann Arbor, USA.

Although our main motivation is in discrete graphical models, the proposed methodology can be applied more broadly for model-based change-point estimation. With this in mind, we shall prove a more general result that can be useful with other high-dimensional change-point estimation problems. Theorem 1 follows as a special case.

S1. High-dimensional model-based change-point detection

Let {X(t), 1 ≤ t ≤ T } be a sequence of Rp-valued independent random variables. Let Θ ⊆ Rd be an open, non-empty convex parameter space equipped with the Euclidean inner product h·, ·i, and normk · k2. We will also use the `1-norm kθk1 def=

Pd

j=1|θj|, and the `∞-norm kθk∞ def= max1≤j≤d|θj|. We assume that there exists

a change point τ? ∈ {1, . . . , T − 1}, parameters θ(1)? , θ(2)? ∈ Θ, such that for t =

1, . . . , τ?, X(t) ∼ g_θ(t)(1) ? , and for t = τ? + 1, . . . , T , X(t) ∼ g(t)_θ(2) ? , where g(t) θ(1)? and g(t) θ?(2)

are probability densities on Rp. The goal is to estimate τ?, θ(1)? , θ?(2). This setting

includes the Markov random field setting (our main motivation), where g(t)

θ?(1)

and g(t)

θ?(2)

does not depend t. It also includes regression models where the index t in the distributions g(t)

θ(1)?

and g(t)

θ?(2)

accounts for the covariates of subject t.

For t = 1 . . . , T , let (θ, x) 7→ φt(θ, x) be jointly measurable functions on Θ × Rp,

such that θ 7→ φt(θ, x) is convex and continuously differentiable for all x ∈ Rp. We

define `T(τ ; θ1, θ2) def = 1 T τ X t=1 φt(θ1, X(t)) + 1 T T X t=τ +1 φt(θ2, X(t)),

†Address for correspondence: Department of Statistics, 439 West Hall, University of Michigan, Ann Arbor, MI 48109-1107, USA

(2)

and we consider the change-point estimator τ? given by

b

τ =Argmin

τ ∈T

`T(τ ; bθ1,τ, bθ2,τ), (S1)

for a non-empty search domain T ⊂ {1, . . . , T }, where for each τ ∈ T , bθ1,τ and bθ2,τ

are defined as b θ1,τ def = Argmin θ∈Θ " 1 T τ X t=1 φt(θ, X(t)) + λ1,τkθk1 # , and b θ2,τ def = Argmin θ∈Θ " 1 T T X t=τ +1 φt(θ, X(t)) + λ2,τkθk1 # ,

for some positive penalty parameters λ1,τ, λ2,τ. Note that by allowing the use of

user-defined learning functions φt, our framework can be used to analyze maximum

likelihood and maximum pseudo-likelihood change-point estimators. For τ ∈ {1, . . . , T − 1}, we set G_τ1def= 1 T τ X t=1 ∇φt(θ(1)? , X(t)), and Gτ2 def = 1 T T X t=τ +1 ∇φt(θ(2)? , X(t)),

where ∇φt(θ, x) denotes the partial derivative of u 7→ φt(u, x) at θ. Also for τ ∈

{1, . . . , T − 1}, and for θ ∈ Θ, we define,

L1(τ, θ) def = 1 T τ X t=1 h φt(θ, X(t)) − φt(θ(1)? , X(t)) − D ∇φt(θ(1)? , X(t)), θ − θ(1)? Ei , and L2(τ, θ)def= 1 T T X t=τ +1 h φt(θ, X(t)) − φt(θ(2)? , X(t)) − D ∇φt(θ(2)? , X(t)), θ − θ (2) ? Ei . For j = 1, 2, define Aj def = n 1 ≤ k ≤ d : θ(j)_?k 6= 0o, sj = |Aj| , and Cj def=    θ ∈ Θ : X k∈Ac j |θ_k(j)| ≤ 3 X k∈Aj |θ(j)_k |    . (S2)

The curvature of the function Lj(τ, ·) is not always best described with the

usual quadratic function θ 7→ kθ − θ(j)? k22. We will need a more flexible framework,

in order to handle Lj(τ, ·) in the case of discrete Markov random fields. Let r :

(3)

and limx↓0r(x)/x = 0. We call r a rate function, and for a > 0, we define Ψr(a) def

= inf{x > 0 : r(x)/x ≥ a} (inf ∅ = +∞). For τ ∈ {1, . . . , T − 1}, λ > 0, a rate functionr, c > 0, and for j = 1, 2 we work with the event

E_τj(λ,r, c)def=    kGj_τk∞≤ λ 2, θ6=θ(j) inf ? , θ−θ?(j)∈Cj Lj(τ, θ) rkθ − θ_?(j)k₂ ≥ τ T, sup θ6=θ(j)? , θ−θ (j) ? ∈Cj L_j(τ, θ) kθ − θ_?(j)k2 2 ≤ τ T c 2 ) . Define κ(t)₀ def=    E h φt(θ?(2), X(t)) − φt(θ?(1), X(t)) i if t ≤ τ? E h φt(θ?(1), X(t)) − φt(θ?(2), X(t)) i if t > τ? , and U(t) def= ( φt(θ(2)? , X(t)) − φt(θ?(1), X(t)) − κ(t)0 if t ≤ τ? φt(θ(1)? , X(t)) − φt(θ?(2), X(t)) − κ(t)₀ if t > τ? . We make the following assumption.

A 1. There exist finite constants σ0t > 0 such that

E exU(t)≤ ex2σ20tkθ (2) ? −θ (1) ? k22/2_, _{for all} _{x > 0.}

Furthermore, there exist B0> 0, ¯σ20 > 0, ¯κ0 > 0 such that for all integer k ≥ B0,

min 1 k τ? X t=τ?−k+1 κ(t)₀ , 1 k τ?+k X t=τ?+1 κ(t)₀ ! ≥ ¯κ0kθ?(2)− θ (1) ? k22, (S3) and max 1 k τ? X t=τ?−k+1 σ2_0t,1 k τ?+k X t=τ?+1 σ2_0t ! ≤ ¯σ₀2. (S4)

Theorem S1. Assume A1, and θ(1)? 6= θ(2)? . Suppose that ˆτ is defined over a

search domain T 3 τ?, and with penalty λj,τ > 0 (for j = 1, 2). For j = 1, 2,

take a rate function rj, constant cj > 0, and define E def = ∩τ ∈TEτ1(λ1,τ,r1, c1) ∩ E2 τ (λ2,τ,r2, c2). Set δ(τ )def= Ψr1 6 T τ s1/2₁ λ1,τ 2s1/2₁ T λ1,τ + τ Ψr1 6 T τ s1/2₁ λ1,τ + Ψr2 6 T T − τ s1/2₂ λ2,τ 2s1/2₂ T λ2,τ + (T − τ )Ψr2 6 T T − τ s1/2₂ λ2,τ ,

(4)

δdef= sup_{τ ∈T} δ(τ ), and Bdef= max B0,_¯_κ 4δ 0kθ (2) ? −θ (1) ? k22

, with B0 as in A1. Then

P (|ˆτ − τ?| > B) ≤ 2P(Ec) + 4 exp −κ¯20δ 2¯σ2 0 1 − exp −¯κ20kθ (2) ? −θ?(1)k22 8¯σ2 0 . (S5)

Proof. The starting point of the proof is the following variant of a result due to Neghaban et al. (2010).

Lemma 1. Fix τ ∈ {1, 2, . . . , T − 1}. On Eτ1(λ1,τ,r1, c1) ∩ Eτ2(λ2,τ,r2, c2), ˆθj,τ −

θ?(j)∈ Cj, (j = 1, 2), where Cj is defined in (S2), and

kˆθ1,τ− θ (1) ? k2≤ Ψr1 6 T τ s1/2₁ λ1,τ , and kˆθ2,τ − θ(2)? k2 ≤ Ψr2 6 T T − τ s1/2₂ λ2,τ . (S6) Proof. We prove the first inequality. The second follows similarly. We set

U (θ)def= 1 T τ X t=1 φt(θ, X(t)) + λ1,τkθk1− 1 T τ X t=1 φt(θ?(1), X(t)) + λ1,τkθ(1)? k1 ! . Since ˆθ1,τ =Argminθ∈Θ ₁ T Pτ

t=1φt(θ, X(t)) + λ1,τkθk1, and using the convexity of

the functions φt we have

0 ≥ U (ˆθ1,τ) ≥ D G1_τ, ˆθ1,τ− θ (1) ? E + λ1,τ kˆθ1,τk1− kθ (1) ? k1 .

On E_τ1(λ1,τ,r1, c1), kG1τk∞≤ λ1,τ/2. Using this and some easy algebra as in

Negha-ban et al. (2010), shows that ˆθ1,τ − θ (1) ? ∈ C1. Set b = Ψr1 6 T_τ s1/2₁ λ1,τ . We will show that for all θ ∈ Rd such that θ − θ?(1) ∈ C1, and kθ − θ?(1)k2 > b, we have

U (θ) > 0. Since U (ˆθ1,τ) ≤ 0, and ˆθ1,τ − θ (1)

? ∈ C1, the claim that kθ − θ (1) ? k2 ≤ b

follows. On the event E_τ1(λ1,τ,r1, c1), and for θ − θ?(1)∈ C1, we have

U (θ) = DG1_τ, θ − θ?(1) E + L1(τ, θ) + λ1,τ kθk₁− kθ(1)_? k₁ ≥ τ Tr1(kθ − θ (1) ? k2) − 3λ1,τ 2 kθ − θ (1) ? k1 ≥ τ T r1(kθ − θ?(1)k2) − 6 T τ s1/2₁ λ1,τkθ − θ?(1)k2 . Using the definition of Ψr1, we then see that U (θ) > 0 for kθ − θ

(1)

? k2 > b. This ends

the proof. 2

(5)

Lemma 2. Fix τ ∈ {1, 2, . . . , T − 1}. On Eτ1(λ1,τ,r1, c1) ∩ Eτ2(λ2,τ,r2, c2), `T(τ, ˆθ1,τ, ˆθ2,τ) − `T(τ, θ (1) ? , θ (2) ? ) ≤ δ(τ ) T , where δ(τ )def= Ψr1 6 T τ s1/2₁ λ1,τ 2s1/2₁ T λ1,τ + τ c1 2 Ψr1 6 T τ s1/2₁ λ1,τ +Ψr2 6 T T − τ s1/2₂ λ2,τ 2s1/2₂ T λ2,τ+ (T − τ )c2 2 Ψr2 6 T T − τ s1/2₂ λ2,τ . Proof. `T(τ, ˆθ1,τ, ˆθ2,τ) − `T(τ, θ(1)? , θ(2)? ) = 1 T τ X t=1 h φt(ˆθ1,τ, X(t)) − φt(θ(1)? , X(t)) i + 1 T T X t=τ +1 h φt(ˆθ2,τ, X(t)) − φt(θ?(2), X(t)) i .

From the definition 1 T τ X t=1 h φt(ˆθ1,τ, X(t)) − φt(θ(1)? , X(t)) i = D G1_τ, ˆθ1,τ− θ(1)? E + L1(τ, ˆθ1,τ).

On E_τ1(λ1,τ,r1, c1), and using Lemma 1, we have

D G1_τ, ˆθ1,τ− θ?(1) E ≤ λ1,τ 2 kˆθ1,τ− θ (1) ? k1≤ 2s1/21 λ1,τΨr1 6 T τ s1/2₁ λ1,τ , and L1(τ, ˆθ1,τ) ≤ τ T c1 2kˆθ1,τ − θ (1) ? k22 ≤ τ c1 2TΨr1 6 T τ s1/2₁ λ1,τ 2 . Hence 1 T τ X t=1 h φt(ˆθ1,τ, X(t)) − φt(θ(1)? , X(t)) i ≤ 1 TΨr1 6 T τ s1/2₁ λ1,τ 2s1/2₁ T λ1,τ+ τ c1 2 Ψr1 6 T τ s1/2₁ λ1,τ . A similar bound holds for the second term, and the lemma follows easily. 2 We are now in position to prove Theorem S1. We have

(6)

We bound the first term P (ˆτ > τ?+ B). The second term follows similarly by

working with the reversed sequence X(T ), . . . , X(1). For τ > τ?, we shall use `T (τ ) instead of `T

τ ; ˆθ1,τ, ˆθ2,τ

for notational conve-nience, and we define rT (τ )def= `T(τ ) − `T

τ, θ(1)? , θ(2)? . We have `T (τ ) = `T τ, θ?(1), θ(2)? + rT(τ ), = h `T τ, θ(1)? , θ?(2) − `T τ?, θ?(1), θ(2)? i + `T τ?, θ(1)? , θ?(2) + rT(τ ). Hence `T(τ ) − `T(τ?) = h `T τ, θ(1)? , θ?(2) − `T τ?, θ?(1), θ(2)? i + rT(τ ) − rT(τ?). (S7)

It is straightforward to check that for τ > τ?,

`T τ, θ(1)? , θ (2) ? − `_T τ?, θ (1) ? , θ (2) ? = 1 T τ X t=τ?+1 φt(θ (1) ? , X(t)) − φt(θ (2) ? , X(t)) .

Therefore, and using the definition of U(t) and κ(t)₀ , (S7) becomes

`T(τ ) − `T(τ?) = 1 T τ X t=τ?+1 κ(t)₀ + 1 T τ X t=τ?+1 U(t)+ rT(τ ) − rT(τ?). (S8)

We conclude from Lemma 2 that on the event E ,

`T(τ ) − `T(τ?) = 1 T τ X t=τ?+1 κ(t)₀ + 1 T τ X t=τ?+1 U(t)+ T(τ ), where |T(τ )| ≤ 2 sup_{τ T} |δ(τ )| T = 2δ T . (S9) Therefore, P (ˆτ > τ + B) ≤ P(Ec) + X j≥0, τ?+dBe+j∈T P (E , ˆτ = τ?+ dBe + j) . Using (S9), we have

P (E , ˆτ = τ?+ dBe + j) ≤ P (E, `T(τ?+ dBe + j) ≤ `T(τ?))

≤ P   τ?+dBe+j X t=τ?+1 U(t) > τ?+dBe+j X t=τ?+1 κ(t)₀ − 2δ  .

(7)

However, since B > B0, by Assumption A1, τ?+dBe+j X t=τ?+1 κ(t)₀ − 2δ ≥ (dBe + j) ¯κ0kθ?(2)− θ(1)? k22− 2δ ≥ 1 2(dBe + j) ¯κ0kθ (2) ? − θ?(1)k22.

The first part of A1 implies that the random variables Z(t) are sub-Gaussian, and by standard exponential bounds for sub-Gaussian random variables, we then have

P [E , `T(τ?+ dBe + j) ≤ `T(τ?)] ≤ 2 exp − (dBe + j)2¯κ2₀kθ_?(2)− θ_?(1)k4 2 8kθ?(2)− θ?(1)k2₂Pτ_t=τ?+dBe+j_?₊₁ σ2_0t ! , ≤ 2 exp −(dBe + j) ¯κ 2 0kθ (2) ? − θ(1)? k22 8¯σ2 0 ! ,

where the last inequality uses (S4). We can conclude that

P [ˆτ > τ?+ B] ≤ P(Ec) + 2 X j≥0 exp −(dBe + j) ¯κ 2 0kθ (2) ? − θ(1)? k22 8¯σ2 0 ! ≤ P(Ec) + 2 exp −B ¯κ20kθ (2) ? −θ(1)? k22 8¯σ2 0 1 − exp −κ¯20kθ (2) ? −θ(1)? k22 8¯σ2 0 , (S10) as claimed. 2 S2. Proof of Theorem 1

We will deduce Theorem 1 from Theorem S1. We take Θ as Mp, the set of all p × p

real symmetric matrices, equipped with the (modified) Frobenius inner product hθ, ϑi_F def= P

k≤jθjkϑjk, and the associated norm kθkF def

= phθ, θi. With this inner product, we identify Mp with the Euclidean space Rd, with d = p(p + 1)/2. This

puts us in the setting of Theorem S1.

We will use the following notation. If u ∈ Rq, for some integer q ≥ 1, and A is an ordered subset of {1, . . . , q}, we define uA

def

= (uj, j ∈ A), and u−j is a

shortcut for u{1,...,q}\{j}. We define the function Bjk(x, y) = B0(x) if j = k, and

Bjk(x, y) = B(x, y) if j 6= k.

In the present case, the function φtis φ as given in (5), and does not depend on t.

The following properties of the conditional distribution (3) will be used below. It is well known (and easy to prove using Fisher’s identity) that the function θ 7→ φ(θ, x)

(8)

is Lispchitz and

|φ(θ, x) − φ(ϑ, x)| ≤ 2c0kθ − ϑk1, θ, ϑ ∈ Mp, x ∈ Xp, (S11)

where c0 is as in (9). From the expression (3) of the conditional densities, using

straightforward algebra, it is easy to show that the negative log-pseudo-likelihood function φ(θ, x) satisfies the following. For all θ, ∆ ∈ Mp, and x ∈ Xp,

φ(θ + ∆, x) − φ(θ, x) − h∇θφ(θ, x), ∆iF = p X j=1 " log Z_θ+∆(j) (x) − log Z_θ(j)(x) − p X k=1 ∆jk ∂ ∂θjk log Z_θ(j)(x) # . (S12)

Furthermore by Taylor expansion, we have

log Z_θ+∆(j) (x) − log Z_θ(j)(x) − p X k=1 ∆jk ∂ ∂θjk log Z_θ(j)(x) = Z 1 0 (1 − t)Varθ+t∆ p X k=1 ∆jkBjk(Xj, Xk)|X−j ! dt ≤ c 2 0 2 p X k=1 |∆_jk| !2 . (S13)

By the self-concordant bound derived in Atchad´e (2014) Lemma A2, we have

log Z_θ+∆(j) (x) − log Z_θ(j)(x) − p X k=1 ∆jk ∂ ∂θjk log Z_θ(j)(x) ≥ 1 2 + c0Pp_k=1|∆jk| Varθ p X k=1 ∆jkBjk(Xj, Xk)|X−j ! . (S14)

Proof (Proof of Theorem 1). Let us first show that under assumption H3 of Theorem 1, A1 holds. Since in this case φtdoes not actually depend on t, we can

take B0 = 1 in A1, and (S3) follows automatically from H3 with ¯κ0 = κ/kθ(2)? −

θ?(1)k22. Also, (S11) implies that |U(t)| ≤ 4c0kθ?(2)− θ?(1)k1 ≤ 4c0s1/2kθ(2)? − θ?(1)k2,

where s denotes the number of non-zero entries of θ(2)− θ?(1). Hence for all x > 0,

E

exU(t)≤ exp8x2c₀2skθ(2)? − θ(1)? k22

.

This establishes the sub-Gaussian condition of A1, and (S4) holds with ¯σ₀2 = 16c2₀s.

For j = 1, 2, let λ1,τ, λ2,τ as in (8). We will apply Theorem S1 with cj =

64c0sj, the rate function rj(x) = ρjx

2

2+4c0s 1/2 j x

(9)

T

τ ∈T Eτ1(λ1,τ,r1, c1) ∩ Eτ2(λ2,τ,r2, c2), where the search domain T satisfies (15),

(16), and (18). Notice that ifr(x) = ρx2_{/(2 + bx), ρ, b > 0, is a rate function, then}

for a > 0, Ψr(a)def= inf{x > 0 : r(x) ≥ ax} ≤ 4a/ρ, provided that 2ba ≤ ρ. Hence

Ψr1 6 T τ s1/2₁ λ1,τ ≤ 4 ρ1 6 T τ s1/2₁ λ1,τ = 24 × 32c2 s1/2₁ ρ1 r log(dT ) τ , provided that τ ≥ (48 × 32)2c2₀s1 ρ1 2

log(dT ). Therefore, given that all τ ∈ T satisfies (18), with some simple algebra we see that there exists a universal constant a that we can take as a = (24 × 32 × 64)2, such that for all τ ∈ T ,

δ(τ ) ≤ δ = ac2₀M log(dT ), where M = s1 ρ1 1 + c0 s1 ρ1 + s2 ρ2 1 + c0 s2 ρ2 . Therefore in Theorem S1, we can take B = 4ac20M log(dT )

κ , and by the conclusion of

Theorem S1, P [|ˆτ − τ?| > B] ≤ 2P(Ec) + 4 exp − δ 32c2 0s κ kθ(2) ? −θ(1)? k22 2 1 − exp− κ2 27_c2 0skθ (2) ? −θ(1)? k22 .

We show in Lemma 3 and Lemma 4 below that P(Ec) ≤ 8/d, and this ends the proof.

2 Lemma 3. Let λ1,τ, λ2,τ be as in equation (8). Suppose that the search domain

T is such that (15)-(16) hold. Then P max τ ∈T λ −1 1,τ G1_τ ∞> 1 2 ≤ 2 d, and P max τ ∈T λ −1 2,τ G2_τ ∞> 1 2 ≤ 2 d, where d = p(p + 1)/2.

Proof. We carry the details for the first bound. The second is done similarly by working with the reversed sequence X(T ), . . . , X(1). Fix 1 ≤ j ≤ i ≤ p, t ∈ T , and define V_ij(t) def= _∂θ∂

ijφ(θ (1) ? , X(t)). We calculate that V_ij(t) =    −B₀(X_i(t)) + E_θ(1) ? (B0(Xi|X (t) −i) if i = j −2B(X_i(t), X_j(t)) + E_θ(1) ? B(Xi, X_j(t))|X−i(t) + E_θ(1) ? B(Xi, X_j(t))|X−j(t) if j < i.

(10)

In the above display the notation E_θ(1) ? B(Xi, Xj(t))|X (t) −i

is defined as the function z 7→ E_θ(1)

? (B(Xi, zj)|X−i= z−i) evaluated on X

(t)_{. Since X}(1:τ?) i.i.d_{∼ g}

θ(1)? , it follows

that E(Vij(t)) = 0 for t = 1, . . . , τ?. We set µij def = E(V(τ?+1) ij ) = E(V (t) ij ) for t = τ?+ 1, . . . , T . We also set ¯V_ij(t) def = V_ij(t)− EV_ij(t)

. It is easy to see that | ¯V_ij(t)| ≤ 4c0,

where c0 is defined in (9) . With these notations, for τ ∈ T , we can write

(G1_τ)ij = 1 T τ X t=1 ¯ V_ij(t)+(τ − τ?)+µij T , where a+ def

= max(a, 0). For t > τ?, Lemma 5 can be used to write

E h B(X_i(t), X_j(t)) − E_θ(1) ? B(Xi, Xj(t))|X (t) −i i = E Z X B(u, X_j(t))f_θ(2) ? (u|X (t) −i)du − Z X B(u, X_j(t))f_θ(1) ? (u|X (t) −i)du ≤ c2₀ p X j=1 |θ_?,ij(2) − θ(1)_?,ij| ≤ bc2₀, where b is as in (17). Hence |µij| ≤ 2 max j≤i Eθ(2)? h B(X_i(t), X_j(t)) − E_θ(1) ? B(X_i(t), X_j(t))|X_−j(t) i ≤ 2bc 2 0. Set λτ def = (A√τ /T ), where Adef= 32c0 p log(dT ). By a union-bound argument, P max τ ∈T 2λ −1 τ kG1τk∞> 1 ≤X τ ∈T X i,j P " 1 A√τ τ X t=1 ¯ V_ij(t) +2bc 2 0(τ − τ?)+ A√τ > 1 2 # . (S15)

Since A = 32c0plog(dT ), for τ ∈ T , and using (15) we see that maxτ ∈T 2bc

2 0(τ −τ?)+ A√τ ≤ 1/4. Hence P max τ ∈T 2λ −1 τ kG1τk∞> 1 ≤ X τ ∈T X i,j P " τ X t=1 ¯ V_ij(t) > A √ τ 4 # , (S16) ≤ 2X τ ∈T X i,j exp − A 2 83_c2 0 ≤ 2 d.

(11)

Remark 1. The log(dT ) term that appears in the convergence rate of Theorem 1 follows from the union bound and the exponential bound used in (S15), and (S16) respectively. Alternatively, it is easy to see that one could also write

P max τ ∈T 2λ −1 τ kG1τk∞> 1 ≤X i,j P " max τ ∈T 1 √ τ τ X t=1 ¯ V_ij(t) > A 4 # .

Hence whether one can remote the log(T ) term hinges on the existence of an ex-ponential bound for the term maxτ ∈T

τ −1/2Pτ t=1V¯ (t) ij

. Unfortunately we are not aware of any such result in the literature. The closest results available deal with the unweighted sums: maxτ ∈T

Pτ t=1V¯ (t) ij

(see for instance pinelis (2006) for some of the best bounds available).

Lemma 4. Assume H1 and H2. Let λ1,τ and λ2,τ as in Equation (8), and let

the search domain T be such that Equations (15)-(16) hold. Take c1 = 64c0s1,

c2= 64c0s2 and r1(x) = ρ1x2 2 + 4c0s 1/2 1 x , and r2(x) = ρ2x2 2 + 4c0s 1/2 2 x , x ≥ 0. Then the eventT

τ ∈T Eτ1(λ1,τ,r1, c1) ∩ Eτ2(λ2,τ,r2, c2) holds with probability at least

1 −8_d.

Proof. We have seen in Lemma 3 that with λ1,τ and λ2,τ as in equation (8),

the event ∩τ ∈T {kG1τk∞≤ λ1,τ/2} ∩ {kG1τk∞≤ λ2,τ/2} holds with probability at

least 1 − 2/d. We have L1(τ, θ)def= 1 T τ X t=1 h φ(θ, X(t)) − φ(θ(1)? , X(t)) − D ∇φ(θ?(1), X(t)), θ − θ?(1) Ei . (S13) then implies that for all τ ∈ T , and θ − θ?(1)∈ C1,

L₁(τ, θ) ≤ τ T 4c2₀ 2 kθ − θ (1) ? k21 ≤ τ T 64c2₀s1 2 kθ − θ (1) ? k22.

A similar bound holds for j = 2. Hence ∩τ ∈T∩2j=1

n sup_θ6=θ(j) ? , θ−θ(j)? ∈Cj Lj(τ,θ) kθ−θ(j) ? k22 ≤ _Tτcj 2 o

holds with probability one. Using (S14), we have L₁(τ, θ) ≥ τ T 1 2 + 4c0s 1/2 1 kθ − θ (1) ? k2 ×1 τ τ X t=1 p X j=1 Var_θ(1) ? p X k=1 Bkj(Xj(t), X (t) k ) θkj − θ(1)_?,kj |X_−j(t) ! . (S17)

(12)

We will now show that for all τ ∈ T , and all θ − θ?(1)∈ C1, with probability at least 1 − 2/d, we have 1 τ τ X t=1 p X j=1 Var_θ(1) ? p X k=1 Bkj(X_j(t), X_k(t)) θkj− θ_?,kj(1) |X_−j(t) ! ≥ ρ1kθ − θ?(1)k22.

Given (S17), this assertion will implies that L1(τ, θ) ≥ _Tτr1(kθ − θ(1)? k2) for all

θ − θ(1)? ∈ C1 with probability at least 1 − 2/d, where r1(x) = ρ1x2/(2 + 4c0s1/21 x).

The lemma will then follow easily. For ∆ ∈ Mp, we define V1_{(τ, ∆)}def₌ 1 τ τ X t=1 p X j=1 Var_θ(1) ? p X k=1 Bkj(Xj(t), X (t) k )∆kj|X (t) −j ! , and W_jkk(t)0 def = Cov_θ(1) ? B(X_j(t), X_k(t)), B(X_j(t), X_k(t)0 )|X (t) −j − EhCov_θ(1) ? B(X_j(t), X_k(t)), B(X_j(t), X_k(t)0 )|X (t) −j i . Then for ∆ ∈ C1\ {0}, V1(τ, ∆) = 1 τ τ X t=1 p X j=1 p X k,k0₌₁ ∆jk∆jk0_E h Cov_θ(1) ? B(X_j(t), X_k(t)), B(X_j(t), X_k(t)0 )|X (t) −j i . +1 τ τ X t=1 p X j=1 p X k,k0₌₁ ∆jk∆jk0W(t) jkk0 (S18)

Using H1, we deduce that

V1_{(τ, ∆) ≥ 2ρ} 1k∆k22+ 1 τ τ X t=1 p X j=1 p X k,k0₌₁ ∆jk∆jk0W(t) jkk0 +(τ − τ?)+ τ p X j=1 E_θ(2) ? " Var_θ(1) ? p X k=1 ∆jkBik(Xj, Xk)|X−j !# − (τ − τ?)+ τ p X j=1 E_θ(1) ? " Var_θ(1) ? p X k=1 ∆jkBik(Xj, Xk)|X−j !# . (S19)

(13)

By the comparison Lemma 5 E_θ(2) ? " Var_θ(1) ? p X k=1 ∆jkBik(Xj, Xk)|X−j !# − E_θ(1) ? " Var_θ(1) ? p X k=1 ∆jkBik(Xj, Xk)|X−j !# ≤ c3 0 p X k=1 |∆_jk| !2 _p X k=1 |θ(1)_?jk− θ_?jk(2)| ≤ c3 0b p X k=1 |∆_jk| !2 ,

which implies that V1(τ, ∆) ≥ 2ρ1− 64 τ (τ − τ?)+s1c 3 0b k∆k2₂+ 1 τ τ X t=1 p X j=1 p X k,k0₌₁ ∆jk∆jk0W(t) jkk0.

Given that on T+, 128(τ − τ?)s1c30b ≤ ρ1τ , it follows that for all τ ∈ T ,

V1(τ, ∆) ≥ 3 2ρ1k∆k 2 2+ 1 τ τ X t=1 p X j=1 p X k,k0₌₁ ∆jk∆jk0W(t) jkk0 (S20) Set Z_jkkτ 0 def = _τ1 τ X t=1

W_jkk(t)0. We conclude from equation (S20) that if for some ∆ ∈

C1\ {0}, and for some τ ∈ T ,

V1(τ, ∆) ≤ ρ1k∆k22 (S21) then p X j=1 p X k,k0₌₁ ∆jk∆jk0Z(τ ) jkk0 ≤ − ρ1 2 k∆k 2 2.

Therefore if there exists a non-zero ∆ ∈ C1 and τ ∈ T such that equation (S21)

holds then sup

j,k,k0

|Z_jkk(τ )0|

!

(14)

union-sum bound, P " sup j,k,k0 |Z_jkk(τ )0| ≥ ρ1 128s1 # ≤ 2 exp 3 log p − τ ρ 2 1 29_c2 0s21 ≤ 2 p,

since for τ ∈ T , τ ≥ 211c2₀s2₁ρ−2₁ log p. 2

Lemma 5. Let (Y, A, ν) be a measure space where ν is a finite measure. Let g1, g2, f1, f2 : Y → R be bounded measurable functions. Set Zgi

def = R Ye gi(y)_ν(dy), i ∈ {1, 2}. Then 1 Zg1 Z

f1(y)eg1(y)ν(dy) −

1 Zg2

Z

f2(y)eg2(y)ν(dy)

≤ kf2− f1k∞+ 1

2osc(g2− g1) (osc(f1) +osc(f2)) ,

where kf k∞= supx∈Y|f (x)|, and osc(f ) def

= sup_x,y∈Y|f (x) − f (y)| is the oscillation of f .

Proof. The proof follows from Atchad´e (2014) Lemma 3.4.

S3. Different Methods of Missing Data Imputation for the Real Data

Appli-cation

In the main paper we replaced the missing votes by the value (yes/no) of that member’s party majority position on that particular vote. Here we employed two other missing data imputation techniques viz. (i) replacing all missing values by the value (yes/no) representing the winning majority on that bill and (ii) replacing the missing value of a Senator by the value that the majority of the opposite party voted on that particular bill. The estimated change-point obtained following these two imputation methods are not much different . The imputation technique (i) results in a estimated change-point at January 19, 1995 and the technique (ii) yields estimated change-point at January 17, 1995 respectively. The change-point estimate we obtained in the main paper was January 17, 1995. Clearly there is not much difference between the different imputation techniques and Fig. S1 also conveys the same message.

(15)

Fig. S1: Estimated Change-points via imputation technique (i) and (ii) respectively

References

Atchad´e, F. Y. (2014). Estimation of Network Structures from partially observed markov random field. Electron. J. Statist., 8, 2242-2263

Bach, F. (2010). Self-concordant analysis for logistic regression. Electronic Journal of Statistics, 4, 384-414.

Bai, J.(2010). Estimation of a change-point in multiple regression models. The Review of Economics and Statistics, 4, 551-563.

Banerjee, O., El Ghaoui, L. and d’Aspremont, A. (2008) Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. J.Mach.Learn.Res., 9, 485-516.

Basu, S. and Michailidis, G. (2015). Estimation in high dimensional vector autore-gressive models. Ann. Statist. To Appear.

Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems. J. R. Stat. Soc. Ser. B., 36, 192-236.

Bhattcharya, K. P. (1974). Maximum likelihood estimation of a change-point in the distribution of the independent random variables: General Multiparameter CaseJ. Mult. Anls., 23, 183-208.

Bickel, P.J. and Levina, E. (2008). Regularized estimation of large covariance ma-trices. Ann. Statist., 36, 199-227.

Bickel, P. J., Ritov, Y. and Tsybakov, A.B. (2009) Simultaneous analysis of lasso and Dantzig selector. Ann. Statist., 37, 1705-1732.

(16)

Buja, A., Hastie, T. and Tibshirani, R. (1989). Linear smoothers and additive models. Ann. Statist., 17, 453-510.

Carlstein, E. (1988). Nonparametric change-point estimation. Ann. Statist., 16, 188-197.

Drton, M. and Perlman, M.D. (2004). Model selection for Gaussian concentration graphs. Biometrika., 91, 591-602.

Friedman, J., Hastie, T. and Tibshirani, R. (2010). Regularized paths for generalized linear models via coordinate descent. J. Statist. Softwr., 33, 1-22

Guo, J., Levina, E., Michailidis, G. and Zhu, J. (2010). Joint structure estimation for categorical markov networks. Tech. rep., Univ. of Michigan.

Han, F. and Liu, H. (2006). A direct estimation of high dimensional stationary vector autoregressionsarXiv:1307.0293v2 [stat.ML] .

Hanneke, S. and Xing, P. E. (2006). Discrete temporal models of social networks. Lecture Notes in Computer Science., 4503, 115-125.

Hinkley, V. D. (1970). Inference about the change-point in a sequence of random variables. Biometrika., 57, 1-17.

Hinkley, V. D. (1972). Time-ordered classification. Biometrika., 59, 509-523. Hoefling, H.(2010). BMN: The pseudo-likelihood method for pairwise binary markov

networks. R package version 1.02, http://CRAN.R-project.org/package=BMN. H¨ofling, H. and Tibshirani, R. (2009). Estimation of Sparse Binary Pairwise Markov

Networks using Pseudo-likelihoods. J. Mach. Learn. Res. 10, 883-906.

Hurvich, M. C., Simonoff, S.J. and Tsai, C. (1998). Smoothing Parameter Selection in Nonparametric Regression Using an Improved Akaike Information Criterion. J. R. Stat. Soc. Ser. B., 60, 271-293.

Kolar, M., Song, L., Ahmed, A. and Xing, P. E. (2010). Estimating Time varying Networks. Ann. App. Statist., 4, 94-123.

Kolar, M. and Xing, P. E. (2012). Estimating networks with jumps. Electron. J. Statist., 6, 2069-2106.

(17)

Kosorok, R. M. (2012). Introduction to empirical processes and semiparametric inference. Springer Series in Statistics.

Lam, C. and Fan, J. (2009). Sparsistency and rates of convergence in large covari-ance matrix. Ann. Statist., 37, 4254-4278.

Lan, Y., Banerjee M. and Michailidis G. (2009). Change-point estimation under adaptive sampling. Ann. Statist., 37, 1752-1791.

Loader C. (1996). Change-point estimation using nonparametric regression. Ann. Statist., 24, 1667-1678.

Massart, P. (2007). Concentration inequalities and model selection. Springer Verlag. Meinshausen, N. and B¨uhlmann, P.(2006) High dimensional graphs and variable

selection with the lasso. Ann. Statist., 34, 1436-1462.

Meinshausen, N. and B¨uhlmann, P.(2010). Stability selection. J. R. Statist. Soc. B, 72, 417-473.

Moody, J. and Mucha, P.(2013). Portrait of political party polarization Network Science, 1, 119-121.

Muller, H.(1992). Change-points in nonparametric regression analysis. Ann. Statist., 20, 737-761.

Nadaraya, E. A.(1965) On non-parametric estimation of density functions and regression curves. Theory Prob. Applic., 10, 186-190.

Neghaban, S., Ravikumar, P., Wainwright, M. and Yu, B. (2010). A unified frame-work for high-dimensional analysis of M-estimators with decomposable regulariz-ers. Statist. Sci., 27, 538-557.

Pinelis, I. (2006). On the normal domination of (super) martingales. Electronic Journal of Probability, 39, 1049-1070.

Raimondo, M. (1998). Minimax estimation of sharp change-points. Ann. Statist., 26, 1379-1397.

Ravikumar, P., Wainwright, J. M. and Lafferty, D. J. (2010). High-dimensional ising model selection using l1-regularized logistic regression. Ann. Statist., 38,

(18)

Rothman, A. J., Bickel P. J., Levina, E. and Zhu, J. (2008). Sparse permutation invariant covariance estimation. Electron. J. Stat., 2, 494-515.

Schwarz, G.(1978). Estimating the dimension of a model. Ann. Statist., 6, 461-464. Van de Geer, S. A. and B¨uhlmann, P. (2009). On the conditions used to prove

oracle results for the lasso. Electron. J. Stat., 3, 1360-1392.

Van der Vaart, A. and Wellner, J. (1996). Weak convergence and empirical pro-cesses. Springer Series in Statistics.

Wainwright, J. M. and Jordan, I. M.(2008) Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning. 1, 1-305. Xue, L., Zou, H. and Cai, T. (2012). Non-concave penalized composite likelihood

estimation of sparse ising models. Ann. Statist., 40, 1403-1429.

Yuan, M. and Lin, Y. (2007). Model selection and estimation in the Gaussian graphical model. Biometrika., 94, 19-35.

Zhou, S., Lafferty, J. and Wasserman, L. (2010). Time-varying undirected graphs. Machine Learning. , 80, 295-319.