Distributed Hypothesis Testing in Networks

Full text

(1)

Distributed Hypothesis Testing in Networks

Angelia Nedi´ c

Collaborative work with Alexander Olshevsky and Cesar Uribe

Industrial and Enterprise Systems Engineering Department and Coordinated Science Laboratory

University of Illinois at Urbana-Champaign

October 2, 2015

(2)

Learning/Decision Making

Decision Making with Learning can be thought of a process through which we refine our understanding of a ”state” of certain phenomena of interest:

x k-1 x k x k+1

t k-1 t k t k+1

observation observation

Our focus will be on Bayesian ”Decision Process” - Hypothesis Testing.

(3)

Decision Process in Networks

The process takes place in a system of networked agents with different learning capabilities.

• A system of interconnected agents observes a ”phenomenon”

• Each agent has ”learning” capability

• Each agent has its own observations of the phenomenon

• Agents are willing to share their knowledge

• If the agents collaborate (share their knowledge), can they learn better, faster?

• If so, can we quantify how much better, how much faster?

We answer some of those questions for a system of Bayesian ”learners”

(4)

Related Literature

Spread from social and economic networks to wired/wireless sensor networks

• Learning and opinion aggregation in social networks: Gale and Kariv 2003, Epstein et al. 2010, Acemoglu et al. 2011, Jadbabaie et al. 2012, 2013, Mueller-Frank 2013, Shahrampour et al. 2014, Lalitha et al. 2014, 2015, N. O. Uribe 2014, 2015, Sarwate and Javidi 2015

• Detection/estimation in sensor networks: Olfati-Saber et al. 2006, Saligrama et al.

2006, Rahnama-Rad and Tahbaz-Salehi 2010, Barbarossa et al. 2013, Battisteli and Chisci 2014

• Consensus literature: DeGroot 1974, Aumann 1976, Borkar and Varaiya 1982,

Tsitisklis 1984, Jadbabaie, Lin, and Morse 2003 . . .

(5)

Bayesian Learning: Single Agent

An agent observes a certain phenomenon, and uses the observations to refine its under- standing of the state of the phenomenon.

Agent receives observations s k of the true state, which are random and form an i.i.d.

process with a distribution f over a finite set S of the possible states

Agent learning model: a set of probability distributions `(·|θ) on S parametrized by θ , where θ takes a finite number of possible values: set of hypotheses Θ = {θ 1 , . . . , θ m }.

At time k , agent has a belief µ k (probability distribution on Θ ) that the best explains the observations s 1 , . . . , s k collected up to that time. At time k + 1 , it observes s k+1 and updates its belief to µ k+1 :

µ

k-1

µ

k

µ

k+1

k-1 k k +1

Sk Sk+1

Bayesian update: for all θ ∈ Θ ,

µ k+1 (θ) = µ k (θ)`(s k+1 |θ) P m

p=1 µ kp )`(s k+1p ) µ k+1 = BU (µ k , s k+1 )

∗Likelihood functions

(6)

Learning on Graphs

Learning in a system of networked agents with different learning capabilities.

• A system of interconnected agents observes a ”phenomenon”

• Each agent has ”learning” capability and maintains a belief about the true state of the phenomenon

• Agents receive noisy private obser- vation of the phenomenon and use them to update their beliefs

• Agents also talk to their neighbors and exchange their beliefs, which can also be used to update the beliefs

• How should agents aggregate their

beliefs so that every agent learns the

true state?

(7)

Agents can do better by collaborating

• Nature chooses a coin which could be biased or not

• Over time, the coin is tossed and the outcomes can be observed (either H or T )

• Assume the toss outcomes are iid

• Coin model: let θ = 1 if the coin is biased and θ = 0 otherwise

• The outcome of the k th toss is s k ∈ {H, T } .

• Two agents observe the toss outcomes and want to learn if the coin is biased or not.

• Both agents have some prior belief µ 1 (0) and µ 2 (0) about the biassed-ness of the coin (prob. distributions on θ -values).

• Agent learning models captured through likelihoods:

` 1 (· | θ = 1) = [0.5, 0.5], ` 1 (· | θ = 0) = [0.55, 0.45]

` 2 (· | θ = 1) = [0.55, 0.45], ` 2 (· | θ = 0) = [0.5, 0.5]

• Independently (by Bayesian updates), none of them can learn on its own

• Collaboratively, both can learn no matter what coin the nature choses

(8)

Known :

• A fully Bayesian approach may not be possible in general since full knowledge of neither the network structure nor other agents likelihood functions might be available; Gale and Kariv 2003.

• Non-Bayesian methods have been shown in the literature to be successful in learning;

Epstein, Noor, and Sandroni 2010

• An aggregation method for local Bayes estimates leading to a global learning in work of Jadbabaie, Molavi, Sandroni, and Tahbaz-Salehi 2012, Jadbabaie, Molavi, and Tahbaz-Salehi 2013, Bandyopadhyay and Chung 2014

We will start with a slightly more general model akin to that of Jadbabaie et al. 2012

(9)

Learning Model

• There are n agents and all have the same set of hypotheses: θ ∈ Θ , 1 , . . . θ m }

• At each time k , agent i has a belief µ i k - probability distribution on the hypotheses

• Each agent i receives a noisy private observation s i k , s i k ∈ {s i 1 , . . . , s i mi }

• We assume that the observations {s i k } are iid for each agent, and independent across agents

• Each agent i has likelihood functions (probability distributions) capturing the agent learning capability: ` i (· | θ) for all θ ∈ Θ

• Agents also exchange beliefs with neighbors and use them when updating their beliefs

• The neighbor structure (static at the moment) is captured by a directed graph:

G = (V, E) , where E is the set of directed edges, and

(i, j) ∈ E if agent i receives information from j

• Upon observing the outcome s i k+1 , agent i updates its belief based the prior belief µ i k ,

the outcome, and its likelihood functions

(10)

An update has been proposed and analyzed in Jadbabaie et al. 2012 of the following form:

µ i k+1 = A ii BU (µ i k , s i k+1 ) + X

j∈Ni

A ij µ j k ,

where N i = {j ∈ [n] | (j, i) ∈ E} , with a row-stochastic matrix A satisfying A ij > 0 for (j, i) ∈ E .

• The update makes sense, since it preserves ”probability distributions”.

• Nice results were shown (convergence with probability 1 to the true state of ”world”).

• What motivates this update rule?

• Why not

µ i k+1 = BU (ˆ µ i k , s i k+1 ) with µ ˆ i k = X

j∈Ni∪{i}

A ij µ j k ?

• Can the update rule be related to some network optimization objective?

(11)

Agent Update: Inference Via Minimization Rule

• When learning alone: Standard Bayesian update - agent i updates minimize a cost function which is composed of two terms (Walker 2006)

• Maximum Likelihood Estimation (MLE) of a state given the observed data and

• A regularization function KL divergence from the current prior µ i k+1 = argmin

π∈Π(Θ)

− E π 

log ` i (s i k+1 | ·) 

| {z }

M LE

+D KL (πkµ i k )

 where Π(Θ) is the set of all probability distributions over the set Θ

• When learning with neighbors:

µ i k+1 = argmin

π∈Π(Θ)

− E π 

log ` i (s i k+1 | ·) 

| {z }

M LE

+

n

X

j=1

A ij D KL (πkµ j k )

with a row-stochastic matrix A compatible with the graph G : A ij ≥ 0 when (j, i) ∈ E , and A ii > 0 .

• There is an explicit expression for µ i k+1 (unique minimizer)

(12)

Learning with Neighbors

µ i k+1 (θ) =

Q n

j=1 [µ j k (θ)] Aij ` i (s i k+1 |θ) P m

r=1

Q n

j=1 [µ j k (θ)] Aij ` i (s i k+1r ) for all θ ∈ Θ The minimizer is given by

µ i k+1 (θ) =

Q n

j=1 [µ j k (θ)] Aij ` i (s i k+1 |θ) P m

r=1

Q n

j=1 [µ j k (θ)] Aij ` i (s i k+1r ) for all θ ∈ Θ

Proof: Consider Lagrangian function associated with the problem (constraint π 0 1 = 1 is penalized)

m

X

r=1

π s log(y r ) +

n

X

j=1

A ij

m

X

r=1

π r log

 π r z r j

 + λ

m

X

r=1

π r − 1

!

Set the gradient to zero

− log(y r ) +

n

X

j=1

A ij log π r

z r j + 1 + λ = 0 for all r = 1, . . . , m

− log(y r ) + log(π r ) − log

m

Y

j=1

(z r j ) Aij

 + c = 0

(13)

log π r y r Q m

j=1 (z r j ) Aij = −c π r = y r

m

Y

j=1

(z r j ) Aij C for all r = 1, . . . , m

with C > 0 , which is just a normalization constant ensuring that π ∈ Π(θ) with y r = ` i (s i k+1 | θ) and z r j = µ j k (θ) we obtain

µ i k+1 (θ) =

Q n

j=1 [µ j k (θ)] Aij ` i (s i k+1 |θ) P m

r=1

Q n

j=1 [µ j k (θ)] Aij ` i (s i k+1r ) for all θ ∈ Θ

• If N i = ∅ (corresponding to A ii = 1 ) the preceding rule gives the standard Bayesian update

µ i k+1 (θ) = µ i k (θ) ` i (s i k+1 |θ) P m

r=1 µ i k (θ) ` i (s i k+1r ) for all θ ∈ Θ

(14)

Belief Update

The preceding belief update can be related to opinion aggregation, also referred to as

”opinion pooling”. A general form of opinion pooling (Gilardoni and Clayton 1993):

b i k+1 = BU (µ i k , `(s i k+1 | ·)) Mixing based on g -Opinion pools

µ i k+1 (θ) =

g −1

 P n

j=1 A ij g(b j k+1 (θ))

 P m

r=1 g −1

 P n

j=1 A ij g(b j k+1r ))

 for all θ ∈ Θ

• It is shown by Gilardoni and Clayton 1993 that a consensus can be reached only for g(x) = x or g(x) = log(x) .

Our rule is Logarithmic pool: g(x) = log(x) µ i k+1 (θ) =

Q n

j=1 [µ j k (θ)] Aij ` i (s i k+1 |θ) P m

r=1

Q n

j=1 [µ j k (θ r )] Aij ` i (s i k+1r ) for all θ ∈ Θ

Recent work by Lalitha et. al 2014 also uses g(x) = log(x) , while the rule used in

Jadbabaie et al. 2012 is more related to Linear pool, i.e., g(x) = x .

(15)

Learning Objective

µ i k+1 = argmin

π∈Π(Θ)

− E π 

log ` i (s i k+1 | ·) 

| {z }

M LE

+D KL (πkµ i k )

 Can be interpreted as gradient method for minimizing

D KL f i k` i (· | θ) 

over θ ∈ Θ, with D KL (pkq) = P d

i=1 p i log

 pi qi



• Learning Objective: The update corresponds to determine a hypothesis θ ∈ Θ that explains best the agent system data

min

θ∈Θ n

X

i=1

D KL f i k` i (· | θ) 

• Let

Θ i = argmin

θ∈Θ

D KL f i k` i (· | θ) 

and Θ = ∩ n i=1 Θ i

• Consistency All agents should agree on the hypotheses in Θ that best describes the observations for all.

• Definition The agents beliefs are said to provide a distributed consistent estimator if

with probability 1: for all agents i ,

(16)

Assumptions: Graphs, Likelihoods and Initial Beliefs

Assumption (Graphs) Graph sequence {G k } and a matrix sequence {A k } are such that:

1. A k is row-stochastic with [A k ] ij > 0 if (j, i) ∈ E k and [A k ] ii > 0 . 2. If [A k ] ij > 0 then [A k ] ij > η for some positive scalar η .

3. {G k } is B -strongly connected, i.e., here is an integer B ≥ 1 such that the graph n

V, S (k+1)B−1 i=kB E i

o

is strongly connected for all k ≥ 0 . Assumption (Likelihood Models)

Θ ,

n

\

i=1

argmin

θ∈Θ

D KL f i (·) k` i (·|θ) 

is nonempty Assumption (Initial Beliefs) For all agents i = 1, . . . , n ,

1. The prior beliefs on all θ ∈ Θ are positive, i.e. µ i 0 ) > 0 for all θ ∈ Θ . 2. There exists an α > 0 such that ` i s i |θ 

> α for all outcomes s i and all θ ∈ Θ .

(17)

Consistency

Proposition 1 Under these assumptions, the update rule µ i k+1 = argmin

π∈Π(Θ)

− E π 

log ` i (s i k+1 |·)  +

n

X

j=1

[A k ] ij D KL (πkµ j k )

 or explicitly

µ i k+1 (θ) =

Q n

j=1 µ j k (θ) [Ak]ij ` i (s i k+1 |θ) P m

r=1

Q n

j=1 µ j k (θ) [Ak]ij ` i (s i k+1r )

generates sequences {µ i k } , i = 1, . . . , m , of beliefs such that with probability 1, for all agents i :

lim

k→∞ µ i k (θ) = 0 for all θ / ∈ Θ Proof: Choose a θ ∈ Θ and define

ϕ i k (θ) = log

 µ i k (θ) µ i k )



(18)

From the update rule, we obtain for all i = 1, . . . , n ϕ i k+1 (θ) =

n

X

j=1

[A k ] ij ϕ j k (θ) + log ` i (s i k+1 |θ)

` i (s i k+1 )

!

for all θ ∈ Θ Stacking ϕ i k+1 , for i = 1, . . . , n in a vector

ϕ k+1 (θ) = A k ϕ k (θ) + L k (θ) for all θ ∈ Θ

after some manipulations akin to ”consensus”-type analysis, we find that, almost surely, coordinate-wise we have

lim

k→∞

1

k ϕ k+1 (θ) ≤ − δ

n kH(θ)k 1 1 for all θ / ∈ Θ where

[H(θ)] i = D KL (f i k` i (·|θ)) − D KL (f i k` i (·|θ )) for i = 1, . . . , n and δ is a bound for the ”influence imbalance for the chain {A k }

• δ = inf t≥0 min i∈[n] [A t · · · A 0 1 n ] i

• when the matrices A k are doubly stochastic δ = 1

(19)

Non-Asymptotic Learning Rate

Proposition 2 Let Assumptions 1-3 hold. Also, let ρ ∈ (0, 1) and consider the update rule

µ i k+1 = argmin

π∈Π(Θ)

− E π [log ` i (s k+1 |·)] +

n

X

j=1

[A k ] ij D KL (πkµ j k )

Then, the following property is true: for any θ 6∈ Θ , there is an integer N (ρ) such that, with probability 1 − ρ , for all k ≥ N (ρ) there holds

µ i k (θ) ≤ exp



− k

2 γ 2 + γ 1



∀i = 1, . . . , n, where

N (ρ) ,

8 (log (α)) 2 log

 1 ρ



γ 2 2 + 1

 , α is a lower bound on the likelihoods ` i (see Assum. 3),

γ 1 , max

θ∈Θ\Θ∗

θ∗∈Θ∗



1≤i≤n max log µ i 0 (θ)

µ i 0 ) + C

1 − λ k H (θ) k 1



γ 2 , δ

n min

θ∈Θ\Θ∗

k H (θ) k 1 . [ H (θ)] i = D KL (f i k` i (·|θ)) − D KL (f i k` i (·|θ ))

The constants C , δ and λ are related to the graphs and satisfy the following relations:

(20)

 For general B -connected graph sequences {G k } , C = 2, λ ≤ 1 − η nB  B 1

, δ ≥ η nB .

 If every matrix A k is doubly stochastic, C = √

2, λ =



1 − η 4n 2

 1

B , δ = 1.

 If each G k is an undirected graph and each A k is the lazy Metropolis matrix, i.e. the stochastic matrix which satisfies

[A k ] ij = 1

2 max(d(i), d(j)) for all {i, j} ∈ G k , then

C = √

2, λ = 1 − 1

Θ(n 2 ) , δ = 1.

(21)

Simulation result

n = 20 , Θ = {θ 1 , θ 2 , θ 3 } , S i = {0, 1} for all agents Graph is static and undirected: A is lazy metropolis

0 5 10 15 20 25 30 35 40 45 50

0 0.2 0.4 0.6 0.8 1

Time

Beliefs on θ

2

(22)

Allowing Conflicting Models

Relaxing the assumption that

n

\

i=1

argmin

θ∈Θ

D KL f i (·) k` i (·|θ) 

is nonempty Note that

Θ = argmin

θ∈Θ

n

X

i=1

D KL f i k` i (· | θ) 

is non-empty and the agents’ beliefs will vanish for all θ 6∈ Θ as long as the matrices A k satisfy some additional properties. Three different settings:

1. Time-varying undirected graphs: A k is doubly-stochastic with [A k ] ij > 0 if {i, j} ∈ E k . 2. Time-varying directed graphs [A k ] ij =

 

 

1 d j k

if j ∈ N k i 0 if otherwise

d i k is the out degree of node i at time k , N k i is the set of in neighbors of node i . 3. Acceleration in static graphs A ¯ ij =

( 1

max { di,dj } if {i, j} ∈ E ,

0 if {i, j} / ∈ E ,

d i degree of the node i and A = ¯ 1 I + 1 A,

(23)

Learning Rules

Time-varying undirected graphs µ i k+1 (θ) = 1

Z k+1 i

n

Y

j=1

µ j k (θ) [ Ak ] ij

l i s i k+1 |θ 

, (1)

Z k+1 i is a normalization factor, i.e., Z k+1 i =

m

X

p=1 n

Y

j=1

µ j kp ) [ Ak ] ij

l i s i k+1p  . Time-varying directed graphs

µ i k+1 (θ) = 1 Z k+1 i

n

Y

i=1

µ j k (θ) [ Ak ] ij y k j

l i s i k+1 |θ 

! 1

yi k+1

(2)

Z k+1 i =

m

X

p=1 n

Y

i=1

µ j kp ) [ Ak ] ij y j k

l i s i k+1p 

! 1

yi k+1

y k+1 i =

n

X

i=1

[A k ] ij y k j

(24)

Acceleration in static graphs based on a recent paper by Olshevsky 2014

µ i k+1 (θ) = 1 Z k+1 i

n

Q

j=1

µ j k (θ) (1+ σ)Aij l i s i k+1 |θ 

n

Q

j=1



µ j k−1 (θ) l j



s j k |θ  σAij (3)

Z k+1 i =

m

X

p=1 n

Q

j=1

µ j kp ) (1+ σ)Aij l i s i k+1p 

n

Q

j=1



µ j k−1p ) l j



s j kp  σAij (4) Acceleration in Static Graphs

Theorem 1 Let assumptions for Case 3 hold and let ρ ∈ (0, 1) . Furthermore let U ≥ n and let σ = 1 − 2/(9U + 1) . Then, the update rule of Eq. (3) with this σ , initial condition µ i −1 (θ) = µ i 0 (θ) and β −1 i fixed to zero, has the following property: there is an integer N (ρ) such that, with probability 1 − ρ , for all k ≥ N (ρ) and for all θ v ∈ Θ / , there holds

µ i kv ) ≤ exp



− k

2 γ 2 + γ 1



for all i = 1, . . . , n,

(25)

where

N (ρ) ,

72 (log (α)) 2 n log

 1 ρ

 γ 2 2

 ,

γ 1 , max

θv / ∈Θ∗ max

θw∈ ˆ Θ∗

( √

2 1 − λ

n

X

i=1

| C i qv ) − C i qw ) | )

γ 2 = 1

n min

θv / ∈Θ∗ C q − C qv )  ,

with α from the Assumption on likelihoods and λ = 1 − 18U 1 .

(26)

Number of nodes

0 100 200 300

M e a n n u m b e r o f It e r a t io n s

10 1 10 2 10 3 10 4 10 5

(a) Path Graph

Number of nodes

0 100 200 300

M e a n n u m b e r o f It e r a t io n s

10 1 10 2 10 3 10 4 10 5

(b) Circle Graph

Number of nodes

60 100 140

M e a n n u m b e r o f It e r a t io n s

10 1 10 2 10 3 10 4

(c) Grid Graph

Figure 1: Empirical mean over 50 Monte Carlo runs of the number of iterations required for µ i k (θ) <  for all agents on θ / ∈ Θ . All agents but one have all their hypotheses to be observationally equivalent.

Dotted line for the algorithm proposed in [Jadbabaie 2012], Dashed

line for the procedure described in Eq. (1) and solid line for the

procedure described in Eq. (3)

(27)

Distributed Source Localization

(28)

x-position

-10 0 10

y -p o si ti o n

-10 0

10 θ

9

θ 6

θ 3

θ 2

θ 8

θ 5

θ 1

θ 4

θ 7

Agents Source

1 2

3

(a) Network of Agents

Distance

0 5 10 15 20 25

0 0.05 0.1 0.15

l

2

(·|θ

5

) l

2

(·|θ

3

) f

2

(·)

(b) Hypothesis Distributions

Figure 2: Figure (a) shows a group of 3 agents in a grid of 3 × 3

hypotheses. Each hypothesis corresponds to a possible location of

the source. For example, hypothesis θ 2 locates the source at the

(−10, 0) point in the plane. Figure (b) shows the likelihood functions

for θ 2 and θ 5 and distribution of observations f 2 for agent 2.

(29)

Figure 3: Belief distribution of one agent over the hypotheses grid.

Darker shades of gray indicates higher beliefs on the corresponding

hypothesis.

(30)

x-position

-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10

y -p o si ti o n

-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10

Normal No Sensor Conflicting Source

(a) Network of Agents

Time

10

0

10

1

10

2

10

3

µ (θ

*

)

0 0.2 0.4 0.6 0.8 1

Eq. (1) Eq. (2) [34]

[13]

(b) Beliefs on the optimal hypothesis

Figure 4: Figure (a) shows a network of heterogeneous agents. 4

indicates agents whose observations have been modified such that the

optimal hypothesis is the (0, 0) point in the grid.  indicates agents

for whom all hypothesis are observationally equivalent (i.e. no data

is measured). ◦ indicates regular agents with correct observation

models and informative hypothesis. Figure (b) shows the belief

(31)

Closely Related

• Shahrampour, Rakhlin, and Jadbabaie 2014

• Lalitha, Javidi, and Sarwate 2014, 2015

• N. O. Uribe 2014, 2015

Figure

Updating...

References

Updating...

Related subjects :