Nonmyopic ϵ-Bayes-Optimal Active Learning of Gaussian Processes

(1)

Nonmyopic -Bayes-Optimal Active Learning of Gaussian Processes

Trong Nghia Hoang† [email protected]

Kian Hsiang Low† [email protected]

Patrick Jaillet§ [email protected]

Mohan Kankanhalli† [email protected]

†_{Department of Computer Science, National University of Singapore, Republic of Singapore}

§_{Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, USA}

Abstract

A fundamental issue in active learning of Gaussian processes is that of the exploration-exploitation trade-off. This paper presents a novel nonmyopic -Bayes-optimal active learn-ing(-BAL) approach that jointly and naturally optimizes the trade-off. In contrast, existing works have primarily developed myopic/greedy algorithms or performed exploration and ex-ploitation separately. To perform active learning in real time, we then propose an anytime algo-rithm based on -BAL with performance guaran-tee and empirically demonstrate using synthetic and real-world datasets that, with limited budget, it outperforms the state-of-the-art algorithms.

1. Introduction

Active learning has become an increasingly important focal theme in many environmental sensing and monitoring ap-plications (e.g., precision agriculture, mineral prospecting

(Low et al.,2007), monitoring of ocean and freshwater

phe-nomena like harmful algal blooms (Dolan et al.,2009;

Pod-nar et al.,2010), forest ecosystems, or pollution) where a

high-resolution in situ sampling of the spatial phenomenon of interest is impractical due to prohibitively costly sam-pling budget requirements (e.g., number of deployed sen-sors, energy consumption, mission time): For such appli-cations, it is thus desirable to select and gather the most informativeobservations/data for modeling and predicting the spatially varying phenomenon subject to some budget constraints, which is the goal of active learning and also known as the active sensing problem.

To elaborate, solving the active sensing problem amounts to deriving an optimal sequential policy that plans/decides the most informative locations to be observed for

minimiz-Proceedings of the 31st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copy-right 2014 by the author(s).

ing the predictive uncertainty of the unobserved areas of a phenomenon given a sampling budget. To achieve this, many existing active sensing algorithms (Cao et al.,2013;

Chen et al.,2012;2013b;Krause et al.,2008;Low et al.,

2008;2009;2011;2012;Singh et al.,2009) have modeled

the phenomenon as a Gaussian process (GP), which allows its spatial correlation structure to be formally character-ized and its predictive uncertainty to be formally quantified (e.g., based on mean-squared error, entropy, or mutual in-formation criterion). However, they have assumed the spa-tial correlation structure (specifically, the parameters defin-ing it) to be known, which is often violated in real-world applications, or estimated crudely using sparse prior data. So, though they aim to select sampling locations that are optimal with respect to the assumed or estimated parame-ters, these locations tend to be sub-optimal with respect to the true parameters, thus degrading the predictive perfor-mance of the learned GP model.

In practice, the spatial correlation structure of a phe-nomenon is usually not known. Then, the predictive per-formance of the GP modeling the phenomenon depends on how informative the gathered observations/data are for both parameter estimation as well as spatial prediction given the true parameters. Interestingly, as revealed in previous geostatistical studies (Martin, 2001; M¨uller, 2007), poli-cies that are efficient for parameter estimation are not nec-essarily efficient for spatial prediction with respect to the true model. Thus, the active sensing problem involves a potential trade-off between sampling the most informative locations for spatial prediction given the current, possibly incomplete knowledge of the model parameters (i.e., ex-ploitation) vs. observing locations that gain more informa-tion about the parameters (i.e., explorainforma-tion):

How then does an active sensing algorithm trade off be-tween these two possibly conflicting sampling objectives? To tackle this question, one principled approach is to frame active sensing as a sequential decision problem that jointly and naturally optimizes the above exploration-exploitation

(2)

Nonmyopic -Bayes-Optimal Active Learning of Gaussian Processes

trade-off while maintaining a Bayesian belief over the model parameters. This intuitively means a policy that biases towards observing informative locations for spatial prediction given the current model prior may be penalized if it entails a highly dispersed posterior over the model pa-rameters. So, the resulting induced policy is guaranteed to be optimal in the expected active sensing performance. Un-fortunately, such a nonmyopic Bayes-optimal policy cannot be derived exactly due to an uncountable set of candidate observations and unknown model parameters (Solomon &

Zacks,1970). As a result, most existing works (Diggle,

2006;Houlsby et al.,2012;Park & Pillow,2012;

Zimmer-man, 2006; Ouyang et al., 2014) have circumvented the

trade-off by resorting to the use of myopic/greedy (hence, sub-optimal) policies.

To the best of our knowledge, the only notable nonmy-opic active sensing algorithm for GPs (Krause & Guestrin,

2007) advocates tackling exploration and exploitation sep-arately, instead of jointly and naturally optimizing their trade-off, to sidestep the difficulty of solving the Bayesian sequential decision problem. Specifically, it performs a probably approximately correct (PAC)-style exploration until it can verify that the performance loss of greedy ex-ploitation lies within a user-specified threshold. But, such an algorithm is sub-optimal in the presence of budget con-straints due to the following limitations: (a) It is unclear how an optimal threshold for exploration can be determined given a sampling budget, and (b) even if such a threshold is available, the PAC-style exploration is typically designed to satisfy a worst-case sample complexity rather than to be optimal in the expected active sensing performance, thus resulting in an overly-aggressive exploration (Section4.1). This paper presents an efficient decision-theoretic planning approach to nonmyopic active sensing/learning that can still preserve and exploit the principled Bayesian sequential decision problem framework for jointly and naturally opti-mizing the exploration-exploitation trade-off (Section3.1) and consequently does not incur the limitations of the al-gorithm of Krause & Guestrin (2007). In particular, al-though the exact Bayes-optimal policy to the active sens-ing problem cannot be derived (Solomon & Zacks,1970), we show that it is in fact possible to solve for a nonmy-opic -Bayes-optimal active learning (-BAL) policy (Sec-tions3.2and3.3) given a user-defined bound , which is the main contribution of our work here. In other words, our proposed -BAL policy can approximate the optimal ex-pected active sensing performance arbitrarily closely (i.e., within an arbitrary loss bound ). In contrast, the algorithm

ofKrause & Guestrin(2007) can only yield a sub-optimal

performance bound1_{. To meet the real-time requirement} 1

Its induced policy is guaranteed not to achieve worse than the optimal performance by more than a factor of 1/e.

in time-critical applications, we then propose an asymp-totically -optimal, branch-and-bound anytime algorithm based on -BAL with performance guarantee (Section3.4). We empirically demonstrate using both synthetic and real-world datasets that, with limited budget, our proposed ap-proach outperforms state-of-the-art algorithms (Section4).

2. Modeling Spatial Phenomena with

Gaussian Processes (GPs)

The GP can be used to model a spatial phenomenon of in-terest as follows: The phenomenon is defined to vary as a realization of a GP. Let X denote a set of sampling lo-cations representing the domain of the phenomenon such that each location x ∈ X is associated with a realized (ran-dom) measurement zx(Zx) if x is observed/sampled

(un-observed). Let ZX , {Zx}x∈X denote a GP, that is, every

finite subset of ZX has a multivariate Gaussian distribution

(Chen et al.,2013a;Rasmussen & Williams,2006). The

GP is fully specified by its prior mean µx , E[Zx] and

covariance σxx0_|λ , cov[Z_x, Z_x0|λ] for all x, x0 ∈ X , the

latter of which characterizes the spatial correlation struc-ture of the phenomenon and can be defined using a covari-ance function parameterized by λ. A common choice is the squared exponential covariance function:

σxx0_|λ, (σλ_s)2exp − 1 2 P X i=1 [sx]i− [sx0]_i `λ i 2! +(σλn)2δxx0

where [sx]i([sx0]_i) is the ith component of the P

-dimensional feature vector sx(sx0), the set of realized

pa-rameters λ_,σλ

n, σλs, `λ1, . . . , `λP ∈ Λ are, respectively,

the square root of noise variance, square root of signal vari-ance, and length-scales, and δxx0 is a Kronecker delta that

is 1 if x = x0and 0 otherwise.

Supposing λ is known and a set zD of realized

measure-ments is available for some set D ⊂ X of observed lo-cations, the GP can exploit these observations to predict the measurement for any unobserved location x ∈ X \ D as well as provide its corresponding predictive uncertainty using the Gaussian predictive distribution p(zx|zD, λ) ∼

N (µx|D,λ, σxx|D,λ) with the following posterior mean and

variance, respectively:

µx|D,λ, µx+ ΣxD|λΣ−1_DD|λ(zD− µD) (1)

where, with a slight abuse of notation, zDis to be perceived

as a column vector in (1), µDis a column vector with mean

components µx0for all x0∈ D, Σ_xD|λis a row vector with

covariance components σxx0_|λfor all x0 ∈ D, Σ_Dx|λis the

transpose of ΣxD|λ, and ΣDD|λis a covariance matrix with

(3)

correlation structure (i.e., λ) is not known, a probabilistic belief bD(λ) , p(λ|zD) can be maintained/tracked over all

possible λ and updated using Bayes’ rule to the posterior belief bD∪{x}(λ) given a newly available measurement zx:

bD∪{x}(λ) ∝ p(zx|zD, λ) bD(λ) . (3)

Using belief bD, the predictive distribution p(zx|zD) can be

obtained by marginalizing out λ: p(zx|zD) =

X

λ∈Λ

p(zx|zD, λ) bD(λ) . (4)

3. Nonmyopic -Bayes-Optimal Active

Learning (-BAL)

3.1. Problem Formulation

To cast active sensing as a Bayesian sequential deci-sion problem, let us first define a sequential active sens-ing/learning policy π given a budget of N sampling loca-tions: Specifically, the policy π _{, {π}n}Nn=1 is structured

to sequentially decide the next location πn(zD) ∈ X \ D

to be observed at each stage n based on the current ob-servations zD over a finite planning horizon of N stages.

Recall from Section1that the active sensing problem in-volves planning/deciding the most informative locations to be observed for minimizing the predictive uncertainty of the unobserved areas of a phenomenon. To achieve this, we use the entropy criterion (Cover & Thomas,1991) to mea-sure the informativeness and predictive uncertainty. Then, the value under a policy π is defined to be the joint entropy of its selected observations when starting with some prior observations zD0 and following π thereafter:

V₁π(zD0) , H[Zπ|zD0] , −

Z

p(zπ|zD0) log p(zπ|zD0) dzπ

(5) where Zπ (zπ) is the set of random (realized)

measure-ments taken by policy π and p(zπ|zD0) is defined in a

sim-ilar manner to (4).

To solve the active sensing problem, the notion of Bayes-optimality2is exploited for selecting observations of largest possible joint entropy with respect to all possible induced sequences of future beliefs (starting from initial prior be-lief bD0) over candidate sets of model parameters λ, as

detailed next. Formally, this entails choosing a sequen-tial policy π to maximize Vπ

1(zD0) (5), which we call the

Bayes-optimal active learning (BAL) policy π∗. That is, V₁∗(zD0) , V π∗ 1 (zD0) = maxπV π 1(zD0). When π ∗ _is

plugged into (5), the following N -stage Bellman equations result from the chain rule for entropy:

2

Bayes-optimality is previously studied in reinforcement learning whose developed theories (Poupart et al.,2006;Hoang & Low,2013) cannot be applied here because their assumptions of discrete-valued observations and Markov property do not hold.

V_n∗(zD) = HZπ∗ n(zD)|zD +EV ∗ n+1zD∪ {Zπ∗ n(zD)}|zD = max x∈X \D Q∗_n(zD, x) Q∗_n(zD, x) , H[Zx|zD] + EVn+1∗ (zD∪ {Zx})|zD H[Zx|zD] , − Z p(zx|zD) log p(zx|zD) dzx (6)

for stage n = 1, . . . , N where p(zx|zD) is defined in (4)

and the expectation terms are omitted from the right-hand side (RHS) expressions of V_N∗ and Q∗_N at stage N . At each stage, the belief bD(λ) is needed to compute Q∗n(zD, x)

in (6) and can be uniquely determined from initial prior belief bD0 and observations zD\D0 using (3). To

under-stand how the BAL policy π∗ jointly and naturally opti-mizes the exploration-exploitation trade-off, its selected lo-cation π∗n(zD) = arg maxx∈X \DQ∗n(zD, x) at each stage

n affects both the immediate payoff HZπ∗

n(zD)|zD given

current belief bD(i.e., exploitation) as well as the posterior

belief bD∪{π∗

n(zD)}, the latter of which influences expected

future payoff E[Vn+1∗ (zD∪ {Zπ∗

n(zD)})|zD] and builds in

the information gathering option (i.e., exploration). Interestingly, the work of Low et al. (2009) has revealed that the above recursive formulation (6) can be perceived as the sequential variant of the well-known maximum entropy sampling problem (Shewry & Wynn,1987) and established an equivalence result that the maximum-entropy observa-tions selected by π∗achieve a dual objective of minimizing the posterior joint entropy (i.e., predictive uncertainty) re-maining in the unobserved locations of the phenomenon. Unfortunately, the BAL policy π∗ cannot be derived ex-actly because the stage-wise entropy and expectation terms in (6) cannot be evaluated in closed form due to an uncount-able set of candidate observations and unknown model pa-rameters λ (Section 1). To overcome this difficulty, we show in the next subsection how it is possible to solve for an -BAL policy π, that is, the joint entropy of its selected

observations closely approximates that of π∗within an ar-bitrary loss bound > 0.

3.2. -BAL Policy

The key idea underlying the design and construction of our proposed nonmyopic -BAL policy π is to approximate the entropy and expectation terms in (6) at every stage us-ing a form of truncated samplus-ing to be described next: Definition 1 (τ -Truncated Observation) Define random measurement bZxby truncatingZxat−τ andb τ as follows:b

b Zx,    −_bτ if Zx≤ −bτ , Zx if −bτ < Zx<bτ , b τ if Zx≥τ .b

Then, bZxhas a distribution of mixed type with its

contin-uous component defined as f ( bZx = zx|zD) , p(Zx =

zx|zD) for −bτ < zx < bτ and its discrete component de-fined asf ( bZx =bτ |zD) , P (Zx ≥bτ |zD) =

R∞

b

(4)

Nonmyopic -Bayes-Optimal Active Learning of Gaussian Processes zx|zD)dzx andf ( bZx = −bτ |zD) , P (Zx ≤ −bτ |zD) = R−τb −∞p(Zx= zx|zD)dzx. Let µ(D, Λ) _{, max}x∈X \D,λ∈Λµx|D,λ, µ(D, Λ) , minx∈X \D,λ∈Λµx|D,λ, and b τ _{, max} µ(D, Λ) − τ , |µ(D, Λ) + τ | (7) for some0 ≤ τ ≤τ . Then, a realized measurement of b_b Zx

is said to be aτ -truncated observation for location x. Specifically, given that a set zDof realized measurements is

available, a finite set of S τ -truncated observations {z_xi}S i=1

can be generated for every candidate location x ∈ X \ D at each stage n by identically and independently sampling from p(zx|zD) (4) and then truncating each of them

ac-cording to zix← zixmin |zix|,bτ/|z

i

x|. These generated τ

-truncated observations can be exploited for approximating V_n∗(6) through the following Bellman equations:

Vn(zD) , max x∈X \D Q n(zD, x) Q n(zD, x) , 1 S S X i=1 − log p zix|zD + Vn+1 zD∪zxi (8)

for stage n = 1, . . . , N such that there is no V N +1

term on the RHS expression of Q_N at stage N . Like the BAL policy π∗ (Section3.1), the location π_n(zD) =

arg maxx∈X \DQn(zD, x) selected by our -BAL policy

πat each stage n also jointly and naturally optimizes the trade-off between exploitation (i.e., by maximizing imme-diate payoff S−1PS

i=1− log p(z i π

n(zD)|zD) given the

cur-rent belief bD) vs. exploration (i.e., by improving

poste-rior belief bD∪{π

n(zD)}to maximize average future payoff

S−1PS

i=1Vn+1 (zD∪ {ziπ

n(zD)})). Unlike the

determinis-tic BAL policy π∗, our -BAL policy πis stochastic due to its use of the above truncated sampling procedure. 3.3. Theoretical Analysis

The main difficulty in analyzing the active sensing per-formance of our stochastic -BAL policy π (i.e., relative to that of BAL policy π∗) lies in determining how its -Bayes optimality can be guaranteed by choosing appropri-ate values of the truncappropri-ated sampling parameters S and τ (Section3.2). To achieve this, we have to formally under-stand how S and τ can be specified and varied in terms of the user-defined loss bound , budget of N sampling lo-cations, domain size |X | of the phenomenon, and proper-ties/parameters characterizing the spatial correlation struc-ture of the phenomenon (Section2), as detailed below. The first step is to show that Qn(8) is in fact a good

approx-imation of Q∗n(6) for some chosen values of S and τ . There

are two sources of error arising in such an approximation: (a) In the truncated sampling procedure (Section3.2), only a finite set of τ -truncated observations is generated for ap-proximating the stage-wise entropy and expectation terms

in (6), and (b) computing Q

ndoes not involve utilizing the

values of V_n+1∗ but that of its approximation V

n+1instead.

To facilitate capturing the error due to finite truncated sam-pling described in (a), the following intermediate function is introduced: W_n∗(zD, x) , 1 S S X i=1 − log p zi x|zD + Vn+1∗ zD∪ {zix} (9) for stage n = 1, . . . , N such that there is no V_{N +1}∗ term on the RHS expression of W_N∗ at stage N . The first lemma be-low reveals that if the error |Q∗n(zD, x) − Wn∗(zD, x)| due

to finite truncated sampling can be bounded for all tuples (n, zD, x) generated at stage n = n0, . . . , N by (8) to

com-pute V_n0 for 1 ≤ n0 ≤ N , then Q_n0 (8) can approximate

Q∗_n0 (6) arbitrarily closely:

Lemma 1 Suppose that a set zD0of observations, a budget

ofN − n0+ 1 sampling locations for 1 ≤ n0 ≤ N , S ∈ Z+_,

andγ > 0 are given. If

|Q∗n(zD, x) − Wn∗(zD, x)| ≤ γ (10)

for all tuples(n, zD, x) generated at stage n = n0, . . . , N

by(8) to compute V_n0(zD0), then, for all x0∈ X \ D0,

|Q∗_n0(zD0, x0) − Q_n0(zD0, x0)| ≤ (N − n0+ 1)γ . (11)

Its proof is given in AppendixA.1. The next two lemmas show that, with high probability, the error |Q∗_n(zD, x) −

W_n∗(zD, x)| due to finite truncated sampling can indeed be

bounded from above by γ (10) for all tuples (n, zD, x)

gen-erated at stage n = n0, . . . , N by (8) to compute V n0 for

1 ≤ n0≤ N :

Lemma 2 Suppose that a set zD0of observations, a budget

ofN − n0_{+ 1 sampling locations for 1 ≤ n}0 _{≤ N , S ∈ Z}+_,

andγ > 0 are given. For all tuples (n, zD, x) generated at

stagen = n0, . . . , N by (8) to compute V_n0(zD0), P|Q∗_n(zD, x) − Wn∗(zD, x)| ≤ γ ≥ 1−2 exp −2Sγ 2 T2 where_{T , O}N 2_κ2N_τ2 σ2 n +N logσo σn +log |Λ| by setting τ = O  σo s logσ 2 o γ N2_κ2N_+σ2 o σ2 n +N logσo σn +log |Λ|  

withκ, σ_n2, andσ2_odefined as follows:

κ , 1 + 2 max x0_{,u∈X \D:x}0_{6=u,λ∈Λ,D} σx0_u|D,λ /σuu|D,λ, (12) σ2_n_{, min} λ∈Λ(σ λ n) 2_{, and σ}2 o , max λ∈Λ (σ λ s) 2_{+ (σ}λ n) 2_. (13) Refer to AppendixA.2for its proof.

Remark1. Deriving such a probabilistic bound in Lemma2

typically involves the use of concentration inequalities for the sum of independent bounded random variables like the Hoeffding’s, Bennett’s, or Bernstein’s inequalities. How-ever, since the originally Gaussian distributed observations

(5)

are unbounded, sampling from p(zx|zD) (4) without

trun-cation will generate unbounded versions of {zi

x}Si=1 and

consequently make each summation term − log p(zi x|zD)+

V_n+1∗ (zD ∪ {zix}) on the RHS expression of Wn∗ (9)

un-bounded, hence invalidating the use of these concentration inequalities. To resolve this complication, our trick is to exploit the truncated sampling procedure (Section 3.2) to generate bounded τ -truncated observations (Definition 1) (i.e., |zxi| ≤ τ for i = 1, . . . , S), thus resulting in eachb summation term − log p(zxi|zD) + Vn+1∗ (zD∪ {zix}) being

bounded (AppendixA.2). This enables our use of Hoeffd-ing’s inequality to derive the probabilistic bound.

Remark 2. It can be observed from Lemma 2 that the amount of truncation has to be reduced (i.e., higher cho-sen value of τ ) when (a) a tighter bound γ on the error |Q∗

n(zD, x) − Wn∗(zD, x)| due to finite truncated sampling

is desired, (b) a greater budget of N sampling locations is available, (c) a larger space Λ of candidate model pa-rameters is preferred, (d) the spatial phenomenon varies with more intensity and less noise (i.e., assuming all can-didate signal and noise variance parameters, respectively, (σλ_s)2 and (σλ_n)2 are specified close to the true large sig-nal and small noise variances), and (e) its spatial corre-lation structure yields a bigger κ. To elaborate on (e), note that Lemma 2 still holds for any value of κ larger than that set in (12): Since |σx0_u|D,λ|2≤ σ_x0_x0_|D,λσ_uu|D,λ

for all x0 6= u ∈ X \ D due to the symmetric positive-definiteness of Σ(X \D)(X \D)|D,λ, κ can be set to 1 +

2 maxx0_{,u∈X \D,λ∈Λ,D}pσ_x0_x0_|D,λ/σ_uu|D,λ. Then,

sup-posing all candidate length-scale parameters are specified close to the true length-scales, a phenomenon with extreme length-scales tending to 0 (i.e., with white-noise process measurements) or ∞ (i.e., with constant measurements) will produce highly similar σx0_x0_|D,λfor all x0 ∈ X \ D,

thus resulting in smaller κ and hence smaller τ .

Remark 3. Alternatively, it can be proven that Lemma 2

and the subsequent results hold by setting κ = 1 if a cer-tain structural property of the spatial correlation structure (i.e., for all zD (D ⊆ X ) and λ ∈ Λ, ΣDD|λis diagonally

dominant) is satisfied, as shown in Lemma9(AppendixB). Consequently, the κ term can be removed from T and τ . Lemma 3 Suppose that a set zD0 of observations, a

bud-get ofN − n0 + 1 sampling locations for 1 ≤ n0 ≤ N , S ∈ Z+, and γ > 0 are given. The probability that |Q∗

n(zD, x) − Wn∗(zD, x)| ≤ γ (10) holds for all tuples

(n, zD, x) generated at stage n = n0, . . . , N by (8) to

com-puteV

n0(zD0) is at least 1 − 2(S|X |)Nexp(−2Sγ2/T2)

whereT is previously defined in Lemma2.

Its proof is found in AppendixA.3. The first step is con-cluded with our first main result, which follows from Lem-mas1 and3. Specifically, it chooses the values of S and τ such that the probability of Q

n (8) approximating Q∗n

(6) poorly (i.e., |Q∗_n(zD, x) − Qn(zD, x)| > N γ) can be

bounded from above by a given 0 < δ < 1:

Theorem 1 Suppose that a set zDof observations, a

bud-get of N − n + 1 sampling locations for 1 ≤ n ≤ N , γ > 0, and 0 < δ < 1 are given. The probability that |Q∗

n(zD, x) − Qn(zD, x)| ≤ N γ holds for all x ∈ X \ D

is at least1 − δ by setting S = O T 2 γ2 N logN |X |T 2 γ2 +log 1 δ

whereT is previously defined in Lemma2. By assumingN , |Λ|, σo,σn,κ, and |X | as constants, τ = O(plog(1/γ))

and henceS = O (log (1/γ))

2 γ2 log log (1/γ) γδ ! . Its proof is provided in AppendixA.4.

Remark. It can be observed from Theorem1that the num-ber of generated τ -truncated observations has to be in-creased (i.e., higher chosen value of S) when (a) a lower probability δ of Qn(8) approximating Q∗n(6) poorly (i.e.,

|Q∗

n(zD, x)−Qn(zD, x)| > N γ) is desired, and (b) a larger

domain X of the phenomenon is given. The influence of γ, N , |Λ|, σo, σn, and κ on S is similar to that on τ , as

previ-ously reported in Remark 2 after Lemma2.

Thus far, we have shown in the first step that, with high probability, Q

n(8) approximates Q∗n(6) arbitrarily closely

for some chosen values of S and τ (Theorem1). The next step uses this result to probabilistically bound the perfor-mance loss in terms of Q∗_n by observing location π_n(zD)

selected by our -BAL policy πat stage n and following the BAL policy π∗thereafter:

Lemma 4 Suppose that a set zD of observations, a

bud-get of N − n + 1 sampling locations for 1 ≤ n ≤ N , γ > 0, and 0 < δ < 1 are given. Q∗_n(zD, π∗n(zD)) −

Q∗_n(zD, πn(zD)) ≤ 2N γ holds with probability at least

1 − δ by setting S and τ according to that in Theorem1. See Appendix A.5 for its proof. The final step extends Lemma4to obtain our second main result. In particular, it bounds the expected active sensing performance loss of our stochastic -BAL policy π_{relative to that of BAL}

pol-icy π∗, that is, policy π_{is -Bayes-optimal:}

Theorem 2 Given a set zD0 of prior observations, a

bud-get of N sampling locations, and > 0, V₁∗(zD0) −

Eπ[Vπ

1 (zD0)] ≤ by setting and substituting γ =

/(4N2) and δ = /(2N (N log(σo/σn) + log |Λ|)) into

S and τ in Theorem 1 to give τ = O(plog(1/)) and S = O (log (1/)) 2 2 log log (1/) 2 ! . Its proof is given in AppendixA.6.

Remark1. The number of generated τ -truncated observa-tions and the amount of truncation have to be, respectively,

(6)

increased and reduced (i.e., higher chosen values of S and τ ) when a tighter user-defined loss bound is desired. Remark 2. The deterministic BAL policy π∗ is Bayes-optimal among all candidate stochastic policies π since Eπ[V1π(zD0)] ≤ V

∗

1(zD0), as proven in AppendixA.7.

3.4. Anytime -BAL (hα, i-BAL) Algorithm

Unlike the BAL policy π∗, our -BAL policy πcan be de-rived exactly because its time complexity is independent of the size of the set of all possible originally Gaussian dis-tributed observations, which is uncountable. But, the cost of deriving π _{is exponential in the length N of planning}

horizon since it has to compute the values V

n(zD) (8) for

all (S|X |)N _{possible states (n, z}

D). To ease this

computa-tional burden, we propose an anytime algorithm based on -BAL that can produce a good policy fast and improve its approximation quality over time, as discussed next. The key intuition behind our anytime -BAL algorithm (hα, i-BAL of Algo.1) is to focus the simulation of greedy exploration paths through the most uncertain regions of the state space (i.e., in terms of the values Vn(zD)) instead of

evaluating the entire state space like π. To achieve this, our hα, i-BAL algorithm maintains both lower and upper heuristic bounds (respectively, V_n(zD) and V

n(zD)) for

each encountered state (n, zD), which are exploited for

rep-resenting the uncertainty of its corresponding value V n(zD)

to be used in turn for guiding the greedy exploration (or, put differently, pruning unnecessary, bad exploration of the state space while still guaranteeing policy optimality). To elaborate, each simulated exploration path (EXPLORE of Algo.1) repeatedly selects a sampling location x and its corresponding τ -truncated observation zxi at every stage

until the last stage N is reached. Specifically, at each stage n of the simulated path, the next states (n + 1, zD∪ {zxi})

with uncertainty |Vn+1(zD∪{zix})−V

n+1(zD∪{zxi})|

ex-ceeding α (line 6) are identified (lines 7-8), among which the one with largest lower bound Vn+1(zD∪ {zxi}) (line

10) is prioritized/selected for exploration (if more than one exists, ties are broken by choosing the one with most uncer-tainty, that is, largest upper bound V_n+1(zD∪ {zxi}) (line

11)) while the remaining unexplored ones are placed in the set U (line 12) to be considered for future exploration (lines 3-6 in hα, i-BAL). So, the simulated path terminates if the uncertainty of every next state is at most α; the uncertainty of a state at the last stage N is guaranteed to be zero (14). Then, the algorithm backtracks up the path to up-date/tighten the bounds of previously visited states (line 7 in hα, i-BAL and line 14 in EXPLORE) as follows:

V_n(zD) ← min V_n(zD), max x∈X \D Q_n(zD, x) V_n(zD) ← max V_n(zD), max x∈X \D Q n(zD, x) (14) Algorithm 1 hα, i-BAL(zD0) hα, i-BAL z_D0 1: U ←(1, z_D0) 2: while |V₁z_D0 − V 1zD0 | > α do

3: V ← arg max_{(n,zD )∈U}V_n(zD)

4: n0, z_D0 ← arg max_{(n,zD )∈V}V_n(zD)

5: U ← U \ _n0_{, z} D0

6: EXPLORE(n0, zD0, U ) /∗ U is passed by reference ∗/

7: UPDATE(n0, zD0)

8: return π₁hα,iz_D0 ← arg max_{x∈X \D0}Q

1(zD0, x) EXPLORE(n, zD, U ) 1: T ← ∅ 2: for all x ∈ X \ D do 3: {zi x} S

i=1← sample from p(zx|zD) (4)

4: for i = 1, . . . , S do 5: zi x← zixmin |zix|,bτ/|z i x| 6: if |V_n+1zD∪zix − Vn+1 zD∪zix | > α then 7: T ← T ∪ n + 1, zD∪zix 8: parent n + 1, zD∪zix ← (n, zD) 9: if |T | > 0 then 10: V ← arg max_{(n+1,zD ∪{z}i x })∈TV n+1zD∪zxi 11: (n + 1, zD0) ← arg max_{(n+1,zD ∪{z}i x })∈VV n+1zD∪zxi 12: U ← U ∪ (T \ {(n + 1, zD0)}) 13: EXPLORE(n + 1, zD0, U )

14:Update V_n(zD) and Vn(zD) using (14)

UPDATE(n, zD)

1: Update V_n(zD) and V_n(zD) using (14)

2: if n > 1 then 3: (n − 1, zD0) ← parent(n, zD) 4: UPDATE(n − 1, zD0) Q_n(zD, x) , 1 S S X i=1 − log p zi x|zD + V n+1zD∪zxi Q_n(zD, x) , 1 S S X i=1 − log p zix|zD + Vn+1zD∪zxi

for stage n = 1, . . . , N such that there is no VN +1

(V_{N +1}) term on the RHS expression of QN (Q

N) at stage

N . When the planning time runs out, we provide the greedy policy induced by the lower bound: πhα,i1 (zD0) ,

arg maxx∈X \D0Q

1(zD0, x) (line 8 in hα, i-BAL).

Central to the anytime performance of our hα, i-BAL algorithm is the computational efficiency of deriving in-formed initial heuristic bounds V_n(zD) and V

n(zD) where

V_n(zD) ≤ Vn(zD) ≤ V

n(zD). Due to lack of space,

we have shown in AppendixA.8how they can be derived efficiently. We have also derived a theoretical guarantee similar to that of Theorem2on the expected active sens-ing performance of our hα, i-BAL policy πhα,i, as shown in Appendix A.9. We have analyzed the time complexity of simulating k exploration paths in our hα, i-BAL algo-rithm to be O(kN S|X |(|Λ|(N3+ |X |N2+ S|X |) + ∆ + log(kN S|X |))) (AppendixA.10) where O(∆) denotes the cost of initializing the heuristic bounds at each state. In practice, hα, i-BAL’s planning horizon can be shortened to reduce its computational cost further by limiting the depth of each simulated path to strictly less than N . In that case,

(7)

although the resulting πhα,i’s performance has not been theoretically analyzed, Section 4.1 demonstrates empiri-cally that it outperforms the state-of-the-art algorithms.

4. Experiments and Discussion

This section evaluates the active sensing performance and time efficiency of our hα, i-BAL policy πhα,i (Sec-tion 3) empirically under limited sampling budget using two datasets featuring a simple, simulated spatial phe-nomenon (Section4.1) and a large-scale, real-world traffic phenomenon (i.e., speeds of road segments) over an urban road network (Section4.2). All experiments are run on a Mac OS X machine with Intel Core i7 at 2.66 GHz. 4.1. Simulated Spatial Phenomenon

The domain of the phenomenon is discretized into a finite set of sampling locations X = {0, 1, . . . , 99}. The phe-nomenon is a realization of a GP (Section2) parameterized by λ∗ = {σλn∗ = 0.25, σλ

∗

s = 10.0, `λ

∗

= 1.0}. For sim-plicity, we assume that σnλ∗ and σλ

∗

s are known, but the

true length-scale `λ∗= 1 is not. So, a uniform prior belief b_D₀=∅is maintained over a set L = {1, 6, 9, 12, 15, 18, 21}

of 7 candidate length-scales `λ. Using root mean squared prediction error(RMSPE) as the performance metric, the performance of our hα, i-BAL policies πhα,i with plan-ning horizon length N0 = 2, 3 and α = 1.0 are com-pared to that of the state-of-the-art GP-based active learn-ing algorithms: (a) The a priori greedy design (APGD) policy (Shewry & Wynn, 1987) iteratively selects and adds arg maxx∈X \Sn

P

λ∈ΛbD0(λ)H[ZSn∪{x}|zD0, λ] to

the current set Sn of sampling locations (where S0 = ∅)

until SN is obtained, (b) the implicit exploration (IE)

pol-icy greedily selects and observes sampling location xIE ₌

arg maxx∈X \DPλ∈ΛbD(λ)H[Zx|zD, λ] and updates the

belief from bD to bD∪{xIE_}over L; if the upper bound on

the performance advantage of using π∗ over APGD pol-icy is less than a pre-defined threshold, it will use APGD with the remaining sampling budget, and (c) the explicit exploration via independent tests(ITE) policy performs a PAC-based binary search, which is guaranteed to find `λ∗ with high probability, and then uses APGD to select the remaining locations to be observed.

Both nonmyopic IE and ITE policies are proposed by

Krause & Guestrin(2007): IE is reported to incur the

low-est prediction error empirically while ITE is guaranteed not to achieve worse than the optimal performance by more than a factor of 1/e. Fig. 1a shows results of the active sensing performance of the tested policies averaged over 20 realizations of the phenomenon drawn independently from the underlying GP model described earlier. It can be ob-served that the RMSPE of every tested policy decreases with a larger budget of N sampling locations. Notably, our hα, i-BAL policies perform better than the APGD, IE,

5 10 15 20 25 8.5 9 9.5 10 10.5 11 11.5 12

Budget of N sampling locations

RM S P E APGD IE ITE π⟨α,ϵ⟩_{(N’ = 2)} π⟨α,ϵ⟩_{(N’ = 3)} 0 200 400 600 800 0 1 2 3 4 5x 10 5

No. of simulated paths

Ti m e (m s) 0 200 400 600 800 12 13 14 15 16 17 18 19

No. of simulated paths

En tr o p y (na t) Vϵ 1(zD) Vϵ1(zD) (a) (b) (c)

Figure 1. Graphs of (a) RMSPE of APGD, IE, ITE, and hα, i-BAL policies with planning horizon length N0= 2, 3 vs. budget of N sampling locations, (b) stage-wise online processing cost of hα, i-BAL policy with N0 = 3 and (c) gap between V1(zD0)

and V₁(zD0) vs. number of simulated paths.

and ITE policies, especially when N is small. The perfor-mance gap between our hα, i-BAL policies and the other policies decreases as N increases, which intuitively means that, with a tighter sampling budget (i.e., smaller N ), it is more critical to strike a right balance between exploration and exploitation.

Fig.2shows the stage-wise sampling designs produced by the tested policies with a budget of N = 15 sampling locations. It can be observed that our hα, i-BAL policy achieves a better balance between exploration and exploita-tion and can therefore discern `λ∗much faster than the IE and ITE policies while maintaining a fine spatial coverage of the phenomenon. This is expected due to the following issues faced by IE and ITE policies: (a) The myopic explo-ration of IE tends not to observe closely-spaced locations (Fig. 2a), which are in fact informative towards estimat-ing the true length-scale, and (b) despite ITE’s theoretical guarantee in finding `λ∗_{, its PAC-style exploration is too}

aggressive, hence completely ignoring how informative the posterior belief bDover L is during exploration. This leads

to a sub-optimal exploration behavior that reserves too lit-tle budget for exploitation and consequently entails a poor spatial coverage, as shown in Fig.2b.

Our hα, i-BAL policy can resolve these issues by jointly and naturally optimizing the trade-off between observing the most informative locations for minimizing the predic-tive uncertainty of the phenomenon (i.e., exploitation) vs. the uncertainty surrounding its length-scale (i.e., explo-ration), hence enjoying the best of both worlds (Fig.2c). In fact, we notice that, after observing 5 locations, our hα, i-BAL policy can focus 88.10% of its posterior belief on `λ∗_{while IE only assigns, on average, about 18.65% of its}

posterior belief on `λ∗_{, which is hardly more informative}

than the prior belief bD0(`

λ∗

) = 1/7 ≈ 14.28%. Finally, Fig.1b shows that the online processing cost of hα, i-BAL per sampling stage grows linearly in the number of sim-ulated paths while Fig. 1c reveals that its approximation quality improves (i.e., gap between V1(zD0) and V

1(zD0)

decreases) with increasing number of simulated paths. In-terestingly, it can be observed from Fig.1c that although hα, i-BAL needs about 800 simulated paths (i.e., 400 s) to close the gap between V₁(zD0) and V

1(zD0), V

1(zD0)

(8)

Nonmyopic -Bayes-Optimal Active Learning of Gaussian Processes 0 10 20 30 40 50 60 70 80 90 100 Spatial Field Stages 1 - 5 1 - 9 1 - 13 1 - 15 0 10 20 30 40 50 60 70 80 90 100 Spatial Field

Exploration: Find the true length-scale by binary search

Exploitation: Too little remaining budget entails poor coverage Stages 1 - 5 1 - 9 1 - 13 1 - 15 0 10 20 30 40 50 60 70 80 90 100 Spatial Field

Exploration: Observing closely-spaced locations to identify true length-scale

Exploitation: Sufficient budget remaining to achieve even coverage Stages

1 - 4 1 - 8 1 - 12 1 - 15

(a) IE policy (b) ITE policy (c) hα, i-BAL policy

Figure 2. Stage-wise sampling designs produced by (a) IE, (b) ITE, and (c) hα, i-BAL policy with a planning horizon length N0 = 3 using a budget of N = 15 sampling locations. The final sampling designs are depicted in the bottommost rows of the figures.

only takes about 100 simulated paths (i.e., 50 s). This im-plies the actual computation time needed for hα, i-BAL to reach V

1(zD0) (via its lower bound V

1(zD0)) is much less

than that required to verify the convergence of V₁(zD0) to

V

1(zD0) (i.e., by checking the gap). This is expected since

hα, i-BAL explores states with largest lower bound first (Section3.4).

4.2. Real-World Traffic Phenomenon

Fig.3a shows the traffic phenomenon (i.e., speeds (km/h) of road segments) over an urban road network X compris-ing 775 road segments (e.g., highways, arterials, slip roads, etc.) in Tampines area, Singapore during lunch hours on June 20, 2011. The mean speed is 52.8 km/h and the stan-dard deviation is 21.0 km/h. Each road segment x ∈ X is specified by a 4-dimensional vector of features: length, number of lanes, speed limit, and direction. The phe-nomenon is modeled as a relational GP (Chen et al.,2012) whose correlation structure can exploit both the road seg-ment features and road network topology information. The true parameters λ∗ = {σλ∗

n , σλ

∗

s , `λ

∗

} are set as the maxi-mum likelihood estimates learned using the entire dataset. We assume that σλ∗

n and σλ

∗

s are known, but `λ

∗

is not. So, a uniform prior belief bD0=∅ is maintained over a set

L = {`λi_}6

i=0 of 7 candidate length-scales `λ0 = `λ

∗

and `λi _{= 2(i + 1)`}λ∗_{for i = 1, . . . , 6.}

The performance of our hα, i-BAL policies with planning horizon length N0 = 3, 4, 5 are compared to that of APGD and IE policies (Section4.1) by running each of them on a mobile probe to direct its active sensing along a path of adjacent road segments according to the road network topology; ITE cannot be used here as it requires observ-ing road segments separated by a pre-computed distance during exploration (Krause & Guestrin,2007), which vio-lates the topological constraints of the road network since the mobile probe cannot “teleport”. Fig. 3 shows results of the tested policies averaged over 5 independent runs: It can be observed from Fig.3b that our hα, i-BAL policies outperform APGD and IE policies due to their nonmyopic exploration behavior. In terms of the total online process-ing cost, Fig.3c shows that hα, i-BAL incurs < 4.5 hours given a budget of N = 240 road segments, which can be afforded by modern computing power. To illustrate the be-havior of each policy, Figs. 3d-f show, respectively, the road segments observed (shaded in black) by the mobile

85008550860086508700875088008850890089509000 3150 3200 3250 3300 3350 3400 3450 3500 3550 0 16 32 48 64 80 96 60 120 180 240 15 16 17 18 19 20 21 22

Budget of N road segments

RM S P E (k m /h ) APGD IE π⟨α ,ϵ ⟩(N’ = 3) π⟨α ,ϵ ⟩(N’ = 4) π⟨α ,ϵ ⟩(N’ = 5) 60 120 180 240 102 104 106 108

Budget of N road segments

Ti m e (m s) π⟨α ,ϵ ⟩(N’ = 3) π⟨α ,ϵ ⟩(N’ = 4) π⟨α ,ϵ ⟩(N’ = 5) (a) (b) (c) 85008550860086508700875088008850890089509000 3150 3200 3250 3300 3350 3400 3450 3500 3550 0 16 32 48 64 80 96 85008550860086508700875088008850890089509000 3150 3200 3250 3300 3350 3400 3450 3500 3550 0 16 32 48 64 80 96 85008550860086508700875088008850890089509000 3150 3200 3250 3300 3350 3400 3450 3500 3550 0 16 32 48 64 80 96 (d) (e) (f)

Figure 3. (a) Traffic phenomenon (i.e., speeds (km/h) of road seg-ments) over an urban road network in Tampines area, Singapore, graphs of (b) RMSPE of APGD, IE, and hα, i-BAL policies with horizon length N0= 3, 4, 5 and (c) total online processing cost of hα, i-BAL policies with N0

= 3, 4, 5 vs. budget of N segments, and (d-f) road segments observed (shaded in black) by respective APGD, IE, and hα, i-BAL policies (N0= 5) with N = 60.

probe running APGD, IE, and hα, i-BAL policies with N0 = 5 given a budget of N = 60. It can be observed from Figs.3d-e that both APGD and IE cause the probe to move away from the slip roads and highways to low-speed segments whose measurements vary much more smoothly; this is expected due to their myopic exploration behavior. In contrast, hα, i-BAL nonmyopically plans the probe’s path and can thus direct it to observe the more informative slip roads and highways with highly varying measurements (Fig.3f) to achieve better performance.

5. Conclusion

This paper describes and theoretically analyzes an -BAL approach to nonmyopic active learning of GPs that can jointly and naturally optimize the exploration-exploitation trade-off. We then provide an anytime hα, i-BAL algo-rithm based on -BAL with real-time performance guaran-tee and empirically demonstrate using synthetic and real-world datasets that, with limited budget, it outperforms the state-of-the-art GP-based active learning algorithms. Acknowledgments. This work was supported by Singa-pore National Research Foundation in part under its In-ternational Research Center @ Singapore Funding Initia-tive and administered by the InteracInitia-tive Digital Media Pro-gramme Office and in part through the Singapore-MIT Al-liance for Research and Technology Subaward Agreement No. 52.

(9)

References

Cao, N., Low, K. H., and Dolan, J. M. Multi-robot infor-mative path planning for active sensing of environmental phenomena: A tale of two algorithms. In Proc. AAMAS, pp. 7–14, 2013.

Chen, J., Low, K. H., Tan, C. K.-Y., Oran, A., Jaillet, P., Dolan, J. M., and Sukhatme, G. S. Decentralized data fusion and active sensing with mobile sensors for mod-eling and predicting spatiotemporal traffic phenomena. In Proc. UAI, pp. 163–173, 2012.

Chen, J., Cao, N., Low, K. H., Ouyang, R., Tan, C. K.-Y., and Jaillet, P. Parallel Gaussian process regression with low-rank covariance matrix approximations. In Proc. UAI, pp. 152–161, 2013a.

Chen, J., Low, K. H., and Tan, C. K.-Y. Gaussian process-based decentralized data fusion and active sensing for mobility-on-demand system. In Proc. RSS, 2013b. Cover, T. and Thomas, J. Elements of Information Theory.

John Wiley & Sons, NY, 1991.

Diggle, P. J. Bayesian geostatistical design. Scand. J. Statistics, 33(1):53–64, 2006.

Dolan, J. M., Podnar, G., Stancliff, S., Low, K. H., Elfes, A., Higinbotham, J., Hosler, J. C., Moisan, T. A., and Moisan, J. Cooperative aquatic sensing using the tele-supervised adaptive ocean sensor fleet. In Proc. SPIE Conference on Remote Sensing of the Ocean, Sea Ice, and Large Water Regions, volume 7473, 2009.

Hoang, T. N. and Low, K. H. A general framework for in-teracting Bayes-optimally with self-interested agents us-ing arbitrary parametric model and model prior. In Proc. IJCAI, pp. 1394–1400, 2013.

Houlsby, N., Hernandez-Lobato, J. M., Huszar, F., and Ghahramani, Z. Collaborative Gaussian processes for preference learning. In Proc. NIPS, pp. 2105–2113, 2012.

Krause, A. and Guestrin, C. Nonmyopic active learning of Gaussian processes: An exploration-exploitation ap-proach. In Proc. ICML, pp. 449–456, 2007.

Krause, A., Singh, A., and Guestrin, C. Near-optimal sen-sor placements in Gaussian processes: Theory, efficient algorithms and empirical studies. JMLR, 9:235–284, 2008.

Low, K. H., Gordon, G. J., Dolan, J. M., and Khosla, P. Adaptive sampling for multi-robot wide-area explo-ration. In Proc. IEEE ICRA, pp. 755–760, 2007.

Low, K. H., Dolan, J. M., and Khosla, P. Adaptive multi-robot wide-area exploration and mapping. In Proc. AA-MAS, pp. 23–30, 2008.

Low, K. H., Dolan, J. M., and Khosla, P. Information-theoretic approach to efficient adaptive path planning for mobile robotic environmental sensing. In Proc. ICAPS, pp. 233–240, 2009.

Low, K. H., Dolan, J. M., and Khosla, P. Active Markov information-theoretic path planning for robotic environ-mental sensing. In Proc. AAMAS, pp. 753–760, 2011. Low, K. H., Chen, J., Dolan, J. M., Chien, S., and

Thomp-son, D. R. Decentralized active robotic exploration and mapping for probabilistic field classification in environ-mental sensing. In Proc. AAMAS, pp. 105–112, 2012. Martin, R. J. Comparing and contrasting some

environmen-tal and experimenenvironmen-tal design problems. Environmetrics, 12(3):303–317, 2001.

M¨uller, W. G. Collecting Spatial Data: Optimum Design of Experiments for Random Fields. Springer, 3rd edition, 2007.

Ouyang, R., Low, K. H., Chen, J., and Jaillet, P. Multi-robot active sensing of non-stationary Gaussian process-based environmental phenomena. In Proc. AAMAS, 2014.

Park, M. and Pillow, J. W. Bayesian active learning with localized priors for fast receptive field characterization. In Proc. NIPS, pp. 2357–2365, 2012.

Podnar, G., Dolan, J. M., Low, K. H., and Elfes, A. Telesu-pervised remote surface water quality sensing. In Proc. IEEE Aerospace Conference, 2010.

Poupart, P., Vlassis, N., Hoey, J., and Regan, K. An ana-lytic solution to discrete Bayesian reinforcement learn-ing. In Proc. ICML, pp. 697–704, 2006.

Rasmussen, C. E. and Williams, C. K. I. Gaussian Pro-cesses for Machine Learning. MIT Press, 2006. Shewry, M. C. and Wynn, H. P. Maximum entropy

sam-pling. J. Applied Statistics, 14(2):165–170, 1987. Singh, A., Krause, A., Guestrin, C., and Kaiser, W. J.

Effi-cient informative sensing using multiple robots. J. Arti-ficial Intelligence Research, 34:707–755, 2009.

Solomon, H. and Zacks, S. Optimal design of sampling from finite populations: A critical review and indication of new research areas. J. American Statistical Associa-tion, 65(330):653–677, 1970.

Zimmerman, D. L. Optimal network design for spatial pre-diction, covariance parameter estimation, and empirical prediction. Environmetrics, 17(6):635–652, 2006.

(10)

A. Proofs of Main Results

A.1. Proof of Lemma1

We will give a proof by induction on n that

|Q∗n(zD, x) − Qn(zD, x)| ≤ (N − n + 1)γ (15)

for all tuples (n, zD, x) generated at stage n = n0, . . . , N

by (8) to compute V

n0(zD0). When n = N , W_N∗(z_D, x) =

Q_N(zD, x) in (10), by definition. So, |Q∗N(zD, x) −

Q_N(zD, x)| ≤ γ (15) trivially holds for the base case.

Sup-posing (15) holds for n + 1 (i.e., induction hypothesis), we will prove that it holds for n0≤ n < N :

The first and second inequalities follow from the triangle inequality and (10), respectively. The last inequality is due to |W_n∗(zD, x) − Qn(zD, x)| ≤ 1 S S X i=1 V_n+1∗ (zD∪ {zxi}) − V n+1(zD∪ {z i x}) ≤ 1 S S X i=1 max x0 Q∗_n+1(zD∪ {zxi}, x0) − Q n+1(zD∪ {zxi}, x0) ≤ (N − n)γ

such that the last inequality follows from the induction hy-pothesis.

From (15), when n = n0, |Q∗_n0(zD0, x) − Q_n0(zD0, x)| ≤

(N − n0 + 1)γ for all x ∈ X \ D0 since D = D0 and zD = zD0.

Let

W_ni(zD, x) , − log p(zxi|zD) + Vn+1∗ (zD∪ {zxi}) .

Then, Wn∗(zD, x) = S−1PSi=1W i

n(zD, x) can be viewed

as an empirical mean computed based on the random sam-ples Wni(zD, x) drawn from a distribution whose mean

co-incides with b Qn(zD, x) , bH[ bZx|zD] + E[Vn+1∗ (zD∪ { bZx})|zD] b H[ bZx|zD] , − Z bτ −τb f ( bZx= zx|zD) log p(Zx= zx|zD)dzx −f ( bZx= −bτ |zD) log p(Zx= −bτ |zD) −f ( bZx=bτ |zD) log p(Zx=bτ |zD) (16)

such that the expectation term is omitted from the RHS ex-pression of bQN at stage N , and recall from Definition 1

that f and p are distributions of bZxand Zx, respectively.

Using Hoeffding’s inequality, b Qn(zD, x) − 1 S S X i=1 W_ni(zD, x) ≤γ 2

with probability at least 1 − 2 exp(−Sγ2_{/ 2(W − W )}2₎

where W and W are upper and lower bounds of Wi

n(zD, x), respectively. To determine these bounds, note

that |zi

x| ≤ bτ , by Definition1, and |µx|D,λ| ≤ τ − τ , byb (7). Consequently, 0 ≤ (zi

x− µx|D,λ)2 ≤ (2bτ − τ )

2 _≤

(2N κN −1_{τ − τ )}2 _{= (2N κ}N −1_{− 1)}2_τ2_{such that the last}

inequality follows from Lemma 7. Together with using Lemma6, the following result ensues:

1 p2πσ2 n ≥ p(zi x|zD, λ) ≥ 1 p2πσ2 o exp−(2N κ N −1_{− 1)}2_τ2 2σ2 n where σ2

nand σ2oare previously defined in (13). It follows

that p(zxi|zD) = X λ∈Λ p(zxi|zD, λ) bD(λ) ≥X λ∈Λ " 1 p2πσ2 o exp −(2N κ N −1_{− 1)}2_τ2 2σ2 n # bD(λ) = 1 p2πσ2 o exp −(2N κ N −1_{− 1)}2_τ2 2σ2 n . Similarly, p(zix|zD) ≤ 1/p2πσn2. Then, − log p(zi x|zD) ≤ 1 2log 2πσ 2 o + (2N κN −1_{− 1)}2_τ2 2σ2 n , − log p(zi x|zD) ≥ 1 2log 2πσ 2 n . By Lemma10,N − n 2 log 2πeσ 2 n ≤ Vn+1∗ (zD∪ {zix}) ≤ N − n 2 log 2πeσ 2 o + log |Λ|. Consequently, W − W≤ N log σo σn +(2N κ N −1_{− 1)}2_τ2 2σ2 n + log |Λ| = O N 2_κ2N_τ2 σ2 n + N logσo σn + log |Λ| . Finally, using Lemma16, |Q∗n(zD, x) − bQn(zD, x)| ≤ γ/2

by setting τ = O σo s logσ 2 o γ N2_κ2N_+σ2 o σ2 n +N logσo σn +log |Λ| ! , thereby guaranteeing that

|Q∗ n(zD, x) − Wn∗(zD, x)| ≤ |Q∗n(zD, x) − bQn(zD, x)| + | bQn(zD, x) − Wn∗(zD, x)| ≤ γ 2 + γ 2 = γ

(11)

with probability at least 1 − 2 exp −2Sγ2_/T2_where

T = 2W −W= O N2_κ2N_τ2 σ2 n +N log σo σn +log |Λ| .

From Lemma2, P (|Q∗_n(zD, x) − Wn∗(zD, x)| > γ) ≤ 2 exp −2Sγ 2 T2

for each tuple (n, zD, x) generated at stage n = n0, . . . , N

by (8) to compute V_n0(zD0). Since there will be no more

than (S|X |)N tuples (n, zD, x) generated at stage n =

n0_{, . . . , N by (}₈_{) to compute V}

n0(zD0), the probability that

|Q∗

n(zD, x) − Wn∗(zD, x)| > γ for some generated tuple

(n, zD, x) is at most 2(S|X |)Nexp(−2Sγ2/T2) by

apply-ing the union bound. Lemma3then directly follows. A.4. Proof of Theorem1

Suppose that a set zDof observations, a budget of N −n+1

sampling locations, S ∈ Z+_{, and γ > 0 are given. It}

fol-lows immediately from Lemmas1and3that the probabil-ity of |Q∗_n(zD, x) − Qn(zD, x)| ≤ N γ (11) holding for all

x ∈ X \ D is at least 1 − 2 (S |X |)Nexp −2Sγ 2 T2

where T is previously defined in Lemma2.

To guarantee that |Q∗n(zD, x) − Qn(zD, x)| ≤ N γ (11)

holds for all x ∈ X \ D0 with probability at least 1 − δ,

the value of S to be determined must therefore satisfy the following inequality: 1 − 2 (S |X |)Nexp −2Sγ 2 T2 ≥ 1 − δ , which is equivalent to S ≥ T 2 2γ2

N log S + N log |X | + log2 δ

. (17) Using the identity log S ≤ αS − log α − 1 with an appro-priate choice of α = γ2/(N T2), the RHS expression of (17) can be bounded from above by

S 2 + T2 2γ2 N logN |X |T 2 eγ2 + log 2 δ .

Therefore, to satisfy (17), it suffices to determine the value of S such that the following inequality holds:

S ≥ S 2 + T2 2γ2 N logN |X |T 2 eγ2 + log 2 δ by setting S = T 2 γ2 N logN |X |T 2 eγ2 + log 2 δ (18) where T _{, O}N 2_κ2N_τ2 σ2 n +N logσo σn +log |Λ| by setting τ = O σo s logσ 2 o γ N2_κ2N_+σ2 o σ2 n +N logσo σn +log |Λ| ! , as defined in Lemma 2previously. By assuming σo, σn,

|Λ|, N , κ, and |X | as constants, τ = O(plog(1/γ)), thus resulting in T = O(log(1/γ)). Consequently, (18) can be reduced to S = O    log1_γ 2 γ2 log   log1_γ γδ     .

Theorem 1 implies that (a) Q∗_n(zD, πn∗(zD)) −

Q∗n(zD, πn(zD)) ≤ Q∗n(zD, πn∗(zD)) − Qn(zD, πn(zD)) +

N γ and (b) Q∗n(zD, π∗n(zD)) − Qn(zD, πn(zD)) ≤

maxx∈X \D|Q∗n(zD, x) − Qn(zD, x)| ≤ N γ. By

combin-ing (a) and (b), Q∗n(zD, π∗n(zD)) − Q∗n(zD, πn(zD)) ≤

N γ + N γ = 2N γ holds with probability at least 1 − δ by setting S and τ according to that in Theorem1. A.6. Proof of Theorem2

By Lemma 4, Q∗_n(zD, πn∗(zD)) − Q∗n(zD, πn(zD)) ≤

2N γ holds with probability at least 1 − δ. Otherwise, Q∗_n(zD, πn∗(zD)) − Q∗n(zD, πn(zD)) > 2N γ with

prob-ability at most δ. In the latter case,

Q∗_n(zD, π∗n(zD)) − Q∗n(zD, πn(zD)) ≤ (N − n + 1) log σo σn + log |Λ| ≤ N log σo σn + log |Λ| (19)

where the first inequality in (19) follows from (a) Q∗_n(zD, πn∗(zD)) = Vn∗(zD) ≤ 0.5(N − n +

1) log(2πeσ2

o) + log |Λ|, by Lemma10, and (b)

Q∗_n(zD, πn(zD)) = H[Zπ n(zD)|zD] + EV ∗ n+1(zD∪ {Zπ n(zD)})|zD ≥1 2log 2πeσ 2 n + 1 2(N − n) log 2πeσ 2 n =1 2(N − n + 1) log 2πeσ 2 n (20) such that the inequality in (20) is due to Lemmas10and

11, and the last inequality in (19) holds because σo ≥ σn,

by definition in (13) (hence, log (σo/σn) ≥ 0). Recall that

(12)

like π∗) due to its use of the truncated sampling procedure (Section3.2), which implies π

n(zD) is a random variable. As a result, Eπ n(zD)[Q ∗ n(zD, πn∗(zD)) − Q∗n(zD, πn(zD))] ≤ (1 − δ) (2N γ) + δ N log σo σn + log |Λ| ≤ 2N γ + δ N log σo σn + log |Λ| (21)

where the expectation is with respect to random variable π_(z

D) and the first inequality follows from Lemma4and

(19). Using Eπ n(zD)[Q ∗ n(zD, π∗n(zD)) − Q∗n(zD, πn(zD))] = V_n∗(zD) − Eπ n(zD)[Q ∗ n(zD, π(zD))] and Eπ n(zD)[Q ∗ n(zD, πn(zD))] = Eπ n(zD) H[Zπ n(zD)|zD]+EV ∗ n+1zD∪ {Zπ n(zD)} |zD, (21) therefore becomes V_n∗(zD) − Eπ n(zD) H[Zπ n(zD)|zD] ≤ Eπ n(zD) EV∗ n+1 zD∪ {Zπ n(zD)} |zD + 2N γ + δ N log σo σn + log |Λ| (22)

such that there is no expectation term on the RHS expres-sion of (22) when n = N .

From (5), V₁π(zD0) can be expanded into the following

re-cursive formulation using chain rule for entropy: V_nπ(zD) = HZπn(zD)|zD+EV

π

n+1zD∪ {Zπn(zD)} |zD

(23) for stage n = 1, . . . , N where the expectation term is omit-ted from the RHS expression of Vπ

N at stage N .

Using (22) and (23) above, we will now give a proof by induction on n that V_n∗(zD) − E{π i}Ni=n h V_nπ(zD) i ≤ (N − n + 1) 2N γ + δ N log σo σn + log |Λ| . (24) When n = N , V_N∗(zD) − E{π N} h V_Nπ(zD) i = V_N∗(zD) − Eπ N(zD) h H[Zπ N(zD)|zD] i ≤ 2N γ + δ N log σo σn + log |Λ|

such that the equality is due to (23) and the inequality fol-lows from (22). So, (24) holds for the base case. Suppos-ing (24) holds for n + 1 (i.e., induction hypothesis), we will

prove that it holds for n < N : Vn∗(zD) − E{π i}Ni=n h Vnπ(zD) i = V_n∗(zD) − Eπ n(zD) HZπ n(zD)|zD − Eπ n(zD) h E h E{π i}Ni=n+1 h V_n+1π zD∪ {Zπ n(zD)} i zD ii ≤ Eπ n(zD) EV∗ n+1zD∪ {Zπ n(zD)} |zD + 2N γ + δ N log σo σn + log |Λ| − Eπ n(zD) h E h E{π i}Ni=n+1 h V_n+1π zD∪ {Zπ n(zD)} i zD ii = Eπ n(zD) h E h V_n+1∗ zD∪ {Zπ n(zD)} − E{π i}Ni=n+1 h V_n+1π zD∪ {Zπ n(zD)} i zD ii + 2N γ + δ N log σo σn + log |Λ| ≤ (N − n) 2N γ + δ N log σo σn + log |Λ| + 2N γ + δ N log σo σn + log |Λ| = (N − n + 1) 2N γ + δ N log σo σn + log |Λ| . such that the first equality is due to (23), and the first and second inequalities follow from (22) and induction hypoth-esis, respectively. From (24), when n = 1, V₁∗(zD) − Eπ h V₁π(zD) i = V₁∗(zD) − E{π i}Ni=1 h V₁π(zD) i ≤ N 2N γ + δ N log σo σn + log |Λ| . Let = N (2N γ + δ(N log(σo/σn) + log |Λ|)) by setting

γ = /(4N2) and δ = /(2N (N log(σo/σn) + log |Λ|)).

As a result, from Lemma4, τ = O(plog(1/)) and

S = O log 1 2 2 log log 1 2 !! . Theorem2then follows.

A.7. Proof of Theorem3

Theorem 3 Let π be any stochastic policy. Then, Eπ[V1π(zD0)] ≤ V

∗ 1(zD0).

Proof. We will give a proof by induction on n that E_{π_i_}N i=n[V π n(zD)] ≤ Vn∗(zD) . (25) When n = N , E{πN}[V π N(zD)] = EπN(zD)[H[ZπN(zD)|zD]] ≤ EπN(zD)[ max x∈X \DH[Zx|zD ]] = EπN(zD)[V ∗ N(zD)] = V_N∗(zD)

(13)

such that the first and second last equalities are due to (23) and (6), respectively. So, (25) holds for the base case. Sup-posing (25) holds for n + 1 (i.e., induction hypothesis), we will prove that it holds for n < N :

E_{π_i_}N i=n[V π n(zD)] = Eπn(zD) HZπn(zD)|zD + Eπn(zD) h E h E{πi}Ni=n+1V π n+1zD∪ {Zπn(zD)} zD ii ≤ Eπn(zD) HZπn(zD)|zD + Eπn(zD) h E h V_n+1∗ zD∪ {Zπn(zD)} zD ii ≤ Eπn(zD) max x∈X \DH[Zx|zD] + EV ∗ n+1(zD∪ {Zx})|zD = Eπn(zD)[V ∗ n(zD)] = V_n∗(zD)

such that the first and second last equalities are, respec-tively, due to (23) and (6), and the first inequality follows from the induction hypothesis.

A.8. Initializing Informed Heuristic Bounds

Due to the use of the truncated sampling procedure (Sec-tion3.2), computing informed initial heuristic bounds for V

n(zD) is infeasible without expanding from its

corre-sponding state to all possible states in the subsequent stages n+1, . . . , N , which we want to avoid. To resolve this issue, we instead derive informed bounds Vn(zD) and V

n(zD) that satisfy V_n(zD) ≤ Vn(zD) ≤ V n(zD) . (26)

with high probability: Using Theorem 1, |V_n∗(zD) −

V

n(zD)| ≤ maxx∈X \D|Qn∗(zD, x) − Qn(zD, x)| ≤

N γ, which implies Vn∗(zD) − N γ ≤ Vn(zD) ≤

V_n∗(zD) + N γ with probability at least 1 − δ. Vn∗(zD)

can at least be naively bounded using the uninformed, domain-independent lower and upper bounds given in Lemma10. In practice, domain-dependent bounds V∗n(zD)

and V∗_n(zD) (i.e., V∗n(zD) ≤ Vn∗(zD) ≤ V ∗ n(zD))

tend to be more informed and we will show in Theo-rem 4 below how they can be derived efficiently. So, by setting V_n(zD) = V∗n(zD) − N γ and V n(zD) = V∗n(zD) + N γ for n < N and VN(zD) = V N(zD) = maxx∈X \DS−1P S i=1− log p(z i x|zD), (26) holds with probability at least 1 − δ.

Theorem 4 Given a set zD of observations and a space

Λ of parameters λ, define the a priori greedy design with unknown parameters as the setSnofn ≥ 1 sampling

loca-tions where S0, ∅ Sn, Sn−1∪ ( arg max x∈X X λ∈Λ bD(λ) H[ZSn−1∪{x}|zD, λ] −X λ∈Λ bD(λ) H[ZSn−1|zD, λ] ) . (27) Similarly, define the a priori greedy design with known pa-rametersλ as the set S_nλofn ≥ 1 sampling locations where

Sλ 0 , ∅ Sλ n, S λ n−1∪ n arg max x∈X H[ZS λ n−1∪{x}|zD, λ] − H[ZSλ n−1|zD, λ] o . (28) Then, H[Z{π∗ i}Ni=N −n+1|zD] ≥ X λ∈Λ bD(λ) H[ZSn|zD, λ] H[Z{π∗ i}Ni=N −n+1|zD] ≤ X λ∈Λ bD(λ) e e − 1H[ZSnλ|zD, λ]+ nr e − 1 + H[Λ] (29) where {π∗_i}N

i=N −n+1= arg max

{πi}Ni=N −n+1

H[Z{πi}Ni=N −n+1|zD] ,

Λ denotes the set of random parameters cor-responding to the realized parameters λ, and r = − min(0, 0.5 log(2πeσ2

n)) ≥ 0.

Remark. V_{N −n+1}∗ (zD) = H[Z{π∗

i}Ni=N −n+1|zD], by

definition. Hence, the lower and upper bounds of H[Z{π∗

i}Ni=N −n+1|zD] (29) constitute informed

domain-dependent bounds for V_{N −n+1}∗ (zD) that can be derived

ef-ficiently since both Sn(27) and {Snλ}λ∈Λ(28) can be

com-puted in polynomial time with respect to the interested vari-ables.

Proof. To prove the lower bound, H[Z{π∗ i}Ni=N −n+1|zD] = max {πi}Ni=N −n+1 H[Z{πi}Ni=N −n+1|zD] ≥ max {πi}Ni=N −n+1 X λ∈Λ bD(λ) H[Z{πi}Ni=N −n+1|zD, λ] ≥ max S⊆X :|S|=n X λ∈Λ bD(λ) H[ZS|zD, λ] ≥X λ∈Λ bD(λ) H[ZSn|zD, λ] .

The first inequality follows from the monotonicity of conditional entropy (i.e., “information never hurts”

(14)

bound) (Cover & Thomas, 1991). The second inequality holds because the optimal set S∗

, arg maxS⊆X :|S|=nPλ∈ΛbD(λ) H[ZS|zD, λ] is an

opti-mal a priori design (i.e., non-sequential) that does not per-form better than the optimal sequential policy π∗ (Krause

& Guestrin,2007). The third inequality is due to definition

of Sn.

To prove the upper bound, H[Z{π∗ i}Ni=N −n+1|zD] ≤X λ∈Λ bD(λ) max S⊆X :|S|=nH[ZS|zD, λ] + H[Λ] ≤X λ∈Λ bD(λ) _e e − 1H[ZSnλ|zD, λ] + nr e − 1 + H[Λ] such that the first inequality is due to Theorem 1 ofKrause

& Guestrin(2007), and the second inequality follows from

Lemma18.

A.9. Performance Guarantee of hα,i-BAL Policy πhα,i Lemma 5 Suppose that a set zDof observations, a budget

ofN − n + 1 sampling locations for 1 ≤ n ≤ N , γ > 0, 0 < δ < 1, and α > 0 are given. Q∗_n(zD, πn∗(zD)) −

Q∗_n(zD, πnhα,i(zD)) ≤ 2(N γ + α) holds with probability

at least1 − δ by setting S and τ according to that in Theo-rem1.

1(zD0)| ≤ N γ + α with probability at least

1 − δ. In general, given that the length of planning hori-zon is reduced to N − n + 1 for 1 ≤ n ≤ N , the above inequalities are equivalent to

|V n(zD) − Vn(zD)| ≤ α |V_n∗(zD) − Vn(zD)| = Q ∗ n(zD0, π ∗ n(zD)) − Q_n(zD, πnhα,i(zD)) ≤ N γ + α (30) by increasing/shifting the indices of V

1, V

1, and V1∗above

from 1 to n so that these value functions start at stage n instead. Q∗_n(zD, π∗n(zD)) − Q∗n(zD, πnhα,i(zD)) = Q∗_n(zD, π∗n(zD)) − Q_n(zD, πnhα,i(zD)) + Q_n(zD, πhα,in (zD)) − Qn(zD, πnhα,i(zD)) + Q n(zD, πhα,in (zD)) − Q∗n(zD, πnhα,i(zD)) ≤ N γ + α + 1 S S X i=1 Vn+1(zD∪ {z_πihα,i n (zD)}) − V_n+1 (zD∪ {zi_πhα,i n (zD)}) + N γ ≤ 2(N γ + α)

where the inequalities follow from (8), (14), (30), and The-orem1.

Theorem 5 Given a set zD0 of prior observations, a

budget of N sampling locations, α > 0, and > 4N α, V₁∗(zD0) − Eπhα,i[Vπ

hα,i

1 (zD0)] ≤ by setting

and substituting γ = /(4N2_{) and δ = (/(2N ) −}

2α)/(N log(σo/σn) + log |Λ|) into S and τ in

The-orem 1 to give τ = O(plog(1/)) and S = O (log (1/)) 2 2 log log (1/) ( − α) ! .

Proof Sketch. The proof follows from Lemma5and is sim-ilar to that of Theorem2.

A.10. Time Complexity of hα, i-BAL Algorithm Suppose that our hα, i-BAL algorithm runs k simulated exploration paths during its lifetime where k actually de-pends on the available time for planning. Then, since each exploration path has at most N stages and each stage gener-ates at most S|X | stgener-ates, there will be at most O(kN S|X |) states generated during the whole lifetime of our hα, i-BAL algorithm. So, to analyze the overall time complexity of our hα, i-BAL algorithm, the processing cost at each state is first quantified, which, according to EXPLORE of Algorithm1, includes the cost of sampling (lines 2-5), ini-tializing (line 6) and updating the corresponding heuristic bounds (line 14). In particular, the cost of sampling at each state involves training the GPs (i.e., O(N3)) and com-puting the predictive distributions using (1) and (2) (i.e., O(|X |N2

)) for each set of realized parameters λ ∈ Λ and the cost of generating S|X | samples from a mixture of |Λ| Gaussian distributions (i.e., O(|Λ|S|X |)) by assuming that drawing a sample from a Gaussian distribution consumes a unit processing cost. This results in a total sampling com-plexity of O(|Λ|(N3_{+ |X |N}2_{+ S|X |)).}

Now, let O(∆) denote the processing cost of initializing the heuristic bounds at each state, which depends on the actual bounding scheme being used. The total processing cost at each state is therefore O(|Λ|(N3_{+ |X |N}2_{+ S|X |) + ∆ +}

S|X |) where the last term corresponds to the cost of up-dating bounds by (14). In addition, to search for the most potential state to explore in O(1) at each stage (lines 10-11), the set of unexplored states is maintained in a priority queue (line 12) using the corresponding exploration crite-rion, thus incurring an extra management cost (i.e., updat-ing the queue) of O(log(kN S|X |)). That is, the total time complexity of simulating k exploration paths in our hα, i-BAL algorithm is O(kN S|X |(|Λ|(N3+ |X |N2_{+ S|X |) +}

Nonmyopic ϵ-Bayes-Optimal Active Learning of Gaussian Processes

Nonmyopic -Bayes-Optimal Active Learning of Gaussian Processes

Abstract

1. Introduction

2. Modeling Spatial Phenomena with

Gaussian Processes (GPs)

3. Nonmyopic -Bayes-Optimal Active

Learning (-BAL)

4. Experiments and Discussion

5. Conclusion

References

A. Proofs of Main Results

Nonmyopic -Bayes-Optimal Active Learning of Gaussian Processes

3. Nonmyopic -Bayes-Optimal Active

Learning (-BAL)