Nonmyopic -Bayes-Optimal Active Learning of Gaussian Processes
Trong Nghia Hoang† [email protected]
Kian Hsiang Low† [email protected]
Patrick Jaillet§ [email protected]
Mohan Kankanhalli† [email protected]
†Department of Computer Science, National University of Singapore, Republic of Singapore
§Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, USA
Abstract
A fundamental issue in active learning of Gaussian processes is that of the exploration-exploitation trade-off. This paper presents a novel nonmyopic -Bayes-optimal active learn-ing(-BAL) approach that jointly and naturally optimizes the trade-off. In contrast, existing works have primarily developed myopic/greedy algorithms or performed exploration and ex-ploitation separately. To perform active learning in real time, we then propose an anytime algo-rithm based on -BAL with performance guaran-tee and empirically demonstrate using synthetic and real-world datasets that, with limited budget, it outperforms the state-of-the-art algorithms.
1. Introduction
Active learning has become an increasingly important focal theme in many environmental sensing and monitoring ap-plications (e.g., precision agriculture, mineral prospecting
(Low et al.,2007), monitoring of ocean and freshwater
phe-nomena like harmful algal blooms (Dolan et al.,2009;
Pod-nar et al.,2010), forest ecosystems, or pollution) where a
high-resolution in situ sampling of the spatial phenomenon of interest is impractical due to prohibitively costly sam-pling budget requirements (e.g., number of deployed sen-sors, energy consumption, mission time): For such appli-cations, it is thus desirable to select and gather the most informativeobservations/data for modeling and predicting the spatially varying phenomenon subject to some budget constraints, which is the goal of active learning and also known as the active sensing problem.
To elaborate, solving the active sensing problem amounts to deriving an optimal sequential policy that plans/decides the most informative locations to be observed for
minimiz-Proceedings of the 31st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copy-right 2014 by the author(s).
ing the predictive uncertainty of the unobserved areas of a phenomenon given a sampling budget. To achieve this, many existing active sensing algorithms (Cao et al.,2013;
Chen et al.,2012;2013b;Krause et al.,2008;Low et al.,
2008;2009;2011;2012;Singh et al.,2009) have modeled
the phenomenon as a Gaussian process (GP), which allows its spatial correlation structure to be formally character-ized and its predictive uncertainty to be formally quantified (e.g., based on mean-squared error, entropy, or mutual in-formation criterion). However, they have assumed the spa-tial correlation structure (specifically, the parameters defin-ing it) to be known, which is often violated in real-world applications, or estimated crudely using sparse prior data. So, though they aim to select sampling locations that are optimal with respect to the assumed or estimated parame-ters, these locations tend to be sub-optimal with respect to the true parameters, thus degrading the predictive perfor-mance of the learned GP model.
In practice, the spatial correlation structure of a phe-nomenon is usually not known. Then, the predictive per-formance of the GP modeling the phenomenon depends on how informative the gathered observations/data are for both parameter estimation as well as spatial prediction given the true parameters. Interestingly, as revealed in previous geostatistical studies (Martin, 2001; M¨uller, 2007), poli-cies that are efficient for parameter estimation are not nec-essarily efficient for spatial prediction with respect to the true model. Thus, the active sensing problem involves a potential trade-off between sampling the most informative locations for spatial prediction given the current, possibly incomplete knowledge of the model parameters (i.e., ex-ploitation) vs. observing locations that gain more informa-tion about the parameters (i.e., explorainforma-tion):
How then does an active sensing algorithm trade off be-tween these two possibly conflicting sampling objectives? To tackle this question, one principled approach is to frame active sensing as a sequential decision problem that jointly and naturally optimizes the above exploration-exploitation
Nonmyopic -Bayes-Optimal Active Learning of Gaussian Processes
trade-off while maintaining a Bayesian belief over the model parameters. This intuitively means a policy that biases towards observing informative locations for spatial prediction given the current model prior may be penalized if it entails a highly dispersed posterior over the model pa-rameters. So, the resulting induced policy is guaranteed to be optimal in the expected active sensing performance. Un-fortunately, such a nonmyopic Bayes-optimal policy cannot be derived exactly due to an uncountable set of candidate observations and unknown model parameters (Solomon &
Zacks,1970). As a result, most existing works (Diggle,
2006;Houlsby et al.,2012;Park & Pillow,2012;
Zimmer-man, 2006; Ouyang et al., 2014) have circumvented the
trade-off by resorting to the use of myopic/greedy (hence, sub-optimal) policies.
To the best of our knowledge, the only notable nonmy-opic active sensing algorithm for GPs (Krause & Guestrin,
2007) advocates tackling exploration and exploitation sep-arately, instead of jointly and naturally optimizing their trade-off, to sidestep the difficulty of solving the Bayesian sequential decision problem. Specifically, it performs a probably approximately correct (PAC)-style exploration until it can verify that the performance loss of greedy ex-ploitation lies within a user-specified threshold. But, such an algorithm is sub-optimal in the presence of budget con-straints due to the following limitations: (a) It is unclear how an optimal threshold for exploration can be determined given a sampling budget, and (b) even if such a threshold is available, the PAC-style exploration is typically designed to satisfy a worst-case sample complexity rather than to be optimal in the expected active sensing performance, thus resulting in an overly-aggressive exploration (Section4.1). This paper presents an efficient decision-theoretic planning approach to nonmyopic active sensing/learning that can still preserve and exploit the principled Bayesian sequential decision problem framework for jointly and naturally opti-mizing the exploration-exploitation trade-off (Section3.1) and consequently does not incur the limitations of the al-gorithm of Krause & Guestrin (2007). In particular, al-though the exact Bayes-optimal policy to the active sens-ing problem cannot be derived (Solomon & Zacks,1970), we show that it is in fact possible to solve for a nonmy-opic -Bayes-optimal active learning (-BAL) policy (Sec-tions3.2and3.3) given a user-defined bound , which is the main contribution of our work here. In other words, our proposed -BAL policy can approximate the optimal ex-pected active sensing performance arbitrarily closely (i.e., within an arbitrary loss bound ). In contrast, the algorithm
ofKrause & Guestrin(2007) can only yield a sub-optimal
performance bound1. To meet the real-time requirement 1
Its induced policy is guaranteed not to achieve worse than the optimal performance by more than a factor of 1/e.
in time-critical applications, we then propose an asymp-totically -optimal, branch-and-bound anytime algorithm based on -BAL with performance guarantee (Section3.4). We empirically demonstrate using both synthetic and real-world datasets that, with limited budget, our proposed ap-proach outperforms state-of-the-art algorithms (Section4).
2. Modeling Spatial Phenomena with
Gaussian Processes (GPs)
The GP can be used to model a spatial phenomenon of in-terest as follows: The phenomenon is defined to vary as a realization of a GP. Let X denote a set of sampling lo-cations representing the domain of the phenomenon such that each location x ∈ X is associated with a realized (ran-dom) measurement zx(Zx) if x is observed/sampled
(un-observed). Let ZX , {Zx}x∈X denote a GP, that is, every
finite subset of ZX has a multivariate Gaussian distribution
(Chen et al.,2013a;Rasmussen & Williams,2006). The
GP is fully specified by its prior mean µx , E[Zx] and
covariance σxx0|λ , cov[Zx, Zx0|λ] for all x, x0 ∈ X , the
latter of which characterizes the spatial correlation struc-ture of the phenomenon and can be defined using a covari-ance function parameterized by λ. A common choice is the squared exponential covariance function:
σxx0|λ, (σλs)2exp − 1 2 P X i=1 [sx]i− [sx0]i `λ i 2! +(σλn)2δxx0
where [sx]i([sx0]i) is the ith component of the P
-dimensional feature vector sx(sx0), the set of realized
pa-rameters λ,σλ
n, σλs, `λ1, . . . , `λP ∈ Λ are, respectively,
the square root of noise variance, square root of signal vari-ance, and length-scales, and δxx0 is a Kronecker delta that
is 1 if x = x0and 0 otherwise.
Supposing λ is known and a set zD of realized
measure-ments is available for some set D ⊂ X of observed lo-cations, the GP can exploit these observations to predict the measurement for any unobserved location x ∈ X \ D as well as provide its corresponding predictive uncertainty using the Gaussian predictive distribution p(zx|zD, λ) ∼
N (µx|D,λ, σxx|D,λ) with the following posterior mean and
variance, respectively:
µx|D,λ, µx+ ΣxD|λΣ−1DD|λ(zD− µD) (1)
σxx|D,λ, σxx|λ− ΣxD|λΣ−1DD|λΣDx|λ (2)
where, with a slight abuse of notation, zDis to be perceived
as a column vector in (1), µDis a column vector with mean
components µx0for all x0∈ D, ΣxD|λis a row vector with
covariance components σxx0|λfor all x0 ∈ D, ΣDx|λis the
transpose of ΣxD|λ, and ΣDD|λis a covariance matrix with
Nonmyopic -Bayes-Optimal Active Learning of Gaussian Processes
correlation structure (i.e., λ) is not known, a probabilistic belief bD(λ) , p(λ|zD) can be maintained/tracked over all
possible λ and updated using Bayes’ rule to the posterior belief bD∪{x}(λ) given a newly available measurement zx:
bD∪{x}(λ) ∝ p(zx|zD, λ) bD(λ) . (3)
Using belief bD, the predictive distribution p(zx|zD) can be
obtained by marginalizing out λ: p(zx|zD) =
X
λ∈Λ
p(zx|zD, λ) bD(λ) . (4)
3. Nonmyopic -Bayes-Optimal Active
Learning (-BAL)
3.1. Problem Formulation
To cast active sensing as a Bayesian sequential deci-sion problem, let us first define a sequential active sens-ing/learning policy π given a budget of N sampling loca-tions: Specifically, the policy π , {πn}Nn=1 is structured
to sequentially decide the next location πn(zD) ∈ X \ D
to be observed at each stage n based on the current ob-servations zD over a finite planning horizon of N stages.
Recall from Section1that the active sensing problem in-volves planning/deciding the most informative locations to be observed for minimizing the predictive uncertainty of the unobserved areas of a phenomenon. To achieve this, we use the entropy criterion (Cover & Thomas,1991) to mea-sure the informativeness and predictive uncertainty. Then, the value under a policy π is defined to be the joint entropy of its selected observations when starting with some prior observations zD0 and following π thereafter:
V1π(zD0) , H[Zπ|zD0] , −
Z
p(zπ|zD0) log p(zπ|zD0) dzπ
(5) where Zπ (zπ) is the set of random (realized)
measure-ments taken by policy π and p(zπ|zD0) is defined in a
sim-ilar manner to (4).
To solve the active sensing problem, the notion of Bayes-optimality2is exploited for selecting observations of largest possible joint entropy with respect to all possible induced sequences of future beliefs (starting from initial prior be-lief bD0) over candidate sets of model parameters λ, as
detailed next. Formally, this entails choosing a sequen-tial policy π to maximize Vπ
1(zD0) (5), which we call the
Bayes-optimal active learning (BAL) policy π∗. That is, V1∗(zD0) , V π∗ 1 (zD0) = maxπV π 1(zD0). When π ∗ is
plugged into (5), the following N -stage Bellman equations result from the chain rule for entropy:
2
Bayes-optimality is previously studied in reinforcement learning whose developed theories (Poupart et al.,2006;Hoang & Low,2013) cannot be applied here because their assumptions of discrete-valued observations and Markov property do not hold.
Vn∗(zD) = HZπ∗ n(zD)|zD +EV ∗ n+1zD∪ {Zπ∗ n(zD)}|zD = max x∈X \D Q∗n(zD, x) Q∗n(zD, x) , H[Zx|zD] + EVn+1∗ (zD∪ {Zx})|zD H[Zx|zD] , − Z p(zx|zD) log p(zx|zD) dzx (6)
for stage n = 1, . . . , N where p(zx|zD) is defined in (4)
and the expectation terms are omitted from the right-hand side (RHS) expressions of VN∗ and Q∗N at stage N . At each stage, the belief bD(λ) is needed to compute Q∗n(zD, x)
in (6) and can be uniquely determined from initial prior belief bD0 and observations zD\D0 using (3). To
under-stand how the BAL policy π∗ jointly and naturally opti-mizes the exploration-exploitation trade-off, its selected lo-cation π∗n(zD) = arg maxx∈X \DQ∗n(zD, x) at each stage
n affects both the immediate payoff HZπ∗
n(zD)|zD given
current belief bD(i.e., exploitation) as well as the posterior
belief bD∪{π∗
n(zD)}, the latter of which influences expected
future payoff E[Vn+1∗ (zD∪ {Zπ∗
n(zD)})|zD] and builds in
the information gathering option (i.e., exploration). Interestingly, the work of Low et al. (2009) has revealed that the above recursive formulation (6) can be perceived as the sequential variant of the well-known maximum entropy sampling problem (Shewry & Wynn,1987) and established an equivalence result that the maximum-entropy observa-tions selected by π∗achieve a dual objective of minimizing the posterior joint entropy (i.e., predictive uncertainty) re-maining in the unobserved locations of the phenomenon. Unfortunately, the BAL policy π∗ cannot be derived ex-actly because the stage-wise entropy and expectation terms in (6) cannot be evaluated in closed form due to an uncount-able set of candidate observations and unknown model pa-rameters λ (Section 1). To overcome this difficulty, we show in the next subsection how it is possible to solve for an -BAL policy π, that is, the joint entropy of its selected
observations closely approximates that of π∗within an ar-bitrary loss bound > 0.
3.2. -BAL Policy
The key idea underlying the design and construction of our proposed nonmyopic -BAL policy π is to approximate the entropy and expectation terms in (6) at every stage us-ing a form of truncated samplus-ing to be described next: Definition 1 (τ -Truncated Observation) Define random measurement bZxby truncatingZxat−τ andb τ as follows:b
b Zx, −bτ if Zx≤ −bτ , Zx if −bτ < Zx<bτ , b τ if Zx≥τ .b
Then, bZxhas a distribution of mixed type with its
contin-uous component defined as f ( bZx = zx|zD) , p(Zx =
zx|zD) for −bτ < zx < bτ and its discrete component de-fined asf ( bZx =bτ |zD) , P (Zx ≥bτ |zD) =
R∞
b
Nonmyopic -Bayes-Optimal Active Learning of Gaussian Processes zx|zD)dzx andf ( bZx = −bτ |zD) , P (Zx ≤ −bτ |zD) = R−τb −∞p(Zx= zx|zD)dzx. Let µ(D, Λ) , maxx∈X \D,λ∈Λµx|D,λ, µ(D, Λ) , minx∈X \D,λ∈Λµx|D,λ, and b τ , max µ(D, Λ) − τ , |µ(D, Λ) + τ | (7) for some0 ≤ τ ≤τ . Then, a realized measurement of bb Zx
is said to be aτ -truncated observation for location x. Specifically, given that a set zDof realized measurements is
available, a finite set of S τ -truncated observations {zxi}S i=1
can be generated for every candidate location x ∈ X \ D at each stage n by identically and independently sampling from p(zx|zD) (4) and then truncating each of them
ac-cording to zix← zixmin |zix|,bτ/|z
i
x|. These generated τ
-truncated observations can be exploited for approximating Vn∗(6) through the following Bellman equations:
Vn(zD) , max x∈X \D Q n(zD, x) Q n(zD, x) , 1 S S X i=1 − log p zix|zD + Vn+1 zD∪zxi (8)
for stage n = 1, . . . , N such that there is no V N +1
term on the RHS expression of QN at stage N . Like the BAL policy π∗ (Section3.1), the location πn(zD) =
arg maxx∈X \DQn(zD, x) selected by our -BAL policy
πat each stage n also jointly and naturally optimizes the trade-off between exploitation (i.e., by maximizing imme-diate payoff S−1PS
i=1− log p(z i π
n(zD)|zD) given the
cur-rent belief bD) vs. exploration (i.e., by improving
poste-rior belief bD∪{π
n(zD)}to maximize average future payoff
S−1PS
i=1Vn+1 (zD∪ {ziπ
n(zD)})). Unlike the
determinis-tic BAL policy π∗, our -BAL policy πis stochastic due to its use of the above truncated sampling procedure. 3.3. Theoretical Analysis
The main difficulty in analyzing the active sensing per-formance of our stochastic -BAL policy π (i.e., relative to that of BAL policy π∗) lies in determining how its -Bayes optimality can be guaranteed by choosing appropri-ate values of the truncappropri-ated sampling parameters S and τ (Section3.2). To achieve this, we have to formally under-stand how S and τ can be specified and varied in terms of the user-defined loss bound , budget of N sampling lo-cations, domain size |X | of the phenomenon, and proper-ties/parameters characterizing the spatial correlation struc-ture of the phenomenon (Section2), as detailed below. The first step is to show that Qn(8) is in fact a good
approx-imation of Q∗n(6) for some chosen values of S and τ . There
are two sources of error arising in such an approximation: (a) In the truncated sampling procedure (Section3.2), only a finite set of τ -truncated observations is generated for ap-proximating the stage-wise entropy and expectation terms
in (6), and (b) computing Q
ndoes not involve utilizing the
values of Vn+1∗ but that of its approximation V
n+1instead.
To facilitate capturing the error due to finite truncated sam-pling described in (a), the following intermediate function is introduced: Wn∗(zD, x) , 1 S S X i=1 − log p zi x|zD + Vn+1∗ zD∪ {zix} (9) for stage n = 1, . . . , N such that there is no VN +1∗ term on the RHS expression of WN∗ at stage N . The first lemma be-low reveals that if the error |Q∗n(zD, x) − Wn∗(zD, x)| due
to finite truncated sampling can be bounded for all tuples (n, zD, x) generated at stage n = n0, . . . , N by (8) to
com-pute Vn0 for 1 ≤ n0 ≤ N , then Qn0 (8) can approximate
Q∗n0 (6) arbitrarily closely:
Lemma 1 Suppose that a set zD0of observations, a budget
ofN − n0+ 1 sampling locations for 1 ≤ n0 ≤ N , S ∈ Z+,
andγ > 0 are given. If
|Q∗n(zD, x) − Wn∗(zD, x)| ≤ γ (10)
for all tuples(n, zD, x) generated at stage n = n0, . . . , N
by(8) to compute Vn0(zD0), then, for all x0∈ X \ D0,
|Q∗n0(zD0, x0) − Qn0(zD0, x0)| ≤ (N − n0+ 1)γ . (11)
Its proof is given in AppendixA.1. The next two lemmas show that, with high probability, the error |Q∗n(zD, x) −
Wn∗(zD, x)| due to finite truncated sampling can indeed be
bounded from above by γ (10) for all tuples (n, zD, x)
gen-erated at stage n = n0, . . . , N by (8) to compute V n0 for
1 ≤ n0≤ N :
Lemma 2 Suppose that a set zD0of observations, a budget
ofN − n0+ 1 sampling locations for 1 ≤ n0 ≤ N , S ∈ Z+,
andγ > 0 are given. For all tuples (n, zD, x) generated at
stagen = n0, . . . , N by (8) to compute Vn0(zD0), P|Q∗n(zD, x) − Wn∗(zD, x)| ≤ γ ≥ 1−2 exp −2Sγ 2 T2 whereT , ON 2κ2Nτ2 σ2 n +N logσo σn +log |Λ| by setting τ = O σo s logσ 2 o γ N2κ2N+σ2 o σ2 n +N logσo σn +log |Λ|
withκ, σn2, andσ2odefined as follows:
κ , 1 + 2 max x0,u∈X \D:x06=u,λ∈Λ,D σx0u|D,λ /σuu|D,λ, (12) σ2n, min λ∈Λ(σ λ n) 2, and σ2 o , max λ∈Λ (σ λ s) 2+ (σλ n) 2. (13) Refer to AppendixA.2for its proof.
Remark1. Deriving such a probabilistic bound in Lemma2
typically involves the use of concentration inequalities for the sum of independent bounded random variables like the Hoeffding’s, Bennett’s, or Bernstein’s inequalities. How-ever, since the originally Gaussian distributed observations
Nonmyopic -Bayes-Optimal Active Learning of Gaussian Processes
are unbounded, sampling from p(zx|zD) (4) without
trun-cation will generate unbounded versions of {zi
x}Si=1 and
consequently make each summation term − log p(zi x|zD)+
Vn+1∗ (zD ∪ {zix}) on the RHS expression of Wn∗ (9)
un-bounded, hence invalidating the use of these concentration inequalities. To resolve this complication, our trick is to exploit the truncated sampling procedure (Section 3.2) to generate bounded τ -truncated observations (Definition 1) (i.e., |zxi| ≤ τ for i = 1, . . . , S), thus resulting in eachb summation term − log p(zxi|zD) + Vn+1∗ (zD∪ {zix}) being
bounded (AppendixA.2). This enables our use of Hoeffd-ing’s inequality to derive the probabilistic bound.
Remark 2. It can be observed from Lemma 2 that the amount of truncation has to be reduced (i.e., higher cho-sen value of τ ) when (a) a tighter bound γ on the error |Q∗
n(zD, x) − Wn∗(zD, x)| due to finite truncated sampling
is desired, (b) a greater budget of N sampling locations is available, (c) a larger space Λ of candidate model pa-rameters is preferred, (d) the spatial phenomenon varies with more intensity and less noise (i.e., assuming all can-didate signal and noise variance parameters, respectively, (σλs)2 and (σλn)2 are specified close to the true large sig-nal and small noise variances), and (e) its spatial corre-lation structure yields a bigger κ. To elaborate on (e), note that Lemma 2 still holds for any value of κ larger than that set in (12): Since |σx0u|D,λ|2≤ σx0x0|D,λσuu|D,λ
for all x0 6= u ∈ X \ D due to the symmetric positive-definiteness of Σ(X \D)(X \D)|D,λ, κ can be set to 1 +
2 maxx0,u∈X \D,λ∈Λ,Dpσx0x0|D,λ/σuu|D,λ. Then,
sup-posing all candidate length-scale parameters are specified close to the true length-scales, a phenomenon with extreme length-scales tending to 0 (i.e., with white-noise process measurements) or ∞ (i.e., with constant measurements) will produce highly similar σx0x0|D,λfor all x0 ∈ X \ D,
thus resulting in smaller κ and hence smaller τ .
Remark 3. Alternatively, it can be proven that Lemma 2
and the subsequent results hold by setting κ = 1 if a cer-tain structural property of the spatial correlation structure (i.e., for all zD (D ⊆ X ) and λ ∈ Λ, ΣDD|λis diagonally
dominant) is satisfied, as shown in Lemma9(AppendixB). Consequently, the κ term can be removed from T and τ . Lemma 3 Suppose that a set zD0 of observations, a
bud-get ofN − n0 + 1 sampling locations for 1 ≤ n0 ≤ N , S ∈ Z+, and γ > 0 are given. The probability that |Q∗
n(zD, x) − Wn∗(zD, x)| ≤ γ (10) holds for all tuples
(n, zD, x) generated at stage n = n0, . . . , N by (8) to
com-puteV
n0(zD0) is at least 1 − 2(S|X |)Nexp(−2Sγ2/T2)
whereT is previously defined in Lemma2.
Its proof is found in AppendixA.3. The first step is con-cluded with our first main result, which follows from Lem-mas1 and3. Specifically, it chooses the values of S and τ such that the probability of Q
n (8) approximating Q∗n
(6) poorly (i.e., |Q∗n(zD, x) − Qn(zD, x)| > N γ) can be
bounded from above by a given 0 < δ < 1:
Theorem 1 Suppose that a set zDof observations, a
bud-get of N − n + 1 sampling locations for 1 ≤ n ≤ N , γ > 0, and 0 < δ < 1 are given. The probability that |Q∗
n(zD, x) − Qn(zD, x)| ≤ N γ holds for all x ∈ X \ D
is at least1 − δ by setting S = O T 2 γ2 N logN |X |T 2 γ2 +log 1 δ
whereT is previously defined in Lemma2. By assumingN , |Λ|, σo,σn,κ, and |X | as constants, τ = O(plog(1/γ))
and henceS = O (log (1/γ))
2 γ2 log log (1/γ) γδ ! . Its proof is provided in AppendixA.4.
Remark. It can be observed from Theorem1that the num-ber of generated τ -truncated observations has to be in-creased (i.e., higher chosen value of S) when (a) a lower probability δ of Qn(8) approximating Q∗n(6) poorly (i.e.,
|Q∗
n(zD, x)−Qn(zD, x)| > N γ) is desired, and (b) a larger
domain X of the phenomenon is given. The influence of γ, N , |Λ|, σo, σn, and κ on S is similar to that on τ , as
previ-ously reported in Remark 2 after Lemma2.
Thus far, we have shown in the first step that, with high probability, Q
n(8) approximates Q∗n(6) arbitrarily closely
for some chosen values of S and τ (Theorem1). The next step uses this result to probabilistically bound the perfor-mance loss in terms of Q∗n by observing location πn(zD)
selected by our -BAL policy πat stage n and following the BAL policy π∗thereafter:
Lemma 4 Suppose that a set zD of observations, a
bud-get of N − n + 1 sampling locations for 1 ≤ n ≤ N , γ > 0, and 0 < δ < 1 are given. Q∗n(zD, π∗n(zD)) −
Q∗n(zD, πn(zD)) ≤ 2N γ holds with probability at least
1 − δ by setting S and τ according to that in Theorem1. See Appendix A.5 for its proof. The final step extends Lemma4to obtain our second main result. In particular, it bounds the expected active sensing performance loss of our stochastic -BAL policy πrelative to that of BAL
pol-icy π∗, that is, policy πis -Bayes-optimal:
Theorem 2 Given a set zD0 of prior observations, a
bud-get of N sampling locations, and > 0, V1∗(zD0) −
Eπ[Vπ
1 (zD0)] ≤ by setting and substituting γ =
/(4N2) and δ = /(2N (N log(σo/σn) + log |Λ|)) into
S and τ in Theorem 1 to give τ = O(plog(1/)) and S = O (log (1/)) 2 2 log log (1/) 2 ! . Its proof is given in AppendixA.6.
Remark1. The number of generated τ -truncated observa-tions and the amount of truncation have to be, respectively,
Nonmyopic -Bayes-Optimal Active Learning of Gaussian Processes
increased and reduced (i.e., higher chosen values of S and τ ) when a tighter user-defined loss bound is desired. Remark 2. The deterministic BAL policy π∗ is Bayes-optimal among all candidate stochastic policies π since Eπ[V1π(zD0)] ≤ V
∗
1(zD0), as proven in AppendixA.7.
3.4. Anytime -BAL (hα, i-BAL) Algorithm
Unlike the BAL policy π∗, our -BAL policy πcan be de-rived exactly because its time complexity is independent of the size of the set of all possible originally Gaussian dis-tributed observations, which is uncountable. But, the cost of deriving π is exponential in the length N of planning
horizon since it has to compute the values V
n(zD) (8) for
all (S|X |)N possible states (n, z
D). To ease this
computa-tional burden, we propose an anytime algorithm based on -BAL that can produce a good policy fast and improve its approximation quality over time, as discussed next. The key intuition behind our anytime -BAL algorithm (hα, i-BAL of Algo.1) is to focus the simulation of greedy exploration paths through the most uncertain regions of the state space (i.e., in terms of the values Vn(zD)) instead of
evaluating the entire state space like π. To achieve this, our hα, i-BAL algorithm maintains both lower and upper heuristic bounds (respectively, Vn(zD) and V
n(zD)) for
each encountered state (n, zD), which are exploited for
rep-resenting the uncertainty of its corresponding value V n(zD)
to be used in turn for guiding the greedy exploration (or, put differently, pruning unnecessary, bad exploration of the state space while still guaranteeing policy optimality). To elaborate, each simulated exploration path (EXPLORE of Algo.1) repeatedly selects a sampling location x and its corresponding τ -truncated observation zxi at every stage
until the last stage N is reached. Specifically, at each stage n of the simulated path, the next states (n + 1, zD∪ {zxi})
with uncertainty |Vn+1(zD∪{zix})−V
n+1(zD∪{zxi})|
ex-ceeding α (line 6) are identified (lines 7-8), among which the one with largest lower bound Vn+1(zD∪ {zxi}) (line
10) is prioritized/selected for exploration (if more than one exists, ties are broken by choosing the one with most uncer-tainty, that is, largest upper bound Vn+1(zD∪ {zxi}) (line
11)) while the remaining unexplored ones are placed in the set U (line 12) to be considered for future exploration (lines 3-6 in hα, i-BAL). So, the simulated path terminates if the uncertainty of every next state is at most α; the uncertainty of a state at the last stage N is guaranteed to be zero (14). Then, the algorithm backtracks up the path to up-date/tighten the bounds of previously visited states (line 7 in hα, i-BAL and line 14 in EXPLORE) as follows:
Vn(zD) ← min Vn(zD), max x∈X \D Qn(zD, x) Vn(zD) ← max Vn(zD), max x∈X \D Q n(zD, x) (14) Algorithm 1 hα, i-BAL(zD0) hα, i-BAL zD0 1: U ←(1, zD0) 2: while |V1zD0 − V 1zD0 | > α do
3: V ← arg max(n,zD )∈UVn(zD)
4: n0, zD0 ← arg max(n,zD )∈VVn(zD)
5: U ← U \ n0, z D0
6: EXPLORE(n0, zD0, U ) /∗ U is passed by reference ∗/
7: UPDATE(n0, zD0)
8: return π1hα,izD0 ← arg maxx∈X \D0Q
1(zD0, x) EXPLORE(n, zD, U ) 1: T ← ∅ 2: for all x ∈ X \ D do 3: {zi x} S
i=1← sample from p(zx|zD) (4)
4: for i = 1, . . . , S do 5: zi x← zixmin |zix|,bτ/|z i x| 6: if |Vn+1zD∪zix − Vn+1 zD∪zix | > α then 7: T ← T ∪ n + 1, zD∪zix 8: parent n + 1, zD∪zix ← (n, zD) 9: if |T | > 0 then 10: V ← arg max(n+1,zD ∪{zi x })∈TV n+1zD∪zxi 11: (n + 1, zD0) ← arg max(n+1,zD ∪{zi x })∈VV n+1zD∪zxi 12: U ← U ∪ (T \ {(n + 1, zD0)}) 13: EXPLORE(n + 1, zD0, U )
14:Update Vn(zD) and Vn(zD) using (14)
UPDATE(n, zD)
1: Update Vn(zD) and Vn(zD) using (14)
2: if n > 1 then 3: (n − 1, zD0) ← parent(n, zD) 4: UPDATE(n − 1, zD0) Qn(zD, x) , 1 S S X i=1 − log p zi x|zD + V n+1zD∪zxi Qn(zD, x) , 1 S S X i=1 − log p zix|zD + Vn+1zD∪zxi
for stage n = 1, . . . , N such that there is no VN +1
(VN +1) term on the RHS expression of QN (Q
N) at stage
N . When the planning time runs out, we provide the greedy policy induced by the lower bound: πhα,i1 (zD0) ,
arg maxx∈X \D0Q
1(zD0, x) (line 8 in hα, i-BAL).
Central to the anytime performance of our hα, i-BAL algorithm is the computational efficiency of deriving in-formed initial heuristic bounds Vn(zD) and V
n(zD) where
Vn(zD) ≤ Vn(zD) ≤ V
n(zD). Due to lack of space,
we have shown in AppendixA.8how they can be derived efficiently. We have also derived a theoretical guarantee similar to that of Theorem2on the expected active sens-ing performance of our hα, i-BAL policy πhα,i, as shown in Appendix A.9. We have analyzed the time complexity of simulating k exploration paths in our hα, i-BAL algo-rithm to be O(kN S|X |(|Λ|(N3+ |X |N2+ S|X |) + ∆ + log(kN S|X |))) (AppendixA.10) where O(∆) denotes the cost of initializing the heuristic bounds at each state. In practice, hα, i-BAL’s planning horizon can be shortened to reduce its computational cost further by limiting the depth of each simulated path to strictly less than N . In that case,
Nonmyopic -Bayes-Optimal Active Learning of Gaussian Processes
although the resulting πhα,i’s performance has not been theoretically analyzed, Section 4.1 demonstrates empiri-cally that it outperforms the state-of-the-art algorithms.
4. Experiments and Discussion
This section evaluates the active sensing performance and time efficiency of our hα, i-BAL policy πhα,i (Sec-tion 3) empirically under limited sampling budget using two datasets featuring a simple, simulated spatial phe-nomenon (Section4.1) and a large-scale, real-world traffic phenomenon (i.e., speeds of road segments) over an urban road network (Section4.2). All experiments are run on a Mac OS X machine with Intel Core i7 at 2.66 GHz. 4.1. Simulated Spatial Phenomenon
The domain of the phenomenon is discretized into a finite set of sampling locations X = {0, 1, . . . , 99}. The phe-nomenon is a realization of a GP (Section2) parameterized by λ∗ = {σλn∗ = 0.25, σλ
∗
s = 10.0, `λ
∗
= 1.0}. For sim-plicity, we assume that σnλ∗ and σλ
∗
s are known, but the
true length-scale `λ∗= 1 is not. So, a uniform prior belief bD0=∅is maintained over a set L = {1, 6, 9, 12, 15, 18, 21}
of 7 candidate length-scales `λ. Using root mean squared prediction error(RMSPE) as the performance metric, the performance of our hα, i-BAL policies πhα,i with plan-ning horizon length N0 = 2, 3 and α = 1.0 are com-pared to that of the state-of-the-art GP-based active learn-ing algorithms: (a) The a priori greedy design (APGD) policy (Shewry & Wynn, 1987) iteratively selects and adds arg maxx∈X \Sn
P
λ∈ΛbD0(λ)H[ZSn∪{x}|zD0, λ] to
the current set Sn of sampling locations (where S0 = ∅)
until SN is obtained, (b) the implicit exploration (IE)
pol-icy greedily selects and observes sampling location xIE =
arg maxx∈X \DPλ∈ΛbD(λ)H[Zx|zD, λ] and updates the
belief from bD to bD∪{xIE}over L; if the upper bound on
the performance advantage of using π∗ over APGD pol-icy is less than a pre-defined threshold, it will use APGD with the remaining sampling budget, and (c) the explicit exploration via independent tests(ITE) policy performs a PAC-based binary search, which is guaranteed to find `λ∗ with high probability, and then uses APGD to select the remaining locations to be observed.
Both nonmyopic IE and ITE policies are proposed by
Krause & Guestrin(2007): IE is reported to incur the
low-est prediction error empirically while ITE is guaranteed not to achieve worse than the optimal performance by more than a factor of 1/e. Fig. 1a shows results of the active sensing performance of the tested policies averaged over 20 realizations of the phenomenon drawn independently from the underlying GP model described earlier. It can be ob-served that the RMSPE of every tested policy decreases with a larger budget of N sampling locations. Notably, our hα, i-BAL policies perform better than the APGD, IE,
5 10 15 20 25 8.5 9 9.5 10 10.5 11 11.5 12
Budget of N sampling locations
RM S P E APGD IE ITE π⟨α,ϵ⟩(N’ = 2) π⟨α,ϵ⟩(N’ = 3) 0 200 400 600 800 0 1 2 3 4 5x 10 5
No. of simulated paths
Ti m e (m s) 0 200 400 600 800 12 13 14 15 16 17 18 19
No. of simulated paths
En tr o p y (na t) Vϵ 1(zD) Vϵ1(zD) (a) (b) (c)
Figure 1. Graphs of (a) RMSPE of APGD, IE, ITE, and hα, i-BAL policies with planning horizon length N0= 2, 3 vs. budget of N sampling locations, (b) stage-wise online processing cost of hα, i-BAL policy with N0 = 3 and (c) gap between V1(zD0)
and V1(zD0) vs. number of simulated paths.
and ITE policies, especially when N is small. The perfor-mance gap between our hα, i-BAL policies and the other policies decreases as N increases, which intuitively means that, with a tighter sampling budget (i.e., smaller N ), it is more critical to strike a right balance between exploration and exploitation.
Fig.2shows the stage-wise sampling designs produced by the tested policies with a budget of N = 15 sampling locations. It can be observed that our hα, i-BAL policy achieves a better balance between exploration and exploita-tion and can therefore discern `λ∗much faster than the IE and ITE policies while maintaining a fine spatial coverage of the phenomenon. This is expected due to the following issues faced by IE and ITE policies: (a) The myopic explo-ration of IE tends not to observe closely-spaced locations (Fig. 2a), which are in fact informative towards estimat-ing the true length-scale, and (b) despite ITE’s theoretical guarantee in finding `λ∗, its PAC-style exploration is too
aggressive, hence completely ignoring how informative the posterior belief bDover L is during exploration. This leads
to a sub-optimal exploration behavior that reserves too lit-tle budget for exploitation and consequently entails a poor spatial coverage, as shown in Fig.2b.
Our hα, i-BAL policy can resolve these issues by jointly and naturally optimizing the trade-off between observing the most informative locations for minimizing the predic-tive uncertainty of the phenomenon (i.e., exploitation) vs. the uncertainty surrounding its length-scale (i.e., explo-ration), hence enjoying the best of both worlds (Fig.2c). In fact, we notice that, after observing 5 locations, our hα, i-BAL policy can focus 88.10% of its posterior belief on `λ∗while IE only assigns, on average, about 18.65% of its
posterior belief on `λ∗, which is hardly more informative
than the prior belief bD0(`
λ∗
) = 1/7 ≈ 14.28%. Finally, Fig.1b shows that the online processing cost of hα, i-BAL per sampling stage grows linearly in the number of sim-ulated paths while Fig. 1c reveals that its approximation quality improves (i.e., gap between V1(zD0) and V
1(zD0)
decreases) with increasing number of simulated paths. In-terestingly, it can be observed from Fig.1c that although hα, i-BAL needs about 800 simulated paths (i.e., 400 s) to close the gap between V1(zD0) and V
1(zD0), V
1(zD0)
Nonmyopic -Bayes-Optimal Active Learning of Gaussian Processes 0 10 20 30 40 50 60 70 80 90 100 Spatial Field Stages 1 - 5 1 - 9 1 - 13 1 - 15 0 10 20 30 40 50 60 70 80 90 100 Spatial Field
Exploration: Find the true length-scale by binary search
Exploitation: Too little remaining budget entails poor coverage Stages 1 - 5 1 - 9 1 - 13 1 - 15 0 10 20 30 40 50 60 70 80 90 100 Spatial Field
Exploration: Observing closely-spaced locations to identify true length-scale
Exploitation: Sufficient budget remaining to achieve even coverage Stages
1 - 4 1 - 8 1 - 12 1 - 15
(a) IE policy (b) ITE policy (c) hα, i-BAL policy
Figure 2. Stage-wise sampling designs produced by (a) IE, (b) ITE, and (c) hα, i-BAL policy with a planning horizon length N0 = 3 using a budget of N = 15 sampling locations. The final sampling designs are depicted in the bottommost rows of the figures.
only takes about 100 simulated paths (i.e., 50 s). This im-plies the actual computation time needed for hα, i-BAL to reach V
1(zD0) (via its lower bound V
1(zD0)) is much less
than that required to verify the convergence of V1(zD0) to
V
1(zD0) (i.e., by checking the gap). This is expected since
hα, i-BAL explores states with largest lower bound first (Section3.4).
4.2. Real-World Traffic Phenomenon
Fig.3a shows the traffic phenomenon (i.e., speeds (km/h) of road segments) over an urban road network X compris-ing 775 road segments (e.g., highways, arterials, slip roads, etc.) in Tampines area, Singapore during lunch hours on June 20, 2011. The mean speed is 52.8 km/h and the stan-dard deviation is 21.0 km/h. Each road segment x ∈ X is specified by a 4-dimensional vector of features: length, number of lanes, speed limit, and direction. The phe-nomenon is modeled as a relational GP (Chen et al.,2012) whose correlation structure can exploit both the road seg-ment features and road network topology information. The true parameters λ∗ = {σλ∗
n , σλ
∗
s , `λ
∗
} are set as the maxi-mum likelihood estimates learned using the entire dataset. We assume that σλ∗
n and σλ
∗
s are known, but `λ
∗
is not. So, a uniform prior belief bD0=∅ is maintained over a set
L = {`λi}6
i=0 of 7 candidate length-scales `λ0 = `λ
∗
and `λi = 2(i + 1)`λ∗for i = 1, . . . , 6.
The performance of our hα, i-BAL policies with planning horizon length N0 = 3, 4, 5 are compared to that of APGD and IE policies (Section4.1) by running each of them on a mobile probe to direct its active sensing along a path of adjacent road segments according to the road network topology; ITE cannot be used here as it requires observ-ing road segments separated by a pre-computed distance during exploration (Krause & Guestrin,2007), which vio-lates the topological constraints of the road network since the mobile probe cannot “teleport”. Fig. 3 shows results of the tested policies averaged over 5 independent runs: It can be observed from Fig.3b that our hα, i-BAL policies outperform APGD and IE policies due to their nonmyopic exploration behavior. In terms of the total online process-ing cost, Fig.3c shows that hα, i-BAL incurs < 4.5 hours given a budget of N = 240 road segments, which can be afforded by modern computing power. To illustrate the be-havior of each policy, Figs. 3d-f show, respectively, the road segments observed (shaded in black) by the mobile
85008550860086508700875088008850890089509000 3150 3200 3250 3300 3350 3400 3450 3500 3550 0 16 32 48 64 80 96 60 120 180 240 15 16 17 18 19 20 21 22
Budget of N road segments
RM S P E (k m /h ) APGD IE π⟨α ,ϵ ⟩(N’ = 3) π⟨α ,ϵ ⟩(N’ = 4) π⟨α ,ϵ ⟩(N’ = 5) 60 120 180 240 102 104 106 108
Budget of N road segments
Ti m e (m s) π⟨α ,ϵ ⟩(N’ = 3) π⟨α ,ϵ ⟩(N’ = 4) π⟨α ,ϵ ⟩(N’ = 5) (a) (b) (c) 85008550860086508700875088008850890089509000 3150 3200 3250 3300 3350 3400 3450 3500 3550 0 16 32 48 64 80 96 85008550860086508700875088008850890089509000 3150 3200 3250 3300 3350 3400 3450 3500 3550 0 16 32 48 64 80 96 85008550860086508700875088008850890089509000 3150 3200 3250 3300 3350 3400 3450 3500 3550 0 16 32 48 64 80 96 (d) (e) (f)
Figure 3. (a) Traffic phenomenon (i.e., speeds (km/h) of road seg-ments) over an urban road network in Tampines area, Singapore, graphs of (b) RMSPE of APGD, IE, and hα, i-BAL policies with horizon length N0= 3, 4, 5 and (c) total online processing cost of hα, i-BAL policies with N0
= 3, 4, 5 vs. budget of N segments, and (d-f) road segments observed (shaded in black) by respective APGD, IE, and hα, i-BAL policies (N0= 5) with N = 60.
probe running APGD, IE, and hα, i-BAL policies with N0 = 5 given a budget of N = 60. It can be observed from Figs.3d-e that both APGD and IE cause the probe to move away from the slip roads and highways to low-speed segments whose measurements vary much more smoothly; this is expected due to their myopic exploration behavior. In contrast, hα, i-BAL nonmyopically plans the probe’s path and can thus direct it to observe the more informative slip roads and highways with highly varying measurements (Fig.3f) to achieve better performance.
5. Conclusion
This paper describes and theoretically analyzes an -BAL approach to nonmyopic active learning of GPs that can jointly and naturally optimize the exploration-exploitation trade-off. We then provide an anytime hα, i-BAL algo-rithm based on -BAL with real-time performance guaran-tee and empirically demonstrate using synthetic and real-world datasets that, with limited budget, it outperforms the state-of-the-art GP-based active learning algorithms. Acknowledgments. This work was supported by Singa-pore National Research Foundation in part under its In-ternational Research Center @ Singapore Funding Initia-tive and administered by the InteracInitia-tive Digital Media Pro-gramme Office and in part through the Singapore-MIT Al-liance for Research and Technology Subaward Agreement No. 52.
Nonmyopic -Bayes-Optimal Active Learning of Gaussian Processes
References
Cao, N., Low, K. H., and Dolan, J. M. Multi-robot infor-mative path planning for active sensing of environmental phenomena: A tale of two algorithms. In Proc. AAMAS, pp. 7–14, 2013.
Chen, J., Low, K. H., Tan, C. K.-Y., Oran, A., Jaillet, P., Dolan, J. M., and Sukhatme, G. S. Decentralized data fusion and active sensing with mobile sensors for mod-eling and predicting spatiotemporal traffic phenomena. In Proc. UAI, pp. 163–173, 2012.
Chen, J., Cao, N., Low, K. H., Ouyang, R., Tan, C. K.-Y., and Jaillet, P. Parallel Gaussian process regression with low-rank covariance matrix approximations. In Proc. UAI, pp. 152–161, 2013a.
Chen, J., Low, K. H., and Tan, C. K.-Y. Gaussian process-based decentralized data fusion and active sensing for mobility-on-demand system. In Proc. RSS, 2013b. Cover, T. and Thomas, J. Elements of Information Theory.
John Wiley & Sons, NY, 1991.
Diggle, P. J. Bayesian geostatistical design. Scand. J. Statistics, 33(1):53–64, 2006.
Dolan, J. M., Podnar, G., Stancliff, S., Low, K. H., Elfes, A., Higinbotham, J., Hosler, J. C., Moisan, T. A., and Moisan, J. Cooperative aquatic sensing using the tele-supervised adaptive ocean sensor fleet. In Proc. SPIE Conference on Remote Sensing of the Ocean, Sea Ice, and Large Water Regions, volume 7473, 2009.
Hoang, T. N. and Low, K. H. A general framework for in-teracting Bayes-optimally with self-interested agents us-ing arbitrary parametric model and model prior. In Proc. IJCAI, pp. 1394–1400, 2013.
Houlsby, N., Hernandez-Lobato, J. M., Huszar, F., and Ghahramani, Z. Collaborative Gaussian processes for preference learning. In Proc. NIPS, pp. 2105–2113, 2012.
Krause, A. and Guestrin, C. Nonmyopic active learning of Gaussian processes: An exploration-exploitation ap-proach. In Proc. ICML, pp. 449–456, 2007.
Krause, A., Singh, A., and Guestrin, C. Near-optimal sen-sor placements in Gaussian processes: Theory, efficient algorithms and empirical studies. JMLR, 9:235–284, 2008.
Low, K. H., Gordon, G. J., Dolan, J. M., and Khosla, P. Adaptive sampling for multi-robot wide-area explo-ration. In Proc. IEEE ICRA, pp. 755–760, 2007.
Low, K. H., Dolan, J. M., and Khosla, P. Adaptive multi-robot wide-area exploration and mapping. In Proc. AA-MAS, pp. 23–30, 2008.
Low, K. H., Dolan, J. M., and Khosla, P. Information-theoretic approach to efficient adaptive path planning for mobile robotic environmental sensing. In Proc. ICAPS, pp. 233–240, 2009.
Low, K. H., Dolan, J. M., and Khosla, P. Active Markov information-theoretic path planning for robotic environ-mental sensing. In Proc. AAMAS, pp. 753–760, 2011. Low, K. H., Chen, J., Dolan, J. M., Chien, S., and
Thomp-son, D. R. Decentralized active robotic exploration and mapping for probabilistic field classification in environ-mental sensing. In Proc. AAMAS, pp. 105–112, 2012. Martin, R. J. Comparing and contrasting some
environmen-tal and experimenenvironmen-tal design problems. Environmetrics, 12(3):303–317, 2001.
M¨uller, W. G. Collecting Spatial Data: Optimum Design of Experiments for Random Fields. Springer, 3rd edition, 2007.
Ouyang, R., Low, K. H., Chen, J., and Jaillet, P. Multi-robot active sensing of non-stationary Gaussian process-based environmental phenomena. In Proc. AAMAS, 2014.
Park, M. and Pillow, J. W. Bayesian active learning with localized priors for fast receptive field characterization. In Proc. NIPS, pp. 2357–2365, 2012.
Podnar, G., Dolan, J. M., Low, K. H., and Elfes, A. Telesu-pervised remote surface water quality sensing. In Proc. IEEE Aerospace Conference, 2010.
Poupart, P., Vlassis, N., Hoey, J., and Regan, K. An ana-lytic solution to discrete Bayesian reinforcement learn-ing. In Proc. ICML, pp. 697–704, 2006.
Rasmussen, C. E. and Williams, C. K. I. Gaussian Pro-cesses for Machine Learning. MIT Press, 2006. Shewry, M. C. and Wynn, H. P. Maximum entropy
sam-pling. J. Applied Statistics, 14(2):165–170, 1987. Singh, A., Krause, A., Guestrin, C., and Kaiser, W. J.
Effi-cient informative sensing using multiple robots. J. Arti-ficial Intelligence Research, 34:707–755, 2009.
Solomon, H. and Zacks, S. Optimal design of sampling from finite populations: A critical review and indication of new research areas. J. American Statistical Associa-tion, 65(330):653–677, 1970.
Zimmerman, D. L. Optimal network design for spatial pre-diction, covariance parameter estimation, and empirical prediction. Environmetrics, 17(6):635–652, 2006.
Nonmyopic -Bayes-Optimal Active Learning of Gaussian Processes
A. Proofs of Main Results
A.1. Proof of Lemma1
We will give a proof by induction on n that
|Q∗n(zD, x) − Qn(zD, x)| ≤ (N − n + 1)γ (15)
for all tuples (n, zD, x) generated at stage n = n0, . . . , N
by (8) to compute V
n0(zD0). When n = N , WN∗(zD, x) =
QN(zD, x) in (10), by definition. So, |Q∗N(zD, x) −
QN(zD, x)| ≤ γ (15) trivially holds for the base case.
Sup-posing (15) holds for n + 1 (i.e., induction hypothesis), we will prove that it holds for n0≤ n < N :
|Q∗ n(zD, x) − Qn(zD, x)| ≤ |Q∗ n(zD, x) − Wn∗(zD, x)| + |Wn∗(zD, x) − Qn(zD, x)| ≤ γ + |W∗ n(zD, x) − Qn(zD, x)| ≤ γ + (N − n)γ = (N − n + 1)γ .
The first and second inequalities follow from the triangle inequality and (10), respectively. The last inequality is due to |Wn∗(zD, x) − Qn(zD, x)| ≤ 1 S S X i=1 Vn+1∗ (zD∪ {zxi}) − V n+1(zD∪ {z i x}) ≤ 1 S S X i=1 max x0 Q∗n+1(zD∪ {zxi}, x0) − Q n+1(zD∪ {zxi}, x0) ≤ (N − n)γ
such that the last inequality follows from the induction hy-pothesis.
From (15), when n = n0, |Q∗n0(zD0, x) − Qn0(zD0, x)| ≤
(N − n0 + 1)γ for all x ∈ X \ D0 since D = D0 and zD = zD0.
A.2. Proof of Lemma2
Let
Wni(zD, x) , − log p(zxi|zD) + Vn+1∗ (zD∪ {zxi}) .
Then, Wn∗(zD, x) = S−1PSi=1W i
n(zD, x) can be viewed
as an empirical mean computed based on the random sam-ples Wni(zD, x) drawn from a distribution whose mean
co-incides with b Qn(zD, x) , bH[ bZx|zD] + E[Vn+1∗ (zD∪ { bZx})|zD] b H[ bZx|zD] , − Z bτ −τb f ( bZx= zx|zD) log p(Zx= zx|zD)dzx −f ( bZx= −bτ |zD) log p(Zx= −bτ |zD) −f ( bZx=bτ |zD) log p(Zx=bτ |zD) (16)
such that the expectation term is omitted from the RHS ex-pression of bQN at stage N , and recall from Definition 1
that f and p are distributions of bZxand Zx, respectively.
Using Hoeffding’s inequality, b Qn(zD, x) − 1 S S X i=1 Wni(zD, x) ≤γ 2
with probability at least 1 − 2 exp(−Sγ2/ 2(W − W )2)
where W and W are upper and lower bounds of Wi
n(zD, x), respectively. To determine these bounds, note
that |zi
x| ≤ bτ , by Definition1, and |µx|D,λ| ≤ τ − τ , byb (7). Consequently, 0 ≤ (zi
x− µx|D,λ)2 ≤ (2bτ − τ )
2 ≤
(2N κN −1τ − τ )2 = (2N κN −1− 1)2τ2such that the last
inequality follows from Lemma 7. Together with using Lemma6, the following result ensues:
1 p2πσ2 n ≥ p(zi x|zD, λ) ≥ 1 p2πσ2 o exp−(2N κ N −1− 1)2τ2 2σ2 n where σ2
nand σ2oare previously defined in (13). It follows
that p(zxi|zD) = X λ∈Λ p(zxi|zD, λ) bD(λ) ≥X λ∈Λ " 1 p2πσ2 o exp −(2N κ N −1− 1)2τ2 2σ2 n # bD(λ) = 1 p2πσ2 o exp −(2N κ N −1− 1)2τ2 2σ2 n . Similarly, p(zix|zD) ≤ 1/p2πσn2. Then, − log p(zi x|zD) ≤ 1 2log 2πσ 2 o + (2N κN −1− 1)2τ2 2σ2 n , − log p(zi x|zD) ≥ 1 2log 2πσ 2 n . By Lemma10,N − n 2 log 2πeσ 2 n ≤ Vn+1∗ (zD∪ {zix}) ≤ N − n 2 log 2πeσ 2 o + log |Λ|. Consequently, W − W≤ N log σo σn +(2N κ N −1− 1)2τ2 2σ2 n + log |Λ| = O N 2κ2Nτ2 σ2 n + N logσo σn + log |Λ| . Finally, using Lemma16, |Q∗n(zD, x) − bQn(zD, x)| ≤ γ/2
by setting τ = O σo s logσ 2 o γ N2κ2N+σ2 o σ2 n +N logσo σn +log |Λ| ! , thereby guaranteeing that
|Q∗ n(zD, x) − Wn∗(zD, x)| ≤ |Q∗n(zD, x) − bQn(zD, x)| + | bQn(zD, x) − Wn∗(zD, x)| ≤ γ 2 + γ 2 = γ
Nonmyopic -Bayes-Optimal Active Learning of Gaussian Processes
with probability at least 1 − 2 exp −2Sγ2/T2 where
T = 2W −W= O N2κ2Nτ2 σ2 n +N log σo σn +log |Λ| .
A.3. Proof of Lemma3
From Lemma2, P (|Q∗n(zD, x) − Wn∗(zD, x)| > γ) ≤ 2 exp −2Sγ 2 T2
for each tuple (n, zD, x) generated at stage n = n0, . . . , N
by (8) to compute Vn0(zD0). Since there will be no more
than (S|X |)N tuples (n, zD, x) generated at stage n =
n0, . . . , N by (8) to compute V
n0(zD0), the probability that
|Q∗
n(zD, x) − Wn∗(zD, x)| > γ for some generated tuple
(n, zD, x) is at most 2(S|X |)Nexp(−2Sγ2/T2) by
apply-ing the union bound. Lemma3then directly follows. A.4. Proof of Theorem1
Suppose that a set zDof observations, a budget of N −n+1
sampling locations, S ∈ Z+, and γ > 0 are given. It
fol-lows immediately from Lemmas1and3that the probabil-ity of |Q∗n(zD, x) − Qn(zD, x)| ≤ N γ (11) holding for all
x ∈ X \ D is at least 1 − 2 (S |X |)Nexp −2Sγ 2 T2
where T is previously defined in Lemma2.
To guarantee that |Q∗n(zD, x) − Qn(zD, x)| ≤ N γ (11)
holds for all x ∈ X \ D0 with probability at least 1 − δ,
the value of S to be determined must therefore satisfy the following inequality: 1 − 2 (S |X |)Nexp −2Sγ 2 T2 ≥ 1 − δ , which is equivalent to S ≥ T 2 2γ2
N log S + N log |X | + log2 δ
. (17) Using the identity log S ≤ αS − log α − 1 with an appro-priate choice of α = γ2/(N T2), the RHS expression of (17) can be bounded from above by
S 2 + T2 2γ2 N logN |X |T 2 eγ2 + log 2 δ .
Therefore, to satisfy (17), it suffices to determine the value of S such that the following inequality holds:
S ≥ S 2 + T2 2γ2 N logN |X |T 2 eγ2 + log 2 δ by setting S = T 2 γ2 N logN |X |T 2 eγ2 + log 2 δ (18) where T , ON 2κ2Nτ2 σ2 n +N logσo σn +log |Λ| by setting τ = O σo s logσ 2 o γ N2κ2N+σ2 o σ2 n +N logσo σn +log |Λ| ! , as defined in Lemma 2previously. By assuming σo, σn,
|Λ|, N , κ, and |X | as constants, τ = O(plog(1/γ)), thus resulting in T = O(log(1/γ)). Consequently, (18) can be reduced to S = O log1γ 2 γ2 log log1γ γδ .
A.5. Proof of Lemma4
Theorem 1 implies that (a) Q∗n(zD, πn∗(zD)) −
Q∗n(zD, πn(zD)) ≤ Q∗n(zD, πn∗(zD)) − Qn(zD, πn(zD)) +
N γ and (b) Q∗n(zD, π∗n(zD)) − Qn(zD, πn(zD)) ≤
maxx∈X \D|Q∗n(zD, x) − Qn(zD, x)| ≤ N γ. By
combin-ing (a) and (b), Q∗n(zD, π∗n(zD)) − Q∗n(zD, πn(zD)) ≤
N γ + N γ = 2N γ holds with probability at least 1 − δ by setting S and τ according to that in Theorem1. A.6. Proof of Theorem2
By Lemma 4, Q∗n(zD, πn∗(zD)) − Q∗n(zD, πn(zD)) ≤
2N γ holds with probability at least 1 − δ. Otherwise, Q∗n(zD, πn∗(zD)) − Q∗n(zD, πn(zD)) > 2N γ with
prob-ability at most δ. In the latter case,
Q∗n(zD, π∗n(zD)) − Q∗n(zD, πn(zD)) ≤ (N − n + 1) log σo σn + log |Λ| ≤ N log σo σn + log |Λ| (19)
where the first inequality in (19) follows from (a) Q∗n(zD, πn∗(zD)) = Vn∗(zD) ≤ 0.5(N − n +
1) log(2πeσ2
o) + log |Λ|, by Lemma10, and (b)
Q∗n(zD, πn(zD)) = H[Zπ n(zD)|zD] + EV ∗ n+1(zD∪ {Zπ n(zD)})|zD ≥1 2log 2πeσ 2 n + 1 2(N − n) log 2πeσ 2 n =1 2(N − n + 1) log 2πeσ 2 n (20) such that the inequality in (20) is due to Lemmas10and
11, and the last inequality in (19) holds because σo ≥ σn,
by definition in (13) (hence, log (σo/σn) ≥ 0). Recall that
Nonmyopic -Bayes-Optimal Active Learning of Gaussian Processes
like π∗) due to its use of the truncated sampling procedure (Section3.2), which implies π
n(zD) is a random variable. As a result, Eπ n(zD)[Q ∗ n(zD, πn∗(zD)) − Q∗n(zD, πn(zD))] ≤ (1 − δ) (2N γ) + δ N log σo σn + log |Λ| ≤ 2N γ + δ N log σo σn + log |Λ| (21)
where the expectation is with respect to random variable π(z
D) and the first inequality follows from Lemma4and
(19). Using Eπ n(zD)[Q ∗ n(zD, π∗n(zD)) − Q∗n(zD, πn(zD))] = Vn∗(zD) − Eπ n(zD)[Q ∗ n(zD, π(zD))] and Eπ n(zD)[Q ∗ n(zD, πn(zD))] = Eπ n(zD) H[Zπ n(zD)|zD]+EV ∗ n+1zD∪ {Zπ n(zD)} |zD, (21) therefore becomes Vn∗(zD) − Eπ n(zD) H[Zπ n(zD)|zD] ≤ Eπ n(zD) EV∗ n+1 zD∪ {Zπ n(zD)} |zD + 2N γ + δ N log σo σn + log |Λ| (22)
such that there is no expectation term on the RHS expres-sion of (22) when n = N .
From (5), V1π(zD0) can be expanded into the following
re-cursive formulation using chain rule for entropy: Vnπ(zD) = HZπn(zD)|zD+EV
π
n+1zD∪ {Zπn(zD)} |zD
(23) for stage n = 1, . . . , N where the expectation term is omit-ted from the RHS expression of Vπ
N at stage N .
Using (22) and (23) above, we will now give a proof by induction on n that Vn∗(zD) − E{π i}Ni=n h Vnπ(zD) i ≤ (N − n + 1) 2N γ + δ N log σo σn + log |Λ| . (24) When n = N , VN∗(zD) − E{π N} h VNπ(zD) i = VN∗(zD) − Eπ N(zD) h H[Zπ N(zD)|zD] i ≤ 2N γ + δ N log σo σn + log |Λ|
such that the equality is due to (23) and the inequality fol-lows from (22). So, (24) holds for the base case. Suppos-ing (24) holds for n + 1 (i.e., induction hypothesis), we will
prove that it holds for n < N : Vn∗(zD) − E{π i}Ni=n h Vnπ(zD) i = Vn∗(zD) − Eπ n(zD) HZπ n(zD)|zD − Eπ n(zD) h E h E{π i}Ni=n+1 h Vn+1π zD∪ {Zπ n(zD)} i zD ii ≤ Eπ n(zD) EV∗ n+1zD∪ {Zπ n(zD)} |zD + 2N γ + δ N log σo σn + log |Λ| − Eπ n(zD) h E h E{π i}Ni=n+1 h Vn+1π zD∪ {Zπ n(zD)} i zD ii = Eπ n(zD) h E h Vn+1∗ zD∪ {Zπ n(zD)} − E{π i}Ni=n+1 h Vn+1π zD∪ {Zπ n(zD)} i zD ii + 2N γ + δ N log σo σn + log |Λ| ≤ (N − n) 2N γ + δ N log σo σn + log |Λ| + 2N γ + δ N log σo σn + log |Λ| = (N − n + 1) 2N γ + δ N log σo σn + log |Λ| . such that the first equality is due to (23), and the first and second inequalities follow from (22) and induction hypoth-esis, respectively. From (24), when n = 1, V1∗(zD) − Eπ h V1π(zD) i = V1∗(zD) − E{π i}Ni=1 h V1π(zD) i ≤ N 2N γ + δ N log σo σn + log |Λ| . Let = N (2N γ + δ(N log(σo/σn) + log |Λ|)) by setting
γ = /(4N2) and δ = /(2N (N log(σo/σn) + log |Λ|)).
As a result, from Lemma4, τ = O(plog(1/)) and
S = O log 1 2 2 log log 1 2 !! . Theorem2then follows.
A.7. Proof of Theorem3
Theorem 3 Let π be any stochastic policy. Then, Eπ[V1π(zD0)] ≤ V
∗ 1(zD0).
Proof. We will give a proof by induction on n that E{πi}N i=n[V π n(zD)] ≤ Vn∗(zD) . (25) When n = N , E{πN}[V π N(zD)] = EπN(zD)[H[ZπN(zD)|zD]] ≤ EπN(zD)[ max x∈X \DH[Zx|zD ]] = EπN(zD)[V ∗ N(zD)] = VN∗(zD)
Nonmyopic -Bayes-Optimal Active Learning of Gaussian Processes
such that the first and second last equalities are due to (23) and (6), respectively. So, (25) holds for the base case. Sup-posing (25) holds for n + 1 (i.e., induction hypothesis), we will prove that it holds for n < N :
E{πi}N i=n[V π n(zD)] = Eπn(zD) HZπn(zD)|zD + Eπn(zD) h E h E{πi}Ni=n+1V π n+1zD∪ {Zπn(zD)} zD ii ≤ Eπn(zD) HZπn(zD)|zD + Eπn(zD) h E h Vn+1∗ zD∪ {Zπn(zD)} zD ii ≤ Eπn(zD) max x∈X \DH[Zx|zD] + EV ∗ n+1(zD∪ {Zx})|zD = Eπn(zD)[V ∗ n(zD)] = Vn∗(zD)
such that the first and second last equalities are, respec-tively, due to (23) and (6), and the first inequality follows from the induction hypothesis.
A.8. Initializing Informed Heuristic Bounds
Due to the use of the truncated sampling procedure (Sec-tion3.2), computing informed initial heuristic bounds for V
n(zD) is infeasible without expanding from its
corre-sponding state to all possible states in the subsequent stages n+1, . . . , N , which we want to avoid. To resolve this issue, we instead derive informed bounds Vn(zD) and V
n(zD) that satisfy Vn(zD) ≤ Vn(zD) ≤ V n(zD) . (26)
with high probability: Using Theorem 1, |Vn∗(zD) −
V
n(zD)| ≤ maxx∈X \D|Qn∗(zD, x) − Qn(zD, x)| ≤
N γ, which implies Vn∗(zD) − N γ ≤ Vn(zD) ≤
Vn∗(zD) + N γ with probability at least 1 − δ. Vn∗(zD)
can at least be naively bounded using the uninformed, domain-independent lower and upper bounds given in Lemma10. In practice, domain-dependent bounds V∗n(zD)
and V∗n(zD) (i.e., V∗n(zD) ≤ Vn∗(zD) ≤ V ∗ n(zD))
tend to be more informed and we will show in Theo-rem 4 below how they can be derived efficiently. So, by setting Vn(zD) = V∗n(zD) − N γ and V n(zD) = V∗n(zD) + N γ for n < N and VN(zD) = V N(zD) = maxx∈X \DS−1P S i=1− log p(z i x|zD), (26) holds with probability at least 1 − δ.
Theorem 4 Given a set zD of observations and a space
Λ of parameters λ, define the a priori greedy design with unknown parameters as the setSnofn ≥ 1 sampling
loca-tions where S0, ∅ Sn, Sn−1∪ ( arg max x∈X X λ∈Λ bD(λ) H[ZSn−1∪{x}|zD, λ] −X λ∈Λ bD(λ) H[ZSn−1|zD, λ] ) . (27) Similarly, define the a priori greedy design with known pa-rametersλ as the set Snλofn ≥ 1 sampling locations where
Sλ 0 , ∅ Sλ n, S λ n−1∪ n arg max x∈X H[ZS λ n−1∪{x}|zD, λ] − H[ZSλ n−1|zD, λ] o . (28) Then, H[Z{π∗ i}Ni=N −n+1|zD] ≥ X λ∈Λ bD(λ) H[ZSn|zD, λ] H[Z{π∗ i}Ni=N −n+1|zD] ≤ X λ∈Λ bD(λ) e e − 1H[ZSnλ|zD, λ]+ nr e − 1 + H[Λ] (29) where {π∗i}N
i=N −n+1= arg max
{πi}Ni=N −n+1
H[Z{πi}Ni=N −n+1|zD] ,
Λ denotes the set of random parameters cor-responding to the realized parameters λ, and r = − min(0, 0.5 log(2πeσ2
n)) ≥ 0.
Remark. VN −n+1∗ (zD) = H[Z{π∗
i}Ni=N −n+1|zD], by
definition. Hence, the lower and upper bounds of H[Z{π∗
i}Ni=N −n+1|zD] (29) constitute informed
domain-dependent bounds for VN −n+1∗ (zD) that can be derived
ef-ficiently since both Sn(27) and {Snλ}λ∈Λ(28) can be
com-puted in polynomial time with respect to the interested vari-ables.
Proof. To prove the lower bound, H[Z{π∗ i}Ni=N −n+1|zD] = max {πi}Ni=N −n+1 H[Z{πi}Ni=N −n+1|zD] ≥ max {πi}Ni=N −n+1 X λ∈Λ bD(λ) H[Z{πi}Ni=N −n+1|zD, λ] ≥ max S⊆X :|S|=n X λ∈Λ bD(λ) H[ZS|zD, λ] ≥X λ∈Λ bD(λ) H[ZSn|zD, λ] .
The first inequality follows from the monotonicity of conditional entropy (i.e., “information never hurts”
Nonmyopic -Bayes-Optimal Active Learning of Gaussian Processes
bound) (Cover & Thomas, 1991). The second inequality holds because the optimal set S∗
, arg maxS⊆X :|S|=nPλ∈ΛbD(λ) H[ZS|zD, λ] is an
opti-mal a priori design (i.e., non-sequential) that does not per-form better than the optimal sequential policy π∗ (Krause
& Guestrin,2007). The third inequality is due to definition
of Sn.
To prove the upper bound, H[Z{π∗ i}Ni=N −n+1|zD] ≤X λ∈Λ bD(λ) max S⊆X :|S|=nH[ZS|zD, λ] + H[Λ] ≤X λ∈Λ bD(λ) e e − 1H[ZSnλ|zD, λ] + nr e − 1 + H[Λ] such that the first inequality is due to Theorem 1 ofKrause
& Guestrin(2007), and the second inequality follows from
Lemma18.
A.9. Performance Guarantee of hα,i-BAL Policy πhα,i Lemma 5 Suppose that a set zDof observations, a budget
ofN − n + 1 sampling locations for 1 ≤ n ≤ N , γ > 0, 0 < δ < 1, and α > 0 are given. Q∗n(zD, πn∗(zD)) −
Q∗n(zD, πnhα,i(zD)) ≤ 2(N γ + α) holds with probability
at least1 − δ by setting S and τ according to that in Theo-rem1.
Proof. When our hα, i-BAL algorithm terminates, |V1(zD0) − V 1(zD0)| ≤ α, which implies |V 1(zD0) − V1(zD0)| ≤ α. By Theorem 1, since |V ∗ 1(zD0) − V1(zD0)| ≤ maxx∈X \D0|Q ∗ n(zD0, x) − Q n(zD0, x)| ≤ N γ, |V1∗(zD0) − V 1(zD0)| ≤ |V ∗ 1(zD0) − V 1(zD0)| + |V 1(zD0) − V
1(zD0)| ≤ N γ + α with probability at least
1 − δ. In general, given that the length of planning hori-zon is reduced to N − n + 1 for 1 ≤ n ≤ N , the above inequalities are equivalent to
|V n(zD) − Vn(zD)| ≤ α |Vn∗(zD) − Vn(zD)| = Q ∗ n(zD0, π ∗ n(zD)) − Qn(zD, πnhα,i(zD)) ≤ N γ + α (30) by increasing/shifting the indices of V
1, V
1, and V1∗above
from 1 to n so that these value functions start at stage n instead. Q∗n(zD, π∗n(zD)) − Q∗n(zD, πnhα,i(zD)) = Q∗n(zD, π∗n(zD)) − Qn(zD, πnhα,i(zD)) + Qn(zD, πhα,in (zD)) − Qn(zD, πnhα,i(zD)) + Q n(zD, πhα,in (zD)) − Q∗n(zD, πnhα,i(zD)) ≤ N γ + α + 1 S S X i=1 Vn+1(zD∪ {zπihα,i n (zD)}) − Vn+1 (zD∪ {ziπhα,i n (zD)}) + N γ ≤ 2(N γ + α)
where the inequalities follow from (8), (14), (30), and The-orem1.
Theorem 5 Given a set zD0 of prior observations, a
budget of N sampling locations, α > 0, and > 4N α, V1∗(zD0) − Eπhα,i[Vπ
hα,i
1 (zD0)] ≤ by setting
and substituting γ = /(4N2) and δ = (/(2N ) −
2α)/(N log(σo/σn) + log |Λ|) into S and τ in
The-orem 1 to give τ = O(plog(1/)) and S = O (log (1/)) 2 2 log log (1/) ( − α) ! .
Proof Sketch. The proof follows from Lemma5and is sim-ilar to that of Theorem2.
A.10. Time Complexity of hα, i-BAL Algorithm Suppose that our hα, i-BAL algorithm runs k simulated exploration paths during its lifetime where k actually de-pends on the available time for planning. Then, since each exploration path has at most N stages and each stage gener-ates at most S|X | stgener-ates, there will be at most O(kN S|X |) states generated during the whole lifetime of our hα, i-BAL algorithm. So, to analyze the overall time complexity of our hα, i-BAL algorithm, the processing cost at each state is first quantified, which, according to EXPLORE of Algorithm1, includes the cost of sampling (lines 2-5), ini-tializing (line 6) and updating the corresponding heuristic bounds (line 14). In particular, the cost of sampling at each state involves training the GPs (i.e., O(N3)) and com-puting the predictive distributions using (1) and (2) (i.e., O(|X |N2
)) for each set of realized parameters λ ∈ Λ and the cost of generating S|X | samples from a mixture of |Λ| Gaussian distributions (i.e., O(|Λ|S|X |)) by assuming that drawing a sample from a Gaussian distribution consumes a unit processing cost. This results in a total sampling com-plexity of O(|Λ|(N3+ |X |N2+ S|X |)).
Now, let O(∆) denote the processing cost of initializing the heuristic bounds at each state, which depends on the actual bounding scheme being used. The total processing cost at each state is therefore O(|Λ|(N3+ |X |N2+ S|X |) + ∆ +
S|X |) where the last term corresponds to the cost of up-dating bounds by (14). In addition, to search for the most potential state to explore in O(1) at each stage (lines 10-11), the set of unexplored states is maintained in a priority queue (line 12) using the corresponding exploration crite-rion, thus incurring an extra management cost (i.e., updat-ing the queue) of O(log(kN S|X |)). That is, the total time complexity of simulating k exploration paths in our hα, i-BAL algorithm is O(kN S|X |(|Λ|(N3+ |X |N2+ S|X |) +