CiteSeerX — An asymptotically optimal algorithm for the max k-armed bandit problem

Full text

(1)An Asymptotically Optimal Algorithm for the Max k-Armed Bandit Problem Matthew J. Streeter1 and Stephen F. Smith2 Computer Science Department and Center for the Neural Basis of Cognition1 and The Robotics Institute2 Carnegie Mellon University Pittsburgh, PA 15213 {matts, sfs}@cs.cmu.edu. 1.1. Summary of Results. Abstract. We consider a restricted version of the max k-armed bandit problem in which each arm yields payoff drawn from a generalized extreme value (GEV) distribution (defined in §2). This paper presents the first provably asymptotically optimal algorithm for this problem. Roughly speaking, the reason for assuming a GEV distribution is the Extremal Types Theorem (stated in §2), which states that the distribution of the sample maximum of n independent identically distributed random variables approaches a GEV distribution as n → ∞. A more formal justification is given in §3. For reasons that will become clear, the nature of our results depends on the shape parameter (ξ) of the GEV distribution. Assuming all arms have ξ ≤ 0, our results can be summarized as follows.. We present an asymptotically optimal algorithm for the max variant of the k-armed bandit problem. Given a set of k slot machines, each yielding payoff from a fixed (but unknown) distribution, we wish to allocate trials to the machines so as to maximize the expected maximum payoff received over a series of n trials. Subject“ to certain distributional assumptions, we show that ” 2 k ln(n) O k ln( δ ) 2 trials are sufficient to identify, with probability at least 1 − δ, a machine whose expected maximum payoff is within of optimal. This result leads to a strategy for solving the problem that is asymptotically optimal in the following sense: the gap between the expected maximum payoff obtained by using our strategy for n trials and that obtained by pulling the single best arm for all n trials approaches zero as n → ∞.. • Let a be an arm that yields payoff drawn from a GEV distribution with unknown parameters; let Mn denote the maximum payoff obtained after pulling a n times; and let mn = E[Mn ]. We provide an algorithm that, after pulling 2 times, produces an estimate m ¯n the arm O ln( 1δ ) ln(n) 2 of mn with the property that P [|m ¯ n − mn | < ] ≥ 1 − δ.. 1. Introduction In the k-armed bandit problem one is faced with a set of k slot machines, each having an arm that, when pulled, yields a payoff from a fixed (but unknown) distribution. The goal is to allocate trials to the arms so as to maximize the expected cumulative payoff obtained over a series of n trials. Solving the problem entails striking a balance between exploration (determining which arm yields the highest mean payoff) and exploitation (repeatedly pulling this arm). In the max k-armed bandit problem, the goal is to maximize the expected maximum (rather than cumulative) payoff. This version of the problem arises in practice when tackling combinatorial optimization problems for which a number of randomized search heuristics exist: given k heuristics, each yielding a stochastic outcome when applied to some particular problem instance, we wish to allocate trials to the heuristics so as to maximize the maximum payoff (e.g., the maximum number of clauses satisfied by any sampled variable assignment, the minimum makespan of any sampled schedule). Cicirello and Smith (2005) show that a max k-armed bandit approach yields good performance on the resourceconstrained project scheduling problem with maximum time lags (RCPSP/max).. • Let a1 , a2 , . . . , ak be k arms, each yielding payoff from (distinct) GEV distributions with unknown parameters. Let min denote the expected maximum payoff obtained by pulling the ith arm n times, and let m∗n = max1≤i≤k min . We provide an algorithm that, when run for n pulls, obtains expected maximum payoff m∗n − o(1).. Our results for the case ξ > 0 are similar, except that our estimates and expected maximum payoffs come within arbitrarily small factors (rather than absolute distances) of optimality. Specifically, our estimates have the property that m ¯ n −α1 1 P 1+ <m < 1 + ≥ 1 − δ for constant α1 inden −α1 pendent of n, while the expected maximum payoff obtained by using our algorithm for n pulls is m∗n (1 − o(1)).. 1.2. Related Work The classical k-armed bandit problem was first studied by Robbins (1952) and has since been the subject of numerous papers; see Berry and Fristedt (1986) and Kaelbling (1993) for overviews. In a paper similar in spirit to ours, Fong (1995) showed that an initial exploration phase consisting. c 2006, American Association for Artificial IntelliCopyright gence (www.aaai.org). All rights reserved.. 135.

(2) of O k2 ln( kδ ) pulls is sufficient to identify, with probability at least 1 − δ, an arm whose mean payoff is within of optimal. Theorem 2 of this paper proves a bound similar to Fong’s on the number of pulls needed to identify an arm whose expected maximum payoff (over a series of n trials) is near-optimal. The max variant of the k-armed bandit problem was first studied by Cicirello and Smith (2004; 2005), who successfully used a heuristic for the max k-armed bandit problem to select among priority rules for the RCPSP/max. The design of Cicirello and Smith’s heuristic is motivated by an analysis of the special case in which each arm’s payoff distribution is a GEV distribution with shape parameter ξ = 0, but they do not rigorously analyze the heuristic’s behavior. Our paper is more theoretical and less empirical: on the one hand we do not perform experiments on any practical combinatorial problem, but on the other hand we provide stronger performance guarantees under weaker distributional assumptions.. 20. Expected maximum. Shape = 0.1 Shape = 0. 15. Shape = -0.1. 10. 5. 0 0. 2. 4. 6. 8. 10. ln(n). Figure 1: The effect of the shape parameter (ξ) on the expected maximum of n independent draws from a GEV distribution.. 1.3. Notation For an arbitrary cumulative distribution function G, let the random variable MnG be defined by. Proposition 1. Let Z be a random variable with P[Z ≤ z] = GEV(µ,σ,ξ) (z). Then   µ + σξ (Γ(1 − ξ) − 1) if ξ < 1, ξ = 0 E[Z] = µ + σγ if ξ = 0  ∞ if ξ ≥ 1. MnG = max{Z1 , Z2 , . . . , Zn }. where Z1 , Z2 , . . . , Zn are independent random variables, each having distribution G. Let. where. G mG n = E[Mn ] .. Γ(z) =. . ∞. tz−1 exp(−t) dt. 0. 2. Extreme Value Theory. is the complete gamma function and. n 1 − ln(n) γ = lim n→∞ k. This section provides a self-contained overview of results in extreme value theory that are relevant to this work. Our presentation is based on the text by Coles (2001). The central result of extreme value theory is an analogue of the central limit theorem that applies to extremely rare events. Recall that the central limit theorem states that (under certain regularity conditions) the distribution of the sum of n independent, identically distributed (i.i.d) random variables converges to a normal distribution as n → ∞. The extremal types theorem states that (under certain regularity conditions) the distribution of the maximum of n i.i.d random variables converges to a generalized extreme value (GEV) distribution.. k=1. is Euler’s constant. Proposition 2. Let G = GEV(µ,σ,ξ) . Then MnG has distribution GEV(µ ,σ ,ξ ) , where µ + σξ (nξ − 1) if ξ = 0 µ = µ + σ ln(n) otherwise, σ = σnξ , and ξ = ξ . Substituting the parameters of MnG given by Proposition 2 into Proposition 1 gives an expression for mG n. Proposition 3. Let G = GEV(µ,σ,ξ) where ξ < 1. Then µ + σξ nξ Γ(1 − ξ) − 1 if ξ = 0 G mn = µ + σγ + σ ln(n) otherwise.. Definition (GEV distribution). A random variable Z has a generalized extreme value distribution if, for constants µ, σ > 0, and ξ, P[Z ≤ z] = GEV(µ,σ,ξ) (z), where . − ξ1 ξ(z − µ) GEV(µ,σ,ξ) (z) = exp − 1 + σ. It follows that ξ • for ξ > 0, mG n is Θ(n ); G • for ξ = 0, mn is Θ(ln(n)); and σ ξ • for ξ < 0, mG n = µ − ξ − Θ(n ) .. for z ∈ {z : 1 + ξ(z − µ)σ −1 > 0}, and GEV(µ,σ,ξ) (z) = 1 otherwise. The case ξ = 0 is interpreted as the limit. µ−z . GEV (z) = exp − exp lim (µ,σ,ξ ) ξ →0 σ. It is useful to have a visual picture of what Proposition 3 means. Figure 1 plots mG n as a function of n for three GEV distributions with µ = 0, σ = 1, and ξ ∈ {0.1, 0, −0.1}. The central result of extreme value theory is the following theorem.. The following three propositions establish properties of the GEV distribution.. 136.

(3) m times. Effectively, I¯ is a restricted version of I in which the arms must be pulled in batches of size m, rather than in any arbitrary order. For m sufficiently large, the Ex¯i tremal Types Theorem guarantees that for each i, G GEV(µi ,σi ,ξi ) for some constants µi , σi , and ξi . Thus, the n instance I = ( m , G ) with G = (G1 , G2 , . . . , Gk ) and Gi = GEV(µi ,σi ,ξi ) is approximately equivalent to I¯ and satisfies our distributional assumptions. The purpose of the restrictions on the parameters µi , σi , and ξi is to ensure that each GEV distribution has finite, bounded mean and variance.. The Extremal Types Theorem. Let G be an arbitrary cumulative distribution function, and suppose there exist sequences of constants {an > 0} and {bn } such that G Mn − bn lim P ≤ z = G∗ (z) (1) n→∞ an. for any continuity point z of G∗ , where G∗ is a not a point mass. Then there exist constants µ, σ > 0, and ξ such that G∗ (z) = GEV(µ,σ,ξ) (z) ∀z. Furthermore, lim P [Mn ≤ z] = GEV(µan +bn ,σan ,ξ) (z) .. n→∞. 4. An Asymptotically Optimal Algorithm. Condition (1) holds for a variety of distributions including the normal, lognormal, uniform, and Cauchy distributions.. In this section we will analyze the max k-armed bandit strategy S 1 shown below. Our analysis will take a different form depending on whether each GEV distribution has shape parameter ξ < 0, ξ = 0, or ξ > 0. Although we will analyze all three cases, it is worth noting that the case ξ < 0 is the only one that can arise in practice. This is true because in any real combinatorial optimization problem the maximum payoff is bounded from above, which (by Proposition 3) can only happen when ξ < 0.. 3. The Max k-Armed Bandit Problem Definition (max k-armed bandit instance). An instance I = (n, G) of the max k-armed bandit problem is an ordered pair whose first element is a positive integer n, and whose second element is a k-tuple G = (G1 , G2 , . . . , Gk ) of cumulative distribution functions, each thought of as an arm on a slot machine. The ith arm, when pulled, returns a sample drawn independently at random from Gi . Definition (max k-armed bandit strategy). A max karmed bandit strategy S is an algorithm that, given an instance I = (n, G) of the max k-armed bandit problem, performs a sequence of n arm-pulls. For any strategy S and integer ≤ n, we denote by S (I) the expected maximum payoff obtained by running S on I for trials: S (I) = E max pj. Strategy S 1 (, δ): 1. (Exploration) For each arm Gi ∈ G: 2 Using t = O ln( 1δ ) ln(n) samples of Gi , 2 Gi i obtain an estimate m ¯G n of mn . Assuming that arm Gi has shape parameter ξi ≤ 0, our estimate will have the property that Gi i P |m ¯ n − mG n |< ≥1−δ .. 0≤j≤. where pj is the payoff obtained from the j th pull, and we define p0 = 0. Our goal is to come up with a strategy S such that Sn (I) is near-maximal. Note that the problem is ill-posed (i.e., there is no clear criterion for preferring one strategy over another) unless we make some assumptions about the distributions Gi . We will assume that each arm Gi = GEV(µi ,σi ,ξi ) is a GEV distribution whose parameters satisfy 1. |µi | ≤ µu 2. 0 < σ ≤ σi ≤ σu 3. ξ ≤ ξi ≤ ξu < 12 for known constants µu , σ , σu , ξ , and ξu . There are two arguments for assuming that each arm is a GEV distribution. First, in practice the distribution of payoffs returned by a strong heuristic may be approximately GEV, even if the conditions of the Extremal Type Theorem are not formally satisfied (Cicirello & Smith 2004). A second argument runs as follows. Suppose I = (n, G) is an instance of the max k-armed bandit problem in which each distribution Gi ∈ G satisfies condition (1) of the Exn ¯ , G), tremal Types Theorem. Consider the instance I¯ = ( m ¯1, G ¯2, . . . , G ¯ k ), and arm G ¯ i returns the maxwhere G¯ = (G imum payoff obtained by pulling the corresponding arm Gi. i 2. (Exploitation) Set î = arg max1≤i≤k m ¯G n , and pull arm Gî for the remaining n−tk trials.. If an arm Gi has shape parameter ξi > 0, the estimate obtained in step 1 (a) will instead have the property that m ¯ n −α1 1 P 1+ <m < 1 + ≥ 1 − δ for constant α1 inden −α1 pendent of n. The following theorem shows that with appropriate settings of and δ, strategy S 1 is asymptotically optimal when each arm has shape parameter ξi ≤ 0. In Appendix A, we establish a similar guarantee (using the same parameter settings) when one or more arms have ξi > 0. Theorem 1. Let I = (n, G) be an instance of the max karmed bandit problem, where G = (G1 , G2 , . . . , Gk ) and Gi = GEV(µi ,σi ,ξi ) . Let i • m∗n = max1≤i≤k mG n , • ξmax = max1≤i≤k ξi , and. 3 k 1 1 • S=S n , kn2 .. Then if ξmax ≤ 0,. Sn (I) = m∗n − O(∆) where ∆ = ln(nk) ln(n)2 3 nk .. 137.

(4) G Proof. Let m ˆ n = mn î (where î is the arm selected for exploitation in step 2). Then m ˆ n−tk is the expected maximum payoff obtained during the exploitation step, so ˆ n−tk . Sn (I) ≥ m tk ˆ n−tk is O( n ). Claim 1. m ˆn −m Proof of Claim 1. Let µ = µî , σ = σî , and ξ = ξî be the parameters of the arm selected for exploitation. Supˆ n−tk = pose ξ = 0. Then by Proposition 3, m ˆn − m σ (ln(n) − ln(n − tk)). Thus for n sufficiently large, ˆ n−tk = σ (ln(n) − ln(n − tk)) m ˆn −m. n − tk = −σ ln n. tk = −σ ln 1 − n tk < 2σ n. tk =O n where on the fourth line we have used the fact that for n 1 1 sufficiently large, tk n < 2 , and for 0 < x < 2 , −2x < ln(1 − x) ≤ −x. ˆ n−tk = Now suppose ξ < 0. By Proposition 3, m ˆn − m σ ξ ξ ξ ξ Γ(1 − ξ)(n − (n − t) ) = O((n − t) − n ) where we ξ have used the fact that σξ Γ(1 − ξ) < 0. Expanding (n − t)ξ in powers of t about t = 0 gives (n − t)ξ = nξ − ξnξ−1 t + O t2 nξ−2 . Because ξ ≤ 0 and |ξ| is bounded, it follows that (n − t)ξ − nξ is O( nt ).. 4.1. Estimating mn We adopt the following notation: • Let G = GEVµ,σ,ξ denote a GEV distribution with (unknown) parameters µ, σ, and ξ satisfying the conditions stated in §3, and. • let mi = mG i .. To estimate mn , we first obtain an accurate estimate of ξ. Then. 1. if ξ 0 (so that the growth of mn as a function of ln n is linear), we estimate mn by first estimating m1 and m2 , then performing linear interpolation; 2. otherwise we estimate mn by first estimating m1 , m2 , and m4 , then performing a nonlinear interpolation. 4.1.1. Estimating mi for i ∈ {1, 2, 4} The following two lemmas use well-known ideas to efficiently estimate mi for small values of i. Lemma 1. For any fixed positive integer i, O 12 draws from G suffice to obtain an estimate m ¯ i of mi such that P[|m ¯ i − mi | < ] ≥. 3 . 4. Proof. First consider the special case i = 1. Let X denote the sum of t draws from G, for some to-be-specified positive integer t. Then E[X] = m1 t and V ar[X] = σ ˜ 2 t, where σ ˜ is the (unknown) standard deviation of G (˜ σ is proportional to, but not the same as, the scale parameter σ). We take m ¯ 1 = Xt as our estimate of m1 . Then ¯ 1 − tm1 | ≥ t] P[|m ¯ 1 − m1 | ≥ ] = P[|tm √ t V ar[X]] = P[|X − E[X]| ≥ σ ˜ 2 σ ˜ ≤ 2 t. With probability at least 1 − kδ, all estimates obtained during the exploration phase are within of the correct valˆ n < 2. Assuming m∗n − m ˆ n < 2, it ues, so that m∗n − m follows that ˆ n−tk = (m∗n − m ˆ n ) + (m ˆn −m ˆ n−tk ) m∗n − m . tk < 2 + O n. k 1 (ln n)2 = 2 + O ln( ) n δ 2 = O (∆) where on the second line we have used Claim 1. Thus with probability at least 1 − kδ, our expected maximum payoff is at least m∗n − O(∆). Therefore, Sn (I) ≥ (1 − kδ) (m∗n − O (∆)) ≥ m∗n − O (∆) − kδm∗n = m∗n − o(1) where on the last line we have used the fact that ∆ = o(1) and the fact that for ξ ≤ 0, m∗n is O(log n), so that kδm∗n = m∗ n n2 = o(1).. where the last inequality is Chebyshev’s. Thus to guarantee 1 σ2 P [|m ¯ 1 − m1 | ≥ ] ≤ 14 we must set t = 4˜ 2 = O 2 (note that due to the assumptions in §3, σ ˜ is O(1)). For i > 1, we let X be the sum of t block maxima (each the maximum of i independent draws from G). Because the standard deviation of Mi and i itself are both O(1), the lemma follows. To boost the probability that |m ¯ i − mi | < from 1 − δ, we use the “median of means” method.. 3 4. to. Lemma 2. Let i be a positive integer and let > 0 and δ ∈ (0, 1) be real numbers. Then O ln( 1δ ) i2 draws from G suffice to obtain an estimate m ¯ i of mi such that P [|m ¯ n − mn | < ] ≥ 1 − δ .. Proof. We invoke Lemma 1 r times (for r to be determined), (1) (2) (r) ¯i ,...,m ¯ i } of estimates of yielding a set E = {m ¯i ,m (j) mi . Let m ¯ i be the median element of E. Let A = {m ¯i ∈ (j) E : |m ¯ i − mi | < } be the set of “accurate” estimates of ¯ i − mi | ≥ implies A ≤ 2r , mi ; and let A = |A|. Then |m. Theorem 1 completes our analysis of the performance of S 1 . It remains only to describe how the estimates in step 1 (a) are obtained.. 138.

(5) while E[A] ≥ have. 3 4 r.. by plugging m ¯ 1, m ¯ 2 , and m ¯ 4 into (3). Define ∆m = ¯ i − mi | and define ∆ξ = |ξ¯ − ξ|. We wish maxi∈{1,2,4} |m to upper bound ∆ξ as a function of ∆m . In Claim 1 of Theorem 1 we showed that | ln(x + β) − ln(x)| ≤ 2 βx for β ≤ x2 . Letting N = m4 − m2 and D = m2 − m1 , and noting that ξ = log2 (N ) − log2 (D) = 1 ln 2 (ln(N ) − ln(D)), it follows that. 2(2∆m ) 2(2∆m ) 1 + ∆ξ ≤ ln(2) N D. Using the standard Chernoff bound, we. r r ≤ exp − P [|m ¯ i − mi | ≥ ] ≤ P A ≤ 2 C 1 for constant C > 0. Thus r = O ln( δ ) repetitions suffice to ensure P[|m ¯ i − mi | > ] ≤ δ.. 4.1.2. Estimating mn when ξ = 0 Lemma 3. Assume G has shape parameter ξ = 0. Let n be a positive integer and let 2> 0 and δ ∈ (0, 1) be real draws from G suffice to numbers. Then O ln( 1δ ) ln(n) 2 obtain an estimate m ¯ n of mn such that. for ∆m < 12 min(N, D). Thus by Lemma 4 and the assumption that σ ≥ σ , ∆ξ is O(∆m ). ¯ i − mi |, so that ∆m = maxi∈{1,2,4} ∆i . Define ∆i = |m Then to guarantee P[∆ξ < ] ≥ 1 − δ, it suffices that P [∆i ≤ Ω()] ≥ 1 − 3δ for all i ∈ {1, 2, 4}. By Lemma 2, this requires O ln( 1δ ) 12 draws from G.. P[|m ¯ n − mn | < ] ≥ 1 − δ .. Proof. By Proposition 3, mi = µ + σγ + σ ln(i). Thus mn = m1 + (m2 − m1 ) log2 (n) .. (2). ¯ 2 be estimates of m1 and m2 , respectively, Let m ¯ 1 and m ¯1 and let m ¯ n be the estimate of mn obtained by plugging m and m ¯ 2 into (2). Define ∆i = |m ¯ i − mi | for i ∈ {1, 2, n}. Then ∆n ≤ (1 + log2 (n))(∆1 + ∆2 ) . Thus n < ] ≥ 1 − δ, it suffices that to guarantee P[∆ ≥ 1 − 2δ for all i ∈ {1, 2}. By P ∆i ≤ 2(1+log 2 (n)) 2 draws from G. Lemma 2, this requires O ln( 1δ ) (lnn) 2. Lemma 6. Assume G has shape parameter ξ ≤ −ξ ∗ . Let n be a positive integer and let > 0 and δ ∈ (0, 1) be real numbers. Then O ln( 1δ ) 12 draws from G suffice to obtain an estimate m ¯ n of mn such that P[|m ¯ n − mn | < ] ≥ 1 − δ .. Proof. By Proposition 3, mi = µ +. 4.1.3. Estimating mn when ξ = 0 For the purpose of the proofs presented here, we will make a minor assumption concerning an arm’s shape parameter ξi : we assume that for some known constant ξ ∗ > 0,. Define. |ξi | < ξ ∗ ⇒ ξi = 0 .. so that. Removing this assumption does not fundamentally change the results, but it makes the proofs more complicated. Lemma 5 shows how to efficiently estimate ξ. Lemmas 6 and 7 show how to efficiently estimate mn in the cases ξ < 0 and ξ > 0, respectively. We will use the following lemma. Lemma 4. m4 − m2 ≥ 14 σ and m2 − m1 ≥ 18 σ .. σξ i Γ(1 − ξ) − 1 . ξ. α1 = µ − σξ −1 α2 = σξ −1 Γ(1 − ξ) α3 = 2ξ log2 (i). mi = α1 + α2 α3. .. (4). Plugging in the values i = 1, i = 2, and i = 4 into (4) yields a system of three quadratic equations. Solving this system for α1 , α2 , and α3 yields α1 = (m1 m4 − m22 )(m1 − 2m2 + m4 )−1 α2 = (−2m1 m2 + m21 + m22 )(m1 − 2m2 + m4 )−1 α3 = (m4 − m2 )(m2 − m1 )−1 .. Proof. See Appendix A.. ¯ 2 , and m ¯ 4 be estimates of m1 , m2 , and m4 , reLet m ¯ 1, m ¯ 2 , and m ¯ 4 into the above equaspectively. Plugging m ¯ 1, m ¯ 2 , and α ¯ 3 , of α1 , α2 , and tions yields estimates, say α ¯1, α ¯ i − mi | and α3 , respectively. Define ∆m = maxi∈{1,2,4} |m ∆α = maxi∈{1,2,3} |¯ αi − αi |. To complete the proof, we show that |m ¯ n − mn | is O(∆m ). The argument consists of two parts: in claims 1 through 3 we show that ∆α is O(∆m ), then in Claim 4 we show that |m ¯ n − mn | is O(∆α ).. Lemma 5. For real numbers > 0 and δ ∈ (0, 1), O ln( 1δ ) 12 draws from G suffice to obtain an estimate ξ¯ of ξ such that P |ξ¯ − ξ| < ≥ 1 − δ . Proof. Using Proposition 3, it is straightforward to check that for any ξ < 1,. m4 − m2 . (3) ξ = log2 m2 − m1. Claim 1. Each of the numerators in the expressions for α1 , α2 , and α3 has absolute value bounded from above, while each of the denominators has absolute value bounded from below. (The bounds are independent of the unknown parameters of G.). ¯ 2 , and m ¯ 4 be estimates of m1 , m2 , and Let m ¯ 1, m m4 , respectively, and let ξ¯ be the estimate of ξ obtained. 139.

(6) Putting claims 3 and 4 together, |m ¯ n − mn | is O(∆m ). ¯ i − mi |, so that ∆m = maxi∈{1,2,4} ∆i . Define ∆i = |m Thus to guarantee P[|m ¯ n − mn | < ] ≥ 1 − δ, it suffices that P [∆i < Ω()] ≥ 1 − 3δ for all i ∈ {1, 2, 4}. By Lemma 2, this requires O(ln( 1δ ) 12 ) draws from G.. Proof of claim 1. The numerators will have bounded absolute value as long as m1 , m2 , and m3 are bounded. Upper bounds on m1 , m2 , and m3 follow from the restrictions on the parameters µ, σ, and ξ. As for the denominators, by Lemma 4 we have |m1 − 2m2 + m4 |. = |(m2 − m1 )(α3 − 1)| ∗ ≥ 18 σ |2−ξ − 1| .. Lemma 7. Assume G has shape parameter ξ ≥ ξ ∗ . Let n be a positive integer and let 2> 0 and δ ∈ (0, 1) be real draws from G suffice to numbers. Then O ln( 1δ ) ln(n) 2 obtain an estimate m ¯ n of mn such that 1 m ¯ n − α1 < < (1 + ) ≥ 1 − δ P 1+ mn − α1. Claim 2. Let N and D be fixed real numbers, and let βN and N +βN N βD be real numbers with |βD | < |D| 2 . Then | D+βD − D | is O(|βN | + |βD |).. Proof of claim 2. First, using the Taylor series expansion of N D+βD , . i . ∞ N βD N β N D (−1)i+1 D + βD − D = D2 D i=0 N βD ≤ 2 −1 D (1 − βD D ) = O (|βD |) .. where α1 = µ − σξ .. Proof. See Appendix A. Putting the results of lemmas 3, 5, 6, and 7 together, we obtain the following theorem. Theorem 2. Let n be a positive integer and let 2 > 0 and 1 ln(n) δ ∈ (0, 1) be real numbers. Then O ln( δ ) 2 draws from G suffice to obtain an estimate m ¯ n of mn such that with probability at least 1 − δ, one of the following holds: • ξ ≤ 0 and |m ¯ n − mn | < , or m ¯ n −α1 1 <m < 1 + , where α1 = µ − σξ . • ξ > 0 and 1+ n −α1. Then N + βN N N N βN D + βD − D ≤ D + βD − D + D + βD = O (|βN | + |βD |) .. ∗. Proof. First, invoke Lemma 5 with parameters ξ3 and 2δ . Then invoke one of Lemmas 3, 6, or 7 (depending on the estimate ξ¯ obtained from Lemma 5) with parameters and δ 2.. Claim 3. ∆α is O(∆m ). Proof of claim 3. We show that |¯ α1 −α1 | is O(∆m ). Similar α4 − α4 | are O(∆m ), arguments show that |¯ α2 − α2 | and |¯ which proves the claim. To see that |¯ α1 − α1 | is O(∆m ), let N = m1 m4 − m22 , and let D = m1 − 2m2 + m4 , so ¯ ¯ that α1 = N D . Define N and D in the natural way so that ¯ N α ¯ 1 = D¯ . Because m1 ,m2 , and m3 are all O(1) (by Claim ¯ − N | and |D ¯ − D| are O(∆m ). 1), it follows that both |N That |¯ α1 − α1 | is O(∆m ) follows by Claim 2.. Theorem 2 shows that step 1 (a) of strategy S 1 can be performed as described. The following PAC bound is a corollary. Corollary 1. Let I = (n, G) be an instance of the max karmed bandit problem, where G = (G1 , G2 , . . . , Gk ) and 2 Gi = GEV(µi ,σi ,ξi ) . If ξi ≤ 0 ∀i, then O k ln( kδ ) ln(n) 2 pulls suffice to identify an arm î such that with probability at ˆ. least 1 − δ, m∗n − min < .. Claim 4. |m ¯ n − mn | is O (∆α ).. 5. Conclusions. Proof of claim 4. Because ξ ≤ ξ ≤ −ξ ∗ it must be that ∗ 0 < 2ξ < α3 < 2−ξ < 1. So for ∆α sufficiently small, 0<α ¯ 3 < 1. log (n) log (n) ¯1 + α ¯2α ¯3 2 − α1 + α2 α3 2 |m ¯ n − mn | = α log (n) log (n) α2 α ¯3 2 − α ¯ 2 α3 2 ≤ |¯ α1 − α1 | + ¯ log (n) log (n) + ¯ α2 α3 2 − α2 α3 2 . The max k-armed bandit problem is a variant of the classical k-armed bandit problem with practical applications to combinatorial optimization. Motivated by extreme value theory, we studied a restricted version of this problem in which each arm yields payoff drawn from a GEV distribution. We derived PAC bounds on the sample complexity of estimating mn , the expected maximum of n draws from a GEV distribution. Using these bounds, we showed that a simple algorithm for the max k-armed bandit problem is asymptotically optimal. Ours is the first algorithm for this problem with rigorous asymptotic performance guarantees.. α2 | |¯ α3 − α3 | + |¯ α2 − α2 | ≤ |¯ α1 − α1 | + |¯ = O(∆α ). Acknowledgment. This work was sponsored in part by the National Science Foundation under contract #9900298 and the CMU Robotics Institute.. where on the third line we have used the fact that both α3 and α ¯ 3 are between 0 and 1, and in the last line we have used the fact that |¯ α2 | is O(1).. 140.

(7) References. m ˆ n−tk ) is O Because δk( m∗ n −α1 that A implies. Berry, D. A., and Fristedt, B. 1986. Bandit Problems: Sequential Allocation of Experiments. London: Chapman and Hall. Cicirello, V. A., and Smith, S. F. 2004. Heuristic selection for stochastic search optimization: Modeling solution quality by extreme value theory. In Proceedings of the 10th International Conference on Principles and Practice of Constraint Programming, 197–211. Cicirello, V. A., and Smith, S. F. 2005. The max k-armed bandit: A new model of exploration applied to search heuristic selection. In Proceedings of AAAI 2005, 1355– 1361. Coles, S. 2001. An Introduction to Statistical Modeling of Extreme Values. London: Springer-Verlag. Fong, P. W. L. 1995. A quantitative study of hypothesis selection. In International Conference on Machine Learning, 226–234. Kaelbling, L. P. 1993. Learning in Embedded Systems. Cambridge, MA: The MIT Press. Robbins, H. 1952. Some aspects of sequential design of experiments. Bulletin of the American Mathematical Society 58:527–535.. . 1 n2. . = o(∆), it suffices to show. m ˆ n−tk − α1 = 1 − O(∆) . m∗n − α1. 1 This can be rewritten as m∗n −α1 = (m ˆ n−tk −α1 ) 1−O(∆) = 1 (m ˆ n−tk − α1 )(1 + O(∆)) (we can replace 1−O(∆) with 1 r = 1+ 1−r < 1+2r). 1+O(∆) because for r < 12 , 1−r. Claim 2.. m ˆ n −α ˆ1 m ˆ n−tk −α ˆ1. = 1 + O(∆).. Proof of claim 2. Using Proposition 3,. m ˆn −α nξ ˆ1 ln = ln m ˆ n−tk − α ˆ1 (n − t)ξ = ξ (ln(n) − ln(n − tk)) tk = O( ) n = O(∆) The claim follows from the fact that exp(β) < 1 + 32 β for β < 12 , so that exp(O(∆)) = 1 + O(∆). Claim 3. A implies that for all i,. m ¯ in − α1 <1+. min − α1. Appendix A Theorem 3. Let I = (n, G) be an instance of the max karmed bandit problem, where G = (G1 , G2 , . . . , Gk ) and Gi = GEV(µi ,σi ,ξi ) . Let. Proof of claim 3. By definition, α1 = α1i − β for some β ≥ 0. The claim follows from the fact that for positive N and D N +β and β ≥ 0, N D < 1 + implies D+β < 1 + .. i • m∗n = max1≤i≤k mG n , • ξmax = max1≤i≤k ξi , and . 3 k 1 1 • S=S n , kn2 .. Claim 4. A implies. m∗ n −α1 m ˆ n−tk −α1. = 1 + O(∆).. Proof of claim 4. m∗ n −α1 ˆ min−tk −α1. Then if ξmax > 0,. Sn (I) − α1 = 1 − O(∆) m∗n − α1 where ∆ = ln(nk) ln(n)2 3 nk and α1 = max1≤i≤k αi , where αi = µi − σξii .. =. m∗ n −α1 m ¯∗ n −α1. ·. m ¯∗ n −α1 m ¯ în −α1. ˆ. ·. m ¯ in −α1 mîn −α1. ˆ. ·. min −α1 ˆ min−tk −α1. ≤ (1 + ) · 1 · (1 + ) · (1 + O(∆)) = 1 + O(∆). where in the second step we have used claims 2 and 3. Putting claims 1 and 4 together completes the proof. To remove the assumption that all arms have ξi > 0, we need to show that A implies that for n sufficiently large, the arms î and i∗ (the only arms that play a role in the proof) will have shape parameters > 0. This follows from the fact that if ξi ≤ 0, min is O(ln(n)), while if ξi > ξ ∗ > 0, min is Ω(nξi ).. Proof. For the moment, let us assume that all arms have shape parameter ξi > 0. Let A be the event (which occurs with probability at least 1 − kδ) that all estimates obtained in step 1 (a) satisfy the inequality in Theorem 2.. Lemma 4.. Claim 1. To prove the theorem, it suffices to show that A m∗ n −α1 implies m ˆ n−tk −α1 = 1 + O(∆).. m4 − m2 ≥ 14 σ and m2 − m1 ≥ 18 σ .. Proof. If ξ = 0, then by Proposition 3, m4 − m2 = m2 − m1 = ln(2)σ and we are done. Otherwise,. Proof of claim 1. Because Sn (I) ≥ m ˆ n−tk and the event A occurs with probability at least 1 − kδ, it suffices to show that A implies. m4 − m2 = σ(2ξ − 1)ξ −1 Γ(1 − ξ) and m2 − m1 = σ(4ξ − 2ξ )ξ −1 Γ(1 − ξ) .. (1 − δk)m ˆ n−tk − α1 = 1 − O(∆) . ∗ mn − α1. It thus suffices to prove the following claim.. 141.

(8) Claim 1.. . Claim 1. | ln(m ¯ n − α1 ) − ln(mn − α1 )| is O(ln(n)∆α ).. . 1 2 −1 Γ(1 − ξ) ≥ , and ξ 4 ξ< 2 ξ 1 4 − 2ξ Γ(1 − ξ) ≥ . min1 ξ 8 ξ< 2. min1. ξ. ¯n −α ¯ 1 )| is Proof of claim 1. Because | ln(m ¯ n − α1 ) − ln(m ¯ n −α ¯ 1 )−ln(mn −α1 )| O(∆α ), it suffices to show that | ln(m is O(ln(n)∆α ). This is true because log (n) ¯ 1 ) = ln α ¯2α ¯3 2 ln(m ¯n −α α3 ) + ln(α ¯2) = log2 (n) ln(¯ = log2 (n) ln(α3 ) + ln(α2 ) ± O (ln(n)∆α ) log (n) ± O (ln(n)∆α ) = ln α2 α3 2. Proof of claim 1. We state without proof the following properties of the Γ function: Γ(z) ≥ z

(9) ! ∀z ≥ 2 Γ(z) ≥ 12 ∀z > 0. = ln(mn − α1 ) ± O (ln(n)∆α ) .. Making the change of variable y = −ξ, it suffices to show 1 1 − 2−y min1 Γ(1 + y) ≥ , and (5) y 4 y>− 2 −y 2 (1 − 2−y ) 1 min1 Γ(1 + y) ≥ . (6) y 8 y>− 2. Setting ∆α < Ω ln(n)−1 then guarantees | ln(m ¯ n) − ln(mn )| < 78 . By Claim 3 of the proof of Lemma 6 (which did not depend on the assumption ξ < 0), ∆α is O (∆m ), so we require P[∆m < Ω ln(n)−1 ] ≥ 1 − δ. Define ¯ i − mi |, so that ∆m = maxi∈{1,2,4} ∆i . It suf∆i = |m δ fices that P ∆i < Ω ln(n)−1 ≥ 1 − 3 for i ∈ {1, 2, 4}.. (5) holds because for − 12 < y ≤ 1,. 1 − 2−y 1 1 Γ(1 + y) ≥ Γ(1 + y) ≥ , y 2 4. By Lemma 2, ensuring this requires O ln( 1δ ) ln(n) 2 from G.. while for y > 1, 1 − 2−y y + 1

(10) ! 1 Γ(1 + y) ≥ ≥ . y 2y 2 Similarly, (6) holds because for − 12 < y ≤ 1, 2−y (1 − 2−y ) 1 Γ(1 + y) ≥ , y 8. while for y > 1, 2−y (1 − 2−y ) y + 1

(11) ! 1 Γ(1 + y) ≥ ≥ . y 2y(2y ) 8. Lemma 7. Assume G has shape parameter ξ ≥ ξ ∗ . Let n be a positive integer and let 2> 0 and δ ∈ (0, 1) be real draws from G suffice to numbers. Then O ln( 1δ ) ln(n) 2 obtain an estimate m ¯ n of mn such that 1 m ¯ n − α1 P < < (1 + ) ≥ 1 − δ 1+ mn − α1 where α1 = µ − σξ .. Proof. We use the same estimation procedure as in the proof of Lemma 6. Let α1 , α2 , α3 , ∆α , and ∆m be defined as they were in that proof. m ¯ n −α1 1 < m < 1 + is the same as The inequality 1+ n −α1 m ¯ n −α1 | ln( mn −α1 )| < ln(1 + ). For < 12 , ln(1 + ) ≥ 78 , so it suffices to guarantee that | ln(m ¯ n − α1 ) − ln(mn − α1 )| <. 7 . 8. 142. 2. draws.

(12)

No results found