# Multi Armed Bandits

## Top PDF Multi Armed Bandits:

### Beyond the Hazard Rate: More Perturbation Algorithms for Adversarial Multi-armed Bandits

For the non-stochastic multi-armed bandit problem, Kujala and Elomaa (2005) and Poland (2005) both showed that using the exponential (actually double exponential/Laplace) distribution in an FTPL algorithm coupled with standard unbiased estimation technique yields near-optimal O( √ N T log N ) regret. Unbiased estimation needs access to arm prob- abilities that are not explicitly available when using an FTPL algorithm. Neu and Bart´ ok (2013) introduced the geometric resampling scheme to approximate these probabilities while still guaranteeing low regret. Recently, Abernethy et al. (2015) analyzed FTPL for adver- sarial multi-armed bandits and provided regret bounds under the condition that the hazard rate of the perturbation distribution is bounded. This condition allowed them to consider a variety of perturbation distributions beyond the exponential, such as Gamma, Gumbel, Frechet, Pareto, and Weibull.

With the rise in internet usage, there has been a corresponding rise in online retail and online advertising, evidenced by the growth of companies such as Amazon and Google. There are many optimization problems in both of these domains. For example, as a retailer, which items should we recommend to a given customer? Or, as an advertiser, how much should we bid in order to place one of our ads? Or, as an ad server, whose ads should we select to display? In general, different users will also have different preferences, which are not known in advance, and hence may need to be learned over time. Furthermore, these problems often have many players, each with their own objectives and possible actions, which suggest game- theoretic formulations. Finally, for online retailers and ad servers, mechanisms can be designed to optimize for a metric of their choosing (e.g., user satisfaction, advertiser profit, or social welfare). We will attempt to address some of these problems using the theory of multi-armed bandits.

### On the identification and mitigation of weaknesses in the Knowledge Gradient policy for multi armed bandits

The sequential nature of the problems coupled with imperfect system knowledge means that decisions cannot be evaluated alone. Effective decision-making needs to account for possible future actions and associated outcomes. While standard solution methods such as stochastic dynamic programming can in principle be used, in practice they are computationally impractical and heuristic approaches are generally required. One such approach is the knowledge gradient (KG) heuristic. Gupta and Miescke [8] originated KG for application to offline ranking and selection problems. After a period of time in which it appears to have been studied little, Frazier et al. [5] expanded on KG’s theoretical properties. It was adapted for use in online decision-making by Ryzhov et al. [14] who tested it on multi-armed bandits (MABs) with Gaussian rewards. They found that it performed well against an index policy which utilised an analytical approximation to the Gittins index; see Gittins et al. [7]. Ryzhov et al. [12] have investigated the use of KG to solve MABs with exponentially distributed rewards while Powell and Ryzhov [10] give versions for Bernoulli, Poisson and uniform rewards, though without testing performance. They propose the method as an approach to online learning problems quite generally, with particular emphasis on its ability to handle correlated arms. Initial empirical results were promising but only encompassed a limited range of models. This paper utilises an important sub-class of MABs to explore properties of the KG heuristic for online use. Our investigation reveals weaknesses in the KG approach. We inter alia propose modifications to mitigate these weaknesses.

### Efficient Benchmarking of NLP APIs using Multi armed Bandits

Comparing NLP systems to select the best one for a task of interest, such as named entity recognition, is critical for practition- ers and researchers. A rigorous approach involves setting up a hypothesis testing scenario using the performance of the sys- tems on query documents. However, often the hypothesis testing approach needs to send a large number of document queries to the systems, which can be problematic. In this paper, we present an effective al- ternative based on the multi-armed ban- dit (MAB). We propose a hierarchical gen- erative model to represent the uncertainty in the performance measures of the com- peting systems, to be used by Thompson Sampling to solve the resulting MAB. Ex- perimental results on both synthetic and real data show that our approach requires significantly fewer queries compared to the standard benchmarking technique to identify the best system according to F- measure.

### Statistical Consequences of using Multi-armed Bandits to Conduct Adaptive Educational Experiments

Multi-armed bandit (MAB) algorithms offer a potential alternative that could benefit learners in the experiment by considering the utility of different versions of content. MABs select a version for each user by optimizing expected reward. Reward is specific to the problem the MAB is applied to; in the context of an experiment, the reward is the outcome that is being used to define the effectiveness of the conditions. For example, in the experiment comparing how text versus video explanations affect performance on later problems, the reward could be defined as the score on the next problem after viewing the explanation. The MAB algorithm would then select condition assignments to maximize the proportion of students who got the next problem right. MAB algorithms are designed to solve online decision problems, where decisions are made sequentially and information about an option is acquired only by choosing that particular option (in contrast to supervised learning). Traditionally, MABs have been used for applications like selecting online ads (Tang et al., 2013), but they have also been used in education to choose what version of a system to give to each learner (Liu et al., 2014; Williams et al., 2016). Since different learners interact with the system at different times, the success (or failure) of a learner in a particular version of the system can be used to inform what version of the system to give to the next learner. If the version of the system is viewed as an experimental condition, then the algorithm will direct more students to more effective conditions over time. Because MAB algorithms make decisions sequentially, they are particularly relevant to decision making about alternative pedagogies in educational technologies, where students may access materials asynchronously. Experiments using MAB assignment have been conducted within course quizzes with the aim of increasing benefits to students (Williams et al., 2018).

### Optimizing deep learning networks using multi armed bandits

A literature review was carried out, revealing existing approaches for pruning, their strengths, and weaknesses. A key issue emerging from this review is that there is a trade-off between removing a weight or neuron and the potential reduction in accuracy. Thus, this study develops new algorithms for pruning that utilize a framework, known as a multi-armed bandit, which has been successfully applied in applications where there is a need to learn which option to select given the outcome of trials. There are several different multi-arm bandit methods, and these have been used to develop new algorithms including those based on the following types of multi-arm bandits: (i) Epsilon-Greedy (ii) Upper Confidence Bounds (UCB) (iii) Thompson Sampling and (iv) Exponential Weight Algorithm for Exploration and Exploitation (EXP3).

### Multi armed bandits based on a variant of simulated annealing

Each arm a j in A is sampled once per iteration k and this contributes to the high sampling budget of the algorithm in (2) (as well as the original in [10]). This high sampling budget comes about because all sub-optimal actions a j are taken at each iteration k before a high level of confidence to infer the best action a ∗ is achieved at some iteration k << K. To mitigate this problem, we employ the structure of SAMW in the SOFTMIX algorithm of [7]. In SOFTMIX, k−sample means µ i k are used at the k−th step of the iteration (by a suitable construction), even though only a ‘winner’ action (call it ˆ a k ) is actually taken at each iteration k. Such behaviour is typical of Stochastic Multi-Armed

### A multi-armed bandit approach for exploring partially observed networks

Linear bandits(Rusmevichientong and Tsitsiklis 2010; Dani et al. 2008) model, the sim- plest among such models, assumes that the reward of choosing an arm is linearly dependent on its features. In linear bandits, the expected reward of an arm is calcu- lated as the inner product of its feature vector and a parameter vector θ . However, real-world data often exhibit more complicated relationships than a linear one. There- fore, we choose k-nearest neighbor (k-NN) regression to estimate the expected reward of arms. To introduce exploration into the solution, we extend Guan and Jiang (Guan and Jiang 2018)’s k-armed KNN-UCB algorithm to the structured setting. As explained in Multi-armed bandits, upper confidence bound (Auer 2002) (UCB) algorithms incorpo- rate an exploration term by calculating confidence bound for each arm and choose the action corresponding to the largest confidence bound.

### Algorithms for the multi-armed bandit problem

In addition to including more algorithms and considering different variances and arm numbers, our study could be improved by considering settings where reward variances are not identical. Certain algorithms, such as UCB1-Tuned, are specifically designed to take into account the variance of the arms, and may therefore have an advantage in such settings. In the second half of the paper, we turned our attention to an important application of the bandit problem: clinical trials. Although clinical trials have motivated theoretical research on multi-armed bandits since Robbins’ original paper, bandit algorithms have never been evaluated as treatment allocation strategies in a clinical trial.

### Combinatorial Multi-Armed Bandit and Its Extension to Probabilistically Triggered Arms

Since our work of CMAB model (Chen et al., 2013), several studies are also related to combinatorial multi-armed bandits or in general combinatorial online learning. Qin et al. (2014) extend CMAB to contextual bandits and apply it to diversified online recommen- dations. Lin et al. (2014) address combinatorial actions with limited feedbacks. Gopalan et al. (2014) use Thompson sampling method to tackle combinatorial online learning prob- lems. Comparing with our CMAB framework, they allow more feedback models than our semi-bandit feedback model, but they require finite number of actions and observations, their regret contains a large constant term, and it is unclear if their framework supports approximation oracles for hard combinatorial optimization problems. Kveton et al. (2014) study linear matroid bandits, which is a subclass of the linear combinatorial bandits we discussed in Section 4.2, and they provide better regret bounds than our general bounds given in Section 4.2, because their analysis utilizes the matroid combinatorial structure. In a latest paper, Kveton et al. (2015) improve the regret bounds of the linear combinatorial bandits via a more sophisticated non-uniform sufficient sampling condition than the one we used in our paper. However, it is unclear if this technique can be applied to non-linear re- ward functions satisfying the bounded smoothness condition (see discussions in Section 4.2 for more details).

### Balanced Linear Contextual Bandits

The balancing technique is well-known in machine learn- ing, especially in domain adaptation and studies in learning- theoretic frameworks (Huang et al. 2007), (Zadrozny 2004), (Cortes, Mansour, and Mohri 2010). There is a number of recent works which approach contextual bandits through the framework of causality (Bareinboim, Forney, and Pearl 2015), (Bareinboim and Pearl 2015), (Forney, Pearl, and Bareinboim 2017), (Lattimore, Lattimore, and Reid 2016). There is also a significant body of research that leverages balancing for offline evaluation and learning of contextual bandit or reinforcement learning policies from logged data (Strehl et al. 2010), (Dud´ık, Langford, and Li 2011), (Li et al. 2012), (Dud´ık et al. 2014), (Li et al. 2014), (Swaminathan and Joachims 2015), (Jiang and Li 2016), (Thomas and Brunskill 2016), (Athey and Wager 2017), (Kallus 2017), (Wang, Agarwal, and Dud´ık 2017), (Deshpande et al. 2017), (Kallus and Zhou 2018), (Zhou, Athey, and Wager 2018). In the offline setting, the complexity of the historical assign- ment policy is taken as given, and thus the difficulty of the offline evaluation and learning of optimal policies is taken as given. Therefore, these results lie at the opposite end of the spectrum from our work, which focuses on the online setting. Methods for reducing the bias due to adaptive data collection have also been studied for non-contextual multi- armed bandits (Villar, Bowden, and Wason 2015), (Nie et al. 2018), but the nature of the estimation in contextual ban- dits is qualitatively different. Importance weighted regres- sion in contextual bandits was first mentioned in (Agarwal et al. 2014), but without a systematic motivation, analysis and evaluation. To our knowledge, our paper is the first work to integrate balancing in the online contextual bandit setting, to perform a large-scale evaluation of it against direct esti- mation method baselines with theoretical guarantees and to provide a theoretical characterization of balanced contextual bandits that match the regret bound of their direct method counterparts. The effect of importance weighted regression is also evaluated in (Bietti, Agarwal, and Langford 2018), but this is a successor to the extended version of our paper (Dimakopoulou et al. 2017).

### Bandits with Delayed, Aggregated Anonymous Feedback

of the K possible arms. In the classic stochastic MAB set- ting, the player immediately observes stochastic feedback from the pulled arm in the form of a ‘reward’ which can be used to improve the decisions in subsequent rounds. One of the main application areas of MABs is in online adver- tising. Here, the arms correspond to adverts, and the feed- back would correspond to conversions, that is users buy- ing a product after seeing an advert. However, in practice, these conversions may not necessarily happen immediately after the advert is shown, and it may not always be pos- sible to assign the credit of a sale to a particular showing of an advert. A similar challenge is encountered in many other applications, e.g., in personalized treatment planning, where the effect of a treatment on a patient’s health may be delayed, and it may be difficult to determine which out of several past treatments caused the change in the patient’s health; or, in content design applications, where the effects of multiple changes in the website design on website traffic and footfall may be delayed and difficult to distinguish. In this paper, we propose a new bandit model to handle on- line problems with such ‘delayed, aggregated and anony- mous’ feedback. In our model, a player interacts with an environment of K actions (or arms) in a sequential fashion. At each time step the player selects an action which leads to a reward generated at random from the underlying re- ward distribution. At the same time, a nonnegative random integer-valued delay is also generated i.i.d. from an under- lying delay distribution. Denoting this delay by τ ≥ 0 and the index of the current round by t, the reward generated in round t will arrive at the end of the (t + τ)th round. At the end of each round, the player observes only the sum of all the rewards that arrive in that round. Crucially, the player does not know which of the past plays have con- tributed to this aggregated reward. We call this problem multi-armed bandits with delayed, aggregated anonymous feedback (MABDAAF). As in the standard MAB problem, in MABDAAF, the goal is to maximize the cumulative re- ward from T plays of the bandit, or equivalently to mini- mize the regret. The regret is the total difference between the reward of the optimal action and the actions taken. If the delays are all zero, the MABDAAF problem reduces to the standard (stochastic) MAB problem, which has been studied considerably (e.g., Thompson, 1933; Lai & Rob- bins, 1985; Auer et al., 2002; Bubeck & Cesa-Bianchi,

### Fast boosting using adversarial bandits

In this paper we apply multi-armed ban- dits (MABs) to improve the computational complexity of AdaBoost. AdaBoost con- structs a strong classifier in a stepwise fashion by selecting simple base classifiers and us- ing their weighted “vote” to determine the final classification. We model this stepwise base classifier selection as a sequential de- cision problem, and optimize it with MABs where each arm represents a subset of the base classifier set. The MAB gradually learns the “usefulness” of the subsets, and selects one of the subsets in each iteration. Ad- aBoost then searches only this subset in- stead of optimizing the base classifier over the whole space. The main improvement of this paper over a previous approach is that we use an adversarial bandit algorithm instead of stochastic bandits. This choice allows us to prove a weak-to-strong-learning theorem, which means that the proposed technique re- mains a boosting algorithm in a formal sense. We demonstrate on benchmark datasets that our technique can achieve a generalization performance similar to standard AdaBoost for a computational cost that is an order of magnitude smaller.

### SOFT FUZZY SETS IN MULTI OBSERVER, MULTI CRITERIA DECISION MAKING FOR ARMED FORCE RECRUITMENT

Applications in technology and social sciences involve data which may not be precise and deterministic in nature. The reason is they are humanistic and have a subjective approach to it so, they require a different way of mathematical representation. Some of the recent theories developed for handling problems with imprecise data are interval mathematics, fuzzy sets, rough sets etc. But it has been observed that there exists some inherent limitations to their applications. They lack the parameterization tool. Hence the paper introduces a “Soft set theory” having parameterization tools for dealing with various non-deterministic data that involve multiple agents for the purpose of evaluation. The evaluation is done on multiple criteria. The paper considers a hypothetical scenario of recruitment process in the armed forces. It demonstrates the application of soft fuzzy sets in multi-criteria and multi-observer, decision making. The selection decision also varies depending on the various deputations that the candidate can undergo. Each deputation has graded priority of proficiency for identified parameters of evaluation.

### Fragmented Wars: Multi-Territorial Military Operations against Armed Groups

Objections to the unwilling or unable test can rest on a number of grounds. First, there is a question of the subjective element in determining whether Betazed was indeed unwilling or unable. However, this type of ques- tion could equally be asked in any determination of necessity for self-de- fense, even in the “old-fashioned” self-defense directly between States. Ulti- mately, it will be for the State acting in self-defense to be able to make a convincing case—whether at an early stage or later on before the Security Council or the ICJ—that there was a necessity for it to act in self-defense. In cases of armed groups operating from other States, the case will invariably include the unwillingness or inability of the territorial State to prevent the attacks. Moreover, rather than requiring a lengthy analysis of whether Betazed was deliberately unwilling or whether it simply did not have the ca- pacity to act against the armed group, the deciding factor will be whether Betazed, given the chance, was taking effective action to stop the attacks by the armed group. If the armed attacks by the Veridian group against Angosia continue despite attempts to resolve the matter through Betazed, then whether this was a result of unwillingness or the inability of Betazed would not alter the necessity of Angosia to take action in self-defense.

### Normal Bandits of Unknown Means and Variances

The structure of this proof will be to bound the expected value of T π i (n) for all sub-optimal bandits i, and use this to bound the regret R π (n). The basic techniques follow those in Katehakis and Robbins (1995) for the known variance case, modified accordingly here for the unknown variance case and assisted by the probability bound of Proposition 3. For any i such that µ i 6= µ ∗ , we define the following quantities: Let 1 > ε > 0 and define ˜ ε = ∆ i ε /2.

### Partially Observable Multi-Sensor Sequential Change Detection: A Combinatorial Multi-Armed Bandit Approach

lifetime. Furthermore, when the number of variables is large and their measurement streams are generated in high veloc- ity, real-time analysis may be significantly hindered due to the constraints of system memory, storage space, transmis- sion bandwidth, computational power and processing speed. Consequently, in many applications, even if all variables can be measured, only partial observations of them can be trans- mitted back to the data fusion center for real-time analytics. All the above constraints trigger the demand of new learn- ing techniques to address the emerging challenge of Par- tially Observable Multi-sensor Sequential Change Detection (POMSCD), where only a subset of sensors can be observed at each epoch for change detection. Specifically, consider a system characterized by a set of p variables (for example, in a manufacturing system, each variable corresponds to a fabrication characteristic). Signals of these variables at each sensing epoch t are denoted as X(t) = [X 1 (t), . . . , X p (t)],

### Leveraging Observations in Bandits: Between Risks and Benefits

In this work, we study the observational learning problem in the context of bandits, the simplest setting for studying the explore-exploit trade-off faced by an agent in an unknown environment. We consider a learner (agent) that observes ac- tions performed by a target policy in the same environment, but not their associated rewards. Note that the target actions can in fact be performed by several other agents. When col- lected from a good target, this data can potentially improve the behaviour of the learner, specifically, by speeding up the learning process. Consequently, we would like an agent equipped with the ability to leverage it whenever available. This should not be confused with cooperative bandits (Land- gren, Srivastava, and Leonard 2016), where several agents share knowledge with each other regarding the actions and obtained rewards.

### Enhancing Evolutionary Conversion Rate Optimization via Multi-Armed Bandit Algorithms

Thompson Sampling Except for UCB, Thompson Sam- pling (TS) (Thompson 1933) is another good alternative MAB algorithm for the classical stochastic MAB problem. The idea is to assume a simple prior distribution on the pa- rameters of the reward distribution of every arm, and at each round, play an arm according to its posterior probability of being the best arm (Agrawal and Goyal 2012). The effec- tiveness of TS has been empirically demonstrated by several studies (Granmo 2010; Scott 2010; Chapelle and Li 2011), and the asymptotic optimality of TS has been theoretically proved for Bernoulli bandits (Kaufmann, Korda, and Munos 2012; Agrawal and Goyal 2012). TS for Bernoulli bandits utilizes beta distribution as priors, i.e., a family of contin- uous probability distributions on the interval [0, 1] parame- terized by two positive shape parameters, denoted by α and β. The mean of Beta(α, β) is α+β α , and higher α, β lead to tighter concentration of Beta(α, β) around the mean. TS initially assumes each arm i to have prior reward distribu- tion Beta(1, 1), which is equivalent to uniform distribution on [0, 1]. At round t, after having observed S i successes (re-