For the non-stochastic **multi**-**armed** bandit problem, Kujala and Elomaa (2005) and Poland (2005) both showed that using the exponential (actually double exponential/Laplace) distribution in an FTPL algorithm coupled with standard unbiased estimation technique yields near-optimal O( √ N T log N ) regret. Unbiased estimation needs access to arm prob- abilities that are not explicitly available when using an FTPL algorithm. Neu and Bart´ ok (2013) introduced the geometric resampling scheme to approximate these probabilities while still guaranteeing low regret. Recently, Abernethy et al. (2015) analyzed FTPL for adver- sarial **multi**-**armed** **bandits** and provided regret bounds under the condition that the hazard rate of the perturbation distribution is bounded. This condition allowed them to consider a variety of perturbation distributions beyond the exponential, such as Gamma, Gumbel, Frechet, Pareto, and Weibull.

Show more
24 Read more

With the rise in internet usage, there has been a corresponding rise in online retail and online advertising, evidenced by the growth of companies such as Amazon and Google. There are many optimization problems in both of these domains. For example, as a retailer, which items should we recommend to a given customer? Or, as an advertiser, how much should we bid in order to place one of our ads? Or, as an ad server, whose ads should we select to display? In general, different users will also have different preferences, which are not known in advance, and hence may need to be learned over time. Furthermore, these problems often have many players, each with their own objectives and possible actions, which suggest game- theoretic formulations. Finally, for online retailers and ad servers, mechanisms can be designed to optimize for a metric of their choosing (e.g., user satisfaction, advertiser profit, or social welfare). We will attempt to address some of these problems using the theory of **multi**-**armed** **bandits**.

Show more
98 Read more

The sequential nature of the problems coupled with imperfect system knowledge means that decisions cannot be evaluated alone. Effective decision-making needs to account for possible future actions and associated outcomes. While standard solution methods such as stochastic dynamic programming can in principle be used, in practice they are computationally impractical and heuristic approaches are generally required. One such approach is the knowledge gradient (KG) heuristic. Gupta and Miescke [8] originated KG for application to offline ranking and selection problems. After a period of time in which it appears to have been studied little, Frazier et al. [5] expanded on KG’s theoretical properties. It was adapted for use in online decision-making by Ryzhov et al. [14] who tested it on **multi**-**armed** **bandits** (MABs) with Gaussian rewards. They found that it performed well against an index policy which utilised an analytical approximation to the Gittins index; see Gittins et al. [7]. Ryzhov et al. [12] have investigated the use of KG to solve MABs with exponentially distributed rewards while Powell and Ryzhov [10] give versions for Bernoulli, Poisson and uniform rewards, though without testing performance. They propose the method as an approach to online learning problems quite generally, with particular emphasis on its ability to handle correlated arms. Initial empirical results were promising but only encompassed a limited range of models. This paper utilises an important sub-class of MABs to explore properties of the KG heuristic for online use. Our investigation reveals weaknesses in the KG approach. We inter alia propose modifications to mitigate these weaknesses.

Show more
28 Read more

Comparing NLP systems to select the best one for a task of interest, such as named entity recognition, is critical for practition- ers and researchers. A rigorous approach involves setting up a hypothesis testing scenario using the performance of the sys- tems on query documents. However, often the hypothesis testing approach needs to send a large number of document queries to the systems, which can be problematic. In this paper, we present an effective al- ternative based on the **multi**-**armed** ban- dit (MAB). We propose a hierarchical gen- erative model to represent the uncertainty in the performance measures of the com- peting systems, to be used by Thompson Sampling to solve the resulting MAB. Ex- perimental results on both synthetic and real data show that our approach requires significantly fewer queries compared to the standard benchmarking technique to identify the best system according to F- measure.

Show more
33 Read more

A literature review was carried out, revealing existing approaches for pruning, their strengths, and weaknesses. A key issue emerging from this review is that there is a trade-off between removing a weight or neuron and the potential reduction in accuracy. Thus, this study develops new algorithms for pruning that utilize a framework, known as a **multi**-**armed** bandit, which has been successfully applied in applications where there is a need to learn which option to select given the outcome of trials. There are several different **multi**-arm bandit methods, and these have been used to develop new algorithms including those based on the following types of **multi**-arm **bandits**: (i) Epsilon-Greedy (ii) Upper Confidence Bounds (UCB) (iii) Thompson Sampling and (iv) Exponential Weight Algorithm for Exploration and Exploitation (EXP3).

Show more
237 Read more

Each arm a j in A is sampled once per iteration k and this contributes to the high sampling budget of the algorithm in (2) (as well as the original in [10]). This high sampling budget comes about because all sub-optimal actions a j are taken at each iteration k before a high level of confidence to infer the best action a ∗ is achieved at some iteration k << K. To mitigate this problem, we employ the structure of SAMW in the SOFTMIX algorithm of [7]. In SOFTMIX, k−sample means µ i k are used at the k−th step of the iteration (by a suitable construction), even though only a ‘winner’ action (call it ˆ a k ) is actually taken at each iteration k. Such behaviour is typical of Stochastic **Multi**-**Armed**

Show more
18 Read more

Linear **bandits**(Rusmevichientong and Tsitsiklis 2010; Dani et al. 2008) model, the sim- plest among such models, assumes that the reward of choosing an arm is linearly dependent on its features. In linear **bandits**, the expected reward of an arm is calcu- lated as the inner product of its feature vector and a parameter vector θ . However, real-world data often exhibit more complicated relationships than a linear one. There- fore, we choose k-nearest neighbor (k-NN) regression to estimate the expected reward of arms. To introduce exploration into the solution, we extend Guan and Jiang (Guan and Jiang 2018)’s k-**armed** KNN-UCB algorithm to the structured setting. As explained in **Multi**-**armed** **bandits**, upper confidence bound (Auer 2002) (UCB) algorithms incorpo- rate an exploration term by calculating confidence bound for each arm and choose the action corresponding to the largest confidence bound.

Show more
18 Read more

In addition to including more algorithms and considering different variances and arm numbers, our study could be improved by considering settings where reward variances are not identical. Certain algorithms, such as UCB1-Tuned, are specifically designed to take into account the variance of the arms, and may therefore have an advantage in such settings. In the second half of the paper, we turned our attention to an important application of the bandit problem: clinical trials. Although clinical trials have motivated theoretical research on **multi**-**armed** **bandits** since Robbins’ original paper, bandit algorithms have never been evaluated as treatment allocation strategies in a clinical trial.

Show more
32 Read more

Since our work of CMAB model (Chen et al., 2013), several studies are also related to combinatorial **multi**-**armed** **bandits** or in general combinatorial online learning. Qin et al. (2014) extend CMAB to contextual **bandits** and apply it to diversified online recommen- dations. Lin et al. (2014) address combinatorial actions with limited feedbacks. Gopalan et al. (2014) use Thompson sampling method to tackle combinatorial online learning prob- lems. Comparing with our CMAB framework, they allow more feedback models than our semi-bandit feedback model, but they require finite number of actions and observations, their regret contains a large constant term, and it is unclear if their framework supports approximation oracles for hard combinatorial optimization problems. Kveton et al. (2014) study linear matroid **bandits**, which is a subclass of the linear combinatorial **bandits** we discussed in Section 4.2, and they provide better regret bounds than our general bounds given in Section 4.2, because their analysis utilizes the matroid combinatorial structure. In a latest paper, Kveton et al. (2015) improve the regret bounds of the linear combinatorial **bandits** via a more sophisticated non-uniform sufficient sampling condition than the one we used in our paper. However, it is unclear if this technique can be applied to non-linear re- ward functions satisfying the bounded smoothness condition (see discussions in Section 4.2 for more details).

Show more
33 Read more

The balancing technique is well-known in machine learn- ing, especially in domain adaptation and studies in learning- theoretic frameworks (Huang et al. 2007), (Zadrozny 2004), (Cortes, Mansour, and Mohri 2010). There is a number of recent works which approach contextual **bandits** through the framework of causality (Bareinboim, Forney, and Pearl 2015), (Bareinboim and Pearl 2015), (Forney, Pearl, and Bareinboim 2017), (Lattimore, Lattimore, and Reid 2016). There is also a significant body of research that leverages balancing for offline evaluation and learning of contextual bandit or reinforcement learning policies from logged data (Strehl et al. 2010), (Dud´ık, Langford, and Li 2011), (Li et al. 2012), (Dud´ık et al. 2014), (Li et al. 2014), (Swaminathan and Joachims 2015), (Jiang and Li 2016), (Thomas and Brunskill 2016), (Athey and Wager 2017), (Kallus 2017), (Wang, Agarwal, and Dud´ık 2017), (Deshpande et al. 2017), (Kallus and Zhou 2018), (Zhou, Athey, and Wager 2018). In the offline setting, the complexity of the historical assign- ment policy is taken as given, and thus the difficulty of the offline evaluation and learning of optimal policies is taken as given. Therefore, these results lie at the opposite end of the spectrum from our work, which focuses on the online setting. Methods for reducing the bias due to adaptive data collection have also been studied for non-contextual **multi**- **armed** **bandits** (Villar, Bowden, and Wason 2015), (Nie et al. 2018), but the nature of the estimation in contextual ban- dits is qualitatively different. Importance weighted regres- sion in contextual **bandits** was first mentioned in (Agarwal et al. 2014), but without a systematic motivation, analysis and evaluation. To our knowledge, our paper is the first work to integrate balancing in the online contextual bandit setting, to perform a large-scale evaluation of it against direct esti- mation method baselines with theoretical guarantees and to provide a theoretical characterization of balanced contextual **bandits** that match the regret bound of their direct method counterparts. The effect of importance weighted regression is also evaluated in (Bietti, Agarwal, and Langford 2018), but this is a successor to the extended version of our paper (Dimakopoulou et al. 2017).

Show more
of the K possible arms. In the classic stochastic MAB set- ting, the player immediately observes stochastic feedback from the pulled arm in the form of a ‘reward’ which can be used to improve the decisions in subsequent rounds. One of the main application areas of MABs is in online adver- tising. Here, the arms correspond to adverts, and the feed- back would correspond to conversions, that is users buy- ing a product after seeing an advert. However, in practice, these conversions may not necessarily happen immediately after the advert is shown, and it may not always be pos- sible to assign the credit of a sale to a particular showing of an advert. A similar challenge is encountered in many other applications, e.g., in personalized treatment planning, where the effect of a treatment on a patient’s health may be delayed, and it may be difficult to determine which out of several past treatments caused the change in the patient’s health; or, in content design applications, where the effects of multiple changes in the website design on website traffic and footfall may be delayed and difficult to distinguish. In this paper, we propose a new bandit model to handle on- line problems with such ‘delayed, aggregated and anony- mous’ feedback. In our model, a player interacts with an environment of K actions (or arms) in a sequential fashion. At each time step the player selects an action which leads to a reward generated at random from the underlying re- ward distribution. At the same time, a nonnegative random integer-valued delay is also generated i.i.d. from an under- lying delay distribution. Denoting this delay by τ ≥ 0 and the index of the current round by t, the reward generated in round t will arrive at the end of the (t + τ)th round. At the end of each round, the player observes only the sum of all the rewards that arrive in that round. Crucially, the player does not know which of the past plays have con- tributed to this aggregated reward. We call this problem **multi**-**armed** **bandits** with delayed, aggregated anonymous feedback (MABDAAF). As in the standard MAB problem, in MABDAAF, the goal is to maximize the cumulative re- ward from T plays of the bandit, or equivalently to mini- mize the regret. The regret is the total difference between the reward of the optimal action and the actions taken. If the delays are all zero, the MABDAAF problem reduces to the standard (stochastic) MAB problem, which has been studied considerably (e.g., Thompson, 1933; Lai & Rob- bins, 1985; Auer et al., 2002; Bubeck & Cesa-Bianchi,

Show more
42 Read more

In this paper we apply **multi**-**armed** ban- dits (MABs) to improve the computational complexity of AdaBoost. AdaBoost con- structs a strong classifier in a stepwise fashion by selecting simple base classifiers and us- ing their weighted “vote” to determine the final classification. We model this stepwise base classifier selection as a sequential de- cision problem, and optimize it with MABs where each arm represents a subset of the base classifier set. The MAB gradually learns the “usefulness” of the subsets, and selects one of the subsets in each iteration. Ad- aBoost then searches only this subset in- stead of optimizing the base classifier over the whole space. The main improvement of this paper over a previous approach is that we use an adversarial bandit algorithm instead of stochastic **bandits**. This choice allows us to prove a weak-to-strong-learning theorem, which means that the proposed technique re- mains a boosting algorithm in a formal sense. We demonstrate on benchmark datasets that our technique can achieve a generalization performance similar to standard AdaBoost for a computational cost that is an order of magnitude smaller.

Show more
Applications in technology and social sciences involve data which may not be precise and deterministic in nature. The reason is they are humanistic and have a subjective approach to it so, they require a different way of mathematical representation. Some of the recent theories developed for handling problems with imprecise data are interval mathematics, fuzzy sets, rough sets etc. But it has been observed that there exists some inherent limitations to their applications. They lack the parameterization tool. Hence the paper introduces a “Soft set theory” having parameterization tools for dealing with various non-deterministic data that involve multiple agents for the purpose of evaluation. The evaluation is done on multiple criteria. The paper considers a hypothetical scenario of recruitment process in the **armed** forces. It demonstrates the application of soft fuzzy sets in **multi**-criteria and **multi**-observer, decision making. The selection decision also varies depending on the various deputations that the candidate can undergo. Each deputation has graded priority of proficiency for identified parameters of evaluation.

Show more
Objections to the unwilling or unable test can rest on a number of grounds. First, there is a question of the subjective element in determining whether Betazed was indeed unwilling or unable. However, this type of ques- tion could equally be asked in any determination of necessity for self-de- fense, even in the “old-fashioned” self-defense directly between States. Ulti- mately, it will be for the State acting in self-defense to be able to make a convincing case—whether at an early stage or later on before the Security Council or the ICJ—that there was a necessity for it to act in self-defense. In cases of **armed** groups operating from other States, the case will invariably include the unwillingness or inability of the territorial State to prevent the attacks. Moreover, rather than requiring a lengthy analysis of whether Betazed was deliberately unwilling or whether it simply did not have the ca- pacity to act against the **armed** group, the deciding factor will be whether Betazed, given the chance, was taking effective action to stop the attacks by the **armed** group. If the **armed** attacks by the Veridian group against Angosia continue despite attempts to resolve the matter through Betazed, then whether this was a result of unwillingness or the inability of Betazed would not alter the necessity of Angosia to take action in self-defense.

Show more
37 Read more

The structure of this proof will be to bound the expected value of T π i (n) for all sub-optimal **bandits** i, and use this to bound the regret R π (n). The basic techniques follow those in Katehakis and Robbins (1995) for the known variance case, modified accordingly here for the unknown variance case and assisted by the probability bound of Proposition 3. For any i such that µ i 6= µ ∗ , we define the following quantities: Let 1 > ε > 0 and define ˜ ε = ∆ i ε /2.

28 Read more

lifetime. Furthermore, when the number of variables is large and their measurement streams are generated in high veloc- ity, real-time analysis may be significantly hindered due to the constraints of system memory, storage space, transmis- sion bandwidth, computational power and processing speed. Consequently, in many applications, even if all variables can be measured, only partial observations of them can be trans- mitted back to the data fusion center for real-time analytics. All the above constraints trigger the demand of new learn- ing techniques to address the emerging challenge of Par- tially Observable **Multi**-sensor Sequential Change Detection (POMSCD), where only a subset of sensors can be observed at each epoch for change detection. Specifically, consider a system characterized by a set of p variables (for example, in a manufacturing system, each variable corresponds to a fabrication characteristic). Signals of these variables at each sensing epoch t are denoted as X(t) = [X 1 (t), . . . , X p (t)],

Show more
In this work, we study the observational learning problem in the context of **bandits**, the simplest setting for studying the explore-exploit trade-off faced by an agent in an unknown environment. We consider a learner (agent) that observes ac- tions performed by a target policy in the same environment, but not their associated rewards. Note that the target actions can in fact be performed by several other agents. When col- lected from a good target, this data can potentially improve the behaviour of the learner, specifically, by speeding up the learning process. Consequently, we would like an agent equipped with the ability to leverage it whenever available. This should not be confused with cooperative **bandits** (Land- gren, Srivastava, and Leonard 2016), where several agents share knowledge with each other regarding the actions and obtained rewards.

Show more
Thompson Sampling Except for UCB, Thompson Sam- pling (TS) (Thompson 1933) is another good alternative MAB algorithm for the classical stochastic MAB problem. The idea is to assume a simple prior distribution on the pa- rameters of the reward distribution of every arm, and at each round, play an arm according to its posterior probability of being the best arm (Agrawal and Goyal 2012). The effec- tiveness of TS has been empirically demonstrated by several studies (Granmo 2010; Scott 2010; Chapelle and Li 2011), and the asymptotic optimality of TS has been theoretically proved for Bernoulli **bandits** (Kaufmann, Korda, and Munos 2012; Agrawal and Goyal 2012). TS for Bernoulli **bandits** utilizes beta distribution as priors, i.e., a family of contin- uous probability distributions on the interval [0, 1] parame- terized by two positive shape parameters, denoted by α and β. The mean of Beta(α, β) is α+β α , and higher α, β lead to tighter concentration of Beta(α, β) around the mean. TS initially assumes each arm i to have prior reward distribu- tion Beta(1, 1), which is equivalent to uniform distribution on [0, 1]. At round t, after having observed S i successes (re-

Show more
If there is not enough exploration, potentially not enough different **bandits** are chosen which can result in an expensive attribute being chosen too early. An attribute with a high test cost can be chosen as a root attribute, thus returning a higher cost to classify than anticipated simply because the dataset space has not been explored enough. To make sure this scenario does not arise there must be a sufficient number of lever pulls carried out. If the number of lever pulls is high enough the algorithm has the opportunity to explore many more potentially good attributes. A way must be found to determine, what is a ‘sufficient number’. To determine what value is best for the look-ahead parameter k, a dataset was processed with increasing values of k. On examining the results obtained, it was found that there was no improvement in the results obtained at each increase of depth. It has been decided that further experimentation would be required as this result may be unique to this dataset. This parameter will therefore be set to the value 1 for all datasets and set to 2 for a selection of datasets to determine whether a lower depth improves results or not and if it is dataset related.

Show more
201 Read more