Recent advances in bandit tools and techniques for sequential learning are steadily enabling new applications and are promising the resolution of a range of challenging related problems. We study the game treesearch problem, where the goal is to quickly identify the optimal move in a given game tree by sequentially sampling its stochastic payoffs. We develop new algorithms for trees of arbitrary depth, that operate by summarizing all deeper levels of the tree into confidence intervals at depth one, and applying a bestarmidentification procedure at the root. We prove new sample complexity guarantees with a refined dependence on the problem instance. We show experimentally that our algorithms outperform existing elimination-based algorithms and match previous special-purpose methods for depth-two trees.
searchtree has only root node when a experiment starts, so the root node is necessarily selected as leaf note. The root node memorizes the state of problem unprocessed, complete initial input. The selection of leaf node de- pends on the selection strategy described below steps. In order to improve the quality of solution, some heuristics based on knowledge are usually applied for simulations. For SameGame, a game with similar rules in this paper’s problem, Schadd proposed two static simulation strate- gies, Tabu Random, and Tabu Color Random  that are described as solution 2 and 3 in the section of numer- ical experiments. All selected nodes by a play-out and the obtained score are memorized each simulation, so the best score and moves (procedures to remove blocks) are updated when the simulation exceeds the previous best score, which are ﬁnally obtained as a solution.
an opening book by evaluating the strength of game states after an opening sequence by measuring the win / loss ratio from this state in self play. Assuming that our opponents also use a variant of MCTS, we expect that the measured win rates are a good representative of our actual chance of winning. When we evaluate states that are deeper in the game tree, we also get more information about the strength of earlier moves. However, the number of opening sequences increases exponentially in the length of the opening sequence. Furthermore, it is computationally expensive to evaluate the strength of a game state. In our opening analysis, we use 5 seconds per player for the self plays, for a total of 10 seconds per game. We perform 256 self plays per state, which means that evaluating a single opening sequence takes 42.7 minutes of single-core computation time. Still, the variation in measurements is high, as the error decreases only by the square root in the number of self plays. Because the board is symmetrical, we only have to compute opening moves for one symmetry, so there are only 16 opening moves to consider. It takes 11.4 hours of single-core computation time to evaluate these 16 opening moves. If we want to look deeper, we have to consider all 105 responses the opponent can make. For this level-2 analysis, there are 1680 cases to consider, for a total of 49.8 days of single-core computation time. A level-3 analysis would require analyzing 174,720 states, for a total of 14.2 years of single core computation time. This shows that computing a deep opening book is infeasible. An optimization is to only investigate the most promising moves. For instance, we could stop analyzing a branch when our win rate in self play is higher than some threshold. Alternatively, a method such as UCT can be used to perform fewer playouts for game states that are less promising, and instead focus on the more promising game states . An additional optimization is to prune states that cannot become better than the best state found so far with more samples, e.g., if it is outside the 95% error bounds of the best move. In our analysis we do not use these optimizations methodically, but we hand picked the best moves for further exploration.
The decomposition of a game into several sub-games produces sub-states, in which available moves depend on the sub-states combination, and then provides a difficulty for legal moves computation. The decomposition also poses a problem for the identification of terminal sub-states. For example, the game Incredible is decomposed into a labyrinth (Maze), a game of cubes (Blocks), a stepper and a set of useless contemplation actions. The game is over if the stepper reaches 20 or if the Maze sub-game is over. But, in Blocks, a sub-state is never terminal by itself. We can also imagine a game where two sub-states are both non-terminal but their conjunction is terminal and must be avoided in a global plan. The decomposition also raises an issue for evaluating sub-states where scores can result from a timely combination with other sub-states. More specifically in GGP, the score described by the goal predicate is reliable only if the current state is terminal . These two facts make the score function less reliable in sub-trees. At last, the decomposition raises the problem of its reliability. If the decomposition is inconsistent, the evaluation of legal moves can be wrong, leading the player program to choose illegal moves and compute inconsistent sub-states. To avoid all these problems, we propose the following approach: doing sim- ulations in the global game and building a sub-tree for each sub-game. Legal moves, next states and the end of game can be evaluated for the global game in which the real score is known. Move selection is performed according to the eval- uation of the sub-states in the sub-trees. An inconsistency of the decomposition is detected if during two simulations, the same move from the same sub-state leads to different following sub-states. A partial 4 but consistent decomposition
playing strategies against players based on MonteCarloTreeSearch (MCTS)  and Information Set MonteCarloTreeSearch (ISMCTS) . The first rule-based player implements the basic greedy strategy taught to beginner players; the second one implements Chitarella’s rules  with the additional rules introduced by Saracino , and represents the most funda- mental and advanced strategy for the game; the third rule- based player extends the previous approach with the additional rules introduced in . MCTS requires the full knowledge of the game state (that is, of the cards of all the players) and thus, by implementing a cheating player, it provides an upper bound to the performance achievable with this class of methods. ISMCTS can deal with incomplete information and thus implements a fair player. For both approaches, we evaluated different reward functions and simulation strategies. We performed a set of experiments to select the best rule-based player and the best configuration for MCTS and ISMCTS. Then, we performed a tournament among the three selected players and also an experiment involving humans. Our results show that the cheating MCTS player outperforms all the other strategies while the fair ISMCTS player outperforms all the rule-based players that implement the best known and most studied advanced strategy for Scopone. The experiment involving human players suggests that ISMCTS might be more challenging than traditional strategies.
In this section we briefly describe the design of a minimized user interface, built to enable testing the program against a human player, and the observations obtained from testing the agents. Figure 3 shows the representation of the game board after the console prompt for user input. Twenty games have been played against the developed agents allowing 4000 simulations per a MCTS move, with a 17/20 win rate of the author. In most cases, MCTS-based agents obtained more points than strategy playing agents. In fact, we have observed that the MCTS-UCT agent has mostly performed second best. Note that, the performance was estimated not only in the number of points obtained, but also in the amount of built pieces on the board and bought Development Cards, which gives a player advantage in the following rounds. Our observations of the performance of agents are concurrent with the obtained regret and average win rate measurements presented in Section 5.5. As expected, an experienced human player is still superior to the agents. Not only does an experienced human have better abilities to strategically plan the moves, the player has also an overview of the current game state, including resource and Development Cards of opponents.
also been made in multi-player poker and Skat  which show promise towards challenging the best human players. Determinization, where all hidden and random information is assumed known by all players, allows recent advances in MCTS to be applied to games with incomplete information and randomness. The determinization approach is not perfect: as discussed by Frank and Basin , it does not handle situ- ations where different (indistinguishable) game states suggest different optimal moves, nor situations where the opponent’s influence makes certain game states more likely to occur than others. In spite of these problems, determinization has been applied successfully to several games. An MCTS-based AI agent which uses determinization has been developed that plays Klondike Solitaire , arguably one of the most popular computer games in the world. For the variant of the game considered, the performance of MCTS in this case exceeds human performance by a substantial margin. A determinized MonteCarlo approach to Bridge , which uses MonteCarlo simulations with a tree of depth one has also yielded strong play. The combination of MCTS and determinization is discussed in more detail in Section V.
Evaluation is more difficult in the in-tree phase, because pass moves are always generated here to avoid losing sekis in zugzwang situations. A terminal position after two passes in the in-tree phase often contains dead blocks. However, the search does not have information about the status of blocks. Therefore, the score is determined using Tromp-Taylor rules: every block is considered to be alive. Together with the additional requirements that the two passes are both played in the search, this will still generate the best move if Chinese rules are used, in which dead blocks may remain on the board, because the Tromp-Taylor score of a territory is a lower bound to its Chinese score. The player to move will only generate a pass move if the game is a win in case the opponent terminates the game by also playing a pass, and the resulting “final” position is evaluated with Tromp-Taylor. B. The Player
The changes introduced to the basic MCTS algorithm by Sarsa-UCT(λ) are generic and can be combined (to the best of our knowledge) with any known enhancement. This is because the intro- duced update method is principled: the new parameters closely follow the reinforcement learning methodology – they can formally describe concepts such as forgetting, first-visit updating, dis- counting, initial bias, and other (Section 4.5). As such, Sarsa-UCT(λ) is not better or worse than any other enhancement, but it is meant to complement other approaches, especially ones that embed some knowledge into the treesearch, which can be done in Sarsa-UCT(λ) as easily as in UCT or any other basic MCTS algorithm. Therefore, it can likely benefit from enhancements that influence the tree, expansion, and playout phases, from generalization techniques (such as transpositions or MAST, as shown earlier, for example), and from position-evaluation functions (which are well- known also to the RL community) and other integrations of domain-specific heuristics or expert knowledge. It could benefit even from other MCTS backpropagation enhancements, although in such case it might be reasonable to keep separate estimates for each backup method, to retain the convergence guarantees of the RL-based backups. As noted previously, the relationship between MCTS and RL is a two-way street: a number of MCTS enhancements are in practice heuristics that appeal to certain characteristics of certain games, but there is no reason why they (at least on principle) could not be transferred back to RL.
Abstract— We are addressing the course timetabling problem in this work. In a university, students can select their favorite courses each semester. Thus, the general requirement is to allow them to attend lectures without clashing with other lectures. A feasible solution is a solution where this and other conditions are satisfied. Constructing reasonable solutions for course timetabling problem is a hard task. Most of the existing methods failed to generate reasonable solutions for all cases. This is since the problem is heavily constrained and an e ﬀ ective method is required to explore and exploit the search space. We utilize MonteCarloTreeSearch (MCTS) in finding feasible solutions for the first time. In MCTS, we build a tree incrementally in an asymmetric manner by sampling the decision space. It is traversed in the best-first manner. We propose several enhancements to MCTS like simulation and tree pruning based on a heuristic. The performance of MCTS is compared with the methods based on graph coloring heuristics and Tabu search. We test the solution methodologies on the three most studied publicly available datasets. Overall, MCTS performs better than the method based on graph coloring heuristic; however, it is inferior compared to the Tabu based method. Experimental results are discussed.
We start with a brief introduction to the stochastic multi-armed ban- dit setting. This is a simple mathematical model for sequential decision making in unknown random environments that illustrates the so-called exploration-exploitation trade-oﬀ. Initial motivation in the context of clinical trials dates back to the works of Thompson [1933, 1935] and Robbins . In this chapter we consider the optimism in the face of uncertainty principle, which recommends following the optimal pol- icy in the most favorable environment among all possible environments that are reasonably compatible with the observations. In a multi-armed bandit the set of “compatible environments” is the set of possible dis- tributions of the arms that are likely to have generated the observed rewards. More precisely we investigate a speciﬁc strategy, called UCB (where UCB stands for upper conﬁdence bound) introduced by Auer, Cesa-Bianchi, and Fischer in [Auer et al., 2002], that uses simple high- probability conﬁdence intervals (one for each arm) for the set of pos- sible “compatible environments”. The strategy consists of selecting the arm with highest upper-conﬁdence-bound (the optimal strategy for the most favorable environment).
In order to address problems with larger search spaces, we must turn to alternative methods. MonteCarlotreesearch (MCTS) has had a lot of success in Go and in other appli- cations  . MCTS eschews the typical brute force tree searching methods, and utilizes statistical sampling instead. This makes MCTS a probabilistic algorithm. As such, it will not always choose the best action, but it still performs rea- sonably well given sufficient time and memory. MCTS per- forms lightweight simulations that randomly select actions. These simulations are used to selectively grow a game tree over a large number of iterations. Since these simulations do not take long to perform, it allows MCTS to explore search spaces quickly. This is what gives MCTS the advantage over deterministic methods in large search spaces.
In this chapter we proposed to optimize the search parameters of MCTS by using an evo- lutionary strategy: the Cross-Entropy Method (CEM). We tested CEM by optimizing 11 parameters of the MCTS program MANGO. Experiments revealed that using a batch size of 500 games gave the best results, although the convergence was slow. To be more precise, these results were obtained by using a cluster of 10 quad-core computers running for 3 days. Interestingly, a small (and fast) batch size of 10 still gave reasonable results when compared to the best one. A variable batch size performed a little bit worse than a fixed batch size of 50 or 500. However, the variable batch size converged faster than a fixed batch size of 50 or 500. Subsequently, we showed that MANGO with the CEM parameters performed better against GNU GO than the MANGO version without. Moreover, in four self-play experi- ments with different time settings and board sizes, the CEM version of MANGO defeated each time the default version convincingly. Based on these results, we may conclude that parameter optimization by CEM genuinely improved the playing strength of MANGO, for various time settings and board sizes. The nature of our research allows the following gen- eralization: a hand-tuned MCTS-using game engine may improve its playing strength when re-tuning the parameters with CEM.
The experiments we did show a general improvement in the results, when compared to the algorithm with which we started our research. Tests were done so that either the eligibility trace or the discount parameter was set to one and the other one is ranging between 0 and 1. In both cases, the best results over all were never when both parameters are set to 1. Each game responds differently to our algorithm. In some cases the improvement is drastic, in other it is only a slight improvement or none at all. The majority of games we tested on generally show an improvement, but each game has a different value as its maximum. One of the things that influence this is the game itself. We weight the nodes with the scoring, and every game has a different scoring system. Some are scored throughout, some only at the end.
Since their breakthrough in computer Go, MonteCarlotreesearch (MCTS) methods have initiated almost a revolution in game-playing agents: the artiﬁcial intelligence (AI) community has since developed an enormous amount of MCTS variants and en- hancements that advanced the state of the art not only in games, but also in several other domains. Although MCTS methods merge the generality of random sampling with the precision of treesearch, their convergence rate can be relatively low in practice, especially when not aided by additional enhancements. This is why practitioners often combine them with expert or prior knowledge, heuristics, and handcrafted strategies. Despite the outstanding results (like the AlphaGo engine, which defeated the best human Go players, prodigiously overcoming this grand challenge of AI), such task- speciﬁc enhancements decrease the generality of many applied MCTS algorithms. Im- proving the performance of core MCTS methods, while retaining their generality and scalability, has proven diﬃcult and is a current research challenge. This thesis presents a new approach for general improvement of MCTS methods and, at the same time, ad- vances the fundamental theory behind MCTS by taking inspiration from the older and well-established ﬁeld of reinforcement learning (RL). The links between MCTS, which is regarded as a search and planning framework, and the RL theory have already been outlined in the past; however, they have neither been thoroughly studied yet, nor have the existing studies signiﬁcantly inﬂuenced the larger game AI community. Motivated by this, we re-examine in depth the close relation between the two ﬁelds and detail not only the similarities, but identify and emphasize also the diﬀerences between them. We present a practical way of extending MCTS methods with RL dynamics: we de- velop the temporal diﬀerence treesearch (TDTS) framework, a novel class of MCTS-like algorithms that learn via temporal-diﬀerences (TD) instead of MonteCarlo sampling. This can be understood both as a generalization of MCTS with TD learning, as well
remaining computations. Accordingly, for the game of Go there exists a large number of publications about the design of such policies, e.g., . One of the objectives playout designers pursue focuses on balancing simulations to prevent biased evaluations . Simulation balancing targets at ensuring that the policy generates moves of equal quality for both players in any situation. Hence, adding domain knowledge to the playout policy for attacking also necessitates adding domain knowledge for according defense moves. One of the greatest early improvements in Monte-Carlo Go were sequence-like playout policies  that highly concentrate on local answer moves. They lead to a very selective search. Further concentration on local attack and defense moves improved the handling of some tactical fights and hereby contributed to additional strength gain of MCTS programs. However, adding more and more specific domain knowledge with the result of increasingly selective playouts, we open the door for more imbalance. This in turn allows for severe false estimates of position values. Accordingly, the correct evaluation of, e.g., semeai is still considered to be extremely challenging for MCTS based Go programs . This holds true, especially when they require long sequences of correct play by either player. In order to face this issue, we search for a way to make MCTS become aware of probably biased evaluations due to the existence of semeai or groups with uncertain status. In this chapter we present our results about the analysis of score histograms to infer information about the presence of groups with uncertain status. We heuristically locate fights on the Go board and estimated their corresponding relevance for winning the game. The developed heuristic is not yet used by the MCTS search. Accordingly, we cannot definitely specify and empirically prove the benefit of the proposed heuristic in terms of playing strength. We further conducted experiments with our MCTS Computer Go engine Gomorra on a number of 9 × 9 game positions that are known to be difficult to handle by state-of-the-art Go programs. All these positions include two ongoing capturing fights that were successfully recognized and localized by Gomorra using the method presented in the remainder of this chapter.
This paper studies active learning in the context of robust statistics. Specifically, we propose a variant of the BestArmIdentification problem for contaminated bandits, where each arm pull has probability ε of generating a sample from an arbitrary contamination distribution instead of the true underlying distribution. The goal is to identify the best (or approximately best) true distribution with high probability, with a secondary goal of providing guarantees on the quality of this distribution. The primary challenge of the contaminated bandit setting is that the true distributions are only partially identifiable, even with infinite samples. To address this, we develop tight, non-asymptotic sample complexity bounds for high-probability estimation of the first two robust moments (median and median absolute deviation) from contaminated samples. These concentration inequalities are the main technical contributions of the paper and may be of independent interest. Using these results, we adapt several classical BestArmIdentification algorithms to the contaminated bandit setting and derive sample complexity upper bounds for our problem. Finally, we provide matching information-theoretic lower bounds on the sample complexity (up to a small logarithmic factor).
The algorithm slowly builds a tree, where the root node represents the current game state. Each node not only holds information about the game state it represents, but also keeps track of the number of times it has been visited and the number of times a simulation through this node led to a victory or loss. The first step is selection in which a node is selected which is not part of the MCTS tree yet. This node is then added in the expansion step. From there, a simulation is performed by playing random moves. The result of the simulated game is then backpropagated to the root. If there is still time left, another iteration of these steps is executed. The four steps are explained in more detail in the subsequent subsections.
The results we have shown here will differ each time the do-file is executed, as a different sequence of pseudo-random numbers will be computed. To generate replicable MonteCarlo results, use Stata’s set seed command to initialize the random number generator at the top of the do-file (not inside the program!) This will cause the same
In contrast to MCMC, sequential MonteCarlo (SMC) methods are based on iterative importance sampling, and have traditionally been applied to inference in filtering problems with a sequence of time-varying target dis- tributions . We focus on static SMC methods for Bayesian inference on a fixed target distribution [29, 32, 43, 105]. Static SMC frames infer- ence as a sequential problem by defining an artificial series of incremental targets. This can be done by tempering the target density , by includ- ing data points sequentially , or by targeting the full density at every iteration. The latter is a special case known as population MonteCarlo (PMC) . SMC offers three striking advantages over MCMC: adaptive proposal mechanisms do not compromise convergence, normalising con- stants (e.g. model evidence) can be estimated in a straightforward manner, and the particle system can represent multi-modality (where MCMC often gets ‘stuck’ in a single mode).