The MonteCarlotreesearch algorithm over the time has become one of the preferred choices for solving problems in many domains, not just games. The goal of our research was to try to enhance one of the versions of MCTS, more precisely the UCT algorithm. We started by changing the node selection method with the ϵ-greedy method. Later on we started analyzing the TDlearning paradigm, and ended up incorporating the Sarsa(λ) algorithm into the UCT. This resulted in our Sarsa-TS(λ) algorithm. We incorporated the use of eligibility traces, λ as a trace-decay parameter, and γ as a discount factor.
Since their breakthrough in computer Go, MonteCarlotreesearch (MCTS) methods have initiated almost a revolution in game-playing agents: the artiﬁcial intelligence (AI) community has since developed an enormous amount of MCTS variants and en- hancements that advanced the state of the art not only in games, but also in several other domains. Although MCTS methods merge the generality of random sampling with the precision of treesearch, their convergence rate can be relatively low in practice, especially when not aided by additional enhancements. This is why practitioners often combine them with expert or prior knowledge, heuristics, and handcrafted strategies. Despite the outstanding results (like the AlphaGo engine, which defeated the best human Go players, prodigiously overcoming this grand challenge of AI), such task- speciﬁc enhancements decrease the generality of many applied MCTS algorithms. Im- proving the performance of core MCTS methods, while retaining their generality and scalability, has proven diﬃcult and is a current research challenge. This thesis presents a new approach for general improvement of MCTS methods and, at the same time, ad- vances the fundamental theory behind MCTS by taking inspiration from the older and well-established ﬁeld of reinforcement learning (RL). The links between MCTS, which is regarded as a search and planning framework, and the RL theory have already been outlined in the past; however, they have neither been thoroughly studied yet, nor have the existing studies signiﬁcantly inﬂuenced the larger game AI community. Motivated by this, we re-examine in depth the close relation between the two ﬁelds and detail not only the similarities, but identify and emphasize also the diﬀerences between them. We present a practical way of extending MCTS methods with RL dynamics: we de- velop the temporal diﬀerence treesearch (TDTS) framework, a novel class of MCTS-like algorithms that learn via temporal-diﬀerences (TD) instead of MonteCarlo sampling. This can be understood both as a generalization of MCTS with TDlearning, as well
Some studies treat RL and MCTS more like two standalone groups of algorithms, but use the value-estimations of both to develop stronger algorithms. Gelly and Silver (2007) were the first to combine the benefits of both fields: they used offline TD-learned values of shape features from the RLGO player (Silver, Sutton, & M¨uller, 2007) as initial estimates for the MCTS-based player MoGo (Gelly et al., 2006). Soon afterwards, Silver, Sutton, and M¨uller (2008) extended this “one-time” interaction between RL and MCTS to an “interleaving” interaction by defining a two- memory architecture, noted as Dyna-2 – an extension of Dyna (Sutton, 1990). Daswani, Sunehag, and Hutter (2014) suggest using UCT as an oracle to gather training samples that are then used for RL feature-learning, which can be seen as an augmentation of Dyna-2 for feature-learning. Finns- son and Bj¨ornsson (2010) employ gradient-descent TD (Sutton, 1988) for learning a linear function approximator online; they use it to guide the MCTS tree policy and default policy in CadiaPlayer, a twice-champion program in the General Game Playing competition (Genesereth, Love, & Pell, 2005). Ilhan and Etaner-Uyar (2017) also learn a linear function approximator online, but through the true online Sarsa(λ) algorithm (Van Seijen, Mahmood, Pilarski, Machado, & Sutton, 2016) and they use it only in the playout for informing an ε-greedy default policy; they improve the perfor- mance of vanilla UCT on a set of GVG-AI games. Robles, Rohlfshagen, and Lucas (2011) employ a similar approach to Finnsson and Bj¨ornsson (2010), but they learn the approximator already of- fline and evaluate the performance on the game of Othello; they observe that guiding the default policy is more beneficial than guiding the tree policy. Osaki, Shibahara, Tajima, and Kotani (2008) developed the TDMC(λ) algorithm, which enhances TDlearning by using winning probabilities as substitutes for rewards in nonterminal positions. They gather these probabilities with plain MonteCarlo sampling; however, as future research they propose to use the UCT algorithm.
Abstract— We are addressing the course timetabling problem in this work. In a university, students can select their favorite courses each semester. Thus, the general requirement is to allow them to attend lectures without clashing with other lectures. A feasible solution is a solution where this and other conditions are satisfied. Constructing reasonable solutions for course timetabling problem is a hard task. Most of the existing methods failed to generate reasonable solutions for all cases. This is since the problem is heavily constrained and an e ﬀ ective method is required to explore and exploit the search space. We utilize MonteCarloTreeSearch (MCTS) in finding feasible solutions for the first time. In MCTS, we build a tree incrementally in an asymmetric manner by sampling the decision space. It is traversed in the best-first manner. We propose several enhancements to MCTS like simulation and tree pruning based on a heuristic. The performance of MCTS is compared with the methods based on graph coloring heuristics and Tabu search. We test the solution methodologies on the three most studied publicly available datasets. Overall, MCTS performs better than the method based on graph coloring heuristic; however, it is inferior compared to the Tabu based method. Experimental results are discussed.
playing strategies against players based on MonteCarloTreeSearch (MCTS)  and Information Set MonteCarloTreeSearch (ISMCTS) . The first rule-based player implements the basic greedy strategy taught to beginner players; the second one implements Chitarella’s rules  with the additional rules introduced by Saracino , and represents the most funda- mental and advanced strategy for the game; the third rule- based player extends the previous approach with the additional rules introduced in . MCTS requires the full knowledge of the game state (that is, of the cards of all the players) and thus, by implementing a cheating player, it provides an upper bound to the performance achievable with this class of methods. ISMCTS can deal with incomplete information and thus implements a fair player. For both approaches, we evaluated different reward functions and simulation strategies. We performed a set of experiments to select the best rule-based player and the best configuration for MCTS and ISMCTS. Then, we performed a tournament among the three selected players and also an experiment involving humans. Our results show that the cheating MCTS player outperforms all the other strategies while the fair ISMCTS player outperforms all the rule-based players that implement the best known and most studied advanced strategy for Scopone. The experiment involving human players suggests that ISMCTS might be more challenging than traditional strategies.
However, this problem is intrinsically difficult be- cause it is hard to encode what to say into a sentence while ensuring its syntactic correctness. We propose to use MonteCarlotreesearch (MCTS) (Kocsis and Szepesvari, 2006; Browne et al., 2012), a stochastic search algorithm for decision processes, to find an optimal solution in the decision space. We build a searchtree of possible syntactic trees to generate a sentence, by selecting proper rules through numer- ous random simulations of possible yields.
remaining computations. Accordingly, for the game of Go there exists a large number of publications about the design of such policies, e.g., . One of the objectives playout designers pursue focuses on balancing simulations to prevent biased evaluations . Simulation balancing targets at ensuring that the policy generates moves of equal quality for both players in any situation. Hence, adding domain knowledge to the playout policy for attacking also necessitates adding domain knowledge for according defense moves. One of the greatest early improvements in Monte-Carlo Go were sequence-like playout policies  that highly concentrate on local answer moves. They lead to a very selective search. Further concentration on local attack and defense moves improved the handling of some tactical fights and hereby contributed to additional strength gain of MCTS programs. However, adding more and more specific domain knowledge with the result of increasingly selective playouts, we open the door for more imbalance. This in turn allows for severe false estimates of position values. Accordingly, the correct evaluation of, e.g., semeai is still considered to be extremely challenging for MCTS based Go programs . This holds true, especially when they require long sequences of correct play by either player. In order to face this issue, we search for a way to make MCTS become aware of probably biased evaluations due to the existence of semeai or groups with uncertain status. In this chapter we present our results about the analysis of score histograms to infer information about the presence of groups with uncertain status. We heuristically locate fights on the Go board and estimated their corresponding relevance for winning the game. The developed heuristic is not yet used by the MCTS search. Accordingly, we cannot definitely specify and empirically prove the benefit of the proposed heuristic in terms of playing strength. We further conducted experiments with our MCTS Computer Go engine Gomorra on a number of 9 × 9 game positions that are known to be difficult to handle by state-of-the-art Go programs. All these positions include two ongoing capturing fights that were successfully recognized and localized by Gomorra using the method presented in the remainder of this chapter.
MonteCarloTreeSearch (MCTS) is a family of directed search algorithms that has gained wide- spread attention in recent years. It combines a traditional tree-search approach with MonteCarlo simulations, using the outcome of these simulations (also known as playouts or rollouts) to evaluate states in a look-ahead tree. That MCTS does not require an evaluation function makes it particularly well-suited to the game of Go — seen by many to be chess’s successor as a grand challenge of artificial intelligence — with MCTS-based agents recently able to achieve expert-level play on 19×19 boards. Furthermore, its domain-independent nature also makes it a focus in a variety of other fields, such as Bayesian reinforcement learning and general game-playing.
Another focus of our research was to develop an effective player for Birds of a Feather. Given a game state, we first applied a simple two-step lookahead to the game tree to see if any goal states were present. If so, we would select the first child node that moved toward an identified goal state. To identify goal states, we developed a solvability checker to run each candidate state through for evaluation. If the lookahead was unable to find a goal state, we would then determine our move by applying a variation of Monte Car- lo TreeSearch.
We continue our studies with the full version of MCTS to play Gomoku. We find that while MCTS has shown great success in playing more sophisticated games like Go, it is not effective to address the problem of sudden death/win, which ironically does not often appear in Go, but is quite common on simple games like Tic-Tac-Toe and Gomoku. The main reason that MCTS fails to detect sudden death/win lies in the random playout search nature of MCTS. Therefore, although MCTS in theory converges to the optimal minimax search, with computational resource constraints in reality, MCTS has to rely on RPS as an important step in its simulation search step, therefore suffering from the same fundamental problem as RPS and not necessarily always being a winning strategy.
In order to be self-contained, we start with a brief introduction to the stochastic multi-armed bandit problem in Chapter 1 and de- scribe the UCB (Upper Conﬁdence Bound) strategy and several exten- sions. In Chapter 2 we present the Monte-CarloTreeSearch method applied to Computer Go and show the limitations of previous algo- rithms such as UCT (UCB applied to Trees). This provides motivation for designing theoretically well-founded optimistic optimization algo- rithms. The main contributions on hierarchical optimistic optimization are described in Chapters 3 and 4 where the general setting of a semi- metric space is introduced and algorithms designed for optimizing a function assumed to be locally smooth (around its maxima) with re- spect to a semi-metric are presented and analyzed. Chapter 3 considers the case when the semi-metric is known and can be used by the algo- rithm, whereas Chapter 4 considers the case when it is not known and describes an adaptive technique that does almost as well as when it is known. Finally in Chapter 5 we describe optimistic strategies for a speciﬁc structured problem, namely the planning problem in Markov decision processes with inﬁnite horizon discounted rewards.
Bubble Breaker. From the results, it is thought that ex- act number of nodes might depend on the characteristic of problems. This paper also considers that the combi- nation with a pruning algorithm like beam search will be important to obtain more eﬃcient results quickly. This paper considers that SP-MCTS might be a good match with practical scheduling problems, especially a reentrant scheduling problem . We have been focused on the improvement of a printing process as a practical scheduling. In the printing process, dial plates used for car tachometers are printed with various colors and char- acter plates. At this time, the production lead time can be shortened by collecting the products printed by the same type of color or character plate. When the type of color or character plate is switched to another type, the process requires “setup operation” with production idle time. So the problem can be formulated by the mini-
The idea of updating a tree by adding leaves dates back to at least Felsenstein (1981), in which he describes, for maxi- mum likelihood estimation, that an effective search strategy in tree space is to add species one by one. More recent work also makes use of the idea of adding sequences one at a time: ARGWeaver (Rasmussen et al. 2014) uses this approach to initialise MCMC on (in this case, a space of graphs), t + 1 sequences using the output of MCMC on t sequences, and TreeMix (Pickrell and Pritchard 2012) uses a similar idea in a greedy algorithm. In work conducted simultaneously to our own, Dinh et al. (2018) also propose a sequential MonteCarlo approach to inferring phylogenies in which the sequence of distributions is given by introducing sequences one by one. However, their approach: uses different proposal distribu- tions for new sequences; does not infer the mutation rate simultaneously with the tree; does not exploit intermediate distributions to reduce the variance; and does not use adaptive MCMC moves. Further investigation of their approach can be found in Fourment et al. (2018), where different guided proposal distributions are explored but that still presents the aforementioned limitations.
This paper presents a novel approach for lever- aging automatically extracted textual knowl- edge to improve the performance of control applications such as games. Our ultimate goal is to enrich a stochastic player with high- level guidance expressed in text. Our model jointly learns to identify text that is relevant to a given game state in addition to learn- ing game strategies guided by the selected text. Our method operates in the Monte-Carlosearch framework, and learns both text anal- ysis and game strategies based only on envi- ronment feedback. We apply our approach to the complex strategy game Civilization II us- ing the official game manual as the text guide. Our results show that a linguistically-informed game-playing agent significantly outperforms its language-unaware counterpart, yielding a 27% absolute improvement and winning over 78% of games when playing against the built- in AI of Civilization II. 1
Policy search methods which does not explicitly depend on a model of a sys- tem are called model-free approaches where the required stochastic trajectories are provided by drawing state action samples from the robot. In the model-based sce- nario, instead of using real robots, simulation environments are hired and the learned model dynamics are used for observing samples to create robot paths. A good exam- ple for this case is done by Tangkaratt et al. (2014) where first a state space model of the system is learned by using least square estimation method and then the policy is obtained by policy gradients with parameter-based exploration method (PGPE) which is already proposed by Sehnke et al. (2010). For an extensive study regard- ing the model-based policy search please refer to Polydoros and Nalpantidis (2017). Although working with simulations are easy in comparison to real robots, learning a forward model of a system is challenging than learning a policy mapping. On the other side, working with real robots is challenging due to the iterative interactions that can result in probable damages that may occur to the robot.
The algorithm slowly builds a tree, where the root node represents the current game state. Each node not only holds information about the game state it represents, but also keeps track of the number of times it has been visited and the number of times a simulation through this node led to a victory or loss. The first step is selection in which a node is selected which is not part of the MCTS tree yet. This node is then added in the expansion step. From there, a simulation is performed by playing random moves. The result of the simulated game is then backpropagated to the root. If there is still time left, another iteration of these steps is executed. The four steps are explained in more detail in the subsequent subsections.
With the optional early pass feature, the player aborts the search early if the value of the root node is close to a win. In this case, it performs additional searches to check if the position is still a safe win even after passing, and if the status of all points on the board is determined. Determining the status of points uses an optional feature of Fuego, where the search computes ownership statistics for each point on the board, averaged over all terminal positions in the playouts. If passing seems to win with high probability, the player passes immediately. This avoids the continuation of play in clearly won positions and avoids losing points by playing moves in safe territory under Japanese rules. This works relatively well, but is not a full implementation of Japanese rules since playouts are not adapted to that rule set. Point ownership statistics are also used to implement the standard
also been made in multi-player poker and Skat  which show promise towards challenging the best human players. Determinization, where all hidden and random information is assumed known by all players, allows recent advances in MCTS to be applied to games with incomplete information and randomness. The determinization approach is not perfect: as discussed by Frank and Basin , it does not handle situ- ations where different (indistinguishable) game states suggest different optimal moves, nor situations where the opponent’s influence makes certain game states more likely to occur than others. In spite of these problems, determinization has been applied successfully to several games. An MCTS-based AI agent which uses determinization has been developed that plays Klondike Solitaire , arguably one of the most popular computer games in the world. For the variant of the game considered, the performance of MCTS in this case exceeds human performance by a substantial margin. A determinized MonteCarlo approach to Bridge , which uses MonteCarlo simulations with a tree of depth one has also yielded strong play. The combination of MCTS and determinization is discussed in more detail in Section V.
In this article we introduce a new MCTS variant, called MCTS-Solver, which has been designed to prove the game-theoretical value of a node in a searchtree. This is an important step towards being able to use MCTS-based approaches effectively in sudden-death like games (including chess). We use the game Lines of Action (LOA) as a testbed. It is an ideal candidate because its intricacies are less complicated than those of chess. So, we can focus on the sudden-death property. Furthermore, because LOA was used as a domain for various other AI techniques [5, 12, 20], the level of the state-of-the-art LOA programs is high, allowing us to look at how MCTS approaches perform against increasingly stronger evaluation functions. Moreover, the search engine of a LOA program is quite similar to the one of a chess program.
The upper confidence bound applied to trees (UCT) is used by MCTS as the tree policy in the selection step to tra- verse the tree. UCT balances the idea of exploration versus exploitation. The exploration approach promotes exploring unexplored areas of the tree. This means that exploration will expand the tree’s breadth more than its depth. While this approach is useful to ensure that MCTS is not overlook- ing any potentially better paths, it can become very ineffi- cient very quickly in games with a large number of moves. To help avoid that, it is balanced out with the exploitation approach. Exploitation tends to stick to one path that has the greatest estimated value. This approach is greedy and will extend the tree’s depth more than its breadth. UCT balances exploration and exploitation by giving relatively unexplored nodes an exploration bonus.