• No results found

Outperforming Game Theoretic Play with Opponent Modeling in Two Player Dominoes

N/A
N/A
Protected

Academic year: 2021

Share "Outperforming Game Theoretic Play with Opponent Modeling in Two Player Dominoes"

Copied!
102
0
0

Loading.... (view fulltext now)

Full text

(1)

Air Force Institute of Technology

AFIT Scholar

Theses and Dissertations Student Graduate Works

3-14-2014

Outperforming Game Theoretic Play with

Opponent Modeling in Two Player Dominoes

Michael M. Myers

Follow this and additional works at:https://scholar.afit.edu/etd

This Thesis is brought to you for free and open access by the Student Graduate Works at AFIT Scholar. It has been accepted for inclusion in Theses and Dissertations by an authorized administrator of AFIT Scholar. For more information, please contactrichard.mansfield@afit.edu.

Recommended Citation

Myers, Michael M., "Outperforming Game Theoretic Play with Opponent Modeling in Two Player Dominoes" (2014). Theses and Dissertations. 617.

(2)

OUTPERFORMING GAME THEORETIC PLAY WITH OPPONENT MODELING IN TWO PLAYER DOMINOES

THESIS

Michael M. Myers, Captain, USAF AFIT-ENG-14-M-57

DEPARTMENT OF THE AIR FORCE AIR UNIVERSITY

AIR FORCE INSTITUTE OF TECHNOLOGY

Wright-Patterson Air Force Base, Ohio

DISTRIBUTION STATEMENT A. APPROVED FOR PUBLIC RELEASE; DISTRIBUTION UNLIMITED.

(3)

The views expressed in this thesis are those of the author and do not reflect the official policy or position of the United States Air Force, Department of Defense, or the United States Government. This material is declared a work of the United States Government and is not subject to copyright protection in the United States.

(4)

AFIT-ENG-14-M-57

OUTPERFORMING GAME THEORETIC PLAY WITH OPPONENT MODELING IN TWO PLAYER DOMINOES

THESIS

Presented to the Faculty

Department of Electrical and Computer Engineering Graduate School of Engineering and Management

Air Force Institute of Technology Air University

Air Education and Training Command In Partial Fulfillment of the Requirements for the Degree of Master of Science in Electrical Engineering

Michael M. Myers, BS Captain, USAF

March 2014

DISTRIBUTION STATEMENT A. APPROVED FOR PUBLIC RELEASE; DISTRIBUTION UNLIMITED.

(5)

AFIT-ENG-14-M-57

ii

OUTPERFORMING GAME THEORETIC PLAY WITH OPPONENT MODELING IN TWO PLAYER DOMINOES

Michael M. Myers, BS Captain, USAF

Approved:

___________________________________ __________

Brett J. Borghetti, PhD (Chairman) Date

___________________________________ __________

Maj Brian G. Woolley (Member) Date

___________________________________ __________

Maj Kennard R. Laviers (Member) Date

___________________________________ __________

Gilbert L. Peterson, PhD (Member) Date

//signed// //signed// //signed// //signed// 7 March 2014 11 March 2014 6 March 2014 7 March 2014

(6)

iii

Abstract

Dominoes is a partially observable extensive form game with probability. The rules are simple; however, complexity and uncertainty of this game make it difficult to apply standard game theoretic methods to solve. This thesis applies strategy prediction opponent modeling to work with game theoretic search algorithms in the game of two player dominoes. This research also applies methods to compute the upper bound potential that predicting a strategy can provide towards specific strategy types. Furthermore, the actual values are computed according to the accuracy of a trained classifier. Empirical results show that there is a potential value gain over a Nash equilibrium player in score for fully and partially observable environments for specific strategy types. The actual value gained is positive for a fully observable environment for score and total wins and ties. Actual value gained over the Nash equilibrium player from the opponent model only exist for score, while the opponent modeler demonstrates a higher potential to win and/or tie in comparison to a pure game theoretic agent.

(7)

iv

Acknowledgments

I would like to thank the members of my committee: Doctor Brett Borghetti, Majors Kennard Laviers and Brian Woolley and Dr. Gilbert Peterson. I have gained a countless amount of knowledge from each of your classes and your advisory sessions. My work would not be complete if it weren’t for you. To Lt. Col. Jeffrey D. Clark (honorary committee member), your class and offline teachings helped me to gain great ideas as to how I would build certain aspects of my implementation.

I would also like to give special thanks to Captain Kyle Stewart for spending countless hours helping me with Qt and other C++ related problems that I came across during my research process. I know you had your doctoral research to do but you always took time to help students like me. I would not be finished with this project without your help.

I would also like to thank the members of the AFIT Cyber ANiMaL group for giving me advice on how to present my thesis points as well as where to go in my research.

Lastly, I would like to thank my loving wife for putting up with me for all of these months here at AFIT. I know it wasn’t easy on you taking care of the house and family in my absence but I truly appreciate it.

(8)

v Table of Contents Page Abstract ... iii Acknowledgments... iv Table of Contents ...v

List of Figures ... vii

List of Tables ... viii

I. Introduction ...1 Problem Statement ...1 Complexity of Dominoes ...2 Branching Factor ... 2 Starting State ... 3 Uncertainty ... 5

Simplifying the Game ...7

Scope ...8 Limitations ...8 Assumptions ...9 Contributions ...9 Methodology ...9 Results ...10 Thesis Overview...10

II. Literature Review ...12

Fully Observable Games ...12

MiniMax Search ...13

Stochastic Partially Observable Games ...15

Game theory and Opponent Modeling ...15

M* Search ... 16

Opponent Modeling in Poker ... 20

Generalizing Opponents to Simplify the Opponent Space ... 24

Environment Value ... 29

Other Methods ... 32

Dominoes AI Agents... 33

(9)

vi

III. Methodology ...36

Introduction ...36

Problem Definition ...36

Goals and Hypothesis ... 36

Approach ... 38

Testing... 41

Dominoes Experimental Setup ...44

Adversarial Search Decision Engine ... 45

Targets... 51

The Opponent Classifier ... 54

Evaluation ...58

Difference of means ... 59

Binary test ... 60

Illustrating the data ... 60

IV. Analysis and Results ...64

Chapter Overview ...64

Classifier Performance Results ...64

Evaluating each Classifier... 64

Classifier Selection ... 68

Opponent Model Value Assessment ...69

Perfect Information ... 70

Imperfect Information ... 74

Actual Value Evaluation ...75

Summary of Results ...77

V. Conclusions and Future Work...79

Future Work ...80

Opponent Modeler for Board State ... 80

Non-Myopic Targets ... 81

Applying a Probability Distribution Model to the OM ... 81

Summary ...81

Appendix A. Rules of Dominoes ...82

(10)

vii

List of Figures

Page Figure 1: Example instance of double-6 dominoes game tree where the opponent has all

possible responses to spinner ... 3

Figure 2: Partially observable game with probability ... 6

Figure 3: MiniMax game tree ... 14

Figure 4: M* Algorithm ... 18

Figure 5: Game tree recursive search of M* with M*(a,3,f2(f1,f0))... 19

Figure 6: A neural network predicting on opponent’s future action ... 23

Figure 7: Mixture-Based Architecture for Opponent Modeling in Guess It ... 28

Figure 8: Game tree representation of dominoes. ... 37

Figure 9: Play options available for the player based on target’s defensive strategy. ... 40

Figure 10: Expectimax algorithm ... 47

Figure 11: M* Algorithm for dominoes ... 48

Figure 12: Game tree before M* pruning process ... 50

Figure 13: Game tree after M* pruning process ... 50

Figure 14: Algorithm for myopic scoring target ... 52

Figure 15: Example of defensive score calculation to make a decision ... 53

Figure 16: Algorithm for defensive scoring target. ... 54

Figure 17: Domino board representation where the board count is 16. ... 57

Figure 18: Full set of double-six dominoes (28 pieces). ... 82

(11)

viii

List of Tables

Page

Table 1: Table of Starting States for Dominoes... 4

Table 2: Re-Weighting of Various Hands after the Flop ... 21

Table 3: Neural Network Inputs ... 24

Table 4: Guess It Strategy Hierarchy Matrix ... 27

Table 5: Confusion matrix of the opponent modeler ... 41

Table 6: Quadrant chart of experimental results ... 44

Table 7: Opponent Modeling Classifier Requirements Matrix... 55

Table 8: Classifier Inputs ... 56

Table 9: Classifier training order per game ... 58

Table 10: Sample score differential results table for the value of the M* decision engine after all levels and factors of experimentation ... 60

Table 11: Sample win/tie data table for the value of the M* decision engine after all levels and factors of experimentation ... 61

Table 12: List of levels and factors for experimentation ... 62

Table 13: Confusion matrix validation results on 1,649 samples from the MDA Classifier with all three targets and 8,242 samples ... 65

Table 14: Confusion matrix validation results on 10,059 samples from the MDA Classifier with all three targets and 50,291 samples ... 65

(12)

ix

Table 15: Confusion matrix validation results on 25,945 samples from the MDA

Classifier with all three targets and 129,724 samples ... 66 Table 16: Confusion matrix validation results on 1,319 samples from the Random

Forests Classifier with all three targets and 8,242 samples ... 67 Table 17: Confusion matrix validation results on 8,047 samples from the Random

Forests Classifier with all three targets and 50,291 samples ... 67 Table 18: Confusion matrix validation results on 20,756 samples from the MDA

Classifier with all three targets and 129,724 samples ... 68 Table 19: Mean per-game score differentials between the oracle and Expectimax for

each target for 50,400 games in a fully observable environment with a target oracle ... 70 Table 20: Total win and tie mean differences between oracle and Expectimax for each

target in a fully observable environment with a target oracle ... 72 Table 21: Mean per-game score differentials between oracle and Expectimax for each

target in a fully observable environment with an opponent model ... 72 Table 22: Total win and tie mean differences between oracle and Expectimax for each

target in a fully observable environment with an opponent model ... 73 Table 23: Mean per-game score differentials between M*(oracle) and Expectimax for

each target under imperfect information in a partially observable environment with a target oracle ... 74 Table 24: Win and tie totals between M*(oracle) and Expectimax for each target in a

(13)

x

Table 25: Mean per-game score differentials between M*(OM) and Expectimax for each target in a partially observable environment with an opponent model ... 76 Table 26: Win and tie total difference between M*(OM) and Expectimax for each target

(14)

1

OUTPERFORMING GAME THEORETIC PLAY WITH OPPONENT MODELING IN TWO PLAYER DOMINOES

I. Introduction Problem Statement

The game of dominoes is a partially observable extensive form game (or POEFG) with probability. An extensive form game is a game in which each player takes an action and there are payoffs as a result of the actions made [22]. The rules (found in Appendix A) are simple. With its numeric tile structure and straightforward rule set, dominoes is a resizable game which allows researchers to examine simple to very complex versions of POEFG’s with probability by only adjusting a few parameters in the game.

Dominoes is also probabilistic nature. The rules state that there is a boneyard (i.e. the pile of faced down dominoes) that players pull from when there is no available play. This adds the element of chance to the game tree. Moreover, players can only view their hand of play and the contents on the board. They cannot see their opponents’ hands or the boneyard.

The goal of this research is to produce an opponent modeling algorithm that can predict an opponent’s style of play as well as provide an optimal move in a partially observable environment with probability. This research focuses on choosing optimal moves by exploiting the opponent’s strategy in place of applying the Nash equilibrium solution. (Nash equilibrium is achieved when a player, given strategies of other players in the same game, has nothing to gain by unilaterally switching his strategy [25].) Common algorithms like MiniMax (explained in Chapter 2) apply a Nash Equilibrium

(15)

2

(NE) solution to select optimal moves. This research evaluates the expected and actual gain in value that a specific opponent modeler provides over a NE solution for the game of two player dominoes. The next section outlines the complexity of dominoes as well as how this game is simplified to perform this research.

Complexity of Dominoes

There are several factors that influence computational complexity in dominoes. The first element of complexity involves the branching factor because the larger the branching factor becomes, the longer a decision engine has to search to find the best action. (Often the growth rate is much worse than linear and often even worse than polynomial). The number of starting states also makes a big impact on the game because unlike many board games such as tic-tac-toe and chess, dominoes bears billions of starting states (as illustrated in Table 1 and Equation 2). Lastly, dominoes is a game of partial observability where many aspects of the game are hidden. Furthermore, dominoes adds probability with boneyard pulls. All of these issues make solving dominoes a hard problem.

Branching Factor.

The branching factor of dominoes is calculated by examining the number of possible actions or moves per turn. This number involves the number of dominoes in a player’s hand and the number of places that a domino can be positioned on the board. Additionally, a domino can be placed in more than one location depending on the domino and the player’s strategy. Figure 1 illustrates how dominoes can be placed on the left or right side of the spinner, (the first domino placed on the board.)

(16)

3

Figure 1: Example instance of double-6 dominoes game tree where the opponent has all possible responses to spinner.

This factor multiplies the branching factor by up to four depending on legal plays available.

Starting State.

The starting state of dominoes is important in comparing this game with fully observable board games such as chess because board games tend to only have a single starting position. Dominoes is interesting because the starting state possibilities change depending on the number of players in the game and the size (or number of dominoes) of the set. Games like poker also have multiple start states; however, unlike poker dominoes has multiple payouts throughout the game instead of one main payout at the end of a hand.

To calculate the number of starting states in two player dominoes, the set size must be found first. The set size depends on the domino with the highest pip count in the set. Set size is calculated by applying

(17)

4

(1)

where n represents the highest pip count in a set of dominoes + 1 (i.e., for double-6 dominoes, n = 7). One is added to the highest pip count to account for the zero (or blank) dominoes. By applying Equation 1, double-6 dominoes contain 28 dominoes while double-3 dominoes contain 10 dominoes. Now that the set size is calculated, Equation 2 illustrates the number of possible starting states for two player dominos with set size L and a boneyard size B.

(2) Where SS is the number of starting states, L is the size of the set, B is the boneyard size (B ≤ L-2) and D is the size of a player’s hand (both players have equal hand sizes). Table 1 illustrates how the number of starting states grows exponentially as the pip size grows.

Table 1: Table of Starting States for Dominoes.

Pip Size Dominoes in set Distribution Starting States

Double-2 6 2-2-2 20 Double-3 10 3-3-4 4,200 Double-4 15 5-5-5 576,576 Double-5 21 7-7-7 271,591,320 Double-6 28 9-9-10 6.38 × 1011

= + = = n i n n i S 1 2 ) 1 (       − ×       = D B L B L SS

(18)

5

While strategies in many board games involve computing a set of plays based on a single board state, 638 billion starting states make it very difficult to compute the perfect play for any starting state at the beginning of a game.

Uncertainty.

Dominoes is also a partially observable game with probability. This adds the element of uncertainty to the complexity of the game by introducing chance nodes and information sets. A chance node is a position on the game tree in which probability is associated with the action taken. For dominoes, chance nodes arise when a player pulls from the boneyard. Information sets are positions on the game tree in which the agent has no knowledge of its position on the tree. Figure 2 illustrates a partially observable game instance (constructed in Gambit software [21]) of double-2 dominoes where there are two player agents, the player (red) and opponent (blue). There is also a chance agent (green) that represents the probabilistic property of drawing dominoes from the boneyard.

The player agent can observe its own two dominoes that are (1|1) (where (1|1) represents the one-one domino) and (1|0); however, it cannot see the target agent’s two dominoes or the other two dominoes in the boneyard. From the player’s perspective, the target and boneyard share a uniform distribution of the dominoes (2|2), (2|0), (2|1) and (0|0) with each entity having 2 unknown dominoes [21].

(19)

6

(20)

7

Information sets (labeled by dashed lines) indicate a set of nodes in which players cannot be sure to which node they belong. For dominoes, information sets represent instances when players cannot differentiate the opponent’s current state or future action. Because of this uncertainty, a decision engine must evaluate all possible outcomes and payouts to calculate an accurate expected value of the game from every member of the information set. This means that the engine must account for every possible combination of dominoes between the target and the boneyard. This adds more computational complexity to solving the game. Applying this logic, the game instance in Figure 1 must compute the current complex problem (branching factor of 6) along with every other possible start state (or �199 � = 92,378 starting states, including each state’s branching factors) of the opponent, assuming the player hand stays the same.

Simplifying the Game

In order to solve a game of dominoes for this research, the game must be simplified. One way to simplify the game of dominoes is to use subsets of the total set of dominoes, where the total set of dominoes is 28 dominoes ranging from (0|0) to (6|6). A subset of dominoes ranges from (0|0) to a smaller value domino such as (2|2) with six dominoes or (3|3) that contains 10 dominoes. This is depicted in Table 1.

(21)

8

Scope

The scope of this research involves producing an algorithm that has the ability to make predictions of the opponent’s strategy as well as producing an optimal move for a player agent.

Limitations

Experimentation for this research employs double-3 dominoes. A set of double-3 dominoes contains enough dominoes to evaluate two player dominoes games with partially observable information and chance nodes. For this research, the starting state of all games includes two players’ hands of three dominoes with a boneyard size of four. This subset of dominoes is chosen because it is the smallest subset of dominoes that still provides enough dominoes to have a boneyard size slightly larger than one player’s hand size (that is similar to play in a full set). For instance the starting state of double-6 dominoes consists of nine dominoes in both players’ hands with a boneyard size of 10 dominoes. In addition, larger boneyard sizes create larger branching factors at chance nodes.

Additionally, all games end after the first round. In other words, after the first domino hand is played, the participant with the highest score wins. Playing only one round is different than establishing a predefined score and playing many rounds until that score is achieved. Since these limitations do not affect the basic rules of the game or how it is played, more computations can be applied to solve this problem.

(22)

9

Assumptions

Research is conducted under the assumption that there is no bluffing involved in dominoes. Bluffing in dominoes entails a player implying he or she lacks an available play by pulling from the boneyard even if he or she has and existing play. All domino rules applied in this research come from Domino Basics and Domino [2, 12]. It is assumed that all domino rules followed from this reference apply to real world domino games.

Contributions

This research project contributes in the areas of game theory and opponent modeling. The research presents empirical evidence showing that, for three different strategy opponents, there is potential and actual value gained over a game theoretic player by exploiting an opponent’s actions. Furthermore, this research confirms the actual value computed by employing an MDA classifier predicting three different playing styles in the game of dominoes.

Methodology

To complete this research, a simple dominoes agent is developed to play games according to the rules of dominoes. The agent is required to facilitate every start state of double-3 dominoes between a player agent and three different opponent types. The player agent uses an opponent modeling algorithm to predict the type of the opponent. The player agent plays incorporating opponent’s type into a backwards induction algorithm [25] to maximize score.

(23)

10

Results show the expected and actual value gained by the opponent model as well as how well it predicts the opponent type. The expected value gain of the opponent model is a calculation of the highest gain in value (over a Nash equilibrium player) that an opponent model can provide against a specific opponent strategy. This gain is computed by recording the value of applying an oracle (opponent model with 100% prediction accuracy) to the player subtracting a value computed applying Nash equilibrium decisions with no oracle. The actual value gain is the gain in value over the Nash equilibrium player while applying an opponent classifier in place of the oracle. The prediction accuracy of the opponent classifier is calculated by taking a ratio of the number of correctly predicted targets over the number of attempts to predict a specific target.

Results

Empirical results show that there is a potential value gain over a Nash equilibrium player in score for fully and partially observable environments for specific strategy types. The actual value gained is positive for a fully observable environment for score and total wins and ties. Actual value gained over the Nash equilibrium player from the opponent model only exist for score, while the opponent modeler demonstrates a higher potential to win and/or tie in comparison to a pure game theoretic agent.

Thesis Overview

Chapter II of this thesis discusses previous work in the field of game theory and opponent modeling as well as definitions of many of the concepts discussed in this thesis. Chapter III illustrates how the algorithm is created and implemented, while Chapter IV

(24)

11

discusses and analyses the results of the implementation and testing. Finally, Chapter V concludes this thesis and discusses future work.

(25)

12

II. Literature Review

In Game theory there are many ways to overcome an opponent. The following sections acknowledge concepts from previous work in the areas of game theoretic tree search and opponent modeling. This chapter also provides details of the advantages and limitations to game tree search. Furthermore, this chapter illustrates the benefits of opponent modeling in partially observable strategy games.

Fully Observable Games

Chess is a game of perfect knowledge. Games with perfect knowledge are called fully observable games [26]. Deep Blue applies game tree search in chess. In doing so, this system was able to defeat chess world champion Garry Kasperov[5, 26]. The Deep Blue agent models a chess board and makes predictions based on outcomes (in most cases) 14 game tree layers in advance. Fully observable board games like checkers, chess and go are constantly researched in efforts to find more efficient winning strategies.

With perfect knowledge, programming an agent can be as simple as implementing a game tree and performing a search to find the optimal move for each turn; however, problems (such as increased search times and stack overflows) commence as the game complexity increases. Additional issues arise when choosing the best evaluation function, which is simply the algorithm which returns the best prediction of the value of the game at a specific node. MiniMax is a search algorithm that invokes on an evaluation function and “serves as the basis for mathematical analysis of games” [26].

(26)

13

MiniMax Search

When working with a complex search tree to find the optimal move to make at every level of the tree, a wise decision is to limit the counter move that the opponent is capable of making for that turn. This is done by choosing the move that maximizes the player’s chance of winning, while minimizing the opponent’s winning chances.

According to Funge and Millington [14], it can be easy to predict chances of winning towards the end of the game based on the minimal count of plays remaining. Conversely, counting remaining plays can be more difficult in the beginning and middle of the game, because of the uncertainty of where the game ends. Therefore, an evaluation function is employed to predict how the end game states appear [14]. This function can also be called a heuristic function on a game tree search.

For this search to work, the player and the opponent or adversary are labeled the terms “MAX” and “MIN,” respectively. Furthermore, as explained above, each level of the game tree represents one of these participants.

Figure 3 illustrates how the evaluation function depicts values at every node in the tree in terms of MAX and decides the most advantageous move to make. This algorithm is assessing every move possibility in the game tree with an evaluation function to predict the ending score in the leaf nodes. The algorithm then cycles this final score up through the nodes to the two possible moves that MAX can take. The node with the highest score MIN will allow is chosen [17].

(27)

14

Figure 3: MiniMax game tree.

As shown in Figure 3 MAX chooses the right node at a value of 2. This assessment is completed at every turn throughout the game until a terminal (win, loss or tie) state is achieved.

The Expectimax (EM) algorithm [26] is applied to MiniMax to account for probability (chance nodes) within the game tree. Equation 3 illustrates the value calculated at a chance node within a tree with probabilistic nodes [16]. This equation states

𝐸𝑥𝑝𝑒𝑐𝑡𝑖𝑚𝑎𝑥(𝑠) = � 𝑃(𝑐ℎ𝑖𝑙𝑑𝑖) × 𝑛

𝑖

𝑈(𝑐ℎ𝑖𝑙𝑑𝑖) (3)

where P(childi) is the probability of a specific child node of s and U(childi) is the utility

value of childi and n is the number of chance options (for this research, dominoes in the

(28)

15

Stochastic Partially Observable Games

A game is said to be partially observable and stochastic if there is any random and hidden information. Examples of this type of game are card games such as Poker and Bridge where players are dealt random hands hidden to their opponents [26]. Furthermore, as stated in the last chapter, the game of dominoes also shares this trait.

Partially observable stochastic games take on many forms and fashions. Card games like Poker and Bridge are partially observable because participants are unaware of their opponent’s cards. This concept also holds true for games like dominoes. The stochastic element is encompassed in random card distribution [26] as well as pulls from the boneyard. Non-stochastic elements of dominoes such as a fixed opponent strategy can be modeled by applying opponent modeling techniques.

Game theory and Opponent Modeling

Game theory can be augmented by modeling an opponent’s future actions and/or current state. Ross’s [25] definition of game theory outlines how it is applied to various applications:

Game theory is the study of the ways in which strategic interactions among economic agents produce outcomes with respect to the preferences (or utilities) of those agents, where the outcomes in question might have been intended by none of the agents [25].

This means that when two agents interact with one another, even though each agent has a desired outcome, the outcome of that interaction could be in the favor of one, all or none of the agents involved. Modeling an opponent allows the player to better decide which action it needs to take to receive the best possible utility (or quantified outcome) in a

(29)

16

game situation. The first part of this section discusses the M* search algorithm that aids a game theoretic approach by providing an opponent model to adversary search (MiniMax). The second part discusses research in applying opponent modeling to poker. Later discussed is how neural networks are employed to learn an approximation of the opponent model that has lower memory requirements than the full opponent model in order to simplify the opponent space and expand the amount of opponents that can be modeled. Lastly is a discussion of how to rank opponent models in order to choose an optimal model.

M* Search.

This section discusses Carmel and Markovitch’s M* search algorithm [6]. The goal of this search algorithm is to take advantage of the opponent’s weaknesses (tendencies to make plays other than game theoretic equilibrium moves) throughout the game. To accomplish this, M* applies an opponent model to the MiniMax algorithm. Results from running this algorithm demonstrate a higher winning percentage in fully observable extensive form games.

The M* search algorithm employs an opponent model in order to exploit the opponent’s weaknesses through adversarial search [6]. Assuming an accurate opponent model, the information obtained provides to the player an advantage of playing a best response to the opponent’s modeled strategy instead of the best response to a presumed Nash Equilibrium opponent. This situational awareness offers the player more opportunities to win games that are otherwise unwinnable in equilibrium play. M* works by identifying instances in the game where an opponent underestimates the potential of a

(30)

17

certain action and chooses a move with less value, or times when the opponent overestimates the potential of a move and takes a detrimental action. These occurrences are called swindle and trap positions respectively.

The M* algorithm works similar to MiniMax by employing an evaluation function and search depth to maximize what a player can achieve in a certain play; however, the difference is that M* incorporates an opponent model to provide the specific action of the opponent. This provides the algorithm information on how the opponent may differ from the Nash Equilibrium opponent. The following definition by Carmel and Markovitch, further explains how this opponent model is applied to the algorithm.

1. Given an evaluation function f, P = (f, NIL) is a player [6] (with a modeling-level 0, or search depth of zero).

2. Given an evaluation function f and a player O (with modeling level n - 1), P = (f, O) is a player (with a modeling level n).

The modeling level of the opponent corresponds to the level in which the player models the opponent. For example, a player with a modeling level of zero does not model its opponent. The modeling level of 1 models an opponent with a modeling level of zero and so on [6].

The player P from this definition is used as the opponent model. Using P in the equation, the input of M* becomes <position, depth, (fpl, P)>, where fpl is the player’s

(31)

18

a state or both depending on the preference of the software designer. Figure 4 illustrates the algorithm [6].

Figure 4: M* Algorithm [6].

Figure 5 shows an example game tree that illustrates the recursive calls by the M* algorithm [6].

(32)

19

Figure 5: Game tree recursive search of M* with M*(a,3,f2(f1,f0)) [6].

The example in Figure 5 illustrates how the player “swindles” the opponent by convincing the opponent that the player will select leaf node-o; however, the player actually chooses node-n. This occurs as a result of the opponent’s model of the player presuming the player is using the f0 evaluation function. Since this is the case, the

opponent’s model propagates a value of -6 to node-b and a 0 to node-c. Nonetheless, the player is really using the f2 evaluation function that returns a value of 10 from node-n.

This value is strong enough to propagate back to the root (through backward induction) and become the max value for the player. This value is also the highest value of all f2 leaf

node evaluations. One thing to note here is that a standard MiniMax evaluation of this game using the f2 evaluation function yields 7. Therefore, this evaluation demonstrates

(33)

20

Carmel and Markovitch’s results from this study show that M* is more effective than a Nash Equilibrium agent at winning at checkers extensive form games. Their tests result in a higher number of wins and higher scores achieved in total against opponents in the fully observable games of tic-tac-toe and checkers.

The M* algorithm is an interesting culmination of logic in that it not only attempts to acquire the best value for the player, but it also works to exploit the weakness of the opponent. M* shares many concepts of MiniMax; however, it differs from MiniMax by employing an opponent model, that predicts moves based on the opponent’s strategy. This type of logic can be used to expose higher scoring games and/or wins that Nash Equilibrium does not allow; however, this opponent modeling research example only accounts for fully observable games. The next two sections provide examples of opponent modeling use for partially observable games.

Opponent Modeling in Poker.

When researching opponent modeling for partially observable games, Poker is one of the most popular games researched. This section describes two agents that model opponents in the area of opponent modeling in poker. The first is an agent called Loki [3, 11]. This agent applies weights to possible opponent states (or hands) based on the opponent’s actions on previous rounds. The second agent called Poki [10], builds on Loki by predicting potential actions the opponent will select. Both methods are shown to be effective against opponents in open forums and closed environments.

Loki is an approach to opponent modeling in poker that is “capable of observing its opponents, constructing opponent models and dynamically adapts its play to best

(34)

21

exploit patterns in the opponent’s play” [3]. This agent applies a probability distribution to the opponent’s potential cards by assigning and updating weights to those possibilities as the opponent bets, raises or calls. Statistics are then reevaluated every time the opponent takes an action updating the opponent model.

Table 2 illustrates a subset of every possible opponent hand in a hand of Texas Hold’em [3].

Table 2: Re-Weighting of Various Hands after the Flop [3]. Hand Weight HR HS ~PP EHS Rwt Nwt Comments

JH,

4H 0.01 0.993 0.99 0.04 0.99 1 0.01 very strong, not likely AC, JC 1 0.956 0.931 0.09 0.94 1 1 strong, very likely

5H,

2H 0.6 0.004 0.001 0.35 0.91 1 0.2 weak, good potential 6S, 5S 0.7 0.026 0.006 0.21 0.76 0.9 0.54 moderate, low potential 5S, 3S 0.4 0.816 0.736 0.04 0.74 0.85 0.6 mediocre, moderate potential

were, the numbers in the first column represent the number on a playing card and the capital letters represent the first letter in each suit (i.e, S is for Spades, H is for hearts). The purpose of the table is to show how each hand is re-weighted after the flop. For each possible hand the opponent modeler’s algorithm calculates the initial weight (Weight), un-weighted hand rank (HR), hand strength (HS), and other factors leading to a new overall weight (Nwt) [3]. To explain how this table works, Billings gives details on QS-TS. He states that though this is a strong hand at first, when accompanied with a flop of 3H-4H-JH, the hand strength is then calculated low at 0.189 (where 1 is high and 0 is

(35)

22

low). This is because there are other potential hands available that can better maximize on this flop. After further calculations the algorithm yields an effective hand strength (EHS) of 0.22 for this hand. Consequently, per statistical observations of the opponent, this is lower than the hand strength they usually bet on. The potential hand is then assigned a new weight of 0.01. Furthermore, the new overall weight along with the opponent’s next action is the key predictor of whether the opponent has that hand or not.

When played 100,000 games against controlled setup models, Loki was ahead by approximately $5,000 while the other models were down $550. Loki also does well in online poker play against humans; however, not enough information is present to show that it can outperform previous best programs [3].

Poki is an AI poker agent that also plays on an online poker server with human players. It is used to store game data for poker research [11]. This application is a successor to Loki.

One improvement made in this application simplifies the opponent modeling process by eliminating the re-weighting table and an artificial neural network (ANN). By eliminating the re-weighting approach, this framework alleviates the burden of accounting for a number of active opponents and betting positions. This reduces the systems labor factor [11]. Furthermore, the Poki models specific opponents which allow it to make more informed decisions.

An ANN is a machine learning algorithm that loosely represents a biological neural structure, or brain [11]. Davidson also states that neural networks “typically

(36)

23

provide reasonable accuracy, especially in noisy domains. However they rarely can produce better results than a more formal system built with specific domain knowledge.”

Figure 6 employs the inputs from Table 3 [10]. The ANN in Poki is trained by applying back-propagation [10] (or supervised learning process for ANNs).

Figure 6: A neural network predicting on opponent’s future action [10]. The figure illustrates how the neural network makes use of the 17 inputs in Table 3 and makes a decision on whether an opponent will Fold, Check/Call or Bet/Raise [10].

(37)

24

Table 3: Neural Network Inputs [10].

# Type Description

0 real Immediate pot odds

1 real bet ratio

2 bool Committed 3 bool one bet to call

4 bool two or more bets to call 5 bool betting round = turn 6 bool betting round = river

7 bool last bets called by player > 0

8 bool players's last action ws a bet or raise 9 real 0.1 X num Players

10 bool active players is 2 11 bool player is first to act 12 bool player is last to act

13 real estimated hand strength for opponent

14 real estimated potential strength for opponent

15 bool expert predictor says they would call 16 bool expert predictor says they would raise 17 bool Poki is in the hand

Poki was able to routinely predict opponent actions with an 80% accuracy, while sometimes achieving accuracies of 90% [11]. While Poki makes a significant improvement to state only predictions, it requires a significant amount of data to function properly against a diverse set of opponents [20]. The next section describes how opponent models can be generalized creating a more robust agent.

Generalizing Opponents to Simplify the Opponent Space.

This section describes how opponent modeling can be generalized to use in game theoretic solutions of a partially observable extensive form game. The partially observable extensive form game covered in this section is called Guess It [19, 20]. Because of hidden game state information in partially observable games, game theoretic

(38)

25

solutions such as MiniMax become difficult or non-useful to apply. To make this information visible, predictions must be made using techniques such as opponent modeling. “Opponent models are necessary in games where the game state is only partially known to the player” [20]. At the same time, many opponent models are used to train against specific opponents deeming it time consuming to train an agent against multiple different opponents. Locket, et al solve this problem by generalizing opponent types in Guess It allowing for an opponent model to play against many different opponents without training against opponent-specific models [20].

Guess It is a game where each player is dealt an equal number of cards (usually 6), and one card is faced down on the table for players to guess. Participants cannot see each other’s cards making this a partially observable game. The goal is for participants to figure out the identity of the card faced down by taking turns obtaining and giving information about each other’s hand. In each turn, players have three possible actions. The first is to identify the faced down card. The second possible action is to “ask” the opponent for a card that player does not have. Players can also “bluff” by asking for cards that they already have. The participant who identifies the hidden card wins, while any false identification of the hidden card is an automatic loss.

Game theoretic solutions such as MiniMax and other variants of MiniMax make it simple to find the best move in many extensive form games; however, hidden information in Guess It complicates play because the game state is not as obvious. Other game theoretic solutions include partially observable Markov decision processes (POMDP’s) though, for games with more than just a few states, POMDP solutions

(39)

26

become intractable [24]. This makes opponent modeling a great resource for solving games with hidden information.

Opponent modeling inputs an opponent’s previous actions to a learning engine to help predict the opponent’s state or next action. In many cases opponent models learn through machine learning tools such as classification algorithms or neural networks. These tools are used to predict opponent characteristics by using game state features as inputs. Opponent characteristics are associated to classes identified by features entered into the tool. Each model is opponent-specific, meaning that for each opponent, a new model must be developed. This becomes a time management challenge for training for several opponents.

Lockett solves the specific opponent problem by generalizing opponents in the game of “Guess It” and employing mixture models. A mixture model in this research is a probability distribution over a “cardinal set” of opponents [20]. In this research, the cardinal set contains four different opponent-type categories. Each one of these types has high potential to defeat another specific type as long as that type is identified. A mixture model identifier employs neural networks to predict player-types through a supervised learning process. From here, the agent can make a decision based on which opponent it is playing. Experimentation was conducted using a mixture model based player and a controlled setup player.

The four opponent-type categories of the cardinal set consist of Always-Ask, Always-Bluff, Call-then-Ask and Call-then-Bluff. Always-Ask never calls or bluffs, while Always-Bluff never asks or calls. The Call-then-Ask agent attempts to name the

(40)

27

center card if the other player asks for a card the Call-then-Ask agent lacks, otherwise it asks, while Call-then-Bluff calls every potential bluff; otherwise it bluffs [20].

In the cardinal set, each player-type has an effective counter measure for another specific player type. For example Table 4 illustrates all the hierarchy of all strategies. Each strategy is able to defeat one of the other strategies.

Table 4: Guess It Strategy Hierarchy Matrix [20].

Strategy Defeats

Always-Bluff Call-Then-Bluff Call-Then-Bluff Call-Then-Ask

Call-Then-Ask Always-Ask Always-Ask Always-Bluff

As long as the agent correctly models which opponent-type it is playing, it can determine which strategy to use against it. This serves as the basis of generalizing player types.

Figure 7 illustrates the block diagram of mixture based architecture developed by Locket, et al. [20]. The model includes two modules of integration. The mixture identification module accepts the game state as an input and identifies a mixture. The mixture in this diagram is a probability distribution over all possible opponents in the cardinal set. The decision module accepts the mixture as well as a board state and makes a decision on whether to bluff or call. The default move is to ask [20].

The controlled setup player in this experiment employs only the decision module. While there is the absence of the mixture identification module, the controlled setup has an identical training and validation regimen. The absence of a mixture identification

(41)

28

module in the controlled setup allows for testing the performance of the mixture identification module of the mixture based player.

Figure 7: Mixture-Based Architecture for Opponent Modeling in Guess It [20]. By generalizing opponent-types in a mixture model, the agent was able to win an average of 71.3% of 220,000 games played against 11 diverse opponents, while the controlled setup won an average of 57.7% [20]. This diverse set of opponents included the cardinal set opponents as well as unknown opponent types. When playing against all unknown player types, the mixture model based player won an average of 61.5% games,

(42)

29

while the controlled setup player won an average of 54.6% of its games. In addition, when played 20,000 games against the control player, the mixture model won an average of 77.6% of the games. Lockett, et al. also state that all results are statistically significant with to p < 0.05 [20]. This shows that generalizing an opponent has the ability to apply opponent modeling on a broader scale without training against every different opponent.

Environment Value.

The discussion in this chapter involves many different methods of opponent modeling. The question to ask is which type is the best to use for an agent in a specific environment. This section bridges the gap into which model will be optimal to use compared with other opponent model types.

Opponent modeling has the potential to improve a player’s expected score or winning potential in a game depending on the environment in which the game takes place [4]. This value of a game is improved by utilizing two classes of agent models. The first model type is the opponent’s state in the game and the second model is of the opponent’s actions. Given an accurate model of both of these features improves the value of the game to the player’s favor. The environment value of an opponent model is the game theoretic value improvement that a particular opponent model will provide an agent given the environment. This environment value indicates an “upper bound to the potential performance improvement of an agent” [4]. This section explains how these features (if accurately modeled) help to improve the game theoretic value of a game.

Borghetti states that an agent model is a function or method that “predicts something” about another agent [4]. As stated above, agent models can be employed to

(43)

30

predict information such as the state and/or the action of another agent or opponent. This information is then used to select optimal moves. Optimal moves are selected using game theoretic algorithms such as MiniMax or the probabilistic variant Expectimax (EM) [26].

The state of a game involves all information about the game at a particular instance in time. In computer science parlance, the state of a game can be compared to a node on a tree. The class of state opponent models provides a probability distribution over all possible states [4]. For example, for any number of possible states that an opponent can belong to, the probabilities of all states must sum to one.

This logic is also true for the class of opponent models that predict the actions of an opponent. Actions are classified as the strategy or move that the opponent (or any participant in the game) will make at a given state in the game. As explained in the previous paragraph, this model class provides a probability distribution over all actions.

The environment value (V_Lambda) is the maximum improvement of the game theoretic value given a certain environment. This value is the difference between an environment using a perfect model and environment using no model at all. Equation 4 [4] illustrates this explanation.

(44)

31

(4) Where:

Γ – The original game with no opponent model

Γ′ – The transformed game with a perfect opponent model 𝑈(M) – The utility gained in a game

𝑀 – A particular model that provides a probability distribution over actions or states Borghetti states that given an environment, an upper bound can be developed for an environment to use as a baseline for any opponent model (in its class) to follow [4]. This upper bound baseline is developed by employing a perfect information agent model or oracle to provide an optimal state or action set. The oracle is employed to display the maximum potential that an opponent model can deliver. Assuming MiniMax (or similar algorithm) is employed and all play is rational, if a model cannot do any better, the resulting value is the Nash Equilibrium.

The precondition to finding the environment value is that the Nash Equilibrium of a game can be found. For this research, this requires solving the game of dominoes, which is not possible with today’s computers. “When this precondition does not hold, we may be better off approximating the environment value using an estimation of the distribution over likely opponents.” [4]. This research focuses on finding the expected potential value gained from the information an action opponent model can provide against specific opponents. Therefore the environment value is not calculated.

) | ( )' | ( ) ( = Γ − Γ Γ M U M U M V

(45)

32

Game theory has many applications that benefit from opponent modeling today. The goal of applying opponent modeling is to provide information on the current state or future actions of the opponent. Carmel and Markovitch’s [6] M* search provides a lot of insight on opponent actions to show where to capitalize on opponent mistakes in a fully observable environment. If there is a way to employ M* to a partially observable environment, this opponent modeling algorithm could add value to dominoes. Billings and Davidson, et al. [3, 11] have made vast improvements to how poker is played online by applying models to predict opponent states and actions; however, this research requires a lot of data processing to be optimal. Locket [20] solves this problem by generalizing opponents into a cardinal set (or two-dimensional space of opponents defined by the probability of calling or bluffing [20]), in which each decision is made by applying the action of the generalized model. If opponents in dominoes can be generalized, applying opponent modeling to dominoes should be a quicker and less data intensive practice. Lastly, Borghetti shows how to choose the best opponent model for specific environments. Any opponent modeler applied in current two player dominoes research should apply Borghetti’s technique as a baseline.

Other Methods.

Donkers’s [13] probabilistic opponent model search or PrOM search involves computing a probability distribution over several opponent types in order to make a decision. PrOM search applies an approximation over several strategies, whereas this research will apply the exact strategy of the opponent.

(46)

33

Cowling [9] creates an ensemble method for determination of Monte Carlo Tree Search (MCTS). Cowling’s research involves applying MCTS and upper bounds on Trees (UCT) to solve probabilistic games with hidden information. The research applies these methods to the game of Magic: The Gathering in order to study many different heuristics that can solve this complex game. While dominoes has probability and hidden information, the focus of this research is not to develop heuristics to solve dominoes as the simplified version of the game allows for a search to the leaf nodes of the game. The main focus of opponent modeling for this research is to predict target’s strategy and attempt to find an optimal move on a shallow game.

Dominoes AI Agents.

Although there are many online dominoes games available, academic research has been completed in creating dominoes graphical user interfaces in C++, Java and BASIC. Versions have been found that employ each language respectively. The C++ version is a single game GUI entitled Domino 1.2.0 [23]. This version is created by using the Qt Creator Library. The Java version Badomino [18] is also a single game domino interface. Smiths [28] BASIC version is a game that employs strategy tables and learning techniques to make optimal moves.

Domino 1.2.0 provides a user interface that plays a single hand of double-6 dominoes while keeping score. It also tells the user who wins the game at the end. While the project is completed in a familiar language and library there are some limitations. The first limitation is that most of the code and comments are in Russian. While the code provides a useable GUI, the code is difficult to decipher. There is also no way to scale

(47)

34

the code from double-6 to a smaller (or larger) set of dominoes for experimentation purposes. Lastly, there is no experimental control to run many games in sequence.

The main focus on the Badomino project is to build a well working domino graphical user interface employing AI logic; however, more emphasis is placed on the goodness of the GUI itself. The AI module employs myopic strategies; however, it does apply AI to determine the best myopic strategy to employ. Furthermore, this is also a single game system that is not scalable for smaller or larger sets of dominoes.

Smiths learning algorithm involves strategy learning. This technique applies a model that plays hundreds of games against itself to learn many situations of the game in order to make an optimal move against other opponents. Smith’s research applies strategy tables to avoid applying techniques used for fully observable games. This research applies opponent modeling and roll-out techniques to overcome many of the barriers involved with partially observable games.

Conclusion

Game theory is a complex field in which research is conducted implementing search trees and other graphical figures to play games at an optimal level. When traversing a tree in adversarial games, MiniMax search implements an evaluation function that estimates the value of the game at that point in order to optimize every action. The drawback of this search is that its purpose is to work in a fully observable environment. Opponent modeling is used in many in many applications in order to produce a better estimate on imperfect information to make the best prediction. The next

(48)

35

section explains how opponent modeling will be used in two-player dominoes to make the best prediction in a stochastic partially observable environment.

(49)

36

III. Methodology Introduction

This chapter describes the methodology of how a dominoes artificial intelligence agent employs opponent modeling to achieve improved knowledge of the opponent strategy in order to optimize the decision making process. The chapter starts with the definition of the problem. Next, the experimental setup is explained. Lastly, this chapter explains how each agent of the experimental setup is tested.

Problem Definition

Goals and Hypothesis.

The goal of this research project is to show how well modeling an opponent’s strategy can improve a player’s decision in two player dominoes. When an opponent’s strategy is unknown (assuming all players are rational), the optimal strategy is to achieve a Nash equilibrium through an adversarial search. MiniMax (a type of adversarial search) can be applied to this research by creating a game tree representation for the game of dominoes. Figure 8 illustrates a simple game of dominoes where the player and opponent both pull three random dominoes from a full set of double-6 dominoes. Each board picture represents a node on a game tree. The position on the game tree is called the game (or board) state. This tree represents all possibilities of playing this game instance between the two players.

(50)

37

(51)

38

If the opponent plays with a non-rational strategy, another equilibrium can be achieved that capitalizes on non-rational actions made by the opponent. This is achieved by applying M* search [6], which replaces MIN’s strategy with an opponent model. In order to use M* search, an opponent model (or strategy) must be identified. This concept generates the research question “how does M* search perform in the game of dominoes?” In addition to finding and evaluating the opponent model, another challenge arises in applying M* search to a partially observable game. The issue with applying a search algorithm to a partially observable game is that the level of certainty in the location on a game tree is much worse than in a fully observable environment. Without accurate board state data, potential exists for the search method to traverse a game tree with an erroneous game instance, reducing the possibility of finding an optimal play.

Approach.

To solve these problems, this project centers around a hierarchal approach that focuses on employing an opponent classifier to predict opponent types [27]. Then a decision engine for the player agent uses the estimated opponent’s strategy in M* search. To handle the effects of hidden information, Monte Carlo sampling simulates iterations through multiple possible game states according to the actual probability distributions of unknown dominos in the boneyard and the opponent’s hand.

The hierarchal approach to this decision system involves first predicting an opponent type, then applying the opponent’s strategy to M* search. Opponent types from here on will be defined as “targets.” There are three targets identified for this research.

(52)

39

Each target applies a different strategy to play the game. The first target is the random-play agent, which makes legal plays of pseudo-randomly selected dominoes at its turns. The second target agent is the myopic scoring target, which tries to score the most points at each play without considering consequences on future plays. The final target plays myopically defensive by blocking the player from taking an action when there is an opportunity. The myopic defense target agent also keeps the board at a low scoring level to deny the player from achieving high scores during turns.

The random target agent plays legal moves in a pseudo-random fashion. This target selects its plays from a list of legal moves; however the move is selected at random, with equal likelihood for any legal move. The seed for this random number generator can either be generated based on the system clock for individual games or a game iteration number for experimentally controlled games during repeated testing.

The myopic scoring target agent is a greedy scorer that selects a legal move that scores the highest from all legal moves with no concern for the future value of the move later in the game. If a tie arises between possible moves or there is no score available at that play, the myopic scoring player plays domino that produces the highest board count to increase high scoring opportunities for future plays. (The board count is the sum of all of the outside numbers of the domino chain.)

The myopic defense target agent’s goal is to reduce the number of plays available to its opponent as well as prevent an opponent from obtaining a high score. This agent selects actions that limit the player’s legal plays on the next turn. Figure 9 shows an example of board states from least favorable to most favorable for the defensive target.

(53)

40

Figure 9: Play options available for the player based on target’s defensive strategy. In this example a)’s playable options (from left to right) are 2, 3 and 1 giving a player three options to choose from. b)’s playable options (from left to right) are 1 and 3 allowing for two playable options, while c)’s only playable option is 3. Limiting playable options restricts the opponent’s available actions, therefore creating a possible pull or pass instance. Furthermore, if there are no available defensive plays or a tie in defensive options, the defensive target plays the domino that produces the lowest board count to keep all possible scores low.

The player agent computes the M* search decision before each play. The opponent model used in M* is the same strategy that one of the targets will apply. Since M* is a variant of MiniMax search [6] , it maximizes the player’s decision based on

(54)

41

predicted decisions of the target. Monte Carlo rollouts are applied to the M* search to iterate through many possible board state situations. These rollouts provide the decision engine with an expected game state outcome in order for the player to make an optimal decision in an uncertain environment, given that the opponent model matches the true opponent.

Testing.

Testing this approach involves measuring the prediction accuracy of an opponent classifier as well as measuring how well M* search performs in the game of dominoes against specific player types. The accuracy of the opponent classifier is evaluated by applying a confusion matrix of all target predictions. Table 5 illustrates the data representation of this matrix.

Table 5: Confusion matrix of the opponent modeler.

Predicted Target Random Myopic Scoring Defense Myopic

A ct ua l T arg et Random A B C Myopic Scoring D E F Myopic Defense G H I

Capital letters represent the number of predictions computed by the opponent classifier for a specific target. The rows indicate the actual target type (or labels of the sample data) and the columns represent the predicted target type. The letters in the diagonal (A, E and I) represent all accurately predicted data while all other letters

(55)

42

represent the inaccurate predictions [15]. The accuracy of the opponent classifier is computed by applying the table values to

(5)

where AC is the accuracy, A, E and I are the numbers of accurately predicted targets and Stotal is the total number of targets evaluated. Furthermore, when applied to the opponent

classifier, the testing prediction accuracy of the classifier is calculated by

(6)

where ACOM is the prediction accuracy of the opponent classifier, p is the number of

correctly predicted targets and Sturns is the total number of turns predicted by the opponent

classifier.

To test the potential and actual performance of an opponent classifier with the M* search decision engine, an all-knowing oracle is introduced. The oracle is a method that predicts the target’s state and future actions with perfect accuracy. The oracle is applied to the M* decision engine in place of the opponent classifier to predict the opponent’s type. A set of games is played and all decisions made by the M* decision engine with the oracle are verified to validate M*’s performance.

total S I E A AC)= + + ( turns OM Sp AC =

(56)

43

The expected value gain (EVG) of M* with an opponent type over the Nash equilibrium player is calculated applying

(7) where T is the target and M*T is M* search with the target’s strategy, and EM is

Expectimax. The value gained by applying the M* search decision engine is calculated with both a perfect information and imperfect information environment. The perfect information environment calculates the theoretic upper bound potential of the M* search decision engine with an opponent type. The imperfect information environment calculates the actual upper bound value of the engine.

The actual value of the M* decision engine with an opponent model is calculated by applying the opponent model to the engine. This value is a representation of how well the opponent model predicts the target. Similar to the expected value gain calculations, the actual value gain is calculated in perfect and imperfect information environments to show perfect state information and real world conditions respectively.

Table 6 illustrates a four quadrant matrix of desired results for testing the M* algorithm. Rows of the matrix represent the state information given for all games played. Columns represent the given target predictor. With this configuration, quadrant one and two show a perfect state information environment, while quadrants three and four apply Monte Carlo Rollouts (MCR) to account for hidden information. Quadrants one and three are EVG calculations while two and four are actual values.

*

( T | ) ( | )

(57)

44

Table 6: Quadrant chart of experimental results.

Target Information Target Oracle OM State Info State Oracle 1. 2. Expected Value

Gain Actual Value Upper Bound

MCR

3. 4.

Expected Value

Gain Actual Value Real World

The perfect information environment yields the upper bound of M*’s potential for environment and actual values. The target oracle predicts a target’s strategy with 100% accuracy. This allows the M* engine to apply the correct target type at every play of every game, therefore each game reflects an optimal decision at every play with respect to M*’s capabilities. Likewise, when applying perfect state information to the Expectimax engine, Expectimax plays optimal with respect to its capabilities. Thus, the maximum gain is calculated by taking the difference in value between both engines with respect to score differentials or total games won/tied.

Dominoes Experimental Setup

This section describes the experimental setup that answers the question of whether an opponent model can improve a player’s decision in the game of two player dominoes – and by how much is the improvement. The setup consists of an adversarial search decision engine as well as three target agents. The adversarial search decision engine consists of the M* and Expectimax search algorithms. The player agent employs

References

Related documents

– background cosmic spectra at their specific altitude and latitudes background cosmic spectra at their specific altitude and latitudes.. PoGOLite (Balloon User – KTH,

Institute for Environment and Sustainability 11 Level I Crown Condition Level I Soil Level I Foliar Analysis Level II Database Meteorological Continuous Soil Inventory 10

Allied health professionals (AHP) are key providers of falls prevention services in the community, insufficient dissemination of best practice evidence amongst AHP has been

• two-player zero-sum discrete finite deterministic game of perfect information.. •

(2005) avaliaram o efeito de três ciclos de SRR sobre as médias, as variâncias  genéticas  e  as  correlações  de  várias  características.  Para 

O ideótipo de máxima adaptabilidade geral representa a planta ideal, utilizada para comparação com os acessos do BAG na projeção dos componentes principais, e que

We also show that, even in circumstances where firms may not voluntarily adopt a stakeholder orientation, such governance structures may nevertheless arise endogenously if consumers

Nowak, A bound optimization approach to wavelet-based image deconvolution, in: IEEE International Confer- ence on Image Processing, 2005], which considered the deblurring problem in