11.2 Other General Game Players
11.2.1 Evaluation Function Based Approaches
In the first few years, the most successful players were based on minimax search using iterative deepening to be able to send answers early along with functions to evaluate those states that could not be completely solved. Here we will present three of these players, but not in the chronological order but rather in the order of increasing complexity with respect to the generated evaluation function.
UTEXASLARG
One of the first papers on this topic is due to Kuhlmann et al. (2006) of the Learning Agents Research Group of the University of Texas, the authors of the GGP agent UTEXASLARG. Their search is based onαβ pruning (Knuth and Moore, 1975), which is a faster and more memory efficient extension of the minimax algorithm. For those searches where they do not use a heuristic, the terminal nodes are visited in a depth-first manner. If a heuristic is present, they make use of iterative deepening (Korf, 1985). To further enhance the search, they use transposition tables (Reinefeld and Marsland, 1994) and the history heuristic (Schaeffer, 1983).
To be able to handle multiplayer games in the minimax setting they group all players into two teams, the team that cooperates with the player their agent controls and the opponents. Nodes where it is one of the cooperating players’ turn to choose among a number of moves are treated as maximization nodes, the others as minimization nodes.
Another extension to classical minimax search is for simultaneous move games. In this situation they assume the opponents to choose their moves at random and choose the action that (together with their team mates’ moves) maximizes the expected reward.
166 CHAPTER 11. PLAYING GENERAL GAMES To come up with a heuristic they mainly identify successor relations and boards. For the former they only allow predicates of arity two (such as (succ 1 2)), for the latter predicates of arity three (such as (cell a 1 wr)). With the help of the successor function they also identify counters, for which they look for a GDL rule of the form
(<= (next (<counter> ?<var1>)) (true (<counter> ?<var2>)) (<successor> ?<var1> ?<var2>))
with <counter> being the identified counter predicate, <successor> the previously found successor relation and ?<var1> and ?<var2> two variables. For the boards they identify the arguments that specify the coordinates of a position on the board and the markers that occupy the positions. For the coordinates they also can tell if they are ordered, which is the case if they are part of a successor relation. Furthermore, they distinguish between markers and pieces. The latter can occupy only one position in one state, while the former can be on any number of positions at the same time.
They identify all these structures by simulating a number of purely random moves, similar to what we do to find the sets of mutually exclusive fluents during our instantiation process (see Section 9.1.4). The coordinates of a board are the input parameters, the markers the output parameters.
Additionally they identify all the players that should be in league with the one their agent represents. For this they compare the rewards at the end of such a simulated game. If they always get the same rewards, they are in the same team, otherwise they are not.
Using these structures they can calculate certain features such as thex- and y-coordinates of pieces, the (Manhattan) distance between each pair of pieces, the sum of these pair-wise distances, or the number of markers of each type.
With these features they generate a set of candidate heuristics, each of which is the maximization or the minimization of one feature. These candidate heuristics they implemented as board evaluation functions, which can be used to evaluate the given state.
Their agent consists of a main process, which spawns several slave processes. Each is assigned the current state and one of the candidate heuristics to use for the internal search. Each slave informs the main process of the best action it could find so far. Additionally, at least one slave performs exhaustive search, i. e., it does not use a heuristic.
The decision, which of the proposed actions to choose, is not trivial. Of course, if the exhaustive search was successful, that action should be chosen, as it is supposed to be optimal. Unfortunately, the authors do not provide any more insight into the choice of the best action in case the game is not solved.
FLUXPLAYER
The winner of the second general game playing competition in2006, FLUXPLAYER(Schiffel and Thiel- scher, 2007a,b), uses a similar approach. It also performs a depth-first based minimax search with iterative deepening along with transposition tables and the history heuristic and uses a heuristic evaluation function for those non-terminal states that it will not expand in the current iteration, but there are some more or less subtle differences.
For multiplayer turn-taking games it handles each node of the search tree as a maximization node for the active player. This corresponds to the assumption that each player wants to maximize his own reward, no matter what the opponents will get.
In case of simultaneous move games it serializes the moves of the players and performs its own move first. The opponents are then assumed to know the chosen move. The advantage of this approach is that the classical pruning techniques can be applied, while the downside is that it may lead to suboptimal play (e. g., in ROSHAMBO, or ROCK-PAPER-SCISSORS, the information of the move of one player easily enables the opponent to win the game).
Furthermore, it does not perform pure iterative deepening search, but uses non-uniform depth-first search, i. e., it uses the depth limit of the current iteration only for the move that was evaluated with the highest value in a previous search, while the limit for the other ones is gradually smaller. This is supposed to help especially in games with a high branching factor, as the maximal search depth will be higher than in classical iterative deepening search.
11.2. OTHER GENERAL GAME PLAYERS 167 The main difference comes with the heuristic evaluation function. FLUXPLAYER uses a similar ap- proach to find the structures of the game as UTEXASLARG, but can also detect game boards of any di- mension and with any number of arguments specifying the tokens placed on the positions, i. e., it allows for arbitrary numbers of input and output parameters for each predicate. Furthermore, the heuristic function mainly depends on the similarity to a terminal state and the rewards that might be achieved. It tries to evaluate how close it is to a terminal state that corresponds to a win for it. For this it uses methods from fuzzy logic in order to handle partially satisfied formulas.
CLUNEPLAYER
The third approach in a very similar direction is due to Clune (2007, 2008) whose agent CLUNEPLAYER
won the first competition in2005 and ended second in 2006 and 2007. Again, there are some differences to the other approaches.
In a first step he identifies certain expressions of the game at hand. One of the main features of this is that he determines which constants can be set as which arguments of the expressions, if they incorporate variables. These are substituted for the variables to generate more specific candidate expressions.
From these expressions he creates candidate features, which represent various interpretations of the expressions.
A first interpretation is the solution cardinality interpretation. This corresponds to the number of distinct solutions to a given expression in a given state, such as the number of black pieces on the board in a state of the game of CHECKERS.
The second interpretation is the symbol distance interpretation. For this he uses the binary axioms to construct a graph over the constants for the two parameters. The vertices of this graph are the constants, and an edge is placed between two of these if they are the two arguments of one such binary axiom. Examples would be the successor relation or the axiom to determine the next rank in the game of CHECKERS. The symbol distance is then the shortest path distance between two symbols (the constants) in the constructed graph. With this, the distance between two cells on a board can be calculated.
The final interpretation is the partial solution interpretation. This applies to expressions containing conjunctions and disjunctions. For a conjunction, it gives the number of satisfied conjuncts, so that this is somewhat similar to Schiffel and Thielscher’s (2007b) degree of goal attainment.
For each candidate expression he generates corresponding features by applying the possible interpreta- tions. He also calculates some additional features by observing symmetries in the game’s description.
To find out which of the generated candidate features are more relevant he uses the measure of stability. A feature that does not wildly oscillate between succeeding states is supposed to be more stable than one that does so.
For the calculation of the stability of a feature he performs a random search to generate a number of reachable game states. For each feature he calculates the value for each of the generated states. Additionally, he computes the total variance of the feature over all generated states as well as the adjacent variance, which is the sum of the squares of the differences of the values of the feature in subsequent states divided by the number of pairs of subsequent states. The stability is the total variance divided by the adjacent variance, so that for features that wildly oscillate the stability will be low, while it will be higher if it changes only slightly between subsequent states.
For the heuristic evaluation function he takes into account the approximation of the reward, the degree of control the player has along with the stability of this feature and the probability that a state is terminal along with the stability of this feature. For the approximation of the reward he uses only those features that are based on expressions that influence the goal rules. The control function is intended to capture the number of available moves for the players—the player with more possible moves has more control over the game. Finally, the termination property is used to determine the importance of reward and control, e. g., in the beginning of a game it might be more important to gain more control over the game, while in the endgame getting a higher reward is of paramount importance.
Combining these properties results in a heuristic evaluation function that captures the game in a better way than the other approaches. While Kuhlmann et al. (2006) use a number of heuristics and must choose which returned the best move and Schiffel and Thielscher (2007b) look only at the possible outcomes hoping that achieving a large similarity to the winning goal state earlier will assure higher rewards, here a
168 CHAPTER 11. PLAYING GENERAL GAMES more flexible heuristic is applied. The result is that for example in CHESSit identifies the differences in the number of different pieces of the two players as being important (with the number of queens being most important and the number of pawns the least). In OTHELLO it does not use the number of tokens in the player’s color as a good estimate, which is highly unstable, but prefers those states where the corners are occupied by the player, which is a much more stable feature and yields a much better guidance to a terminal state achieving a higher reward.