Testing and Evaluation - Knowledge and Strategy-based Computer Player for Texas Hold'em Poker

This chapter tests and evaluates both the system and all the computer players described in Chapter 4, Methodology. Firstly, there is a testing of the system design and architecture in the white box method. This is followed by black box result checking of the system. Finally, all the players, i.e. Random – 1, Random – 2, Anki – V1, and Anki – V2 are evaluated, with most emphasis on the latter two. The results and evaluation is also compared to the previous research of Loki, Poki, etc. discussed in Chapter 2, Literature Review.

5.1 - System Test – White and Black Box Testing

The first batch of experimentation and testing that needed to be performed on the program is regarding its completeness and soundness. Its stability needed to be proven, to justify any results obtained from it later on. The architecture of the program was put to brute-force worst-case scenario tests to try and prove its soundness. These test were conducted on the Human vs. Human specification, so that each step of the program could be monitored and observed.

The following tests were conducted and found to complete successfully :

1. The program started up without any errors and provided completely random opening hands to both the players of the game. Also, absolutely no repetition or pattern in the cards was found over a number of hand requests.

2. The program was found to provide the Human Player with all the necessary game state information, including the cards he/she held, community cards and the financial state of the game. All the information was found to be accurate. An example screen shot of the program is provided in Figure 17.

3. The betting options were found to adhere to their respective constraints, along with allowing the player to re-play the last move in case an erroneous choice was entered, e.g. raising when it is not permitted.

4. The betting rounds were found to progress in the manner required, and ended upon equal commitment of monies from both players, i.e. in the cases of two checks, two bets or a raise followed by a bet.

5. Each of the player's actions such as betting or checking was displayed clearly, with no information of the opponent available to a player.

6. Folding sets the game into a quick end mode, whereby all the community card displays and betting rounds are bypassed to reach the end. The cards of each player are not displayed on the board either.

7. Finally, the situation under which a player finishes his/her money is addressed. The program was found to display the required community cards and hurry to the end of showdown without any more betting requests.

Figure 17. Screen shot of normal gameplay in the Prolog window

Following the successful completion of the above mentioned tests, the winning pattern evaluator also needed to be tested. Strict rules are available for this section of the program and their stability and correct implementation needed to be proven. The winning pattern finder was put through a

was tested by providing the tester with identical hands that differed only on the 6th_{most important} card.

The following tests were completed on the winning pattern evaluator :

1. The correct winner was identified using the priority table explained in Section 3.4. For example, Full House won over a Flush, etc.

2. In the case that the patterns on both players were found to match, the owner of the highest ranking card of the pattern was chosen as the winner. For example, 'Three Jacks' won over 'Three 8s'.

3. The comparison of hands with similar patterns was restricted to the correct number of kickers. For example, there are 2 kickers in 'Three-of-a-Kind', but none in a Sequence. 4. In the case of the best 5-card hand of each player being the same, or of similar

importance, the result was announced as a Draw.

5. The correct amount of money was alloted to the players at the end of the pattern

evaluation, i.e. the winner got all the money in the pot, or the money was divided between the players in the case of a draw.

6. All the information concerning the cards being played, the winning pattern type, the winning player and the new financial state of the game is displayed. A screen shot of a final result is provided in Figure 18.

Figure 18. Screen shot of a showdown (end-game scenario) with Winning Evaluator

As mentioned previously, the above methods were conducted on all of the patterns individually, and were created with a specific purpose of checking the most computational and in-depth rule scenarios of each of the winning patterns. All the cards were created by the author and then tested individually, as each winning pattern needed to deal with a different formation of cards. Certain tests revealed errors in the coding, in which case, the error was corrected, and the entire testing cycle was repeated.

The above test carried out was more than 2000 hands in number, however, they were created specially to check specific features or components of the program. There was a need to test the program in its entirety. For this reason, the following four 'Strict' players were created; Always- Checks1 (else Folds), Always-Checks2 (else Bets), Always-Bets and Always-Raises (else Bets).

These players were made to play 1000 games against each other, and the entire decision and game state of each of the games was recorded. This transcript of the 6000 games was checked manually by the author to confirm the stability and soundness of the program created. All the decisions were found to be correct according to the Poker rules discussed in previous chapters. These

5.2 - Random – 1 Player's Evaluation

As expressed earlier, the main phase of computer player testing began with the introduction of Random – 1 and Random – 2, also called Random Player and Non-Folding Random Player respectively. This sub-section explains the bad performance of Random – 1, and the reason why it was not tested to as much depth as Random – 2. It also has certain implications on the workings and required strategies of future computer players.

Random – 1 was found to play appallingly badly against both Random – 2 and Anki – V1, given the same test conditions. All the games experiments were conducted to play 10,000 tournaments. Under these conditions both Random – 2 and Anki – V1 were found to beat Random – 1 in all of the 10,000 tournaments. This clearly shows the flawed strategy of the player.

This finding leads to our first major conclusion of the thesis, i.e. randomisation is required over

strategies and meta-strategies, i.e. the decision to be aggressive, loose, etc. at any one time, that influence individual betting actions. This is especially proven in the Random – 1 vs. Random –

2 experiments, where the high folding rate of Random – 1 forces it to bow out of the competition too often and too early unnecessarily, and thereby loose tournaments quickly. Random – 1 performs slightly better against Anki – V1 because Anki – V1 assumes that Random – 1 is a rational player, and thus if Random – 1 bets, and Anki – V1 has really bad cards, it chooses to fold. But once again, the sheer volume of folding by Random - 1 leads to its eventual downfall. Figure 19 provides more insight into the exact statistics of the two experiments.

Figure 19. Player Performance when playing against Random - 1

Random – V1 was found to play a total of 506,026 and 860,978 games against Random – 2 and Anki – V1 respectively and was beaten by both opponent players in all the tournaments. The game - winning percentages of the latter players is also shown in Figure 19.

Through the results obtained above, the Random – 1 player was abandoned. Future testing is conducted through self-play or through play against Random – 2, with the only occasional comparison to Random – 1.

5.3 - Evaluation of Anki – V1

The evaluation of Anki – V1 is done in the form of two broad categories; playing against pre- programmed players for evolutionary and basic results, and playing against humans for more advanced results and final evaluation. Each of the category of tests are presented in the subsections below in more detail.

5.3.1 - Anki – V1 vs. Computer players

Random – 2 Anki – V1 0 10 20 30 40 50 60 70 80 90 100 % Tournaments Won % Games Won Player Pe rc en ta g e o f V ic to ry

strategy that an expert would recommend for master play. The static aspect of this player is not a disadvantage to it either, as the opponent, Anki – V1, does not have opponent modeling.

In addition to testing the performance of Anki – V1 against the pre-defined computer player, it is also imperative that Anki – V1 prove its increase in its performance as it develops. Anki – V1 is created from four different betting strategy/evaluation components; pre-flop evaluation, post-flop evaluation, post-turn evaluation and final evaluation (post-river). It needs to be shown that the introduction of each one of these components adds value to the player as a whole.

For the purpose of these experiments, the Anki – V1 with only pre-flop evaluation was coined as Start-Eval Anki. The next upgrade with both pre-flop and post-flop evaluations is called Flop- Eval Anki. The addition of post-turn evaluation leads to Turn-Eval Anki and finally, all four evaluations come together to be called Final-Eval Anki. Each experiment between the players consisted of 10,000 tournaments. This was done so, in lieu of the fact that previous research has shown that up to a couple of thousands of games can be affected by good or bad luck of a player [10]. Thus, to make the statistical result more accurate, and assuming at least 100 hands per tournament, 10,000 tournaments provide us with a million games. This gives us an unbiased result that is free from the luck factor. All the results were checked to confirm that more than a million games had at least been played, and this was found to be true.

Figure 21 shows the performance of improving Anki – V1 against Random – 2. As can be seen, each improvement is found to benefit the performance.

Figure 21. Anki – V1's performance against Random – 2 Player improves as more heuristics are added in average of 2.8 million games for each evaluation

The figure above is seen to have an extraordinary quality, in that, from Start-Eval Anki to Flop- Eval Anki, there is a major change in improvement. Also, there is noticeable but not major improvement between the last three players. Both these observations can be explained through the concept of game state information. In the first case, the evaluator has 50% (2 of the 4 cards) information available to it. The decision based on this information is thus seriously flawed, which leads to the folding of good potential hands and protection of bad final hands in the Start-Eval Anki Player. In comparison to this, the information jump is very substantial in the next round, from 50% to 71.4%, as 5 of the 7 cards are now visible to the player. This allows the player to progress more intelligently.

Partial, but small Information Gain is also responsible for the slow growth in the latter three forms of Anki - V1. The percentages of information available to Flop-Eval, Turn-Eval and Final-Eval Anki are 71.4%, 75% and 77.8% respectively, all of which are not much of an increase. As three

Start-Eval

Anki Flop-Eval Anki Turn-Eval Anki Final-Eval Anki 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Anki - V1 Player with accumulated heuristics

N

o

.

o

f

T

o

u

rn

am

en

ts

W

o

n

Apart from the tournament victory of Anki – V1, the various players also need to be measured for their profitability and their efficiency. Figure 21 shows the increase in earnings of the players, whereas Figure 22 shows the relative increase in game winnings.

Figure 21. Profitability of Anki – V1 Players increases, as it adds heuristics to its play.

Figure 21 shows the increase in the profits of the various players. For example, whereas Start- Eval Anki is found to lose 2.52 units of money with every game, Final-Eval wins 9.21 units on average for every game that is played. This shows the increasing intelligence and playing ability of each player.

Figure 22 sums up the Anki – V1's performance against the random player's by showing the comparison between the winnings of tournaments and games. Unlike the victory of tournaments, which describes a players performance, the lesser the number of games won, while improving tournament play, the better the player. This is because the player simultaneously improves both tournament play and overall profitability. It knows better of when and which hands it should play.

Start-Eval

Anki Flop-Eval Anki Turn-Eval Anki Final-Eval Anki -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 Anki - V1 Player M on ey w on p er g am e pl ay ed

Figure 22. Tournaments and Games won by Anki – V1 playing against Random – 2 show that the efficiency of the player is improving.

Figure 22 shows how the increase in the number of tournament wins doesn't seem to affect the percentage of games being played and won by the player Anki - V1, i.e. the percentage of games

won remains constant. This is a positive sign for the latter versions of Anki – V1, as it shows

that the players are becoming more efficient in winning tournaments. Their improvement is proven by the increase in tournament wins, and their intelligence by the stability of percentage of game victories.

5.3.2 - Anki – V1's Evaluation against Human Players

Anki – V1 was played against three forms of players; beginners, intermediate and advanced. Beginners are newcomers to the game, these are people who have never played poker before. One of the subscribed aims of this project is also to investigate the formation of a Poker player that teaches beginners, and for the same reason, it also needs to be able to play well against them.

Start-Eval

Anki Flop-Eval Anki Turn-Eval Anki Final-Eval Anki 0 10 20 30 40 50 60 70 80 90 100 % Tournaments Won % Games Won Anki - V1 Player Pe rc en tag e o f V ic to ry

constraints of the project, the test base for the project was restricted to a close community, and thus certain members of the community had additional information available to them, which helped them develop a strategy against Anki – V1. For this reason, they have been considered in the category above the absolute class.

Finally, advanced players are either people with frequent exposure to the game in tournament play (with real money online or in the cash form), or intermediate players with knowledge of the player's capabilities. Once again, due to the given constraints, the experiments were held to a lower capacity than ideal. However, at least three individuals were gathered from each of the

prescribed categories and were asked to play till they either won or lost a tournament.

The final result data from all the tournaments was gathered, and sorted once again according to the categories in which the human players had been divided. Figure 23 provides a brief outlook of Anki – V1's performance against the human players. Each point on the line of a performance curve is the cumulative average of Anki – V1's money at that point of time, hereby measured in number of games. Also, the important points in the game are provided with their game number.

Figure 23. Anki – V1's performance against human players

It can be seen from the figure above that Anki – V1 succeeds in its primary objective of beating the beginner player, i.e. the tournament ends with Anki – V1 having all of the 2000 money on the

table. The beginners involved in the testing found the player to be quite informative and user friendly, however, they did sometimes require assistance in trying to understand the winning situations with kickers, etc.

Intermediate players finished better off against Anki – V1, but only after a good struggle. It can be seen from the graph that Anki - V1 managed to get an upper hand very early in the game, while the human players tried to control their losses. About halfway in the graph (marked at game 560), it can be seen how Anki – V1 loses a lot money, this was mostly attributed to two of the players having a couple of very big games that went their way around that time frame. This lucky break allowed the human players to move close to winning, however, by looking at the graph, it took a bit of commitment to finish off Anki – V1. This can also seen by the fact that it took an average of 1308 hands for Intermediate Human Players to finally beat the Anki – V1 Player. The general feedback from intermediate players was positive, whereby they felt that the player had a lot to offer if it incorporated a looser or more aggressive form of betting strategy.

The general strategy of the Human Intermediate Players became 'bet-first'. They utilised an exceedingly loose strategy, as it lead to Anki – V1 folding on most accounts. Similarly, closer to the end, the players commented on how they were beginning to trust the tightness of Anki – V1, i.e. they folded when they saw Anki – V1 fighting hard for its cards. This was an expected result from the intermediate bench, as Anki – V1 definitely had the short-comings of being partially predictable.

It is also clear from Figure 23 of how Anki – V1 succumbed to the aggressive and loose

behaviour of the Advanced Players. Yet, it is against these players that the Anki – V1 can show its best traits. Anki – V1's relatively quick defeat was expected at the hands of the Advanced players, due to its failure to cause doubts in the opponent's mind. The advanced players began to trust the computer's tightness strategy from the beginning and used this to their advantage. Apart from all these well-understood problems, Anki – V1 still needed to prove its worth in at least one of the department for which it was created, i.e. quality of evaluation of playing hands.

extensive evaluation knowledge. However, the comparison between Human and Anki – V1's evaluators needs to be done to prove its competence.

In the final result file generated through human play, it was noted that the majority of AI losses were due to folding early on in the game. This resulted in the loss of 10 or 20 units of money each time, but were so frequent that it led to Anki – V1's downfall. Thus to properly estimate the power of Anki – V1's evaluator these smaller values need to be slowly removed. Figure 24 provides two indexes to measure Anki – V1's true capabilities. The indexes are grouped by 'Bet Placed By Anki - V1', this is the 'at least amount' committed by Anki – V1 in the game. Thus a game with 20+ of Bet Placed By Anki - V1 removes the games in which Anki – V1 or the human player folded right after a person bet in the first round of betting, i.e. all the games with the value of just 10.

Figure 24. Anki – V1 evaluation against advanced human players using a couple of indexes explained in text. Once again, the improvement is visible.

The first index that can be seen slowly rising is Relative Performance Index. It is calculated by the formula given in Figure 25. As expected, this value is above 1 for Anki – V1 from the start, this is because Anki – V1 only plays games it believes it will win, thus the result per won games for Anki – V1 will be higher than that of an opponent, who play aggressive to just win 10 units most

In document Knowledge and Strategy-based Computer Player for Texas Hold'em Poker (Page 46-74)