Learning Trading Negotiations Using Manually and Automatically Labelled Data

(1)

Learning Trading Negotiations Using Manually and Automatically Labelled Data

Heriberto Cuay´ahuitl, Simon Keizer, Oliver Lemon School of Mathematical and Computer Sciences,

Heriot-Watt University, United Kingdom {h.cuayahuitl|s.keizer|o.lemon}@hw.ac.uk

Abstract—Strategic conversational agents often need to trade resources with their opponent conversants—and trading strate-gically can lead to better results. While rule-based or super-vised agents can be used for such a purpose, here we explore a learning approach based on automatically labelled examples from human players for automatic trading in the game of Settlers of Catan. Our experiments are based on data collected from human players trading in text-based natural language. We compare the performance of Bayes Nets, Conditional Random Fields, and Random Forests on the task of ranking trading offers, trained from both manually labelled and automatically labelled data. Our experimental results show that our best agent trained on automatic labels outperformed its counterpart trained on manual labels (with moderate annotator agreement) in terms of (a) predicting human trading negotiations better, and (b) winning more games.

Keywords-strategic interaction; supervised learning; semi-supervised learning; automatic labelling; board games;

I. INTRODUCTION

Strategic conversation does not assume full cooperation during the interaction between agents [1]. In this paper, we will use a strategic card-trading board game to illustrate our approach. Board games with trading aspects aim not only at entertaining people, but also at training them with trading skills. Popular board games of this kind include Last Will, Settlers of Catan, and Power Grid, among others [2]. While these games can be played between humans, they can also be played between computers and humans. The trading be-haviours of computer games are usually based on heuristics or optimisation methods. The former include carefully tuned rules, and the latter include methods such as Monte-Carlo tree search [3] and reinforcement learning [4], [5], [6], or a combination of them [7]. However, their application is not trivial due to the complexity of the problem, e.g. large state-action spaces. On the one hand, unique situations in the game can be described by a number of variables (e.g. resources available) so that enumerating them would result in very large state spaces. On the other, the action space can also be large due to the wide range of unique negotiations (e.g. givable and receivable resources). While one can aim for optimising the whole game via compression of the search space, one can also aim for a specialised solution. The latter is the focus of this paper by focusing on learning to trade only, rather than learning to play the whole game. In addition, while previous work has focused on optimising

negotiation strategies [4], [3], our proposed approach focuses on learning human-like trading from human examples— despite the fact that in reality the “best” choice may not be the most human-like one, especially with non-expert player data.

Our scenario for strategic interaction is the game of Settlers of Catan, where players take the role of settlers on the fictitious island of Catan—see Figure 1. The board game consists of 19 hexes randomly connected: 3 hills, 3 mountains, 4 forests, 4 pastures, 4 fields and 1 desert. On this island, hills produce clay, mountains produceore, pastures produce sheep, fields producewheat, forests producewood, and the desert produces nothing. In our setting, four players attempt to settle on the island by building settlements and cities connected by roads. To build, players need specific resource cards, for example: aroadrequires clay and wood; a settlement requires clay, sheep, wheat and wood; a city requires three clay cards and two wheat cards; and a de-velopment cardrequires clay, sheep and wheat. Each player gets points for example by building a settlement (1 point) or a city (2 points), or by obtaining victory point cards (1 point each). A game consists of a sequence of turns, and each game turn starts with the roll of a die that can make the players obtain or lose resources (depending on the number rolled and resources on the board). The player in turn can trade resources with the bank or other players, and can make use of available resources to build roads, settlements or cities. This game is highly strategic because players often face decisions about what resources to request and what resources to give away, which are influenced by what they need to build. A player can extend build-ups on locations connected to existing pieces, i.e. road, settlement or city, and all settlements and cities must be separated by at least 2 roads. The first player to win 10 victory points wins and all others lose.1

This paper extends our previous approach based on sta-tistical inference for ranking trading negotiations [9], i.e. the exchange of resources for some others, from training on labelled data to training onautomatically labelled data. We compare three statistical agents—Bayes Nets, Conditional Random Fields, and Random Forests—against rule-based and random agents as baselines, and show that our best

(2)

Figure 1. Example board of the game “Settlers of Catan” [8]. The top-middle dialogue box is a chat interface that displays the game history— including trading offers and responses from all players

agent, trained on automatically labelled data, performed better than its counterpart trained on manually labelled data.

II. RELATEDWORK

Machine learning techniques for strategic trading games have received little attention to date. Notable exceptions have applied reinforcement learning to board games. [10] proposes reinforcement learning with multilayer neural net-works for training an agent to play the game of Backgam-mon. He finds that agents trained with such an approach are able to match and even beat human performance. [4] proposes hierarchical reinforcement learning for automatic decision making on object-placing and trading actions in the game of Settlers of Catan. He incorporates built-in knowledge for learning the behaviours of the game quicker, and finds that the combination of learned and built-in knowl-edge is able to beat human players. [6] used reinforcement learning in non-cooperative dialogue, and focuses on a small 2-player trading problem with 3 resource types, but without using any real human dialogue data. This work showed that explicit manipulation moves (e.g. “I really need sheep”) can be used to win when playing against adversaries who are gullible (i.e. they believe such statements) but also against adversaries who can detect manipulation and can punish the player for being manipulative [11]. More recently, [12] compare training policies against hand-crafted traders and supervised traders created from human players. They found that rather than training trading policies on hand-crafted rule-based heuristics, a more successful approach is to train

trading policies from a supervised classifier trained from human examples.

Related work on supervised learning using manually and automatically labelled data have reported divergent strategies. One strategy has been to train classifiers for natural language processing (NLP) tasks using automatically extracted examples. For example, [13] train a classifier for discourse relations and report a classification accuracy of up to 93%. On the other hand, [14] compare classifiers for discourse relations trained from automatically extracted examples against such trained from manually labelled ex-amples. The authors focus on a dataset with only moder-ate inter-annotator agreement (κ=0.592), and observe that classification accuracy drops substantially in the presence of ambiguous labels. The success of automatic labelling therefore seems to vary with the nature of the target dataset. In this paper, we will present further evidence that automatic labelling can lead to good results.

Some other supervised learning techniques have been ap-plied to train automated agents that know how to play board games – such as decision trees [15], preference learning [16], and deep neural networks [17]. Since statistical inference has received little attention in previous work, with some exceptions [17], [9], we argue that it can play an important role in training strategic agents with human-like behaviour. In addition, statistical traders have not been trained from automatically labelled data before, and our results report that this approach represents a state-of-the-art method for learning trading negotiations.

Other related work has been carried out in the context of automated non-cooperative dialogue systems, where an agent may act to satisfy its own goals rather than those of other participants [5]. The game-theoretic underpinnings of non-cooperative behaviour have also been investigated [18]. Such automated agents are of interest when trying to persuade, argue, or debate, or in the area of believable char-acters in video games and educational simulations [5], [19]. Another arena in which non-cooperative dialogue behaviour has been investigated is in negotiation [20], where hiding information (and even outright lying) can be advantageous. Given the machine learning efforts applied to strategic interactive games, other forms of learning remain to be explored. They include not only direct but also inverse reinforcement learning to learn from trial and error, semi-supervised learning to learn from labelled and unlabelled data, unsupervised learning to learn from unlabelled data, multi-agent systems to learn behaviours considering the strategies of opponents, transfer learning so that agents do not have to be trained from scratch, and active learning to learn to ask what to do in uncertain situations while playing the game, among others—see [15], [21], [22] for an overview. Another direction to explore in strategic games includes a combination of planning and learning, which has shown more promising results than either in isolation

(3)

[17], [7]. A further direction to explore includes end-to-end statistical training of language understanding [23], [24], game behaviour, and language generation [25], [26], [27] using a unified learning framework.

III. THEDATA ANDTASK

We used a set of 32 logged games from 56 different players as described in [28]. Although they were carefully labelled by multiple annotators, they were difficult to an-notate as is indicated by their moderate annotator agree-ment score of 0.62—according to the well-known kappa score [29]. The data correspond to 2512 trading negotiation events (also referred to as ‘training instances’) denoted as Dm = {(x1, y1), ..,(xN, yN)}, where xi are vectors of features andyi are class labels (i.e. givable resources). Our data set reports an average of 44.8 turns per player. An example trading negotiation in the game of Settlers of Catan in Natural Language is “I’ll give anyone sheep for clay”, which can be represented as follows, including the agent’s available resources:

Givable(Sheep, all)∧Receivable(Clay, all)∧

Resources(clay= 0, ore= 0, sheep= 4, wheat= 1, wood= 0)

∧Buildups(roads= 2, settlements= 0, cities= 0). From this illustrative example, yi=sheep and xi = {0,0,4,1,0,2,0,0,1,0,0,0,0} based on features 1-14 in Table I. Although this representation may look simple at first sight, it has support for88_×₂5_×_{5 = 2}_.₆ _{billion possible}

(and unique) negotiation events. Even though the class label is only the givable, we use receivables as features in ranking all the possible offers—so all offers including one givable and multiple receivables are in fact ranked2_{. Notice that not} all of them are valid or legal at every point in time in the game. Choosing the most human-like (in our case) trading negotiation can be seen as aranking task, where we focus on computing a score representing the importance of each trading negotiation (similar to the one above) available for making the best choice, i.e. the most human-like. In this way, the quality of our learning agents will depend on the quality of the examples provided.

To rank such trading negotiation alternatives, we train a set of statistical classifiers based on the feature set described in Table I. Our set of features includes the resources avail-able (features f1-f5), the build-ups (features f6-f8) with

a default minimum of 0 and maximum value of 7, the receivable resources in binary form to reduce data sparsity (features f9-f13), and the giveable resource considered as

the class prediction (featuref14).

An example subdialogue between players is shown in Table II. The first column shows the player IDs, where the 2_{The feature set listed in Table I was chosen because it yielded the best}

performance in previous experiments from a pool of feature sets from both manual feature selection and automatic feature selection. Other feature sets that we explored include smaller domains (only binary features), larger domains (non-binary features), smaller and larger sets of features, and multiple givables rather than a single one, among others.

ID Domain Feature Description

f1 hasClay {0...7} Num. clay units available f2 hasOre {0...7} Num. ore units available f3 hasSheep {0...7} Num. sheep units available f4 hasWheat {0...7} Num. wheat units available f5 hasWood {0...7} Num. wood units available f6 hasRoads {0...7} Num. roads built so far f7 hasSettlements {0...7} Num. settlements built so far f8 hasCities {0...7} Num. cities built so far f9 recClay Binary Clay offered by opponent? f10 recOre Binary Ore offered by opponent? f11 recSheep Binary Sheep offered by opponent? f12 recWheat Binary Wheat offered by opponent? f13 recWood Binary Wood offered by opponent? f14 givable Resource Clay/Ore/Sheep/Wheat/Wood

Table I

FEATURE SET FOR LEARNING TRADING NEGOTIATIONS FROM EXAMPLES.

fourth player was silent. Each game had four players in total. The second column shows the messages typed and shown in the top-middle dialogue box of Figure 1. The third column shows the semantics of textual messages. The last column shows the context of the trading negotiations, represented by featuresf1-f8described in Table I. These sort

of subdialogues occur in the game, which result in players accepting or rejecting trading offers from other players in turn.

IV. TRAININGAPPROACHES

In this paper, we treat trading in strategic conversation as a classification task, where we train statistical classifiers either with manually labelled data (typical approach) or with automatically labelled data (our proposed approach). A. Training with Manually Labelled Data

To train statistical agents in a supervised manner, we first use only one data set of manually labelled trading examples Dm ₌ _{₍_x

1, y1), ..,(xN, yN)}, where xi are vectors of features andyiare class labels. Each pair or tuple represents an instance used for training or testing by the learning methods described in Section V. See Figure 2(a) for an illustration.

B. Training with Automatically Labelled Data

We extend the previous approach by automatically re-labelling data set Dm into Da = {(x1, yp1), ..,(xN, ypN)}, where they_jp represent our predicted labels using the auto-matic labeller described below. This approach is motivated by the fact that it can generate potentially more useful data than its original source. We then use Da _{to train the} statistical classifiers described in Section V. See Figure 2(b) for an illustration.

Our classifier for automatic labelling used as features the most common words in text-based trading messages

(4)

Player Message Semantics Context (f1...f8) A Anyone wants to trade wood for clay Givable(wood)∧Receivable(clay) 0,0,0,2,3,4,2,0

A 1 4 1 0,0,0,2,3,4,2,0

B No-one wants wheat for clay? Givable(wheat)∧Receivable(clay) 0,0,0,1,1,4,2,0 A Wheat for clay? Givable(wheat)∧Receivable(clay) 0,0,1,2,2,4,2,0 C Sheep for clay? Givable(sheep)∧Receivable(clay) 0,0,4,5,1,2,2,0

A I got 1 sheep Give(sheep) 0,0,1,2,2,4,2,0

Table II

EXAMPLE TRADING NEGOTIATIONS FROM HUMAN PLAYERS IN THE GAME OFSETTLERS OFCATAN.

Figure 2. Illustration of training approaches using manually labelled dataDm_{and automatically labelled data}_Da_{. The latter is created from an automatic}

labeller trained from the original sourceDm_{that re-labels the data—see Section IV-B for further details}

from human players3, and the class labels were ‘Givable’ and ‘Receivable’. This binary classifier used a Random Forest with 100 decision trees, see Section V-C for more details. Specifically, the word-level features included the most common words at the left of a resource in focus, and the most common words in the right-hand context of the same resource in focus. In this way, the sentence “I give you sheep for clay”would be labelled asGivable(sheep) and Receivable(clay). From this illustrative example, the words ‘give’ and ‘you’ at the left of the resource ‘sheep’ would be potentially relevant features for the class label ‘Givable’. Similarly, the word ‘for’ at the left of the resource ‘clay’ would be a potentially relevant feature for the label ‘Re-ceivable’. In other words, our automatic labeller generated the semantics from raw text as illustrated in columns 2 and 3 in Table II. We note that while manual labels referred to context beyond the sentence in a turn (e.g. one or more tradings before the one in focus), our automatic labeller only referred to the local context of the sentence in focus. We also note that our automatic labels were agnostic about the 3_{The common words in text-based trading messages are defined as those}

that appear more than the average number of words and symbols (e.g. dots, question mark) in the training data

Figure 3. High-level proportion of dialogue act types in manual and automatic labels

players in focus, i.e. our automatic labeller did not take into account the sender and recipient players. Furthermore, we note that while the manual labels used 7 dialogue act types, our automatic labels focused on 3 dialogue act types—see Figures 3, 4, and 5. The smaller set of dialogue ac types was used to reduce the complexity in the annotations.

(5)

Figure 4. Detailed proportion of dialogue act types in manual labels

Figure 5. Detailed proportion of dialogue act types in automatic labels

V. STATISTICALTRADINGAGENTS

We compare the performance of the following statistical classifiers with the aim of finding the best predictor of human-like trading negotiations: (i) Bayesian Networks, (ii) Conditional Random Fields, and (iii) Random Forests. A. Learning to Trade with Bayesian Nets

Our Bayesian agent is defined by P(x) =

Qn

i=1P(xi|pa(xi)), where x= {x1, ..., xn} is a set of random variables describing the context of the game, pa(.) denotes the set of parent random variables, and every variable is associated with a conditional probability distribution P(xi|pa(xi)). Two main tasks are involved in the creation of our Bayes net. First, parameter learning involves the estimation of conditional probability distributions (discrete in our case) from D based on maximum likelihood estimation with smoothing. Second, structure learning involves inducing the dependencies of random variables based on the K2 algorithm, see [30] for details. Once the Bayes net has been trained, we use the junction tree algorithm [31] for probabilistic inference of trades. The most probable human-like trade

is selected according to y∗ = arg maxy∈Y P(y|e(y)), where the contextual information of givabley is defined by e(t) ={f1=val1, ..., fn =valn}with features fi. B. Learning to Trade with CRFs

This agent treats trading as a sequence labelling task, in which a sequence of game environment inputs is la-belled with appropriate givable resources to support trades. The task is therefore to find a mapping between (ob-served) features—including available resources, build-ups, and receivables—and a (hidden) sequence of givables.

We use the linear-chain Conditional Random Field (CRF) model for predicting human-like trades in the game of Settlers of Catan. This model defines the posterior probability distribution of labels (givables in our case)y={y1, . . . , y|y|} given features x={x1, . . . , x|x|},

as P(y|x) = _Z₍1_x₎QT t=1exp n PK k=1θkΦk(yt, yt−1,xt) o , where Z(x) is a normalisation factor over all available vectors of contextual information x such that the sum of all labellings is one. The parameters θk are weights asso-ciated with feature functions Φk(.), which are real values describing the label statey at timet based on the previous label stateyt−1and featuresxt. The parametersθkare set to maximise the conditional likelihood of sequences of givables in the training data set. They are estimated using the gradient descent algorithm. After training, labels can be predicted for new sequences of observations. The most likely trading offer is expressed asy∗= arg maxyP r(y|x), which is computed using the Viterbi and A∗ search algorithms—see [32] for further details.

C. Learning to Trade with Random Forests

This agent is trained using an ensemble of trees, which are used to vote for the class prediction at test time [33], [34]. A random forest is an ensemble learning method that constructs a set of random decision trees at training time, and uses them to generate the most popular class. We compute the probability distribution of a human-like trade as P(givable|evidence) = 1

Z Q

b∈BPb(givable|evidence), wheregivablerefers to the class prediction,evidencerefers to observed features 1-13,Pb(.|.)is the posterior distribution of thebth tree, and Z is a normalisation constant—see [35] for further details. In our experiments below, we fixed the amount of decision trees to 100. Assuming that Y is a set of givables at a particular point in time in the game, extracting the most human-like trading offer (givable y∗) and collected evidence (context of the game), is defined as y∗= arg maxy∈Y P r(y|evidence).

VI. EXPERIMENTS ANDRESULTS

Our evaluation metrics for assessing the predictive power of human-like trading include classification accuracy and precision-recall. These metrics are part of ouroffline evalu-ation, which reports performance on held-out data.

(6)

In addition, to assess the performance of our statistical classifiers while playing the game we consider the following game-related metrics (in terms of averages): winning rate, victory points, offers made, successful offers, and pieces built. These metrics are part of our online evaluation, and are used to assess performance while playing the game of Settlers of Catan using a benchmark framework.

Each of the classifiers (Bayes Net, Conditional Random Field, Random Forest) below was trained and evaluated equally. The only difference between models was the data source, i.e. manual labels or automatic labels.

A. Offline Evaluation

Table III shows the classification results of our statistical classifiers using the features listed in Table I trained as described in Section III, IV and V. Our evaluation used 10-fold cross-validation, i.e. average results over 10 rounds of 9 folds for training and 1 fold for validation. These folds mean that while our automatic labeller was trained on 90% of manually labelled data, the remaining 10% was used for validation. The classification accuracy of the automatic labeller was 80.23%—according to the cross evaluation. For the evaluation in the next section, we choose the automatic labeller with the highest classification accuracy. Our obser-vations from Table III can be described as follows:

• Firstly, it can be noted is that all our statistical classi-fiers substantially outperform a majority baseline.

• A second point to notice is that predicting human trading negotiations is a difficult task because our best classifier, the Random Forest, achieves a classification accuracy of 65.7% when training on manual labels, and 84.8% when training on automatic labels.

• A further point to observe is that all classifiers trained on automatic labels perform better than their counter-parts trained on manual labels. In other words, auto-matic labels help to predict human trading behaviour better than manual labels. This result suggests that automatic labels are useful for data sets difficult to annotate—like ours which reported a moderate anno-tator agreement in the manually labelled data [28]. Although this conclusion requires confirmation in other data sets, the next section reports an additional evalu-ation to confirm the good performance of the trained classifiers.

B. Online Evaluation

We also evaluated the statistical classifiers described in Sections III, IV and V by integrating them into the JSettlers benchmark framework [8]—illustrated in Figure 1, where we use random and rule-based baseline negotiators4 _{as the} opponents. It has to be noted that our evaluations here played 4_{The baseline trading agent referred to as ‘rule-based’ included}

the following parameters in all agents, see [36] for further details: TRY N BEST BUILD PLANS:0, FAVOUR DEV CARDS:-5.

strategic games at the semantic level, i.e. using dialogue acts as those shown in column 3 of Table II. In addition, our trained agents were active only during the ranking of trading offers, the functionality of the rest of the game was based on the JSettlers framework [8]. We refer to this evaluation as ‘online’ because the agents were used in the actual game to rank realistic trading negotiations. This means that all games were run using four automated agents: one statistical vs. three rule-based. We evaluate each classifier with 10,000 games in order to obtain significant comparisons due to the randomness exhibited in the game. Such a number of games has shown to produce meaningful comparisons [36].

Table IV shows results of our online evaluation, which we describe as follows.

• First, note that random behaviour is substantially worse than rule-based, and that more (successful) offers do not contribute to more winning.

• Second, it can be noted that the rule-based agents obtain a winning rate of 25% because four players of the same kind play against each other.

• Third, it can also be noted that only some of our agents

using the trained classifiers outperform the rule-based agents resulting in more winning, more victory points, and more pieces built—but not necessarily more offers.

• Fourth, taking into account the classification results in the previous section, it can be inferred that higher classification accuracy from average human players does not imply better winning rates in the case of Conditional Random Fields and Bayes nets—only in the Random Forest case. Similar effects but from expert human traders remain to be investigated.

• Fifth, we can observe that the best results are obtained by the Random Forest trained on automatic labelsDa_. It won 1% more games than the Random Forest trained on manual labels Dm_{. This difference was significant} at p < 0.05 according to a two-tailed Wilcoxon-Signed Rank Test. This result suggests that the use of automatic labels is useful for training better negotiation tradings than manual labels—at least in the case of manual labels with a moderate annotator agreement. Manual labels with higher and even lower annotation agreements remain to be investigated.

VII. CONCLUSIONS ANDFUTUREDIRECTIONS

The contribution of this paper is a learning approach for trading in strategic conversation including an evalua-tion of statistical trading agents trained for manually and automatically labelled data. We have trained three statistical agents from manually and automatically labelled data, and then applied statistical inference for computing probabilistic scores for each trading negotiation. The obtained scores were used to rank the available trading negotiations, where the top choice (i.e. the most human-like) was used in the game. In an offline evaluation, the statistical agents showed that the

(7)

Classifier Accuracy Precision Recall F-Measure

Majority Baseline 0.234 0.055 0.234 0.089 Conditional Random Fieldman _0.621 _0.623 _0.623 _0.623

Bayesian Networkman _0.639 _0.640 _0.639 _0.635

Random Forestman _0.657 _0.657 _0.657 _0.656

Conditional Random Fieldauto _0.712 _0.718 _0.718 _0.718

Bayesian Networkauto _0.691 _0.700 _0.691 _0.688

Random Forestauto 0.848 0.854 0.848 0.848

Table III

OFFLINEEVALUATION: CLASSIFICATIONACCURACY ANDPRECISION-RECALL RESULTS OF HUMAN TRADING NEGOTIATIONS INSETTLERS OF

CATAN. NOTATION:man=TRAINING ON MANUALLY LABELLED DATADm_,_AND_auto₌_{TRAINING ON AUTOMATICALLY LABELLED DATA}_Da

Comparison Between Trained Winning Victory Offers Successful Pieces Statistical Trader vs Opponent Rate (%) Points Made Offers Built

Random (from legal offers) vs Rule-based 20.54 5.89 160.28 150.57 7.45 Rule-based vs Rule-based 25.00 6.40 147.57 137.95 8.33 Conditional Random Fieldman_{vs Rule-based} _23.31 _6.20 _141.39 _131.89 _7.96

Bayesian Networkman_{vs Rule-based} _24.20 _6.20 _141.59 _131.72 _7.98

Random Forest vsman_Rule-based _27.62 _6.54 _145.61 _135.84 _8.50

Conditional Random Fieldauto_{vs Rule-based} _23.10 _6.19 _152.30 _142.61 _7.92

Bayesian Networkauto _{vs Rule-based} _21.30 _6.03 _155.36 _145.48 _7.63

Random Forestautovs Rule-based 28.57 6.60 143.34 133.60 8.62

Table IV

ONLINEEVALUATION: GAME RESULTS COMPARING A STATISTICAL CLASSIFIER VS.THREE RULE-BASED TRADERS,I.E.FOUR PLAYERS IN TOTAL IN EACH GAME—EACH LINE SHOWS AVERAGE RESULTS OVER10,000TEST GAMES. NOTATION:man=TRAINING ON MANUALLY LABELLED DATADm_,

ANDauto=TRAINING ON AUTOMATICALLY LABELLED DATADa

best classification result was obtained by a random forest classifier using automatic labels. In an online evaluation, the best agent (random forest) using automatic labels achieved a winning rate that was 1% better than its counterpart using manual labels with moderate annotator agreement. This result suggests that statistical classifiers should con-sider training from automatically labelled data—especially if initially labelled data does not report high inter-annotator agreement. This result is encouraging for training statistical agents from human examples difficult to annotate in order to incorporate trainable behaviour in strategic conversational agents. Future research avenues include:

• training trading agents that take into account richer contextual information such as features from other players, and training them to play multiple games;

• training with other forms of machine learning, as com-mented in Section II;

• training agents not just from average players but from expert human traders in multiple domains; and

• evaluating trained agents against human players. ACKNOWLEDGMENTS

Funding from the European Research Council (ERC) project “STAC: Strategic Conversation” no. 269427 is gratefully acknowledged (see http://www.irit.fr/STAC/). We would also like to thank the following members of the STAC project for helpful discussions: Markus Guhe, Eric Kow, Mihai Dobre, Ioannis Efstathiou, Wenshuo Tang, Verena Rieser, Alex Lascarides, and Nicholas Asher.

REFERENCES

[1] N. Asher and A. Lascarides, “Strategic conversation,” Seman-tics and PragmaSeman-tics, vol. 6, no. 2, pp. 1–62, August 2013. [2] M. McFarlin, “10 great board games

for traders,” Futures Magazine, Oct. 2013, http://www.futuresmag.com/2013/10/02/10-great-board-games-for-traders. [Online]. Avail-able: http://www.futuresmag.com/2013/10/02/ 10-great-board-games-for-traders

[3] I. Szita, G. Chaslot, and P. Spronck, “Monte-Carlo Tree Search in Settlers of Catan,” in Proceedings of the 12th International Conference on Advances in Computer Games, ser. ACG’09. Berlin, Heidelberg: Springer-Verlag, 2010, pp. 21–32.

[4] M. Pfeiffer, “Reinforcement learning of strategies for Settlers of Catan,” in International Conference on on Computer Games: Artificial Intelligence, Design and Education, 2004. [5] K. Georgila and D. Traum, “Reinforcement learning of

ar-gumentation dialogue policies in negotiation,” in Proc. of INTERSPEECH, 2011.

[6] I. Efstathiou and O. Lemon, “Learning non-cooperative dia-logue behaviours,” inSIGDIAL, 2014.

[7] M. S. Dobre and A. Lascarides, “Online learning and mining human play in complex games,” in IEEE Conference on Computational Intelligence and Games, CIG, 2015. [8] R. Thomas and K. J. Hammond, “Java settlers: a research

en-vironment for studying multi-agent negotiation,” inIntelligent User Interfaces (IUI), 2002, pp. 240–240.

(8)

[9] H. Cuay´ahuitl, S. Keizer, and O. Lemon, “Learning to trade in strategic board games,” inIJCAI Workshop on Computer Games (IJCAI-CGW), 2015.

[10] G. Tesauro, “Temporal difference learning and TD-gammon,”

Commun. ACM, vol. 38, no. 3, pp. 58–68, 1995.

[11] I. Efstathiou and O. Lemon, “Learning to manage risk in non-cooperative dialogues,” inProc. SEMDIAL, 2014.

[12] S. Keizer, H. Cuay´ahuitl, and O. Lemon, “Learning Trade Negotiation Policies in Strategic Conversation,” inWorkshop on the Semantics and Pragmatics of Dialogue (goDIAL), 2015.

[13] D. Marcu and A. Echihabi, “An unsupervised approach to recognizing discourse relations,” inProceedings of the 40th Annual Meeting of the Association for Computational Lin-guistics (ACL), 2002, pp. 368–375.

[14] C. Sporleder and A. Lascarides, “Using automatically labelled examples to classify rhetorical relations: an assessment,”

Natural Language Engineering, vol. 14, no. 3, pp. 369–416, 2008.

[15] J. F¨urnkranz, “Machine learning in games: A survey,” in

Machines that Learn to Play Games, Chapter 2. Nova Science Publishers, 2000, pp. 11–59.

[16] T. P. Runarsson and S. M. Lucas, “Preference learning for move prediction and evaluation function approximation in othello,” IEEE Trans. Comput. Intellig. and AI in Games, vol. 6, no. 3, pp. 300–313, 2014.

[17] C. J. Maddison, A. Huang, I. Sutskever, and D. Silver, “Move Evaluation in Go Using Deep Convolutional Neural Networks,”CoRR, vol. abs/1412.6564, 2014.

[18] N. Asher and A. Lascarides, “Commitments, beliefs and intentions in dialogue,” inProc. of SemDial, 2008, pp. 35–42. [19] J. Shim and R. Arkin, “A Taxonomy of Robot Deception and its Benefits in HRI,” inProc. IEEE Systems, Man, and Cybernetics Conference, 2013.

[20] D. Traum, “Extended abstract: Computational models of non-cooperative dialogue,” inProc. of SIGdial Workshop on Discourse and Dialogue, 2008.

[21] H. Cuay´ahuitl, M. van Otterlo, N. Dethlefs, and L. Fromm-berger, “Machine learning for interactive systems and robots: A brief introduction,” inProceedings of the 2ndWorkshop on Machine Learning for Interactive Systems: Bridging the Gap Between Perception, Action and Communication, ser. MLIS ’13. New York, NY, USA: ACM, 2013, pp. 19–28. [22] O. Pietquin and M. Lopez, “Machine learning for interactive

systems: Challenges and future trends,” inProceedings of the Workshop Affect, Compagnon Artificiel (WACAI), 2014. [23] A. Cadilhac, N. Asher, F. Benamara, and A. Lascarides,

“Grounding strategic conversation: Using negotiation dia-logues to predict trades in a win-lose game,” inProceedings of the 2013 Conference on Empirical Methods in Natural Language Processing EMNLP, 2013, pp. 357–368.

[24] H. Cuay´ahuitl, N. Dethlefs, H. W. Hastie, and O. Lemon, “Barge-in effects in bayesian dialogue act recognition and simulation,” in 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, Olomouc, Czech Republic, December 8-12, 2013, 2013, pp. 102–107.

[25] O. Lemon, “Adaptive natural language generation in dialogue using Reinforcement Learning,” inProc. of the 12th SEMdial Workshop on on the Semantics and Pragmatics of Dialogues, London, UK, June 2008.

[26] N. Dethlefs and H. Cuay´ahuitl, “Hierarchical reinforcement learning for situated natural language generation,” Natural Language Engineering, vol. 21, 5 2015.

[27] N. Dethlefs, H. W. Hastie, H. Cuay´ahuitl, and O. Lemon, “Conditional random fields for responsive surface realisation using global features,” in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, ACL 2013, 4-9 August 2013, Sofia, Bulgaria, Volume 1: Long Papers, 2013, pp. 1254–1263.

[28] S. Afantenos, N. Asher, F. Benamara, A. Cadilhac, C. D´egremont, P. Denis, M. Guhe, S. Keizer, A. Lascarides, O. Lemon, P. Muller, S. Paul, V. Rieser, and L. Vieu, “Developing a corpus of strategic conversation in The Settlers of Catan,” inWorkshop on the Semantics and Pragmatics of Dialogue (SeineDial), Paris, France, 2012, https://hal.inria.fr/ hal-00750618.

[29] J. Carletta, “Assessing Agreement on Classification Tasks: The Kappa Statistic,” Computational Linguistics, vol. 22, no. 2, pp. 249–254, 1996.

[30] G. Cooper and E. Herskovits, “A Bayesian method for the induction of probabilistic networks from data,”Machine Learning, vol. 9, no. 4, pp. 309–347, 1992.

[31] F. G. Cozman, “Generalizing variable elimination in bayesian networks,” in In Workshop on Probabilistic Reasoning in Artificial Intelligence, 2000, pp. 27–32.

[32] T. Kudo, “CRF++: Yet another crf toolkit,”Software available at http:// crfpp.sourceforge.net, 2005.

[33] L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.

[34] T. Hastie, R. Tibshirani, and J. Friedman, The elements of statistical learning: data mining, inference and prediction, 2nd ed. Springer, 2009.

[35] A. Criminisi, J. Shotton, and E. Konukoglu, “Decision forests: A unified framework for classification, regression, density estimation, manifold learning and semi-supervised learning,”

Foundations and Trends in Computer Graphics and Vision, vol. 7, no. 2-3, pp. 81–227, 2012.

[36] M. Guhe and A. Lascarides, “Game strategies for The Settlers of Catan,” in2014 IEEE Conference on Computational Intel-ligence and Games, CIG 2014, Dortmund, Germany, August 26-29, 2014, 2014, pp. 1–8.