• No results found

Analytical Approaches with Simple RL Models

2.2 Experience-based Learning

2.2.3 Analytical Approaches with Simple RL Models

Many authors have analysed the properties of learning rules, and try to establish conditions under which the actions of players converge to the op- timal action (in single-player decision problems) or equilibrium (in games). Typically, the proofs for convergence rely on stochastic approximation the-

ory. Early work mostly established results for limited classes of games or simple one-player decisions. Only more recent articles (e.g. Beggs 2005; Hopkins and Posch 2005;Gotts et al 2007) could state more general results for the boundary behaviour for the process, and larger classes of games.

ER models Some authors have analysed the ER learning rule (Rustichini 1999; Laslier and Walliser 2005; Beggs 2005; Rustichini 1999;Hopkins and Posch 2005) in single decision and game contexts.

Rustichini (1999) considers optimal properties of selection rules under full and partial information in a single player context. Under full infor- mation the player knows opponents’ strategies, under partial information only its own actions. He finds that with a linear rule (as in equation (2.3)), convergence to the optimal choice is guaranteed. It is not with the exponen- tial rule, which weights differences between payoffs higher and thus might speed learning up. Moreover, exponential procedures (as in equation (2.4)) are best in the full information case, but not for partial information: Linear learning is too slow in full information environments, so the process is more likely to lock into sub-optimal interior points of the strategy space, rather than the optimum.

According toLaslier et al(2001) the cumulative RL problem can be seen as an urn model, from which balls are selected with unequal probability over the repetitions of the game. Describing this process with ordinary differ- ential equations (ODE), they first analyse the resulting stochastic process for single player situations and show that the process converges to choosing only payoff maximising actions. For 2x2 games they state that the ER rule converges with positive probability to a Nash equilibrium. If the game has two pure equilibria, the process converges with positive probability to any one of them, but not to a mixed equilibrium. However, they cannot prove

that the process converges with probability 1.

Building on stochastic approximation theory,Beggs(2005) considers 2x2 constant-sum games with unique pure or mixed equilibria and generalises Laslier et al (2001). Players using RL cannot be forced permanently below their minimax payoff, independent of their opponent’s strategy. Similarly, dominated strategies are always eliminated over the course of time. If both players play RL, the probability that both players converge to the unique equilibrium, tends towards 1.

Hopkins and Posch (2005) provide more general results about the re- lationship of the RL processes with the well-analysed replicator dynamics approach from evolutionary game theory (Smith 1982). They find that Arthur’s model (Arthur 1993) as well as ER-type models converge only to boundary points which are a Nash equilibrium. This is easier to show for the Arthur model because the action strength updates (step sizes) are of the same size, while the reinforcements in ER can change at different rates. They show that RL will not converge to boundary points that are linearly unstable under the replicator dynamics.

Averaging models In PA, a decision maker faces for a number of times an identical decision problem. The players assess expected payoffs myopi- cally by estimating the expected payoff using average returns per actions. They choose the action with the expected maximum payoff (i.e. choice is deterministic). Sarin and Vahid (1999) show that this model converges to choosing the objective maximin strategy if learning is slow. If players are more likely to experiment, players converge to the strategy yielding the maximum possible payoff.

Aspiration level models The reinforcement problem in aspiration level models has been also been studied by several authors, and has been sur- veyed in-depth by Bendor et al (2001a). Here, some representatives of this approach are described.

Gilboa and Schmeidler(1996) present a case-based reasoning (CBR) ap- proach. The decision maker faces a number of different situations or ‘states’, and must make a choice in such situations. In dynamic environments, aspi- ration level (AL) updating rules have to be ambitious enough to search for the best result in various situations. In more static environments, it must be realistic, i.e. close to actual payoffs. Both properties must be combined, as a way to search ambitiously for a best strategy, and then to stick to this choice after the expected values of the strategies can be estimated. They show that under these conditions, a case-based decision-maker can learn to become an expected-utility maximiser.

Extending their work on RL with fixed AL, Boergers and Sarin (1997) develop a model with endogenous aspirations and cumulative rewards. In Boergers and Sarin (2000), a single player chooses between two strategies. They show that the process can converge to the optimal choice. Endogenous aspiration levels improve performance by avoiding high dissatisfaction with even the best available strategies, but can lead to probability matching. During probability matching, both strategies are played at the same proba- bility at which they generate benefits, whereas optimal strategies should be played with probabilities close to 1 for behaviour to be considered ‘rational’. This can happen when the initial aspiration levels are too high, so that also dynamic adaptation of the aspiration level cannot lead to a lock-in.

While Boergers and Sarin and Gilboa and Schmeidler establish results for single player decision problems, other authors extend the results to

games. Karandikar et al (1998) first analysed a prisoner’s dilemma. The aspiration levels of both players are updated simultaneously with the re- ceived reward, and approximate long-run averages. The main result is that cooperation is sustained if there are no trembles (i.e., externally imposed changes or noise on the AL’s) to the AL’s and the speed of updating the AL’s is low. Introducing perturbations into the AL changes the process, and may lead to different equilibria. However, in the long run, the process returns to the cooperation path. The intuition behind these results is that the mutual dissatisfaction with non-cooperative payoffs triggers experimen- tation until some state is achieved that yields high enough satisfaction (the point where AL and current payoff converge).

Karandikar et al (1998) is modified and extended to arbitrary games and a larger class of learning rules in Bendor et al(2001b). Similarly,Napel (2003) applies the model to an ultimatum game and shows that in the long run players almost surely achieve the equilibrium state. Which equilibrium depends on the initial conditions and the stability of aspirations, which are allowed to vary randomly. If such trembles are rare and learning is slow, the available surplus will be shared efficiently. If there are perturbations in the aspiration level, any equilibrium is supported.

Gotts et al(2007) look at the behaviour of the BM rule with aspirations in a prisoner’s dilemma, generalising earlier insights of Flache and Macy (2002). They show that the system has two attractors - either a mixed strategy equilibrium (a so-called self-correcting equilibrium SCE) or both players cooperate with probability 1. If learning is slow, the system con- verges in the long run to cooperation. In the medium run however, the process moves towards the SCE. RL thus can exhibit very different results depending on the length of the period considered.