• No results found

Neural Correlates of Gains and Losses

4 Formal Models of Associative Learning

4.3 Adaptive Learning Rate Methods

Similar to the Pearce-Hall learning rule mentioned early in this chapter, which updates associability with a dynamic learning rate, early studies in neural network research showed that using a fixed learning rate in neural networks have certain disadvantages when the step-size of the error surface changes more sharply for one weight than the others (ravines) (Jordan, 1988). Due to this disadvantage various adaptive learning rate methods were developed in order to improve the convergences and speed of learning performance (Jordan, 1988). It has been shown that dynamic learning rate methods are not only perform better than fixed learning rate methods in stationary problems but they are also better in non-stationary problems where the optimal solution to a problem change overtime (Sutton; Behrens; O’Doherty et al., 2006). In fact there are many ways to estimate trial by trialchanges in learning rate. State-space models (Smith et al., 2004), moving average technique (Eichenbaum et al., 1986), information theoretic techniques such as Kullback-Leibler divergence (Haruno et al., 2004), filtering algorithms such as Kalman Filter (Kakade Dayan, 2002), Bayesian learning methods (Fahrmeir and Tutz, 2001, Behrens et al., 2007),fixed-number of consecutive correct responses models (Fox et al., 2003; Stefani et al., 2003) are only few of the dynamic learning rate estimation algorithms. Due to this huge variety in the dynamic learning rate techniques in the next section only two of the most communally used techniques will be summarized Kalman Filtering and Incremental-Delta-Bar-Delta algorithm.

4.3.1Kalman Filter

The Kalman Filter is a powerful mathematical method developed to solve Wiener problems (named in honour of Norbert Wiener) that is to estimate noise in a continuous stochastic-process (Kalman, 1960). In the last ten years a number of studies suggested that the cerebellum and the hippocampus are carrying out computations similar to a

Kalman Filtering algorithm (Paulin, 1986, Bousquet et al., 1998). Moreover evidence suggested that Kalman Filtering might occur in sensory processing and behavioral conditioning (Kakade & Dayan, 2000; Kakade, Dayan, Montague,2001;Dayan & Yu, 2003).The goal of the Kalman filter is to predict the true value of the state (or signal) when the measurements are noisy. The basic idea behind the theory can be demonstrable with a simple example of dead reckoning. Consider, that somebody wants to estimate his precise location by the using global positioning system (GPS) driving a car. In such a case the observations from the GPS will be noisy showing the car a couple of meters away from the place where it actually is. The GPS might give him noisy measurements for a lot of reasons but probably most importantly it will due to driving speed and maneuvers he is making. Since, if we are to estimate the true position of his car by using a Kalman Filter, we need the speed and wheel direction of his car and add this information to the initial noisy position observed from the GPS signal. Daw et al., (2006) applied this simple idea to a multi-arm bandit problem where the participants have to learn to allocate their choices between different bandits in order to earn maximum amount of money. In their experiment the mean payoffs for each bandit is drawn from independent Gaussians with pre-determined mean and variance such that the mean rewards for some bandits are better than others. Secondly, the rewarding outcome from each bandit was diffused with a Gaussian random walk. Given that the mean reward value and the variance in the outcome are assigned a prior, Kalman Filter updates the posterior mean payoff by using the following equation:

[4.19]

In equation 4.19 refers to the updated mean reward of a particular bandit and is the prior mean reward of that bandit with prediction error signal equals to the difference between the reward outcome in a trial and the mean reward outcome, which is

t

as follows:

[4.20]

In the Kalman filter the learning rate which is also called the Kalman gainis calculated by the following equation:

[4.21]

Note that while doing parameter estimation these initial mean payoffs and the standard deviation are the first two free parameters in the model that are similar to the initial GPS signal and the speed in the above dead reckoning problem respectively. Also the variance for the payoff of the chosen bandit is updated by separate functions. In addition to that the Kalman Filter makes the assumption that the subject believes that the outcome of the bandits might vary over time and are governed by the Gaussian random walk which adds additional free parameters to the system. Overall this makes six free parameters (Daw et al., 2006). In conclusion the Kalman Filter is a powerful algorithm and had been utilized in fMRI research but due to its high degree of free parameters and initial assumptions we don’t think it is suitable for explaining the biological plausibility of all reinforcement learning situations.

4.3.2 Incremental-Delta-Bar-Delta Algorithm

One such algorithm that uses adaptive learning rates is the Incremental-Delta-Bar-Delta (IDBD) algorithm. The IDBD algorithm was first introduced by Sutton (1992) and is an extension of the previous delta-bar-delta learning algorithm (Jordan, 1988).

IDBD is a meta-learning algorithm in the sense that it doesn't only learn the weights in a network (such as values of stimuli or actions) but it also learns the learning rate.

In the IDBD algorithm the learning rate is updated by the following equation:

pre

[4.22]

In the above equation " indicates the learning rate and is an additional memory

parameter that is actually modified using another function . The term is updated as follows:

[4.23]

In the above equation 4.23, is a positive constant, which is a meta-learning rate, and is a decaying memory trace keeping the records of previous weight changes. The aim of learning is to minimize the squared hiand is thus a decaying trace of the cumulative sum of recent changes to weights and the basic learning rule for updating weights can be calculated as follows:

[4.24]

The advantage of the IDBD algorithm over its predecessors such as the delta-bar-delta-algorithm (Jordan, 1988) is that it has only one free parameter, the meta-learning rate, and it works with incremental training of inputs rather than batch training. It has been shown that the IDBD algorithm shows greater performance than the least-mean-square algorithm (LMS) and is as good as the Kalman Filter algorithm in a benchmark problem (Sutton, 1992). For example, Sutton (1982) suggested an alternative adaptive learning rate framework showing that even though the learning rate in the

negative acceleration of learning also Rescorla-Wagner model can’t be able to model capture choice switches in a probabilistic reversal learning task (Glascher et al., 2009).