The Temporal Difference (T.D.) Model and the

3.1 Models of Classical Conditioning

3.1.5 The Temporal Difference (T.D.) Model and the

As the field of conditioning models progressed, it began to be influenced by ideas from within the artificial intelligence community. This is apparent in the Temporal Difference model, a real-time model, which was presented by Sutton

& Barto (1987; 1990). The model was an application of the T.D. method of reinforcement learning, a method from artificial intelligence developed by Sutton (1984; 1988) as a method of assigning credit for a reward or punishment to prior actions taken by an agent. This method is extensively reviewed in later in this chapter.

The T.D. method of machine learning was itself influenced by and devel-oped from an earlier real-time model of classical conditioning by Sutton &

Barto that became known as the S.B. model (Sutton & Barto, 1981), a model based on the ideas of Klopf (1972). A short overview of Klopf’s work forms part of section 3.1.6.

In (Sutton & Barto, 1990), the S.B. model was described in a different manner to the original paper, which provides a clearer way of understanding the model’s operation. The new description placed the S.B. model in the context of an observation that was made about a large set of models of classical conditioning. The observation is that many models of classical conditioning have the functional form shown in equation 3.7.

∆V = Reinforcement × Eligibility (3.7) As usual, ∆V represents the change in association strength. “Reinforce-ment” is defined loosely as the level of unconditioned stimulus processing.

“Eligibility” on the same lines was loosely defined as the level of conditioned stimulus processing. Sutton & Barto argued that many models focus primarily at one or the other part, but rarely both. For example, the Rescorla-Wagner model can be said that the “α” part of the formula corresponds to eligibility and the β (λ − V_AB) part of the formula corresponds to reinforcement. In the Rescorla-Wagner model, the model can be argued to look primarily to the re-inforcement side of the function. An example of a model that primarily deals with eligibility would be Mackintosh’s attention model.

In the S.B. model, both parts of the function were used extensively. The reinforcement part used an equation Sutton & Barto later named the ˙Y the-ory (pronounced “Y dot”) (Sutton & Barto, 1990). In the ˙Y theory, every stimulus S produces a reinforcement of +V_S on onset, −V_S on offset and zero at all other times. The value V_S represents the association strength value of the stimulus. The unconditioned stimulus has a fixed, positive association strength value with itself; all other stimuli have a starting association strength value of zero. Time is assumed to pass in small increments. The function ˙Y (t) is defined to be the sum of all reinforcement values that have occurred at time t. The resultant value of ˙Y (t) is then used as the reinforcement part of the

∆V equation.

For the eligibility part of the S.B. model, Sutton & Barto used the concept of an eligibility trace that was first developed by Klopf (1972). An eligibility trace is a time-dependant function that describes the eligibility of a given stimulus in relation to the timing of the presentation of that stimulus. The eligibility trace used by the S.B. model builds while the stimulus is present and then decays when it is removed. In order to do this, the S.B. model represents the presence and non-presence of a conditioned stimulus S in terms of a variable XS(t), which is defined at time t to be one when the stimulus is present and zero otherwise. The eligibility trace X_S(t) at time t is then defined as a running average of the values of X_S(t), as shown in equation 3.8.

X_S(t − 1) = X_S(t) + δ X_S(t) − X_S(t)

(3.8) Where δ is defined to be the weighting placed between the present value of X_S(t) and past values. The S.B. model then puts these two components,

the ˙Y theory and the eligibility trace together to form a single update equation to be applied for each stimulus S at time t, as shown in equation 3.9.

∆V_S= β ˙Y × α_SX_S (3.9)

The successes of the S.B. model was that it did predict all of the Rescorla-Wagner model’s phenomena, plus was able to deal with inter-stimulus-interval effects, and predicted the existence of the temporal primacy effect, which was subsequently confirmed experimentally. However, there arose two major prob-lems with the model. Firstly, when the I.S.I. is very short and the stimulus du-rations were short (i.e. only a few time-steps) and overlapped, the association gained becomes inhibitory. This prediction was disconfirmed experimentally prior to the model being published. It was not found for some time as only stimuli that were active for much longer time-steps were tested.

The second problem arises in trials where the conditioned stimulus contin-ues for a variable length of time but the unconditioned stimulus always starts as the conditioned stimulus stops. The observed experimental effect is that the association strength between the two stimuli reduces as the duration of the conditioned stimulus increases. The prediction by the S.B. model however, is that the duration of the conditioned stimulus does not affect the strength of association in this type of trial.

There were a number of attempts to rectify both of these problems with the model. These attempts were described by Sutton & Barto (1990). However, none of the modifications of the theory were completely satisfactory. With this in mind, Sutton & Barto proposed a model that while sharing some similarities with the S.B. model, has a very different basis. This model was known as the Temporal Difference (T.D.) model of conditioning.

As described before, the T.D. model is a solution to assigning credit of present awards to the correct past actions. The T.D. model does this by attempting to predict an imminence-weighted sum of all future unconditioned stimuli.

When attempting to predict future unconditioned stimuli, ideally, one would wish to predict all future unconditioned stimuli so as to apply those to the current actions or stimuli; however this becomes increasingly difficult as the prediction goes further into the future. Therefore, at a given time-step, the prediction should be weighted more towards the next time-step, then slightly less for the time-step after that and so on. This means that as an uncondi-tioned stimulus becomes more imminent, the prediction that it will happen at the next time step should be greater. This also means that for unconditioned stimuli that last more than one time-step, when the current time-step is in the middle of an unconditioned stimulus, the strength of prediction should be in rough accordance with how many future time-steps remain of the stimulus.

Algebraically, Sutton & Barto (1990) expressed this in formula shown in equation 3.10.

V_t= λ_t+1+ γλ_t+2+ γ²λ_t+3+ γ³λ_t+4+ · · · (3.10) Where V_tis the prediction made at time t, λ_tdenotes the level of intensity of the unconditioned stimulus and γ is the imminence weighting, 0 ≤ γ < 1, with which smaller values denoting a greater weighting to immediate values.

Through algebraic manipulation, this can be written in the form shown in equation 3.11.

V_t= λ_t+1+ γV_t+1 (3.11)

This formula denotes the ideal level of prediction at any given time step.

Therefore, the discrepancy between the current prediction and what it should ideally be is the level of reinforcement that should be provided on any partic-ular time-step, as shown in equation 3.12.

Reinforcement = λ_t+1+ γV_t+1− V_t (3.12) This can then be used instead of the ˙Y theory of the S.B. model to pro-vide the association strength update formula for the T.D. model, as shown in equation 3.13.

∆V_i = β λ_t+1+ γV_t+1− V_t × α_iX_i (3.13) Sutton & Barto showed that this model is able to predict all the same phenomena of the S.B. model without the problems that were encountered with the S.B. model. However, the model is not able to predict several classes of phenomena – many of which have been discussed by Sutton & Barto (1990).

The phenomena that were discussed include configural cues, overshadowing and sensory preconditioning.

It is also believed that the T.D. model would not be able to predict the pre-exposure effects of latent inhibition and the U.S. pre-exposure effect. The reasoning for this claim is that there is no state in the model that can record the non-pairings of stimuli that would allow for pre-exposure effects to be included.

In document The Application of Classical Conditioning to the Machine Learning of a Commonsense Knowledge of Visual Events (Page 69-72)