B.2.2 The Negative Evidence Function
3.1 TD(λ)
S: The set of states in the world.
V : The tabular state value function.
π: The tabular state value function.
λ: The tabular state value function.
γ: The tabular state value function.
1: Initialise V (s) arbitrarily for all s ∈ S
2: Repeat (for each agent):
3: e (s) ← 0 for all s ∈ S
4: Initialise s
5: Repeat (for each step of the agent’s lifetime):
6: a ← action given by π for s
7: Take action a, observe reward r and next state s′
8: δ ← r + γV (s′) − V (s)
9: e (s) ← e (s) + 1
10: Repeat (for each s′′∈ S):
11: V (s′′) ← V (s′′) − αδe (s′′)
12: e (s′′) ← γλe (s′′)
13: s ← s′
The new symbol (λ ≥ 0) is a parameter of the algorithm that controls the weighting of how far the back-propagation of rewards goes. If λ is set to zero, then the reward is only ever propagated backwards by one step each time a state is visited. If λ is set to one, then the reward is propagated backwards through the entire chain of states visited with no weighting other than the time discounting. Values of λ in between zero and one progressively weight more towards the recent past as the value approaches zero. Note that λ is set to one, the algorithm produces the same output as a variant of the Monte Carlo approach.
Temporal-difference methods originally only focused on the state value function, and the introduction of the TD(λ) algorithm by Sutton (1988) made no mention of the state-action value function. The work that applied temporal
3Some minor changes were made for notational clarity, to comply with the vernacular used in this thesis and to take into account the errata by Sut-ton (2010).
difference methods to the state-action value function was done by Watkins (1989; Watkins & Dayan 1992). Watkins introduced the Q-learning algorithm, which is not quite the state-action version of the TD(λ) algorithm due to the use of off-policy learning. Off-policy learning is where the algo-rithm learns the optimal policy, but does not follow it. Instead an off-policy algorithm follows a separate, related policy that still allows for exploration.
By taking into account the differences between the policy being followed and the optimal policy, the optimal policy can be updated such that it includes no exploration and so can truly be the optimal policy. There does exist an on-policy temporal difference method, which is a direct state-action version of the TD(λ) algorithm. It is known as Sarsa and was proposed by Rummery &
Niranjan (1994).
3.3.7 Extensions to Temporal Difference Learning
Up to this point, this section has been discussing the core work of reinforcement learning. There are a great many extensions to the core, too many to cover in any detail. This subsection will look at two extensions to give an impression of how the reinforcement learning problem can be extended.
The aim of generalisation is to apply the knowledge that has been learned from the environmental states that have been observed to those states that have not been visited. This is done by adding to each state some supplemen-tary information which the agent can use to compare the similarities between states. It is the job of the agent to learn which pieces of the supplementary information are useful when and by how much. The agent would then use the information to predict the value function of a state based on supplementary information alone.
How generalisation is done is the subject of a great deal of research. The general idea though is to create a state-value function that is parameterised by the values of the supplementary information rather than the individual state. However, in allowing for generalisation, there has to be the trade-off that no state or state-action value will ever be able to be perfectly predicted.
This is because by using a general measure of similarity between states, means that the transition between states has to be smooth. Therefore, if two close states are near enough but give very different values for their value function, the smoothness of the general value function will mean both values are pulled away from their true value.
Because of this, the goal of a general value function is not to match each state exactly, but to minimise the mean of the squared error between the val-ues of the true (optimal) value function and the general value function. This is done by a method known as gradient descent. The intuition here is that as a given value is changed, this will change the mean of the squared error in some manner, either up or down, meaning that there is a local gradient for each parameter value. By calculating this gradient for each parameter, the direc-tion that most quickly reduces the mean of the squared error can be followed,
allowing for a minimum value to be found. Alonso et al. (2006) suggested an approach to generalisation for Q-learning based on a number of observa-tions made regarding the differences between the TD learning method and the Rescorla-Wagner model of classical conditioning. Sutton et al. (2009) have produced a variant of TD learning that calculates gradient descent efficiently and in a convergent manner.
A related, but different extension to reinforcement learning is that of in-troducing the notion of continuous states and actions. In the conventional version of reinforcement learning, each state and action is a discrete set of choices. However this need not be the case, for instance if a robot is to choose an angle with which to turn through then this is an action where there is a continuum of possible actions rather than a discrete set of choices. Similarly the state that the robot is in after making its choices is also a continuum.
The state space becomes an even larger continuum if the robot then moves forwards after the turn, with another continuum of choices. Some of the meth-ods of generalisation can be applicable, as one form of representing this is to have the supplementary state information become the state itself. Work by van Hasselt (2012) has looked at this version of the reinforcement learning problem in detail. Some recent research by Fairbank & Alonso (2011; 2012) has looked at this variant in terms of dynamic programming.
3.3.8 Reinforcement Learning in Relation to this Thesis This thesis has many commonalities with reinforcement learning in general and temporal difference learning specifically. It also has a number of very significant differences. This subsection will review these similarities and dif-ferences.
The similarities between reinforcement learning and temporal difference learning and this thesis are due to their common origins: conditioning. This means that both systems learn to associate temporally congruous events, as that is the basis for conditioning. Both temporal difference learning and the system presented in this thesis use a value of association to represent the learning – in temporal difference, this is the association to the reward (i.e. the value function) and in this thesis it is the significance value (see chapter four for details). There are other similarities that are not necessarily conditioning based, but are logical developments of it. These similarities are that both systems learn continuously, are able to adapt the circumstances over time and the converged learning need not be biased by the earliest values learned.
However, there are numerous differences, which again originate in decisions made regarding the interpretation of the phenomena of classical conditioning.
The largest difference is that the system developed in this thesis does not make use of any concept of rewards, nor of action. This was due to subscribing to the stimulus-stimulus interpretation of conditioning. In the temporal difference system, the only thing learned is the propagation of reward. This means that if the environment provides no reward, no learning occurs. This is in-line with
the stimulus-response interpretation of classical conditioning, which in the current psychological consensus is that it is the minor form of association. It is accepted that in the reinforcement learning variant of temporal difference, that conformance to any psychological interpretation was never the goal, however given the models heritage, such analysis is to be expected.
Another point regarding the difference between the system presented by this thesis and reinforcement learning in general is that during the transition between the model of classical conditioning and the machine learning method, the basis of what is being learned subtly changed. No longer was the learning method one of classical conditioning but one of instrumental conditioning4. The nature of classical conditioning is that the subject learns passively, with no choice of actions - the unconditioned and conditioned responses are reflexive, not deliberate actions, and if the stimulus-stimulus interpretation is correct, the response is not needed for learning to occur5.
In instrumental conditioning, the choices are deliberate actions on the part of the subject seeking the expectation of reward or avoidance of punishment.
The nature of what becomes associated in instrumental conditioning is differ-ent from classical conditioning. In instrumdiffer-ental conditioning all three items of the cue stimulus, the action and the reinforcement stimulus are associated, whereas in classical conditioning there is no action, just a cue and a rein-forcement stimulus being associated. Note that in reinrein-forcement learning, it is indeed a triple of the current state, the action and the reward. A possible argument against this assertion that reinforcement learning is more analogous to instrumental conditioning is that one can argue that classical conditioning is to instrumental conditioning what the state value function is to the state-action value function. The counterargument is that just because the state value function does not select actions, the policy which it evaluates does select actions, meaning that action selection is still very much an essential part of reinforcement learning, even in the case of the state value function.
The consequence of this subtle difference between the classical conditioning origins of reinforcement learning and its instrumental conditioning abstrac-tion means that the basis for temporal difference learning has unanswered questions. This raises the question that, if a temporal difference model of instrumental conditioning were formulated, would it be different to the clas-sical conditioning model? If so, would that model be able to be applied to the reinforcement learning problem, potentially giving a better solution to it?
4Instrumental conditioning is a form of associative learning where the sub-ject is given a cue stimulus, and the reward is based on the active actions that the subject makes in response to the cue. Instrumental conditioning was first found by Thorndike (1898) and was greatly advanced by Skinner (1938; 1962).
Unlike classical conditioning, the subject makes non-reflex actions.
5For the potential argument that in fear conditioning, actions can be taken that appear to be deliberate, a counterargument would be that the reflex response – the response that becomes conditioned – is the fear response itself, not any actions which follow the fear response.
As with the discussion over interpretation, it is accepted that it was not the intention of the machine reinforcement learning version of temporal difference to be compared, however again it does at least warrant discussion due to the heritage of the method.
It should be pointed out that the two previous arguments, “temporal dif-ference follows a stimulus-response interpretation” and “temporal difdif-ference is instrumental conditioning”, are not necessarily contradictory positions. What-ever the mechanism or interpretation of classical conditioning and instrumental conditioning, the similarities are greater than the differences between the two.
This means that there is a likely shared mechanism between instrumental and classical conditioning and therefore, it is probable that the interpretation of classical conditioning also applies to instrumental conditioning.
There are other lesser differences between temporal difference learning and the system presented by this thesis. The first of these is that the choice of knowledge representation is different. Temporal difference typically encodes its knowledge in its value function and the system presented by this thesis uses predicate logic. The second difference is that temporal difference learn-ing typically associates linearly, whereas the system presented by this thesis associates hierarchically, though it is acknowledged that there are extensions to temporal difference learning that do deal with hierarchical association.
3.4 Visual Event Sequence Learning
The research into visual event sequences is a part of the distinctive sub-field of artificial intelligence known as computer vision. As the system discussed within this thesis uses visual events as its data source, this section provides a brief review of the techniques used in computer vision to identify event se-quences. This is followed by a brief review of visual object tracking techniques.
3.4.1 Event Recognition
Computer vision research into event detection and classification appears to be application-driven. These applications broadly fit into two groups of applica-tions. These are the analysis of broadcast television and automated surveil-lance. There appears to be a single main difference in approach independent of application. There are those systems that manually and explicitly model the classes of event being searched for (Foresti et al., 2004; Cui et al., 2007;
D’Orazio et al., 2009; Cristani et al., 2007) and those systems that do not, dy-namically creating a model instead (Dee & Hogg, 2009; Piciarelli et al., 2008).
It also should be noted that the method used to detect events is very depen-dent on the type of application. This suggests that the approach of research being application-led has in this instance failed to produce any methods that are general enough to be used across applications.
Systems analysing broadcast television are mainly focused on news broad-casts, where the events being the story segments in the recording (Hoogs et al.,
2003; Xu & Chang, 2007) and sport broadcasts, where the events generally being dependant on the sport – for example goal events in soccer (D’Orazio et al., 2009; Liu et al., 2009). In the analysis of television broadcasts, the most common method of detecting events is through temporally dividing the video into individual shots, classifying each shot based on a cluster analysis and us-ing typical sequences of shots to detect events. These methods are typically augmented with concurrent analysis of the audio and closed caption streams present in the broadcast.
Automated surveillance can be split into those systems attempting to char-acterise the behaviour of all detected subjects (Dee & Hogg, 2009; Cristani et al., 2007) and those that attempt to search for anomalous events (Cui et al., 2007; Foresti et al., 2004; Piciarelli et al., 2008). In the surveillance domain, the most common method to detect events is through analysis of the trajec-tories of the detected objects. These trajectrajec-tories are calculated through the use of an object tracking system. To find anomalous trajectories, the system is trained on sets of trajectories that are considered normal (either through manual labelling or through clustering trajectories and using those clusters with the highest frequency of use). Anomalous or interesting trajectories are then classified as such if they are an outlier to the trained sets.
3.4.2 Object Tracking
While the development of this thesis did not directly implement a full track-ing system, ustrack-ing simulated data instead, the system is designed such that the input data is supposed to be the output of a tracker or simulation thereof.
In addition, the input data simulator modelled some of the kind of noise that a tracker produces in its output. As these two parts of this thesis assume knowledge of a tracking system, and for the sake of completeness, this very large area shall be briefly discussed. Yilmaz et al. (2006) provide a comprehen-sive review of tracking techniques, in which they split the tracking techniques into three different categories: point tracking, kernel tracking and silhouette tracking.
Point tracking is based on finding a correspondence between salient points in one frame and the same points in the next frame. Salient points are points in the scene that are either easy-to-find points on the objects present in the scene (for example, corners) and/or points that are invariant to particular transforms that may be applied to the image.
Kernel tracking is the category of methods where the objects to be tracked are represented in a simple manner, such as a bounding box or an ellipse. A correspondence between frames is then obtained for each simple shape. The search for the area of each frame that defines the shape can then be constrained by searching the neighbourhood of the shape in the previous frame. Kernel tracking is typically used with techniques that attempt to create foreground-background segmentation, such as between changing and static regions of pix-els in the image.
Silhouette tracking is the name given to the set of methods that attempt to model the outline of each class of objects to be tracked. These models are typically a complex polygon or spline and have control points to allow the model to deform within set bounds. Tracking is then the creation of correspondences between models in consecutive frames. The position and shape of the previous frame is used to guide the search for the position and shape in the next.
In dos Santos et al. (2009), it is noted that an object forming part of a dynamic scene can display two types of motion: intrinsic and extrinsic.
Intrinsic motion is that motion that changes the appearance of the object by the movement of constituent parts of the object (e.g. limbs on a body).
Extrinsic motion is motion where the position of the object changes relative to other objects. From the point of view of tracking systems, extrinsic motion can be detected by either a kernel tracker or a silhouette tracker, whereas intrinsic motion can only be detected by a silhouette tracker.
The system discussed in chapter four of this thesis assumes the use of a kernel tracking system. While a silhouette tracker would provide the most in-formation and allow for a more general way to describe events, it has a number of practical concerns. Firstly, a silhouette tracker would require models of the outlines of the objects expected to be tracked. This assumption of what ob-jects would be present in a scene would mean that the system would be less adaptable to novel objects.
3.5 Chapter Conclusion
This chapter has reviewed the ideas that are related to this thesis, either through this thesis making use of those ideas or through the ideas having a common conceptual heritage. Firstly the work done within the psychology community towards modelling classical conditioning was reviewed. It was noted that there are two classes of model, trial-level models, and real-time models. A trial-level model is one that computes the association strength after each presentation of a stimulus completes. A real-time model computes the association strength at regular intervals.
Next, the ideas of commonsense knowledge were reviewed, with the three problems of knowledge representation, knowledge reasoning and knowledge acquisition. The two most pertinent problems, those of representation and acquisition were then discussed. The problem of knowledge representation focused on space and time. The problem of acquisition was reviewed more generally, looking at two dominant approaches of acquisition, namely using
Next, the ideas of commonsense knowledge were reviewed, with the three problems of knowledge representation, knowledge reasoning and knowledge acquisition. The two most pertinent problems, those of representation and acquisition were then discussed. The problem of knowledge representation focused on space and time. The problem of acquisition was reviewed more generally, looking at two dominant approaches of acquisition, namely using