Preliminary Skill — Learning the Count List

Model Overview

5.3 Training of the Model

5.3.1 Preliminary Skill — Learning the Count List

The preliminary skill to be acquired by the model in the first stage of the training is defined as being able to output a pre-defined sequence of number words and then to remain silent for a period of time, in response to the trigger input. At this stage of the training none of the optional components of the model are present, i.e. the only input available to the model is the trigger input, and only the verbal output is present (see figure 12). Assuming the model is trained to recite NV O number words, the training data set consists of two sequences, each 2NV O time steps long:

the first sequence, with trigger input equal to 0 and all target outputs in the

‘silence’ configuration at every time step, trains the model to remain silent when the trigger input is deactivated;

the second sequence, with trigger input equal to 1, trains the model to recite the N_{V O} number words throughout the first N_{V O} time steps of the sequence, and to remain silent during the remaining N_{V O} time steps;

The prolonged period of silence following the recitation of the count list has been introduced in order to prevent any potential unwanted effects that may appear for the largest numbers when training the network to count in the second stage. Should the simulation sequence be equal in length to the biggest considered number, the latter would be subject to a different treatment than the lower numbers, in the sense that after counting up to the largest number the neural network would not have to enter a stable ‘silence’ state. The length of the period of silence assumed to follow counting (equal to NV O) was chosen arbitrarily.

... Hidden layer (N_H)

Trigger input (1)

...

Verbal output (N_VO)

Figure 12: Architecture of the neural network in the first stage of the training.

Note that this represents a subset of the neural network shown in figure 8, with no optional components.

Before the training of the model commences, the adjustable weights of the neural network are initialised randomly and drawn from a uniform distribution with mean 0 and standard deviation equal to the reciprocal of the square root of the fan-in of the node at the receiving side of the connection (LeCun et al., 1998). The first stage of the training is performed using the RProp− algorithm (Igel & H¨usken, 2003). In the simulations reported in chapter 6 the training typically lasts for 700 epochs with the initial learning rate of 0.1 (the learning rate is allowed to vary between 10⁻⁵ and 0.1). The number of the units in the hidden layer NH is usually a parameter of the model.

The success of the first stage of the training is determined based on the correct-ness of the output produced by the model. If the nearest-neighbour classification of the network output is the same as that of the target output at every time step for both sequences in the test data set (which in this stage of the training is identical with the training data set), the preliminary stage can be deemed successful, and the training can proceed to the second stage. Should the preliminary stage not be successful, it would not make sense for the training to proceed.

5.3.2 Counting

After the first stage of the training is completed, the model obtained in the prelim-inary stage is extended by adding the desired optional components (cf. figure 8).

The weights of the connections that need to be added (e.g. from the visual input layer to the hidden layer) are initialised using the same method as in the preliminary stage. Subsequently, the extended neural network is trained to produce a sequence of number words (and, optionally, gestures), the length of which is equal to the number of objects present in the visual input, just as a child would when counting the same set. As explained in section 5.2.1, in the training data set the spatial positions of the counted items must be randomised. The number of possible spatial arrangements of the objects grows exponentially with the number of units in the visual input. If one considers numbers up to 10, what in the model corresponds to the visual input with NV I = 20 units, there are more than 600,000 possible spatial arrangements of the items. Already in this case it is impractical to create a training data set that would contain all possible combinations of locations of objects for all considered numbers. In order to alleviate this problem, the model is trained in an

‘on-line’ fashion, using small data sets that change after every training epoch. This makes it possible to use a different spatial arrangement of objects for a particular number in every epoch.

The training data sets for the second stage of the training are constructed as follows. For every number from the considered number range (up to NV O), the representation of the visual input is composed by randomly choosing the locations of the items within the saliency map. The visual input fed to the model for a particular number remains unchanged throughout a simulation sequence. For every number, two sequences are included in the training data set. In the first sequence, the trigger input is deactivated and the target outputs are set to ‘silence’ (and, optionally, no gesture) throughout the whole duration of the simulation sequence.

In the second sequence, the trigger input is activated and the target outputs contain the correct count list (and, optionally, gestures) that correspond to the current visual

input, followed by ‘silence’ (no gesture) until the end of the simulation sequence. If numbers from 0 to N_{V O} = 10 are considered, this results in 22 sequences per data set.

If the model configuration includes the optional elements connected with the gestures, the appropriate proprioceptive signal needs to be constructed, that cor-responds to the given spatial arrangement of the objects being counted. In the simulations described in chapter 6, two types of gestures are considered: the ‘stand-ard’ counting gestures, that have spatio-temporal character, and ‘rhythmic’ gestures, that have only the temporal aspect. How these two types of gestures are constructed based on the given arrangement of the items in the visual input is explained below.

5.3.2.1 Spatio-temporal Gestures

When constructing the spatio-temporal counting gestures, the locations of the ob-jects in the visual input are considered in a particular order, e.g. from left to right.

As described in section 5.2.2, every spatial location in the visual input layer has a corresponding vector of the activation values of the gesture representation units derived from the values of the joint angles of the robot arm pointing to this posi-tion. Assuming there are K ≤ NV O objects in the counted set, for the time steps 0 ≤ t < K of the simulation sequence, the target activation values of the gesture representation units are assumed to be equal to those representing a posture in which the robot points to the spatial location of the (t + 1)-th object, in the con-sidered order. For the remaining time steps K ≤ t < 2NV O, the activations of the proprioceptive units remain the same as they were for the last object (as discussed before, alternative ways of representing ‘no gesture’ are also possible). This process is illustrated in figure 13. Note that since the arm configuration is different for every location being pointed to, different spatial arrangements of items will yield different gesture signal, even if the collections are of the same size. A spatial correspondence exists therefore between the counted items and the gesture performed.

Visual input

Figure 13: Spatio-temporal counting gesture construction example. It is assumed that N_{V I} = 20 and N_{V O} = 10. There are five items in the robot’s visual input (top).

The activated units in the visual input layer determine which arm postures are used to construct the gesture signal. The vectors corresponding to the occupied spatial locations are designated as A, B, C, D and E (centre). The counting gesture unfolds through the first five time steps of the simulation sequence, representing pointing to the items in the left-to-right order. After the gesture is completed (at time step 5) the robot arm remains in the posture corresponding to the last counted item until the end of the simulation sequence (bottom).

5.3.2.2 Rhythmic Gestures

Rhythmic gestures, employed in the simulation described in section 6.4, are con-structed in the following way. Let l and r be two distinct spatial locations chosen from the N_{V I} = 20 locations represented by the units of the visual input layer. The choice of l and r affects the amplitude of the resulting rhythmic movement. For example, assuming l = 1 and r = 20, that correspond to the leftmost and rightmost spatial position respectively, yields the highest achievable movement amplitude. In turn, taking l = 10 and r = 11 is a way to obtain a movement with the smal-lest possible amplitude. If there are K ≤ NV O objects in the set to be counted, the rhythmic gesture signal is constructed by taking, for the time steps 0≤ t < K of the simulation sequence, the activation values of the proprioceptive units corresponding to the spatial positions l and r, interchangeably. For the remaining time steps of the simulation sequence K ≤ t < 2N^{V O}, the target activations of the propriocept-ive units remain unchanged with respect to those present for t = K − 1. This is illustrated in figure 14. Note, that any arrangement of K items in the visual input results in exactly the same gesture signal. As a consequence, in case of rhythmic gestures there is no spatial correspondence between the gestures and the counted items. The rhythmic gestures will be contrasted with the ‘normal’ counting gestures described above, in an attempt to answer the research question 3.

For the sequences in the training dataset where the trigger input is deactivated (for which counting does not occur), the activation values of the gesture inputs are set to 0, what corresponds to keeping the arm in a ‘neutral’ position. Since the training data set changes in every epoch, in the second stage the neural network is trained using the backpropagation through time algorithm with the network weights updated in an on-line fashion (LeCun et al., 1998). In simulations described in chapter 6, the training usually lasts for 4000 epochs and a constant learning rate of 0.005 is used.

Visual input

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Location (left to right)

−0.4

−0.2 0.0 0.2 0.4

ActivationValue

l r

Unit 1 Unit 2 Unit 3

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Time step

−0.4

−0.2 0.0 0.2 0.4

ActivationValue

l r l r l

Unit 1 Unit 2 Unit 3

Figure 14: Rhythmic counting gesture construction example. It is assumed that N_{V I} = 20 and NV O = 10. The same arrangement of five items in the visual input is used here as in figure 13 (top). l = 1 and r = 20 are assumed, therefore a gesture with maximum amplitude will be obtained (centre). The gesture is constructed by interchanging the vectors l and r five times (since there are five items). After the gesture is completed (at time step 5) the robot arm remains in the posture corresponding to the last beat (bottom).

In document Modelling Learning to Count in Humanoid Robots (Page 159-166)