Model Performance Evaluation

Model Overview

5.4 Model Performance Evaluation

In line with the design assumption 2, the evaluation protocol in the conducted ro-botic simulations of counting aims to follow as closely as practically possible the one applied in the behavioural study by Alibali and DiRusso (1999), so that the results of both can be meaningfully compared. The ‘subjects’ of the simulated experiment are the instantiations of the neural network model architecture described earlier in this chapter. Inter-subject variability results from the random initialisation of the weights of the connections within the network, as well as from the stochasticity of the applied training algorithms. Typically³, the subjects are first trained to recite the sequence of number words as described in section 5.3.1, and the successful ac-quisition of this skill is required for a subject to be included in the remaining part of the experiment. Subsequently, several experimental conditions are simulated by extending the network with the desired optional components, training it appropri-ately to the experimental condition, and then evaluating it on a test dataset. Note that for a single subject the starting point of every experimental condition is a copy of the same neural network obtained in the preliminary training stage. This makes it possible to apply the repeated measures design in the statistical analysis of the results. The experimental set-up described above is illustrated in figure 15.

During the evaluation, the model is presented with the test stimuli, which consist exclusively of the arrangements of items that have never been shown to the model during training. Because of the small number of the possible spatial configurations of the objects for the smallest numbers of items, the test data set is determined prior to the commencement of the training, and the arrangements used in the test data set are prevented from being used throughout the training. The test data sets are constructed in the same way as described for the training data sets in section 5.3.2.

3Since the four experiments described in chapter 6 have different aims, there are minor differ-ences in the details of the experimental set-ups between these simulations. The description herein refers specifically to the simulation experiment 3 presented in section 6.3, which is aimed directly at reproducing the results of Alibali and DiRusso (1999).

Experiment timeline

= inability to recite 10 number words

TaskHM (How Many), cardinal response not important

In all conditions verbal output active ('counting aloud')

Neural network subjected to specific training in every experimental condition.

Sets of visual stimuli arranged randomly with variable spacing (colour not modelled) no grouping (order of presentation irrelevant) Set sizes

1–105 different set arrangements for every number 1–5 = „small” size

6–10 = „large” size Dependent Measure

Counting accuracy

= number of sets (out of 50) counted without any error

Counting duration as auxiliary measure Training

Testing

Figure 15: Example experimental design of a neuro-robotic simulation of counting.

The figure refers to the simulation experiment described in section 6.3. Cf. figure 6.

In every experimental condition, the sequences of the number words (and, op-tionally, gestures) produced by the model in response to the test stimuli are recorded and their correctness is assessed based on the comparison with the corresponding target values. The counting accuracy, defined as the number of the sets from the test data set counted without any errors, is used as the principal index of the performance of the model, and serves as the dependent measure in the statistical analysis of the effects of the model parameters. Based on the test examples which have been coun-ted incorrectly, the counting errors commitcoun-ted by the model are determined, and classified using the same criteria as those applied by Alibali and DiRusso (1999) to children counting (see figure 7 in chapter 4). Figures 16 and 17 illustrate the process of the network output evaluation in the correct and incorrect case, respectively.

An additional measure of the model performance employed in some of the sim-ulations is the counting duration, that is the length of the output produced by the model, defined as the last time step of the simulation sequence, at which the model utters a number word, and which is followed by silence until the end of the simu-lation. Note that although this does not take into account the correctness of the produced number words sequence, regressing the model counting duration against the size of the counted set (in other words, the actual length of the model output against the correct length) for all sequences in the test data set nevertheless provides useful insights into the behaviour of the model. The slope of the resulting regression line can serve as a marker of the quality of the behaviour of the model based on the following observations. For a model that counts perfectly correctly, the actual output length is always equal to the number of the counted items, and the result-ing slope of the regression line is equal to 1 (and its intercept is 0). In turn, for a neural network that does not count, but simply produces a sequence of a fixed length regardless of the size of the counted set, the slope is equal to 0 (and the in-tercept indicates the length of the produced sequence). Intermediate values indicate a situation in-between and the value of the slope can be considered a quantitative indicator of ‘how hard the model is trying’ to count the items in its visual input.

0 5 10 15

Figure 16: Example of a correct output of the model. The figure shows raster plots of the inputs to the neural network (a), the desired (i.e. target) network output (b), and the actual output produced by the network (c). In all three charts, the abscissa corresponds to the simulation time steps, the ordinate to the index of the unit in the appropriate layer of the neural network, and the intensity to the activation value of the unit. Below charts (b) and (c), a result of the nearest-neighbour classification of the network output at the given time step is shown, i.e. the number word assumed to be produced by the network (‘·’ denotes ‘silence’). In this example, there are six items in the visual input to be counted (in groups of 2, 1, and 3). Even though the actual network output (c) is not identical to the target output (b), in this trial, the network counting is considered to be correct. This is because, at every time step, the nearest-neighbour classification of the actual output is identical with that of the target output.

0 5 10 15

Figure 17: Example of an incorrect output of the model. The figure shows raster plots of the inputs to the neural network (a), the desired (i.e. target) network output (b), and the actual output produced by the network (c). In all three charts, the abscissa corresponds to the simulation time steps, the ordinate to the index of the unit in the appropriate layer of the neural network, and the intensity to the activation value of the unit. Below charts (b) and (c), a result of the nearest-neighbour classification of the network output at the given time step is shown, i.e.

the number word assumed to be produced by the network (‘·’ denotes ‘silence’). In this example, there are three items in the visual input to be counted. Note that the network continues to count up to five, beyond the number of the items in the set;

therefore, in this trial, the network counting is not considered to be correct, and the network commits the ‘Continue’ error.

In document Modelling Learning to Count in Humanoid Robots (Page 166-171)