Proxy Ground Truth Comparison - The Application of Classical Conditioning to the Machine Learni

Other than the need to cap the number of levels allowed in the event type hi-erarchy, the comparison with the proxy ground truth has produced the results least in line with expectations. However, on analysis, there are some signs that these results may be due to flaws in the proxy ground truth, rather than due to flaws in the system’s design. In addition, further analysis indicates that even with the flawed proxy ground truth there are some aspects that are in line with expectations. The reason for the proxy ground truth may be flawed is discussed in section 6.3.

By far the largest determining factor for all the comparison measures pre-sented is the model employed. The video duration has some effect but usually stabilises after the first few minutes of input. The noise level has negligible impact. As the hypotheses are linked to the comparison between the models more than any other factor, and are the largest determining factor, the results and discussion shall focus on the comparison between the models. The plots presented will show the duration data as separate lines. This allows the plot to both show the variation that a model can have and show to some extent the influence in the measures that the video duration has. The influence over noise levels will not be shown – each plot is based only on the data for zero noise.

This lack of noise data is also justified by the fact that the noise tolerance is analysed separately later on within this chapter.

There is another general influence on the comparison measures presented – the complexity of the learning scenario. Through the three learning scenarios, the complexity affects the measures in two ways: Firstly, there is a reduction in both the typical and peak performance for each measure, which is due to the difficulty of getting a good score against a proxy ground truth with a larger set of rules than a smaller set of rules. Secondly, as the complexity increases the variation in each measure that is based on the video duration increases for each model. This occurs because more complex models take longer for their corresponding observed rule sets to stabilise.

As mentioned previously, some results more in line with expectations can be found within the results. The results more in line with expectations can be separated from those less in line with expectations by their level within the event type hierarchy. The vast majority of the results less in line with expecta-tions can be attributed to the second level of compound event types (i.e. those rules that are compounds of compound event types). This again can be at-tributed to the much larger proxy ground truth than is present for the second level than the first level in each learning scenario. This is because the second level is derived from the first, which means that any disagreement between the output of the system and the proxy ground truth on the first level will be magnified exponentially on the second level. The exponential divergence arising from the previous multiple overlap problem already discussed.

The first measure that shall be reviewed is the precision measure. Fig-ure 6.3 shows the precision score for each model for both levels of the hierarchy

and figure 6.4 shows the precision score of each model for just the first level of the hierarchy. For some models and scenarios, the precision gives results that are reasonably in line with expectations, and even some that are very in line with expectations, with the Inhibition and Pre-Exposure models perform-ing the best across each scenario. The results least in line with expectations are those of the Temporal / Reacquiring / Blocking group of models, it was expected that these would perform better than the Absolute and Iterative Acquire-Extinguish models.

Of the three non-conditioning models, the performance rank-order for the precision measures is as expected across the models, with the general trend of the Fixed Increment model showing the worst performance, a modest increase in performance for the Symmetrical Fixed Increment model and then a larger increase for the Count Only model. There are some outliers in this trend, but these appear to be for the one minute and two minute inputs, so the rule sets had not stabilised by that point, and the first rules to be generated are more likely to be correct because they reached the significance threshold the fastest. There is another outlier in the precision scores of those three models that cannot be explained by the data points being from the short duration inputs. This is that in figure 6.4c, the Count Only model has a much lower score than the other two models. No explanation was found for this result.

Even more anomalous is that in the same subfigure, with the exception of the one minute input, the precision scores for the Count Only model are in the reverse order. Again no explanation could be found for this behaviour.

A similar trend for all of the models with the level one precision results can be seen as that of the all-level precision results, but each point having a generally higher absolute precision value. Defying this general observation is the two Acquire-Extinguish models, which shows some improvement in per-formance when only taking into account the level one results.

Of all the results presented in this chapter, the recall results are least in line with expectations. However, as with the precision results, the level one results show significant improvement when viewed alone. The recall results for every level are shown in figure 6.5. The corresponding recall results for only the level one event types are shown in figure 6.6. The all-level results show very poor recall performance for every scenario, with the best recall score over every run being 0.042 – effectively meaning that the very best result only found 4.2% of all the proxy ground truth rules. However, taking only the level one results, the recall values are transformed – with the best result being 0.938.

This implies that almost the entire issue with these is that there is a great deal of disagreement between the output of the system and the proxy ground truth, and there is little evidence that it is not the proxy ground truth that is being overly broad. As with the precision results, the relative trend between the models for the two sets of plots is broadly the same.

The individual recall results for each of the models appears to be somewhat of a mirror for the precision results – those models that did less well with the

precision measure performed better with recall measure, and those models that performed well with the precision measure did less well with the recall measure.

To some extent this is expected, since there is a form of inverse relationship between precision and recall. It is possible to score well with both measures however, so it is not a true inverse relationship. This can be seen through a trivial method of getting a high recall – if every possible rule is generated, the recall would be 100%, as every correct rule would be generated, but this would lead to a very low precision as there would be a very high number of false positives. Conversely, a higher precision can be obtained by producing very few rules, as then each true positive gained would not be divided by a significantly higher number. This appears to be what has happened in the results. The two Fixed Increment models and the Temporal group of models generated the highest absolute number of rules and so created a better recall, but at a high cost of precision. The Count Only, Inhibition and Pre-Exposure models produced considerably fewer rules, giving a higher precision at the expense of recall. The two Acquire-Extinguish models then fell in the middle for the number of rules produced, and so fell in the middle for both precision and recall.

The recall results demonstrate that there is a difference between the Tem-poral and Reacquisition models showing a deviation in both the all-level and level one results. Over every result in this section, a difference between the re-sults of the two models is rare. The reasons why the Temporal, Reacquisition and Blocking models have such similar results is discussed in section 6.5.

With the balance between precision and recall being different for each model, the results of the F₁ measure and the MCC becomes of even greater interest. The results of both measures take a similar broad shape and so both measures will be discussed simultaneously. The very low recall figures for the data that includes both levels heavily influences the F₁ measure and the MCC which means that their respective all-level data is also less in line with expectations. Because of this, only the plots for the level one data are presented. Figure 6.7 presents the level one data for the F₁ measure and figure 6.8 shows the level one data for the MCC.

The most remarkable observation between the F₁ and MCC measures is how similar both sets of results are – with a few exceptions, the MCC takes almost the exact same shape as the F1 measure, but at a slightly higher ab-solute value. The slight increase can be accounted for by the inclusion of the very large true negative values in the MCC calculation. Of the exceptions to the similarity between the two plots, the MCC slightly favours the two Fixed Increment models, the Inhibition model and the Pre-Exposure model and slightly disfavours the Temporal group of models. As the learning scenar-ios become more complex, again the values of the measures in general reduce.

Overall, it appears that the models that demonstrated a higher recall (i.e. the two Fixed Increment models and the Temporal group of models) have pro-duced higher scores in these two measures than the models that demonstrated

a higher precision (i.e. the Count Only, Inhibition and Pre-Exposure models), although the discrepancy between the two groups is certainly less than either the individual precision or recall results. As the recall values are worse than the precision values are good, the end result is that the more indiscriminate models are preferred over the more discriminating ones.

Other performance comparison measures not discussed in section 5.4 or presented in this section were collected and calculated. However, the reason these measures have not been fully included in this thesis is that they are not a discriminator for any of the four input variables. These measures are the specificity and the accuracy. The specificity is a measure of the proportion of true negatives out of all the negatives found, which is calculated using equation 6.1. The accuracy is a measure of the proportion of correct results that were found out of the set of all possible rules, which is calculated using equation 6.2. The reason the measures were not used is because for every value calculated, the answer returned rounded to the value one. This is because the size of all possible results is vast compared to any of the output rule sets or the proxy ground truth rule sets. This means the true negative set size will also be much larger than the other set sizes, causing the result to be very near the value one.

Specificity = |T N |

|T N | + |F N | (6.1)

Accuracy = |T P | + |T N |

|T P | + |F P | + |T N | + |F N | (6.2)

(a) Throwing

Figure 6.3: A plot comparing the model employed against its precision value for every level of the event-type hierarchy. The sub-figures each show one of the learning scenarios. Each line corresponds to one value of video duration.