Seq-COIL-100 - Continual Unsupervised Learning

5.3 Continual Unsupervised Learning

5.3.2 Seq-COIL-100

To better generalize the results shown on Seq-NORB we extend the evaluation of SST also on COIL-100 with the same deep architectures. Figure5.17 shows HTM and CNN accuracy for different SST strategies. We observe that:

• The trend for supervised strategies is similar to NORB; both HTM and CNN constantly improve initial accuracy as new batches are presented, with the CNN slightly overperforming HTM. For HTM regularization seems not providing any advantage, probably due to the shorter sequence length (10 frames here instead of 20 frames inNORB) and the presence of gaps in the sequences (patterns segregated/excluded because of their inclusion in the test set).

• Here too, semi-supervised strategies performs better for HTM than for CNN. It is worth noting that in this case the base strategySST-B outperformsSST-A thus indicating that the self-confidence threshold sc (kept fixed at 0.65) is probably too conservative for this dataset.

Figure 5.17: HTM and CNN incremental tuning accuracy onCOIL-100.

5.3.3 Conclusions

In this section we studied semi-supervised tuning based on temporal coherence. The proposed tuning approaches have been evaluated on two deep architectures (HTM and CNN) obtaining partially discordant results. As to HTM our experiments proved that in some conditions even a trivial approach enforcing the output slow change (SST-B) can significantly improve classification accuracy. A slightly more complex approach (SST-A), exploiting temporal coherence twice: i) to enforce the output slow-change;ii) to compute a self-confidence value to trigger semi-supervised update, proved to be very effective, sometimes approaching the supervised tuning accuracy.

Chapter 8. Experimental Evaluation 100

Our CNN implementation worked well with supervised tuning strategies, but (unexpectedly) demonstrated a lower capacity to deal with incremental semi-supervised tuning. Of course the encountered limitations could be due to the specific CNN architecture and training, and the outcomes of other recent studies [Goodfellow et al.,2013] can be very useful to check alternative setups (e.g., better investigating the effect of dropout). We recognize that the empirical evalu- ations carried out in this study are still limited, and to validate/generalize our semi-supervised tuning results, we need to test the proposed approaches on other larger datasets, including nat- ural videos of real objects smoothly moving in front of the camera like the ones contained in CORe50.

However, based on the results obtained so far a question emerges: what made HTM more effective than CNN for incremental learning and semi-supervised tuning from temporal coherence? At this stage we do not have an answer to this question, and we can only formulate some hypotheses, by pointing out architectural/training differences that could have a direct impact on forgetting and capability to work with unlabeled data:

• Pre-training: McRae and Hetherington[1993] argued that network pre-training can miti- gate catastrophic forgetting effects. During initial training HTM self-develop internal mem- ories from patterns of the domain instead of randomly initializing weighs. This could make it more stable and resistant to pattern forgetting and lack of labels. Of course CNN can be pre-trained as well (see [Wagner et al., 2013] or a comparative evaluation of different pre-training approaches), and this is one of directions we intend to follow in our future studies.

• Type of parameters tuned: CNN training is mostly directed to feature extraction layers (i.e. filter parameters), while HTM + HSR main target are parameters of feature pooling layers. Maltoni and Rehn [2012] argued that the most important contribution of HSR is tuning the probabilities denoting how much each coincidence (i.e., a feature extractor) belongs to each group (i.e., a set of feature extractors). Our HTM incremental tuning by HSR is not altering feature extractors, but attempts to optimally arrange existing feature extractors in groups to maximize invariance. Referring to the stability-plasticity dilemma we speculate that keeping feature extractors stable (especially at low levels) promotes stability while moving pooling parameters is enough to get the required plasticity.

In conclusion, we believe that semi-supervised and unsupervised tuning, still scarcely studied with deep learning architectures, is a powerful approach to mimic biological learning where continual learning is a key factor. The lack of supervision, here surrogated by temporal coherence only, can be complemented by other contextual information coming from different modalities (Multiview learning), or from different processing paths (e.g., Co-training). Of course when supervisor signals are available, both supervised and unsupervised learning can be fused into an hybrid scheme (as here demonstrated for SupTR). Moreover, SST can be used in conjunction with other continual learning strategies presented in this chapter for explicitly addressing the issue of forgetting while learning from the new coming data.

The availability of powerful computing platforms makes the development of continual learning system feasible for a number of practical applications. For example in our non-optimized HTM

Chapter 8. Experimental Evaluation 101

implementation, 4 HSR iterations on 1,000 patterns takes about 35 seconds (On a CPU Xeon W3550, 4 cores) we are confident that, upon proper optimization, SST can run on-line once a pre-trained system is switched in working mode.

6

Conclusions and Future

Challenges

“Does anyone ever finish learning to read music? Do we finish learning how to write or do research? Do we ever learn anything completely? Or do we just keep getting better than we were before?”

– Mark Ring, 1994

6.1 Discussion and Conclusions

The intent of this dissertation was to provide a number of original contributions to the early development of continual learning research in the context of deep architectures for AI. The objective was to propose such contributions within a general approach to continual learning taking into account several practical factors as well as long-term goals.

The comprehensive framework proposed in Chapter2, is an important step in this direction. The framework proposed, while not too abstract, is general enough to consider all the possible continual learning interpretation proposed so far and avoid possible and misleading misunderstandings that may arise when different point of view cannot find a more formal common ground. One of the most important steps in disambiguating state-of-the-art research is the disentanglement of the notion of task from the training batch. In fact, while not the principal focus of current continual learning research, many training batches may be related to the same task, or, the notion of task during training may not be available to the model at all. This is modeled in the framework with the availability of thetlabel, making explicit for each experiment or strategy the use of this additional supervised signal. The definition of this framework has allowed us to define three different scenarios with an intuitive interpretation: multi-task,single-incremental-task and multiple-incremental-task based on the nature and availability of the thet label.

Chapter 9. Conlcusion and Future Challenges 103

Machine learning research is often driven by a practical results on complex benchmarks acknowl- edged by the community. However, in continual learning research, and especially considering deep learning architectures, there were no specific datasets or benchmarks available to assess new strategies and advance our understanding of the problem. This is why in this dissertation we proposed several benchmarks based on the re-designed of classic datasets such asSeq-NORB, Seq-COIL100, andSeq-iCubWord28 but also completely new benchmarks such asCORe50 and 3D-VizDOOM Maze, specifically designed for continual learning research.

Having defined a rich set of benchmarks on which we could start to assess novel continual learning approaches, we proposed several CL strategies especially targetingSingle-Incremental- Task scenarios, which, especially considering their additional complexity, was not very much explored until now, designing computationally lighter and memory efficient techniques such as SST, CWR,CWR+ andAR1.

The evaluation conducted in Chapter5, have shown that the continual learning strategies proposed may improve AI systems capabilities at many different levels. They may not always make our prediction models more adaptive and autonomous over time, but also solve many practical issues related to sustainability in terms of hardware resources with the ultimate goal of making AI more ubiquitous and scalable. The experimental evaluation carried out in different machine learning paradigms like supervised, reinforcement or semi-supervised learning, have further shown the impact CL may have not only per se, but especially if used in conjunction with many other techniques developed so far in the context of deep learning. While surely not completely exhaustive and improvable from many points of view, we think it has the sufficient expressive power to show how the pursuit of the continual learning paradigm may be beneficial for general AI research.

In document Continual Learning with Deep Architectures (Page 115-119)