Update Content Types - Continual Learning with Deep Architectures

2.4 Scenarios

2.4.4 Update Content Types

Orthogonal to the type of task supervised signal we could exploit, it is worth considering three differentUpdate Content Type (UCT) which may greatly impact on the complexity of the continual learning scenario. They refer to the possible kind of data contained in each training batch

Bi:

• New Instances (NI): in this case the content of the batch is characterized by new instances (i.e. examples) of the same classes encountered in the previous batches.

• New Classes (NC): the content of each batch Bi is characterize by the presence of

examples belonging to always different classes never encountered before in previous batches

B1, . . . , Bi−1.

• New Instances and Classes (NIC): this update content type constitutes the the most realistic setting where new examples of previous encountered classes but also new classes are encountered over time.

For regression, the same organization can be maintained considering each class as a different regressor.

3

Continual Learning Strategies

“The transfer of knowledge within the lifetime of an individual has been found to be one of the dominating factors of natural learning and intelligence. If computers ever are to exhibit rapid learning capabilities similar to that of humans, they will most likely have to follow the same principles.”

– Sebastian Thrun, 1996

The sudden interest in CL and its applications, especially in the context of deep architectures, has recently led to significant progress and original research directions, yet leaving the research community without a common terminology and clear objectives. Here we propose, in line with

Kemker et al.[2018] andZenke et al.[2017], a three-way fuzzy categorization of the most common CL strategies:

• Architectural strategies: specific architectures, layers, activation functions, and/or weight-freezing strategies are used to mitigate forgetting. Includes dual-memories-models attempting to imitate hippocampus-cortex duality.

• Regularization strategies: the loss function is extended with loss terms promoting selective consolidation of the weights which are important to retain past memories. Include basic regularization techniques such as weight sparsification, dropout, early stopping.

• Rehearsal strategies: past information is periodically replayed to the model to strengthen connections for memories it has already learned. A simple approach is storing part of the previous training data and interleaving them with new patterns for future training. A more challenging approach is pseudo-rehearsal with generative models.

In the Venn diagram of Figure3.1, we show a non-comprehensive set of the most popular CL strategies. While each category is being populated with an increasing number of novel strategies, there is a large room for yet-to-be-explored techniques especially at the intersection of the three categories.

Chapter 4. CL Strategies 31

Figure 3.1: Venn diagram of some of the most popular CL strategies: CWR [Lomonaco and

Maltoni, 2017], PNN [Rusu et al., 2016b], EWC [Kirkpatrick et al., 2017], SI [Zenke et al.,

2017], LWF [Li and Hoiem,2016], ICARL [Rebuffi et al.,2017], GEM [Lopez-paz and Ranzato,

2017], FN [Kemker and Kanan,2018], GDM [Parisi et al.,2018b], EXSTREAM [Hayes et al.,

2018a] and AR1, hereby proposed. Better viewed in color.

Progressive Neural Networks (PNN) [Rusu et al.,2016b] is one of the first architectural strategy proposed and is based on a clever combination of parameter freezing and network expansion. While PNN was shown to be effective on short series of simple tasks, the number of the model parameters keeps increasing at least linearly with the number of tasks, making it difficult to use for long sequences. The proposed CopyWeights with Re-init (CWR) and its evolution CWR+, constitute a simpler and lighter counterpart to PNN (at the cost of a lower flexibility), with a fixed number of shared parameters and already proven to be useful on longer sequences of tasks. Learning Without Forgetting (LWF) [Li and Hoiem,2016] is a regularization strategy attempting to preserve the model accuracy on old tasks by imposing output stability through knowledge distillation [Hinton et al., 2015]. Other well-known regularization strategies are Elastic Weights Consolidation(EWC) andSynaptic Intelligence(SI), both articulated around a weighted quadratic regularization loss which penalizes moving weights which are important for old tasks. In the Rehearsal category,Gradient Episodic Memory (GEM) [Lopez-paz and Ranzato,2017] is an interesting approach using a fixed memory to store a subset of old patterns: it is aimed not only at controlling forgetting but also at improving accuracy on previous tasks while learning the subsequent ones (a phenomenon known as “positive backward transfer” see Chapter4.3.1). Incremental Classifier and Representation Learning (ICARL) [Rebuffi et al., 2017] includes an external fixed memory to store a subset of old task data based on an elaborated sample selection procedure, but also employs a distillation step which makes it overlapping with the regularization category. A recent study on memory efficient implementation of pure rehearsal strategies is provided in [Hayes et al., 2018a] where a new partitioning-based method for stream clustering

Chapter 4. CL Strategies 32

named EXSTREAM is shown to be very competitive with a Full Rehearsal approach (storing all the past data) and with other memory management techniques.

Very recently, a growing number of techniques have been proposed on CL based on both variations of the previously introduced strategies or completely novel approaches with different degrees of success (see [Parisi et al., 2018a] for a review). In particular, FearNet (FN) [Kemker and Kanan,2018] andGrowing Dual-Memory(GDM) [Parisi et al.,2018b] are interesting approaches leveraging ideas from architectural and (pseudo) rehearsal categories: a double-memory system is exploited to learn new concepts in a short-term memory and progressively consolidate them in a long-term one.

In the following section we will better detail some of the most representative strategies for each group and at their intersection. Then the four newly proposed strategy will be detailed in depth.

3.1 Baseline Strategies

Before moving to more elaborated continual learning strategies, let us consider two basic approaches: NaiveandCuumulative, we will later use as standard baselines during the experimental evaluation counducted in Chapter5.

In document Continual Learning with Deep Architectures (Page 45-48)