Lifelong Learning - Knowledge Transfer through Shared Representation Spaces

CHAPTER 2 : Background and Related Work

2.1 Knowledge Transfer through Shared Representation Spaces

2.3.2 Lifelong Learning

In a lifelong machine learning (LML) setting [1], consecutive tasks are learned sequentially. Upon receiving data for the current tasks, the task is learned, the newly obtained knowledge is accumulated to a repository of knowledge, and the LML agent advances to learn the next task. The goal is to learn the current task by transferring knowledge from previous experiences, gained from learning past

tasks. Since the previous tasks may be encountered at any time, performance across all tasks seen so far must be optimized. Ideally, the lifelong learning agent should scale effectively to large numbers of tasks over its lifetime.

Building upon the Go-MTL formulation, ELLA solves Eq. (2.5) in a lifelong learning setting [1]. For this purpose, a second-order Taylor expansion of each individual loss function around the single task optimal parameterθ˜(t)_{is used to approximate the risk terms. This would simplify Eq. (2.5) as:}

min Ls(1)_,...,_s(T) T X t=1 1 TkLs (t)₋_θ_˜(t)_k2 Γ(t) +αks(t)k1+βkLk2F , (2.6)

whereΓ(t)_{is the Hessian matrix for individual loss terms and}_k_v_k2 A =v

>_Av_{. To solve Eq. (2.6)}

in an online scheme, a sparse coefficients(t)_{is only updated when the corresponding current task}

is learned at each time step. This process reduces the MTL objective to a sparse coding problem to solve fors(t)_{in the shared dictionary}_L_{. The shared dictionary is then updated using the task}

parameters learned so far to accumulate the learned knowledge. This procedure makes LML feasible and improves learning speed by two to three orders of magnitude. ELLA algorithm can also address reinforcement learning tasks in LML setting [39]. The idea is to approximate the expected return function for an RL task using the second-order Taylor expansion around the task-specific optimal policy and enforce the policies to be sparse in a shared dictionary domain. The resulting problem is an instance of Eq. (2.6), which can be addressed using ELLA.

Lifelong learning methods have also been developed using deep models. Deep nets have been shown to be very effective for MTL, but an important problem for lifelong learning with deep neural network models is to addresscatastrophic forgetting. Catastrophic forgetting occurs when obtained knowledge about the current task interferes with what has been learned before. As a result, the network forgets past obtained knowledge when new tasks are learned in an LML setting. Rannen et al. [40] address this challenge for classification tasks by training a shared encoder that maps the data for all tasks into a shared embedding space. Task-specific classifiers are trained to map the encoded data from the shared encoding space into the label spaces of the tasks. Additionally, a set

of task-specific autoencoders are trained with the encoded data as their input. When a new task is learned, trained autoencoder for past tasks are used to reconstruct features learned for the new task and then prevent them from changing to avoid forgetting. As a result, memory requirement grows linearly in terms of learnable parameters of the autoencoders. The number of these learnable parameters is considerably less than the parameters that we need to store all the past task data. Another approach to address this challenge is to replay data points from past tasks during training a network on new tasks. This process is calledexperience replaywhich regularizes the network to retain distribution of past tasks. In other words, experience replay recasts the lifelong learning setting into a Multi-task learning setting for which catastrophic forgetting does not occur. Experience replay can be implemented by storing a subset of data points for past tasks, but this would require a memory buffer to store data. As a result, implementing experience replay is challenging when memory constraints exist. Building upon the success of generative models, experience replay can be implemented without any need for a memory buffer by appending the main deep network with a structure that can generate pseudo-data points for the past learned tasks. To this end, we can enforce the tasks to share a common distribution in a shared embedding space. Since the model is generative, we can use samples from this distribution to generate pseudo-data points for all past tasks when the current task is being learned. Shin et al. [41] use adversarial learning to mix the distributions of all tasks in the embedding. As a result, the generator network is able to generate pseudo-data points for past tasks.

We address cross-task knowledge transfer in Part II in chapters 5 through 6. As mentioned, chapter 5 addresses ZSL in a sequential task learning setting. In chapter 6, we address catastrophic forgetting for this setting, where deep nets are base models. In chapter 7, we address domain adaptation in a lifelong learning setting, i.e., adapting a model to generalize well on new tasks using few labeled data points without forgetting the past.

2.4. Cross-Agent Knowledge Transfer

Most ML algorithms consider a single learning agent, which has centralized access to problem data. However, in many real-world applications, multiple (virtual) agents must collectively solve a set of

problems because data is distributed among them. For example, data may only be partially accessible by each learning agent, local data processing can be inevitable, or data communication to a central server may be costly or time-consuming due to limited bandwidth. Cross-agent knowledge transfer is an important tool to address the emerging challenges of these important learning schemes. To model multi-agent learning settings, graphs are suitable models where each node in the graph represents a portion of data or an agent and communication modes between the agents is modeled via edge set (potentially dynamic) of the graph. The challenge is to design a mechanism to optimize the objective functions of individual agents and share knowledge across them over the communication graph without sharing data.

Cross-agent knowledge is a natural setting for RL agents as in many applications; there are many similar RL agents, e.g., personal assistance robots that operate for different people. Since the agents perform similar tasks, the agents can learn collectively and collaboratively to accelerate RL learning speed for each agent. Gupta et al. [42] address cross-agent transfer for two agents with deep models that learn multiple skills to handle RL tasks. The agents learn similar tasks, but their state space, actions space, and transition functions can be different. For example, two different robots that are trained to do the same task. The idea is to use the skills that are acquired by both agents and train two deep neural networks to map the optimal policies for each agent into a shared invariant feature space such that the distribution of the optimal policies become similar. Upon learning the shared space, the agents map any acquired new skill into the shared space. Each agent can then benefit from skills that are acquired only by the other agent through tracking the corresponding features for that skill in the shared space and subsequently its own actions. By doing so, each agent can accelerate its learning substantially using skills that are learned by the other agent.

Cross-agent knowledge transfer is more challenging when the agents process time-dependent data. A simple approach to model this case is to assume that in Eq. (2.1), there are K agents and

L(u)₍_f(u)₍_X(u)_{)) =}P kL (u) k (f (u) k (X (u)

k )). In consensus learning scenarios, it is assumed that all

agents try to reach consensus on learning a parameter that is shared across the agents. We have addressed this cross-agent knowledge-transfer scenario within an LML scenario [43] in chapter 9.

In document Learning Transferable Knowledge Through Embedding Spaces (Page 47-51)