Universidad de los Andes - Colombia Faculty of Engineering
Electrical and Electronic Engineering
Transfer learning in reinforcement learning for image-based environments. An image to
image translation approach.
A thesis submitted in partial fulfillment of the requirements for a degree of Bachelor of Science in Electrical Engineering
By: Juan Sebastian Sosa
Advisor: PhD. Fernando Enrique Lozano Martinez PhD. Carolina Higuera Arias
DECEMBER 2019
Copyright by Juan Sebastian Sosa, 2019.c Universidad de los Andes, All rights reserved.
i
Abstract
Reinforcement learning algorithms allow humans to create agents able to interact in a smart way with a variety of environments. However, due the lack of gener- alization of these algorithms, trained agents fail in related tasks. This problem is even more evident in image-based environments such as video games, where an agent trained in a level fails to play the next level of the game. In order to tackle this problem we propose an image-to-image translation based method that enables us to retrieve key knowledge from an agent trained in a source task and use it to speed up and improve the training in the target task. We validate our method in the super Mario Bros environment running on the OpenAI Gym toolkit. We use the first and second levels of the game as source and target tasks respectively.
Contents
List of Figures iv
List of Tables v
1 Introduction 1
1.1 Problem Definition . . . 1
1.2 Approach . . . 2
1.3 Outline . . . 2
2 Background 3 2.1 Reinforcement Learning . . . 3
2.1.1 Deep Q Networks (DQN) . . . 4
2.1.2 Double Deep Q Networks (DDQN) . . . 5
2.1.3 Prioritized Experience Replay (PER) . . . 5
2.2 Transfer Learning . . . 5
2.3 Imitation Learning . . . 5
2.3.1 Deep Q Learning from Demonstration (DQfD) . . . 6
2.4 Image to Image translation . . . 6
2.4.1 Generative Adversarial Networks (GAN) . . . 7
2.4.2 Cycle Generative Adversarial Networks (CycleGAN) . . . 7
2.5 Related Work . . . 8
3 Method Definition 9 3.1 Source Agent Demonstration . . . 9
3.2 Imitation of trajectories . . . 10
3.3 Training . . . 10
4 Experiments and Results 11 4.1 Image to Image translation . . . 11
4.2 DQN agents . . . 14
4.3 Transfer Learning . . . 18
iii
5 Conclusions and Future Work 21
5.1 Conclusion . . . 21 5.2 Future Work . . . 21
List of Figures
2.1 Reinforcement learning model. . . 4
2.2 Horse to Zebra image translation [24] . . . 7
3.1 Image2Image transfer learning. . . 10
4.1 Loss function of the generators. . . 12
4.2 Loss function of the discriminators. . . 12
4.3 Loss function of a single GAN. . . 13
4.4 Loss function of the identity. . . 13
4.5 Loss function of the cycle consistency. . . 14
4.6 Results of the image to image translation. . . 14
4.7 DQN architecture. . . 15
4.8 Mean reward evolution of the DQN agent in the first level. . . 16
4.9 Loss evolution of the DQN agent in the first level. . . 16
4.10 Mean reward evolution of the DQN agent in the second level. . . . 17
4.11 Loss evolution of the DQN agent in the second level. . . 17
4.12 Mean reward evolution of the DQN agent after using the TL tech- nique in the second level. . . 19
4.13 Loss evolution of the DQN agent after using the TL technique in the second level. . . 19
4.14 Loss of the imitation learning in the second level. . . 20
List of Tables
4.1 Hyper parameters used in DQN. . . 15 4.2 Architecture for DQN. . . 15 4.3 Mean rewards of multiple agents on level 1-2. . . 18
Chapter 1 Introduction
In recent years, the rise of deep learning has led to numerous advances in a wide variety of areas including: image recognition, natural language process- ing, generative models, recommendation systems and reinforcement learning. In reinforcement learning, the adoption of deep learning has been useful in the de- velopment of new techniques, algorithms and optimizations. These advances have opened a new investigation area: Deep Reinforcement learning.
Nowadays, research in Deep Reinforcement learning has been able to develop algorithms to train agents with superhuman performance in a set of specific tasks.
For example, the results of Alpha go [20] and Alpha zero [19] showed the power of the new techniques in challenging tasks such as playing the game of Go. Further- more, similar techniques have been used to create agents to outperform humans in video games; the most recent breakthrough is the development of a general algorithm to play Atari games [13, 14].
Despite the recent developments in Deep reinforcement learning, the problem of creating general algorithms capable of completing related tasks without fully re- training the agents does not have similar breakthroughs as the one aforementioned, thus being an open problem in the area. Moreover, modern algorithms such as DQN have behaviors that show how these algorithms fail in related tasks.
In order to address the generalization problem, techniques from deep learning, such as transfer learning, have been tried in recent research. This research has shown promising results and new paths of research to follow.
1.1 Problem Definition
In reinforcement learning, modern algorithms such as Deep Q Networks (DQN) enable researchers to train agents to solve a large variety of tasks, without thinking about a suitable state representation and simply using an image of the tasks as the state. However, when an agent trained in a task using such algorithms is tested in a closely related task, it performs poorly. In order to overcome this problem, popular deep learning techniques such as fine tuning have been tried to transfer the knowledge from an agent to a new one. Nevertheless, this approach
§1.3 2
fails because the neural networks used in the algorithms tend to overfit their first layers to the source task state representation and then even a minimum change in the input image leads to an unexpected behavior in the agent.
1.2 Approach
In order to solve the problem of knowledge transfer between agents in image based- environments stated in Section 1.1, we propose a method of pre-training. In this method, we use an adaptation of Deep Q-learning From Demonstrations (DQfD) with the source agent as demonstrator that takes actions in the target task, but translating the state to the source task domain using Image-to-Image translation.
1.3 Outline
Chapter 2 presents key concepts needed to understand the proposed method.
Chapter 3 discusses the proposed method and its applicability. Chapter 4 explains the experiments and the implementation and also presents the respective results and the discussion. Chapter 5 presents the conclusion and explores avenues of future work.
Chapter 2 Background
In this chapter we introduce key concepts needed to fully understand the method proposed in this work. First, we explore reinforcement learning and some baseline algorithms and optimizations used in the state of the art. Then we explain Transfer learning as the problem this work tries to solve. Later, we introduce Imitation learning and Image-to-Image translation as both are key tech- niques used in our approach. Finally, we conclude this chapter with a brief review of the existing related work in the area.
2.1 Reinforcement Learning
Reinforcement learning is the process of learning Reinforcement learning is the process of learning to map situations to actions in order to maximize a reward signal [21]. The problem of reinforcement learning can be formalized as a Markov decision process (MDP), because this is an abstraction that enables us to frame the learning for the interaction on which the reinforcement learning is based.
In the reinforcement learning problem, an agent performs a sequence of actions in the environment; these actions change the state of the environment and give a reward signal to the agent as a feedback. Figure 2.1 [21]. The objective of the agent is to maximize this reward signal, thus performing the optimal action in every given state.
§2.1 4
Agent
Environment s,r
a
Figure 2.1: Reinforcement learning model.
To solve the reinforcement learning problem, a large variety of algorithms have been developed. In the next section, we introduce the Deep Q Networks algorithm which is an adaptation of the Q learning algorithm introduced in [23].
2.1.1 Deep Q Networks (DQN)
In the DQN algorithm, the reward function to maximize at each time step t is a discounted reward and is given by Equation (2.1).
Rt=
T
X
t=t0
γt
0−t
rt0 (2.1)
where T is the time in which a terminal state is reached, rt0 is the immediate reward at each time step and γ is the discount factor. [13, 14]
Formally, the algorithm tries to approximate the optimal action-value func- tion Q∗(s, a) which is the maximum expected return achievable by following the optimal strategy, after seeing the state s and taking the action a [13, 14], the action-value is defined as Equation (2.2)
Q∗(s, a) = maxπE[Rt|st= s, at= a, π] (2.2) where π is a policy that maps states to actions. Moreover, as stated in [13, 14]
the optimal action-value function obeys the Bellman equations so it also can be written as Equation (2.3)
Q∗(s, a) = E[r + γ max
a0
Q∗(s0, a0)|s, a] (2.3) Given the previous property, the DQN algorithm trains a deep neural network, typically a deep convolutional neural network, to minimize the loss function in Equation (2.4) in every iteration i [13, 14]
Li(θi) = E(s,a,r,s0)[(yi− Q(s, a : θI))2] (2.4)
§2.3 5
where yi = r + γ maxa0Q∗(s0, a0 : θi−). and the (s, a, r, s0) are uniformly sampled from an experience replay memory in order to stabilize the algorithm. Although, this algorithm is robust as shown in [13, 14]. in order to make it useful to our test environment, it is necessary to introduce two key optimizations: Double Deep Q Networks (DDQN) and Prioritized Experience Replay (PER), explained in Section 2.1.2 and Section 2.1.3 respectively.
2.1.2 Double Deep Q Networks (DDQN)
The vanilla DQN algorithm tends to overestimate some values leading to subop- timal policies as shown in [9]. To reduce this overestimation and improve the algorithm performance, [9] introduces a small change to the original algorithm.
This change consists in evaluating the current greedy policy using the target net- work, thus using yDDQNi = r + γQ(s0, argmaxaQ(s0, a : θi) : θ−i ) in Equation (2.4)
2.1.3 Prioritized Experience Replay (PER)
The DQN algorithm using vanilla experience replay needs a large quantity of time steps to converge to an optimal policy and tends to forget infrequent experiences that perhaps are useful to use in training. In order to address this problem the PER technique is introduced in [18]. In this technique, instead of saving an array of tuple of the form (st, at, rt, st+1) we save tuples of the form (st, at, rt, st+1, p) where p is the priority of the experience and is calculated using the TD-error. To efficiently sample from this kind of memory, it is necessary to use special data structures as shown in [18].
2.2 Transfer Learning
Transfer learning can be defined as the process of extracting knowledge from one or multiple source tasks and applying the knowledge to a target task [15].
Formally, transfer learning is taking one or multiple source Domain Ds and a source task Ts and using the knowledge of Ts to improve a predictive function in the target domain Dt and task Tt[15]. Typically, transfer learning is used in tasks such as image recognition and text classification, to avoid creating models from scratch [15]. In reinforcement learning this technique has been used to reduce the sampling efficiency of the algorithms or, in other words, to reduce the training time of the agents [12].
2.3 Imitation Learning
In imitation learning, the objective is to train an agent to mimic the behavior of an expert agent, normally a human, in a given environment. This process may lead to a faster training and to avoid undesirable situations during training [10].
In Section 2.3.1 we introduce the imitation learning algorithm that we adapted to
§2.4 6
use in our proposed method.
2.3.1 Deep Q Learning from Demonstration (DQfD)
Deep Q Learning from Demonstration (DQfD) is a method for imitation and reinforcement learning based on DQN in which a supervised learning algorithm is used to speed up the training of an agent [10].
DQfD is a two-stage algorithm. In the first stage, supervised learning is used to closely learn the behavior shown by a demonstrator. In this stage, a gradi- ent descent algorithm is performed over a dataset of demonstrations in order to minimize the loss function given in Equation (2.5)
J (Q) = JDQ(Q) + λ1JE(Q) + λ2JL2(Q) (2.5) where JDQ is the loss function of the DQN algorithm Equation (2.4), JL2 is the L2 regularization and JE(Q) is given by Equation (2.6)
JE(Q) = max
a∈A[Q(s, a) + l(aE, a)] − Q(s, aE) (2.6) where aE is the action taken by the demonstrator and l(aE, a) is positive if aE 6= a, this loss function forces the value of an action to be lower than the demonstrator’s one [10]. In the second stage, the DQN algorithm is performed using as starting network the result of the previous stage.
2.4 Image to Image translation
Image-to-Image translation is a class of problem where the goal is to learn a mapping between images of two domains [24]. The problem is aligned if the correspondence between the images of the domains is known or unaligned on the inverse case. In the present work, only the unaligned case is useful. In Section 2.4.1 , we introduce a powerful technique used in a variety of machine learning problems;
in Section 2.4.2, we present a variation of the aforementioned technique and how is used to solve the image-to-image translation problem.Figure 2.2 shows an example of translation between images, in this case a Zebra-to-Horse translation.
§2.5 7
Figure 2.2: Horse to Zebra image translation [24]
2.4.1 Generative Adversarial Networks (GAN)
The objective of the Generative Adversarial Networks (GAN) is to generate data that resembles the data of a given dataset. In order to accomplish this objective, the GAN consists of two deep neural networks. The first one, known as generator (G), generates the data. The second one is the discriminator (D) and its function is to determine if the generated data is in fact generated or not. By combining these two nets, the GAN is able to generate convincing data. Formally, D and G play the two-player mini-max game in Equation (2.7) [7]
minG max
D V (D, G) = Ex∼pdata(x)[log(D(x))] + Ex∼pz(z)[log(1 − D(G(z)))] (2.7)
2.4.2 Cycle Generative Adversarial Networks (CycleGAN)
Cycle Generative Adversarial Networks technique is used to learn two mappings between images in two different domains X and Y . The first mapping G : X → Y and the second one F : Y → X. In order to learn this mapping, the technique uses a pair of GANs and combines their loss to ensure cycle consistency. To achieve this, the CycleGAN minimizes the loss function given in Equation (2.8) [24]
L(G, F, Dx, Dy) = LGAN(G, Dy) + LGAN(F, Dx) + λLcyc(G, F ) (2.8) where G and F are the generator, Dy and Dx are the respective discriminators and the last term is an artificial loss created in order to force cycle-consistency between the mappings and is given in Equation (2.9)
Lcyc = Ex∼pdata(x)[kF (G(x)) − xk1] + Ey∼pdata(y)[kG(F (y)) − yk1] (2.9)
§2.5 8
2.5 Related Work
Transfer learning for reinforcement learning problems arises as soon as researchers observe the need to transfer the knowledge between related tasks. Early ap- proaches trained agents in incrementally harder tasks [1]; these approaches failed in larger tasks due to the necessity of a human to create the intermediate tasks to learn. Further research based their efforts on learning specific features from the tasks state space [2]; however, these techniques failed in tasks with large state space.
Later research in the area uses multiple Deep neural networks to perform the transfer in multitasks environments as in [12]; this approach differs to our proposed approach in the computational capacity needed to perform the goal. In [4] the authors proposed a mapping technique in which an artificial neural network (ANN) is used to approximate the quality of the possible actions to take in the target task. The main difference with our approach is how we use an auxiliary ANN to map directly the state spaces of the tasks.
Finally, in [6] the authors proposed a similar approach to our solution. How- ever, their technique is based on the A3C algorithm and use a custom imitation learning algorithm. In contrast, we use the DQN algorithm as our base learner and use a more tested imitation algorithm to perform the pre- training step.
Chapter 3
Method Definition
Similarly to the DQfD algorithm shown in Section 2.3.1, our proposed method can be divided into three main phases. The first one is source agent demonstration explained in Section 3.1, the following one is Imitation of trajectories shown in Section 3.2 and the final one is training Section 3.3.
3.1 Source Agent Demonstration
In this phase, we use an agent fully trained in the source task to perform a demonstration on the target task. However, as is shown in the results section, a naïve demonstration, in which we use the agent in the environment of the target tasks without any change, leads to a poor performance of the agent.
The agent presents the behavior mentioned before, because the neural network used to approximate the action-value function Q(s, a) in the source task only has seen data of the source tasks states, therefore the net performs poorly in the new data.
In order to overcome the aforementioned problem, we propose to do an interme- diate step using image-to-image translation techniques, specifically a CycleGAN.
We map the current state in the target task to a state similar to the ones seen in the source task. Then, with the translated state, the agent selects an optimal action, and performs that action in the environment of the target tasks. As can be seen in the results section, the actions taken using this approach outperform the results of a purely random agent.
Using the solution described above and using an epsilon greedy policy, we generate a set of sequences of the form (s, a, r, s) that we call trajectories. To assure a good performance in the following phases of the method, we choose the trajectories which have a discount reward above a specific threshold and save their tuples in a replay memory; all other trajectories are ignored.
§3.3 10
3.2 Imitation of trajectories
The goal of this phase is to train a new agent that is able to imitate the selected actions from the demonstration, in other words, an agent that can follow the trajectories selected in the previous phase.
To achieve such objective we adapt the pre-training process of the DQfD al- gorithm, but instead of using the loss function presented in Equation (2.5) we use a reduced form shown in Equation (3.1).
J (Q) = JDQ(Q) + JE(Q) (3.1)
Now, using the Adam optimization algorithm, we perform a fixed quantity of steps to avoid over-fitting the network, and also we clamp all parameters in the network between −1 and 1 in order to match the characteristics of the following phase.
3.3 Training
In this final phase, we perform a DDQN algorithm with PER optimization to train the agent in the target tasks. However, we use the trained network in the previous phase as initial network and we keep the trajectories generated in the first phase in the Experience replay memory to improve the convergence of the algorithm.
A summary of the whole method is presented in Figure 3.1 Figure 3.1: Image2Image transfer learning.
1: function Image2ImageTL(T,EPOCHS)
2: t ← 0
3: trajs ← {} . The empty set ∅
4: while t ≤ T do
5: Append(trajs,newT rajectorie)
6: end while
7: P ER ← trajs . Initialize PER
8: init Q randomly
9: for x in EPOCH do
10: Optimize Q over a batch of the P ER
11: end for
12: Do DQN with Q as initial network
13: end function
Chapter 4
Experiments and Results
We applied our proposed method to subsequent levels of the game Super Mario Bros. To simplify our experiment, we limited it to the first and second levels of the game. Also, in order to easily implement and test the proposed method, we decided to use the programming language Python [22],the deep learning framework Pytorch [17], the Reinforcement learning framework Open AI Gym [3] and the super Mario bros environment [11]. We present the full implementation 1 details.
In the following sections, we describe the experiments used for every part of the proposed approach. First, we show the experiments in image-to-image translation, including the translated images and the behavior of the loss functions in the CycleGAN. Later, we explain the details of the DQN experiment and show the reward and loss curves. Finally, we compare the behavior of different agents in the second level of the game and show the results of the Transfer learning technique.
4.1 Image to Image translation
In this experiment, we setup a CycleGAN to create the mapping between images of the levels. To accomplish this, firstly we use random agents in both levels to collect 5k images per level. The collected images are 84x84 gray scale, so we adapt the standard CycleGAN implementation proposed in [24] tto receive images of a single channel. Then, using the same parameters proposed in the original paper, we train the model during 50 epochs. Finally, we test the model with other randomly- generated dataset. During the training of the CycleGAN, we observe the behaviors shown in Figure 4.1, Figure 4.2, Figure 4.3, Figure 4.4 and Figure 4.5. These behaviors are similar to the ones presented in the original CyclegGAN paper [24] for the same loss functions on a different dataset. These results show the correctness of our implementation.
1The source code is available at https://github.com/jssosa10/super-mario-RL-and-TL
§4.1 12
0 200 400 600 800 1000
Steps
1.0 1.5 2.0 2.5 3.0 3.5
LossG
Figure 4.1: Loss function of the generators.
0 200 400 600 800 1000
Steps
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
LossD
Figure 4.2: Loss function of the discriminators.
§4.1 13
0 200 400 600 800 1000
Steps
0.5 1.0 1.5 2.0 2.5
LossGGAN
Figure 4.3: Loss function of a single GAN.
0 200 400 600 800 1000
Steps
0.1 0.2 0.3 0.4 0.5 0.6 0.7
LossGIdentity
Figure 4.4: Loss function of the identity.
§4.2 14
0 200 400 600 800 1000
Steps
0.4 0.6 0.8 1.0 1.2 1.4
LossGcycle
Figure 4.5: Loss function of the cycle consistency.
Finally, after training our CycleGAN implementation, we proceed to test it.
So, using the prepared test dataset, we translate some images of the first level to images of the second level. An example of the results of this translation is shown in Figure 4.6.
0 25 50 75
0 20 40 60 80
Original
0 25 50 75
0 20 40 60 80
Translated
Figure 4.6: Results of the image to image translation.
4.2 DQN agents
In order to have a baseline to compare the results of our proposed method, we train two DQN agents using both DDQN and PER optimizations. For both agents, we use almost all the parameters proposed in the original paper, but we change the batch size using 512 instead of 32 in order to achieve a more extensive use of the computational resources available. We also adapt the learning rate and use 7.5 ∗ 10−4 and not 1 ∗ 10−4. To train our DQN agents, we use the hyper parameter described in Table 4.1. We also use the architecture explained in Table 4.2 with Relu as activation function for all layers except the last one. A
§4.2 15
graphical representation of the architecture is shown in Figure 4.7.
Parameter Value
Batch Size 512
γ 0.99
Replay memory size 100000 target update freq 10000 learning rate 0.00075
α 0.95
0.05
Table 4.1: Hyper parameters used in DQN.
Layer # Type
1 Conv2D(kernel_size=8, stride=4) 2 Conv2D(kernel_size=4, stride=2) 3 Conv2D(kernel_size=3, stride=1)
4 FC(3136,512)
5 FC(512,12)
Table 4.2: Architecture for DQN.
Figure 4.7: DQN architecture.
First, we train our algorithm to play the first level of Mario Bros. In order to do so, we use a computer with at least 6GB GPU and 12GB of memory. During the training of the algorithm, we observe how the mean reward achieved by the agent increases as the loss function of the Q network decreases, as shown in Figure 4.8 and in Figure 4.9.
§4.2 16
0 500000 1000000 1500000 2000000 2500000
Steps
0 500 1000 1500 2000
MeanReward
Figure 4.8: Mean reward evolution of the DQN agent in the first level.
0 500000 1000000 1500000 2000000 2500000
Steps
0.0 0.5 1.0 1.5 2.0 2.5 3.0
DQNLoss
Figure 4.9: Loss evolution of the DQN agent in the first level.
Later, we also train the same algorithm to play the second level of the game.
As in the first level, we observe a similar behavior in the evolution of the reward and the loss as shown in Figure 4.10 and Figure 4.11. However, in the training of this level, we use a two-phase training to test how the algorithm behaves in cold and warm start.
§4.2 17
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
Steps ×107
0 200 400 600 800 1000 1200 1400 1600
MeanReward
warm start cold start
Figure 4.10: Mean reward evolution of the DQN agent in the second level.
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
Steps ×107
0 2 4 6 8 10 12
DQNLoss
warm start cold start
Figure 4.11: Loss evolution of the DQN agent in the second level.
In the results of the training of the DQN algorithm, we can observe an interest- ing behavior: after we observe a peak in the loss function, the mean reward tends to increase. We explain this by saying that the peaks represent states never seen before by the algorithm, so in their first appearance, the network choose a wrong action with a big enough TD-error, thus the algorithm learns from that new state and, in later episodes, reaches a greater reward. Furthermore, in the results we can observe how the second level takes ten times the number of steps to complete the level, this shows how much more difficult is the second level compared to the
§4.3 18
first one. We provide examples of the trained agents 2 3.
4.3 Transfer Learning
We test the complete method using the results of the previous experiments. For the supervised learning phase of the method, we use a learning rate of 1 ∗ 10−4. For the DQN phase, we use the same parameters used in the DQN experiment.
However, to find an optimal choice for the number of optimization steps needed in the supervised phase, we test two possible values: 20k steps and 50k steps.
In order to show that the translated environment produces suitable demon- strations for the algorithm, we test multiple agents on the second level of the game and compare their mean rewards on Table 4.3. We observe how the agent trained in the source task (level 1-1) is capable of outperforming itself and the random agent if we use the image-to-image translation strategy.
Agent Mean Reward
Random Agent 50
Agent Fully trained in level 1-2 1850 Agent Fully trained in level 1-1 -30 Agent Fully trained in level 1-1 + translation 420
Table 4.3: Mean rewards of multiple agents on level 1-2.
As the previous results show, the environment is suitable for applying our proposed method. So, we proceed to test the proposed method using as target and source tasks the levels 1-2 and 1-1 respectively. We get the results shown in Figure 4.12, Figure 4.13 and Figure 4.14.
2The level 1-1 agent is available at https://youtu.be/UzRy9MQVJMA
3The level 1-2 agent is available at https://youtu.be/N7YTFOSYlYA
§4.3 19
0 200000 400000 600000 800000 1000000
Steps
0 200 400 600 800
MeanReward
with TL without TL
Figure 4.12: Mean reward evolution of the DQN agent after using the TL tech- nique in the second level.
0 20000 40000 60000 80000
Steps
0 5 10 15 20 25 30
DQNLoss
Figure 4.13: Loss evolution of the DQN agent after using the TL technique in the second level.
§4.3 20
0 2500 5000 7500 10000 12500 15000 17500 20000
Steps
0 200 400 600 800 1000 1200 1400 1600
SupervisedLoss
Figure 4.14: Loss of the imitation learning in the second level.
This results show how the proposed technique is able to improve the base reward during the training of the agent. However, the current parameters prevent the technique from completely outperforming the option of training from scratch.
This behavior can be attributed to the use a small set of demonstrations, thus leading to an over-fitting during the imitation step.
Chapter 5
Conclusions and Future Work
5.1 Conclusion
This thesis proposes a technique to perform transfer learning in reinforcement learning using an image-to-image translation approach. To do this, we extend the DQN and the DQfD algorithms with image-to-image translation capabilities. At this moment, our proposed method is able to train agents that in the short term are able to outperform other algorithms, but fail to improve steadily as the vanilla DDQN-PER algorithm.
5.2 Future Work
There are several opportunities for improving the proposed technique. First, we can combine the current algorithm used to collect data (to train the CycleGAN), with a better exploration algorithm such as Go-Explore [5] or Q-map [16], in order to have a better representation of both state spaces.
Second, our solution depends on an adaptation of DQfD to perform the imi- tation step. Perhaps the algorithm can be improved by using another approach to solve the imitation learning problem.
Finally, our solution works only for a pair of tasks, but studying a multi-source single-target variant of the algorithm may lead to more interesting results.
Bibliography
[1] Minoru Asada, Shoichi Noda, Sukoya Tawaratsumida, and Koh Hosoda.
Vision-based behavior acquisition for a shooting robot by using a reinforce- ment learning. In Proc. of IAPR/IEEE Workshop on Visual Behaviors, pages 112–118. Citeseer, 1994.
[2] Bikramjit Banerjee and Peter Stone. General game learning using knowledge transfer. In IJCAI, pages 672–677, 2007.
[3] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016.
[4] Qiao Cheng, Xiangke Wang, Yifeng Niu, and Lincheng Shen. Reusing source task knowledge via transfer approximator in reinforcement transfer learning.
Symmetry, 11(1):25, 2019.
[5] Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O Stanley, and Jeff Clune. Go-explore: a new approach for hard-exploration problems. arXiv preprint arXiv:1901.10995, 2019.
[6] Shani Gamrian and Yoav Goldberg. Transfer learning for related reinforce- ment learning tasks via image-to-image translation. CoRR, abs/1806.07377, 2018.
[7] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde- Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative ad- versarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc., 2014.
[8] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde- Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative ad- versarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
[9] Hado van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learn- ing with double q-learning. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, pages 2094–2100. AAAI Press, 2016.
§.0 23
[10] Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Horgan, John Quan, Andrew Sendonaris, Ian Osband, et al.
Deep q-learning from demonstrations. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
[11] Christian Kauten. Super Mario Bros for OpenAI Gym. GitHub, 2018.
[12] Stephanie Laflamme. Transfer in Reinforcement Learning: An Empirical Comparison of Methods in Mario Al. PhD thesis, McGill University Libraries, 2017.
[13] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
[14] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fid- jeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learn- ing. Nature, 518(7540):529–533, 02 2015.
[15] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2009.
[16] Fabio Pardo, Vitaly Levdik, and Petar Kormushev. Q-map: a convolu- tional approach for goal-oriented reinforcement learning. arXiv preprint arXiv:1810.02927, 2018.
[17] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In NeurIPS Autodiff Workshop, 2017.
[18] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. In International Conference on Learning Representations, Puerto Rico, 2016.
[19] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Ku- maran, Thore Graepel, et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017.
[20] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
§.0 24
[21] Richard S Sutton and Andrew G Barto. Reinforcement learning: An intro- duction. MIT press, 2018.
[22] Guido Van Rossum and Fred L Drake Jr. Python tutorial. Centrum voor Wiskunde en Informatica Amsterdam, The Netherlands, 1995.
[23] Christopher John Cornish Hellaby Watkins. Learning from delayed rewards.
1989.
[24] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networkss. In Computer Vision (ICCV), 2017 IEEE International Conference on, 2017.