ATARI 2600 Games - Generic Reinforcement Learning Beyond Small MDPs

3.3 Domains

3.3.5 ATARI 2600 Games

Here we will look at the use of ATARI 2600 games for evaluating general RL agents. Since the literature on ATARI games is vast, we will look at two domains in particular.

3.3.5.1 Arcade Learning Environment

The reinforcement learning community has lacked a set of general environments that can be used for testing new algorithms in a robust manner. In Veness et al. (2011) a set of small challenging problems were provided, but several algorithms (Daswani et al., 2013; Nguyen et al., 2012) can no longer be differentiated based on them. The recently introduced ALE by Bellemare et al. (2013) attempts to address this big gap in the field by utilising games made for the ATARI82600 as a test bed for reinforcement learning algorithms. The environments in this setting are games made for humans which can be relatively complex, but due to the space/processing limits of the ATARI console, still computationally feasible for current RL

Interestingly, the word “atari” is also a concept from Go, used to describe the situation where a group of stones is soon going to be captured by one’s opponent.

techniques. The ALE consists of an interface to Stella which is an open-source ATARI 2600 games emulator.

This gives access to hundreds of games of this format, which range from side-scrollers, to arcade games, shooters and puzzles. The interface provides access to the screen pixel matrix and the internal state representation of the ATARI games themselves. This allows for both reinforcement learning and planning algorithms to be tested, since the ability to reset to a particular state is crucial for some planning algorithms like UCT.

“ALE provides an interface to hundreds of Atari 2600 game environments, each one different, interesting, and designed to be a challenge for human players. ALE presents significant research challenges for reinforcement learning, model learning, model-based planning, imitation learning, transfer learning, and intrinsic motivation” Bellemare et al. (2013).

Existing work. Although the domain has not been around for very long (initial Master’s thesis by Naddaf (2010), full framework with benchmarks by Bellemare et al. (2013)) , there is already a large amount of work being done with it. The initial works used SARSA(_λ)agents based of simple features of the screen. Even these feature spaces were quite large, with the BASS feature set coming to over 2 million features. Since then there has been work on sketch- based linear value function approximation by Bellemare et al. (2012), factored models for observation spaces in Bellemare et al. (2013), deep reinforcement learning using convolutional neural nets to approximate the Q-value (DQN) by Mnih et al. (2013, 2015) and most recently an algorithm by Veness et al. (2014) for policy evaluation via constructing a consistent estimator for the Q-value function using existing (probability) density models. The last two algorithms out of the Google-owned AI research lab Deep Mind have results that convincingly surpass humans on several games.

Implementation. The ALE consists of an interface to Stella an open-source Atari 2600 emulator. Game states can be easily saved, resulting in an easy way to provide generative models of the games. ALE therefore allows the testing of planning as well as reinforcement learning. Several measures of performance are also indicated in the paper as a way to compare algorithms across domains in the ALE.

The Java/C++ source code provided makes it easy to integrate an agent into the framework. There code used to generate the features (BASS, DISCO, etc) written in Java has also been made available online. The environment also provides the RAM state of the game, which is a Markovian state. However, using the screen frames alone can be non-Markovian depending on the game. As pointed out in Naddaf (2010) the agent cannot tell if a laser beam is moving towards it or away from it (fired by it) without looking at the previous frame.

3.3.5.2 Partially Observable PACMAN (POCMAN)

Pocman is a modified ATARI game domain first proposed by Veness et al. (2011) in order to evaluate MC-AIXI-CTW. It consists of an abstraction of the PACMAN domain to an ASCII

format. The agent starts in the center of a standard Pacman map (17x17), see Figure 5.7a. At every time step it receives a bit sequence containing the following bits. 4 bits to code whether there is a wall in an adjacent square, 4 bits to code whether there is food in an adjacent square, 4 bits to check if there is a ghost in anydirection, 3 bits to “smell” food within 2, 3 and 4 squares and 1 bit that is active when the agent has swallowed a power up pill. It receives a -1 reward every time it makes a valid move. If it attempts to move into a wall it receives -10. Eating a food pellet gains 10 and eating all the food on the map gains 100. Eating a ghost resets the ghost to the center of the map. The domain can either be treated episodically or non-episodically.

Existing work. Pocman has been used in the general RL community to evaluate various algorithms, including CTMRL (Nguyen et al., 2012) in the FRL line of work and CPSR (Hamilton et al., 2014) in the PSR line of work (they also consider a modified version, S-Pocman to add long- term dependencies to the environment). Silver and Veness (2010) develop a POMDP version of the simulation-based planning algorithm UCT called PO-UCT that performs extremely well on the domain, but requires access to a POCMAN emulator.

Popularity.Outside of the general RL community POCMAN is relatively unknown. However, the game it is based on (MS-PACMAN) is very famous and has been implemented on many platforms including, most recently, a playable Google doodle.

In document Generic Reinforcement Learning Beyond Small MDPs (Page 69-71)