Master Thesis

Master thesis on Interactive Intelligent Systems
Universitat Pompeu Fabra
Planning in the Atari video games using

screen pixels
Wilmer Bandres
Supervisor: Hector Geffner

Co-Supervisor: Guillem Francs
September 2017
Master thesis on Interactive Intelligent Systems
Universitat Pompeu Fabra
Planning in the Atari video games using

screen pixels
Wilmer Bandres
Supervisor: Hector Geffner

Co-Supervisor: Guillem Francs
September 2017
Contents
1 Introduction 1
1.1 Goals and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Structure of the Report . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Planning in ALE 4
2.1 Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Iterated Width . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 The ALE Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Planning in ALE using RAM state . . . . . . . . . . . . . . . . . . . . 7
2.5 Reinforcement Learning in ALE . . . . . . . . . . . . . . . . . . . . . . 8
3 Planning with Pixels in ALE 10
3.1 IW with features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 From Pixels to Features . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Subgoaling and Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3.1 Tiling Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4 Conclusions 24
4.1 Summary of contributions . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2.1 Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2.2 Iterated Width With Rollouts . . . . . . . . . . . . . . . . . . . . . . . 26

List of Figures 29
List of Tables 30
Bibliography 31
Abstract
The Arcade Learning Environment (ALE) supports the Atari 2600 games and gives
access to the states of the games, the RAM, the screen, the actions that can be
executed in each state. It is not possible to model these games with a PDDL-
model since the action effects and rewards are not known until they are applied.
However, classical blind search algorithms, like Breadth First Search or Dijkstra,
can be applied to these kind of problems but the search spaces is too large producing
poor results.
Planning algorithms that perform a structured exploration of the state space can be
used to solve problems where the search space is huge. One of them is the Iterated
Width (IW) algorithm which performs a blind breadth-first search (meaning that
it does not require prior knowledge of the state transitions costs or goals), pruning
states that do not yield a new combination of features. The IW(1) algorithm has
been recently show to yield state of the art results in the Atari games when used with
the memory (RAM) states as the lookahead states. This is different than humans
and reinforcement learning algorithms that do not have access the to RAM state of
the ALE emulator and rather use as states the states of the screen pixels. In this
work, we narrow the gap between planning and learning approaches to the Atari
games, by using the IW(1) and suitable variations for planning in the Atari games
using features obtained from the screen pixels, following the work of Yitao Liang et
al [1]. We show that the scores obtained by planning with the screen pixels rival
those reported by humans and reinforcement learning algorithms, and also those
reported for planning with (more informed) RAM states.
Keywords: Planning; Arcade Learning Environment; Atari; Iterated Width

Chapter 1
Introduction
One of the main objectives of the Artificial Intelligence is to reach and improve hu-
man performance on challenging tasks, like games. The Arcade Learning Environment[2]
(ALE) proportions tasks where getting the best rewards on Atari 2600 games is the
target. It provides an interface to use the simulator that, given a state and action
returns the reward of applying that action and the resulting state which makes it
perfect for planning purposes. Planning tries to solve tasks in a general way, mod-
eling each one as a graph-like problem and given a transition function like the one
proportioned by ALE. Nonetheless, classical planners cannot address Atari games
since they do not have a compact PDDL-Like encoding of the domain and the goal
is not given [3], as a consequence approaches like heuristic search cannot be used
either. In this case, the planner has to select the next action on-line using the simu-
lator to explore the search space and this action selection problem can be addressed
as a reinforcement learning problem over MDP or as a net-benefit planning problem.
1.1 Goals and Related Work

Even with the presence of unknown transitions it is still possible to address these
problems by applying blind-search-like algorithms to perform a lookahead, like
breadth-first search, Disjktras Algorithm, or Learning methods such as LRTA*[4]
and UCT[5] in order obtain the most reasonable action at a moment. Blind search-
1
2 Chapter 1. Introduction
like algorithms performance, however, tends to be affected when the search space
is huge. Recent search algorithms like IW[6] have been shown to scale up better,
while being applicable in all contexts where blind-search algorithms can be applied
provided that states are defined over a finite number of state variables or features.
Lately, the RAM states were used to extract features for the Iterated Width algo-
rithm to play Atari games outperforming on-line planners like the ones using BrFS
or UCS[4]. On the other hand, reinforcement learning approaches, like DQN and
Shallow RL [1], have been applied to learn controllers by processing the screen pix-
els, finding spatial relations between them and focused on trial and errors. Now
that ALE provides access to the screen pixels, these ideas can be mixed and it is
possible to do planning with pixels and extract features from the screen using IW(1)
to compute lookaheads.
A variation of the IW algorithm with rollouts is suitable when cloning states is

expensive, instead of saving all the search space, this new version saves the paths
to the current node (or state) and when some condition is met (it is to deep in the
search tree or there are too many frames being simulated) it is restarted to explore
other branches in the search space simulating the path until a certain depth.
By the end of this work the following objectives should be met:
Implement and study the Iterated Width algorithm using different set of fea-
tures extracted from the screen.
Study the relation between the features and the score they help to achieve.
Compare the state-of-the-art on the Atari games with the new approaches and
the human performance.
1.2 Structure of the Report

Chapter 2 gives a brief introduction to concepts like planning, the Arcade Learning
Environment, the Iterated Width algorithm and the use of ALE to do planning or
reinforcement learning.
1.2. Structure of the Report 3
Chapter 3 covers the definitions of the features extracted from the screen, a variation
of the Iterated width algorithm and another kind of features (the tiling features).
Then, a section with the results obtained so far is given comparing the IW algorithm
with pixels against the human performance and the state-of-the-art on Reinforce-
ment Learning and planning.
Finally, chapter 4 provides a summary and a discussion of future work.

Chapter 2
Planning in ALE
The following section presents some concepts on Planning, the Arcade Learning
Environment, Iterated Width and previous work in order to give an insight on the
target of this work.
2.1 Planning
Planning is the model-based approach to autonomous behavior where the agent
selects the action to do next using a model of how actions and sensors work, what
is the current situation and the goal to be achieved. The problem of intelligent
behavior is the problem of selecting the action to do next and there are three different
approaches that can be used to solve the problem: The programmer approach is
when a solution is intended to solve a single problem based on a controller given by a
programmer; in the learning-based approach, the controller is learned by experience,
in other words, the agent performs some actions and observe his environment in
order to learn from experience; Lastly, there is the model-base approach, where the
controller is derived from a model of the actions, goals and sensors [7].
Planning is defined as a branch of AI concerned with the the synthesis of plans

of actions to achieve goals, and it is a model-based approach. The model-based
approaches are composed of three parts: the models, that express the dynamics; the
4
2.1. Planning 5
languages, a way to express the models in a compact way; and the algorithm which
generates the behavior.
Problems in planning can be modeled as a slight variation of the basic state model,
defined in terms of [7]:
a finite and discrete space S,
a known initial state s0 S,
a non-empty set SG S of goal states,
actions A(s) A applicable in each state s S,
a deterministic state transition function f(a,s) such that s0 = f (a, s), where
f (a, s) is the resulting state of applying the action a in s, and
positive action costs c(a,s).
Under this definition it is assumed that the costs of the actions do not depend on
the state where they are being applied, and hence c(a, s) = c(a). A solution in
this model is known is a sequence of applicable actions that map the initial state
into a goal state, and it is known as a plan. Formally, a plan = a0 , ..., an1
generates a state sequence s0 , ..., sn such that ai A(si ), si+1 = f (ai , si ). The plan
is called optimal if it has the minimum cost over all plans. There is also a similar
model for partially observable planning, which is a variation of the classical model
with uncertainty and feedback, where the state transition function returns a set of
possible next states and where is possible to observe partial information about the
state.
A planner is a program that accepts descriptions of problems and outputs the con-
trol. There are two kind of planners:
offline planners where the the optimal action on each state is given. and,
6 Chapter 2. Planning in ALE
online planners where they give the action to be executed in the current state
or situation. These planners are also called close-loop controllers since the
actions depend on the observations in each state.
A planning problem can be modeled as a path-finding problem on a graph and there

is a direct correspondence between classical planning models and directed graphs
[7]. Directed graphs G = (V, E) have a set of nodes V and a set of edges E that
connects some pairs of nodes e = (n, n0 ) where n, n0 V and e E. On a weighted
directed graph, each edge has a cost related to it. Now, one of the basic tasks is
finding a directed path from a source n0 and a goal or target n with minimum length
(or with minimum cost if there are weighted edges). It turns out that a planning
problem can be modeled as a graph problem, where each state correspond to a node
in the graph and each edges represents the transition function between two nodes.
As a result, if a path-finding algorithm is used to solve the corresponding graph for
a planning problem, then the path is a plan (a solution) in the planning problem
where the transitions to be used are the ones found by the algorithm.
2.2 Iterated Width

The iterated Width algorithm was introduced as a classical planning algorithm which
takes a planning problem and tries to find a path (an action sequence) that solves
it, however it could be applied to other problems. A problem could be defined as
an initial state a0 , the set of actions applicable in each state, a transition function f
such that s0 = f (a, s) is the state that results from applying the action a in state s.
The transition and rewards functions do not need to be known a priori but the state
and the reward resulting from applying the action must be observable. The IW is a
sequence of calls to IW (i) for i = 0, 1, 2... over a problem P until some condition is
met. Each call to IW (i) is simply a Breadth-first search with the slight change that
if a node does not meet a condition, then its pruned. The condition in this case is
called the novelty test [4].
In terms of automated planning the novelty of a state s is the minimum k such that
2.3. The ALE Environment 7
s is the first state in the space searched so far that makes a k tuple of atoms true.
The procedure IW (i) is a BrF S where the states are pruned when the novelty of
it is greater than i.
The termination condition for IW in classical planning is the achievement of the

goal, while in the on-line setting (which is the case of study in this work) the con-
dition is a maximum number of generated nodes or a time window. The best node
in the last one, is the generated node with maximum accumulated reward, and the
best path is defined as the sequence of actions to reach that state. The action to be
performed by the agent in this case is the first action on the path to the best node.
Once the action has been executed, the agent has to run again the IW algorithm
and repeat the procedure as described above.
2.3 The ALE Environment

The Arcade Learning Environment (ALE) provides an interface to hundreds of Atari
2600 game environments, each one different and designed to be a challenge for human
players and presents significant research challenges for reinforcement learning, model
learning, model-based planning, imitation learning, transfer learning, and intrinsic
motivation. The interface provides access to interact with the with emulated Atari
2600 game environments and as a consequence, an agent that uses ALE can execute
actions (a total of 18 actions), set the frameskip (the number of frames an action
is going to be repeated), get information of the game state (the screen, the RAM,
remaining lives) and it also can save and recover a given game state, which makes
ALE perfect for planning purposes.
2.4 Planning in ALE using RAM state

The Iterated Width algorithm was used along with the RAM state of the Atari
games in order to extract some features and explore the search space. In these
games the (RAM) state is given by a vector of 128 bytes which is associated with
128 variables Xi . A state s makes an atom Xi = xj true when the value of the i-th
8 Chapter 2. Planning in ALE
byte is xj [4]. The results presented by the authors for IW outperforms algorithms
like 2BF S, BrF S and U CT on 26 of 54 Atari games computing the average score
over 10 runs.
A similar approach exploits a feature coming from the state structure, the score.
The authors proposed improvements over the IW(i) algorithm [8] :
While preserving the breadth-first dynamics of IW(i), the ties in the queue are
broken to favor states with higher accumulated reward.
Every i-set of atoms x is schematically pre-assigned a reward of r(x) = .

Given that, a generated state s is considered novel if, for some i set of
atoms x in s, we have R(s) > r(x). If that holds, then (and only then) s is
not pruned, and, for each i set of atoms x in s, we update its reward to
r(x) := max(R(s), r(x)).
This approach has an advantage, when a new highest score is achieved then it is
like the novelty table gets reseted for every node with a higher score than the last
highest score. Of course, this idea punishes nodes with a low score when a new
score is found but it helps to explore more in depth interesting branches of the
search space. Moreover, the main problem with this new approach is that the IW
algorithm ceases to be polynomial in the number of atoms because as long as a new
state with a higher score is found, the algorithm will continue the search.
2.5 Reinforcement Learning in ALE

In Reinforcement Learning the agent autonomously learns a behavior policy from
experience in order to maximize a provided reward signal, in other words, the agent
must generate a closed-loop controller using the pixels on the screen and the way
every action helps according to the observed screen. To learn the controller, the
agent can randomly execute actions and observe the results. An episode is when
the agent reaches a game over, but since ALE provides the tools the game can be
2.5. Reinforcement Learning in ALE 9
restarted and the agent could execute as many episodes as it wants to learn a good
controller.
Several approaches based on RL were presented and compared [1]: The DQN em-
ploys a deep convolutional neural network to represent the state-action value func-
tion Q; a deep convolutional network uses the pixel patterns along with multiple
layers to allow the network to detect patterns and relationships at progressively
higher levels of abstraction. On the other hand, the Shallow RL approach presents
boolean features in order to get information from the screen, based on the DQN ar-
chitecture which captures spatial and time relations between the screen pixels. The
authors proposed a set of features based on the pixels presented in certain places
in the screen (Basic features), the spatial relation between them (B-PROS features)
and the offset in time (B-PROT features). As a consequence the Shallow approach
competes with the DQN and reached a good performance compared to professional
players.
Unlike the planning approach, the RL approach maintain a single state in memory
at a time since only a single branch in the search space is explored; several episodes
can be run; and finally the controller can be used under stochastic situations. For
the purposes of this work the environments are deterministic, but some ideas coming
from RL can be used in combination with the planning approach.
Chapter 3
Planning with Pixels in ALE
This chapter first introduces the features in the IW algorithm and an extension of
them; then, the features set presented in previous work are introduced; followed by
the definition of a new variation of the Iterated Width algorithm using an auxiliary
function; next, a new way to extend the basic features is explained; finally, the
experimental results are discussed.
3.1 IW with features
The iterated Width algorithm is defined in terms of features which are variables
associated to values. In the case of the Atari games ram variables have been associ-
ated to values between 0 and 255. The state can be also represented in terms of the
screen pixels if each pixel has a variable whose value is in the set of colors. However,
this definition can be extended and a feature could be any boolean interpretation of
the state variables, e.g it is possible to define a boolean variable with the value true
if half of the pixels on the screen are red. The idea behind these boolean features is
to represent the state in other ways or maybe in a more compact form in order to
find better ways to explore the search space.
10
3.2. From Pixels to Features 11
3.2 From Pixels to Features
Given that it is possible to interpret the screen pixels in several ways the authors
of the Shallow RL approach proposed a way to split the screen pixels in order to
obtain three sets of features [1]:
Basic features: to obtain these features the authors divided the Atari 2600
game screen, whose dimensions are 210x160, into 16x14 tiles of size 10x15
pixels. Given a tile at row r and column c the authors defined the binary
feature r,c,k ; meaning there is a pixel with color k presented in the tile (r, c).
Since there are 128 different colors on the Atari games screen, there is a total
of 16 14 128 = 28672 different basic features.
Basic Pairwise Relative Offsets in Space (B-PROS) features: these features

try to get spacial and relevant information, they capture pairwise relative
distances between objects in a single screen. These features are of the form
k1,k2,(o1,o2) , meaning there are two tiles (c, r) and (c + o1, r + o2) where the
color k1 is present in the first tile while there is a pixel with color k2 present
in the second tile.
Basic Pairwise Relative Offsets in Time (B-PROT) features: these features

capture spatial infrmation, but in time. The basic features in the current
screen are compared to the basic features (or the colors presented) five screens
ago. These features have the form k1,k2,(o1,o2) , meaning that there are two
tiles (c, r) and (c + o1, r + o2) where the color k1 is present in the first tile in
the current screen and there is a pixel with color k2 present in the second tile
five screens ago.
Moreover, these are boolean features and can be used to represent the states and
direct the search in planning algorithms with a search space bounded by the number
of different features.
12 Chapter 3. Planning with Pixels in ALE
3.3 Subgoaling and Scores
As was mentioned, basic features are few but a way to expand these features, and
even the B-PROS or B-PROST features, is to have several novelty tables depending
on some function f . This new approach, also called IW(1,f ), is the same Iterated
Width algorithm presented in the previous section but it depends on a function f
which given a state s, the function returns a number corresponding to the novelty
table the algorithm have to check in order to know the novelty of s.
In terms of classical planning, the f function could be the function that returns the
number of top goals achieved until s, but in the case of Atari games there are not
goals, the aim is to maximize the score. Therefore, given that the score on the Atari
Games varies between 0 and 500000, a possible function f , for the purposes of this
work, is the log2 function based on the score of a state. This function just takes the
current score in the state s and the novelty table that IW should check is the table
number dlog2 (score(s))e. Hence, there are 19 possible novelty tables given a total
of 19 28672 = 544768 basic + score features. Of course there are other ways to
take the score in consideration as part of f , e.g diving the range (0, 500000) into x
equal parts make sense in some games where the score doesnt vary too much.
3.3.1 Tiling Features
One of the previous sections explained a way to split the screen, the screen is di-
vided into 15 10 tiles and the basic features are represented by tuples of the form
(row, column, color) where the row is a number between 1 and 14, the column is a
value between 1 and 15 and the color is an even value between 0 and 254. These
tuples create a total of 28672 features which seems a small number, so one of the
things that is possible to do is to split the screen in more tiles, lets say in tiles of
size 5 5, giving a total of 172032 features. Furthermore, the more we split the
screen the more features we have, but the meaning of these features depends on
the frameskip used in the Atari Environment; in other words, Figure 1 shows two
screens of the Space Invaders game after 20 frames where the invader does not go
3.3. Subgoaling and Scores 13
(a) Space Invaders. (b) Space Invaders.
Figure 1: Space invaders example.
out of the patch, as a consequence the basic features for the patch next to it dont
change and the second state can be pruned if there is not other feature that changes
either. Given this example, there are two ways to keep the second state with novelty
1:
by changing the patch size to a smaller value so that the invader passes through
other patch and causing the state to have novelty 1, or
by augmenting the frameskip so that the invader gets out of the patch.
Of course the choice of the frameskip depends on the way we want to compare
the algorithms. For example, if we want to compare the IW (1) algorithm with
the human performance, and we know the humans can do an action every 200ms,
then the frameskip should be set to 12, since a second is 60 frames on the Atari
environment. Also, changing the patches size is important because it affects the
complexity of computing the novelty on each state.
In addition to the size patch, it is possible to move the tiling either from left to right
or from top to bottom. Creating more novelty tables which differs on the way the
patches are being placed can solve the problem just mentioned and, as a consequence
moving objects can be detected.
3.3.2 Experimental Results
The methodology used in this dissertation is experimental since the results are ob-
tained from each run on the online-planning agents and the the setup of each exper-
iments are the same proposed by Lipovetzky, Ramirez and Geffner in [4]. In other
words, each experiment meet the following constraints:
The maximum lookahead budget is 150000 simulated frames when running

IW .
The maximum depth per lookahead is 1500, meaning that the the deepest node
that can be reached from the root node simulating is at most 1500 frames away.
In the case of IW there is a reuse of lookahead (or caching), in other words, if

a node has been already generated on a previous lookahead, then it does not
get discounted from the budget and the novelty table is not updated.
If a node has a subtree explored before with more simulated frames than the
budget size, then the budget is not going to be used in that subtree.
In the same way the experiments follow these assertions:
On each experiment with pixels, the average background is subtracted in order

to make null the invariable features to save space and time (computing the
cross kind of features).
When a node is generated, it is not pruned if after applying the same action,
who generates the node, over a maximum of 30 frames, the resulting state has
different features from the original node.
There are a lot of possible configurations varying the frameskip, the patches size,
moving the tiling and using different features. For the purposes of these results, the
basic features are referred as features of type 1, the Basic + B-PROS features are 2,
Basic + B-PROS + B-PROT are 3 and finally the configuration 4 are the features of
IW (1, f ) where f is the log_score function. The results for the RAM configuration
are not the ones presented in [4], but are similar since we did a slight variation.
The first parameter to be tuned is the frameskip, when an action is executed then
this action is repeated n times, where n is the frameskip. The table 1 presents
results (the scores) for configurations of kind rgb_<configuration>_<frameskip>
and the RAM (with a frameskip of 5).
Having a smaller frameskip helps to do more actions, but it doesnt necessarily

helps to achieve a better score, indeed in most of the games a frameskip of 30 works
better than using 5. Even when the RAM configuration uses 5, it could have more
information in the RAM than the screen, as a consequence generates more nodes
with novelty 1. Of course having a frameskip of 5 is a problem for screen-dependent
configurations, since the screen does not change at all and the color inside the patches
are the same, causing more pruning and generating less meaningful states. However,
for some games (like Asteroids, Chopper Command or Centipede) a lower frameskip
helps to get a higher score. Although there is not a clear tendency between the
frameskip and the score in these first result, of course if the frameskip is too high,
like 120, then the same action will be executed over 120 frames and as a consequence
the agent could die faster.
On the other hand, it is possible to use the tiling features mentioned in the previ-
ous section. These features address the problems of moving objects by moving the
patches upside-down or from left to right. Table 2 shows the scores of several config-
urations <frameskip>_<patch_rows_size>_<patch_columns_size>_<differential>,
where the patch_rows_size refers to the number of pixel rows each patch has, and in
the same way patch_column_size. Lastly, the differential refers to how many pixel
rows are skipped among moving patches , in other words having a patch_rows_size
of 15 and a differential of 3 means that there are five tilings ( 15
3
), the first tiling
is a normal tiling with patches of the specified size, the second one is also a tiling
but the patches are moved three rows of pixels downwards, and so on. The main
problem with these features is that the time to compute them is the same as the
time to compute the basic features but multiplied by a factor that depends on the
differential. However, the results show that these features dont improve the scores
a lot, except maybe for games like Montezuma where almost each configuration gets
a score different to zero. Furthermore, they seem to be worthless since there is not
still a tendency of improvement, and they are costly to compute.
In order to compare the screen features with the human performance, a frameskip
equivalent to the human reflex has to be used. Professional gamers do actions in a
1
range between 200 and 250 ms (every 5
seconds) and given that 60 frames is one
second on Atari Games, then the frameskip should be set to a number between 12
and 15. In order to fix this parameter for the next experiments a frameskip of 15 is
proposed. Since the tailing features dont present a tendency of improvement they
will be obviate. Table 3 shows the score results of four configurations with basic
(1_15), B-PROS(2_15), B-PROT (3_15) and log_score (4_15) features, and also
the human performance (results from [1]).
Finally, Table 4 sum up the score and shows how many times each configuration is
worse than the ones presented in the columns. It can be noticed that the best con-
figuration is the 4_15 outperforming even the human level of performance, maybe
because is the only configuration that takes into account the score in the same way
a human can do. Even when the number of features is less than the configurations
2_15 and 3_15, it seems that the score plays an important role in order to explore
the search space of many games. In the case of 4_15 every time a better score
is found the algorithm explore more in depth the search space since it follows the
ideas presented by Carmel[8], with the advantage that it keeps the running time and
space polynomial in the number of basic features and the number of different values
for f in IW (1, f ), instead of exponential.
Additional information is presented in Table 5 for the best configuration found so

18000
far. The steps executed by the algorithms is bounded by f rameskip
, so in the case
of 4_15 it is bounded by 1200. The Av time column refers to the average time to
compute each step to get the score presented in the second column. The Random
acts column counts the times a random actions was performed due to the fact that
the lookahead didnt find a score different from zero. The number of nodes expanded
is different from the generated since a expanded node means that it wasnt pruned.
While it is difficult to draw firm conclusions, it seems that basic features with a
log_score function outperforms the other configurations, and even the state-of-the-
art in reinforcement learning.
One advantage of the screen features is that unlike RAM features, they can be shown
and they give tips about whats happening on the Algorithm. In particular the
configuration 4_15 performs poorly in games like Asteroids, Chopper Command,
Crazy Climber, Fishing Derby, Freeway and Kungfu Master.
Asteroids is a interesting domain since scores can be reached faster than in other
games. Moreover, running the solutions found by 4_15 two doubts arise:
why sometimes the spaceship prefers to hit asteroids that are farther than
others? and,
why the spaceship seems to rotate more than it needs?
A possible cause of the first observation is the discount factor, a lower discount
factor prioritizes shallower states or scores, and as a consequence it will cause the
agent to hit the nearest asteroid. However, running the same features (4_15 ) with
a discount factor of 0.8 gives a similar score to the one given by a discount factor
of 0.995 with the same set of features and discards the discount factor as a possible
cause of the low scores in Asteroids.
On the other hand, the second question gives other intuition about what is really
happening, the thing is that the frameskip seems to be too high when the agent has
to rotate the spaceship. Figure 2b shows the space ship after just one action with
15 of frameskip which rotates the spaceship almost 180 degress, it means that the
spaceship cannot rotate just 90 degrees to hit a closer asteroid. As a consequence,
even when it doesnt happen in many of the Atari Games, the frameskip could affect
the way an agent finds a higher score.
(a) (b)
Figure 2: Asteroids example of a rotation executed over 15 frames by the agent. It

rotates almost 180 degrees making it impossible to shoot some asteroids.
Table 1: Score of the agents on the Atari games using IW in configurations with
frameskip of 5 or 30 and features of kind Basic (number 1), B-PROS (number 2), B-
PROT (number 3) and using the RAM states. Each configuration is represented as
rgb_<feature type>_<frameskip>, or RAM, and the agent uses the dynamic frameskip-
ping until 30 frames and a budget of 150000 frames per lookahead.
Game RAM rgb_1_5 rgb_1_30 rgb_2_5 rgb_2_30 rgb_3_5 rgb_3_30
Alien 7560 940 20360 1180 6830 1260 4480
Amidar 1879 17 2378 11 2445 6 1921
Assault 1143 787 1501 954 6247 525 3314
Asterix 47300 14900 48100 3700 6700 2150 550
Asteroids [p1] 13320 1730 1300 4370 2950 2300 1490
Atlantis 104300 70700 31100 40200 32200 7400 23500
Bank Heist 440 20 3327 40 4546 40 1599
Battlezone 1000 23000 110000 13000 85000 8000 51000
Beamrider 11460 2160 2160 660 6090 264 2216
Berzerk 1860 350 9810 350 2520 150 2490
Bowling 53 21 26 26 24 14 26
Breakout - Breakaway IV 415 2 14 0 56 0 28
Carnival 5500 740 3140 2440 4460 4640 7480
Centipede 46881 101169 26947 17392 6374 5033 10173
Chopper Command 6900 19200 1300 400 6500 400 1200
Crazy Climber 34100 9100 10600 15300 15700 12000 14300
Demon Attack 19385 21015 21640 1320 4410 140 8330
Double Dunk 0 -20 -18 4 -13 0 -12
Elevator Action 8000 0 0 1100 11200 0 5900
Enduro 200 0 399 0 104 28 85
Fishing Derby 30 -83 -13 -77 -41 -46 -25
freeway.bin 34 0 7 2 12 2 14
Frostbite 3950 120 4510 250 3750 200 3140
Gopher 16860 2580 9260 8100 11820 2640 11380
Gravitar 5300 250 3000 2850 4750 2100 4300
H.E.R.O. 13535 6850 7440 13170 7640 7450 13210
Ice Hockey 2 -7 39 -4 2 -1 2
James Bond 007 1000 100 800 100 200 0 100
Kangaroo 0 0 2100 200 5300 200 4400
Krull 5150 6200 8710 960 3720 150 2730
Kung Fu Master 28900 13400 2900 11300 2000 2600 5900
Montezumas Revenge - Starring Panama Joe 0 0 0 0 2500 0 0
Ms. Pac-Man 21281 300 26651 4360 20911 3570 5350
Name This game 8980 9950 9270 4570 7510 1200 5910
Pooyan 10235 1665 4695 5680 10425 1580 5840
Private Eye 100 100 100 100 120 100 600
Q-bert 1425 1325 20350 12975 5975 4575 28575
River Raid 6990 7070 16990 3390 6060 1580 4770
Road Runner 38800 23900 22200 12300 41200 600 16800
Robot Tank 13 1 39 7 57 4 29
Seaquest 4000 260 10120 520 1300 260 320
Skiing -524 -15366 -20998 -17403 -17462 -16343 -22840
Star Gunner 1700 8000 1800 1500 1500 1500 1400
Tennis 1 -23 20 -19 20 -15 20
Time Pilot 30700 28000 4100 23200 15800 8600 7800
Tutankham 251 135 74 221 80 62 79
Up n Down 23220 4920 44170 17690 11020 4560 2370
Venture 0 0 0 0 300 0 500
Video Olympics 20 -21 -16 -17 -11 -21 -12
Video Pinball 62407 24977 349122 10728 303510 87766 346715
Wizard of Wor 40300 2400 1500 800 1800 1400 19300
Zaxxon 14500 0 5500 4700 13000 2100 4100
Table 2: Scores achieved on the Atari games by IW using tiling features

with different patch sizes and differentials. Each configuration has its own
patch dimensions of <patch_row_size> <patch_column_size> and is represented as
<frameskip>_<patch_row_size>_<patch_column_size>_<differential>. The algorithms used a the
dynamic frameskipping until 30 frames and a budget of 150000 frames per lookahead.
Game 5_5_5_1 5_15_10_1 15_5_5_1 15_15_10_1 30_5_5_1 30_15_10_1
Alien 2730 1360 1180 6010 5510 5480
Amidar 266 73 1699 1106 2910 2267
Assault 964 1132 1427 1459 3272 4568
Asteroids [p1] 1300 1300 11100 3100 830 660
Asterix 2250 2500 9530 4320 46900 6200
Atlantis 52300 27700 40100 61100 19800 21600
Bank Heist 160 180 940 460 2445 1150
Battlezone 17000 16000 48000 39000 97000 78000
Beamrider 528 308 2160 1092 4260 2056
Berzerk 300 880 1220 550 3920 3310
Bowling 27 18 24 34 56 36
Breakout - Breakaway IV 418 36 194 146 57 22
Carnival 7840 1140 5900 2320 2300 3500
Centipede 4077 7487 3553 12723 3774 14815
Chopper Command 800 400 1500 1400 1400 1500
Crazy Climber 43900 21500 52600 50300 28400 28500
Demon Attack 280 130 3590 4345 120 5435
Double Dunk -2 0 0 2 -4 0
Enduro 23 2 140 99 110 85
Elevator Action 0 0 5300 4200 10700 0
Fishing Derby -12 -8 -36 -43 -1 -19
freeway.bin 2 8 4 7 11 5
Frostbite 220 210 270 320 5350 3940
Gopher 15800 6420 14220 13400 9680 7740
Gravitar 2350 750 500 4000 350 2700
H.E.R.O. 2935 2810 6795 7635 7655 7910
Ice Hockey 2 2 0 2 1 4
James Bond 007 50 50 50 250 50 500
Kangaroo 400 0 3100 1000 2100 4100
Krull 330 120 730 1900 3870 2750
Kung Fu Master 9700 13500 21200 11600 4500 2000
Montezumas Revenge - Starring Panama Joe 400 0 400 400 1400 400
Ms. Pac-Man 2240 1260 15411 9680 17661 14321
Name this game 1270 1750 4450 5360 8480 7750
Private Eye 0 0 100 0 200 0
Q-bert 17325 9825 32975 112875 29800 24550
River Raid 1400 440 2590 3380 8230 6220
Road Runner 4100 0 17500 16900 37500 19500
Robot Tank 0 0 220 540 22 19
Seaquest 180 100 20 15 340 720
Skiing -17270 -6548 -16589 -16024 -18317 -19008
Star Gunner 1800 900 1700 1700 1500 1800
Tennis 10 2 20 17 12 21
Time Pilot 12100 8100 15900 8800 12200 8200
Tutankham 181 175 218 243 161 75
Up n Down 7720 3960 14680 8730 14120 11900
Venture 0 0 300 0 0 0
Video Pinball 10502 12301 102301 161906 -10 -11
Wizard of Wor 2100 800 6 -11 20800 194205
Video Olympics 15 -1 49600 26000 45400 40100
Zaxxon 2200 0 7200 5800 2300 8700
Table 3: Score of best configurations found so far. Each screen configuration is

represented as <feature type>_<frameskip>. The agent used the IW algorithm with a
dynamic frameskipping of 30 frames and a budget of 150000 frames per lookahead.
It shows also the state-of-the-art in planning (RAM) and Reinforcement Learning
(RL-Blob-PROST) as well as the Human performance on the Atari games.
Game RAM 1_15 2_15 3_15 4_15 Human RL-BLOB-PROST
Alien 7560 22360 13170 39920 6875 4154.8
Amidar 1879 2311 2622 2886 3183 1676 408.4
Assault 1143 4987 2204 1563 8294 1496 1107.9
Asterix 137500 85000 197500 266000 176000 8503 3996.6
Asteroids [p1] 35450 6940 1180 1300 1930 13157 1759.5
Atlantis 104300 28300 137800 133900 148900 29028 37428.5
Bank Heist 440 1319 660 1380 1572 734.4 463.4
Battlezone 2000 39000 48000 29000 562000 37800 26222.8
Beamrider 11460 2160 7470 3840 4260 5775 2367.3
Berzerk 1860 1850 4580 1360 2910
Breakout - Breakaway IV 415 23 29 59 226 31.8 52.9
Bowling 53 31 39 30 24 154.8 65.9
Centipede 5500 115526 126291 165290 11963 3903.3
Carnival 104303 5260 6800 7360 8140 4322
Chopper Command 8100 1300 3100 4000 7200 9882 3006.6
Crazy Climber 34100 11900 29300 30200 39200 35411 73241.5
Demon Attack 19385 19880 28065 31225 28180 3401 1441.8
Double Dunk 0 -18 -14 -15.5 -6.4
Elevator Action 13000 11500 0
Enduro 200 454 500 309.6 296.7
Fishing Derby 49 -21 -5 17 29 5.5 -58.8
freeway.bin 34 25 24 29 31 29.6 32.3
Frostbite 3950 320 270 270 320 4335 3389.7
Gopher 16860 7880 10900 14080 16340 2321 6823.4
Gravitar 5300 3000 850 3400 7200 2672 1231.8
H.E.R.O. 13535 13545 13720 13285 13665 25673 13690.3
Ice Hockey 2 54 55 0.9 14.5
James Bond 007 21550 250 350 100 7350 406.7 636.3
Journey - Escape -600 -28700 -6100
Kangaroo 0 6200 2300 1300 11900 3035 3800.3
Krull 5150 6920 420 4370 2395 8333.9
Kung Fu Master 61300 3000 8400 41000 22736 33868.5
Montezumas Revenge - Starring Panama Joe 0 400 4900 0 500 4367 778.1
Ms. Pac-Man 23421 30111 27581 36491 41001 15693 4625.6
Octopus 8980 8570 7840 8800 13470 4076 6580.1
Pooyan 10235 10465 14750 16440 12895 2228.1
Private Eye 100 100 100 69571 33
Q-bert 825 21300 22600 34150 137800 13455 8072.4
River Raid 6990 16280 6110 6310 12280 13513 10629.1
Robot Tank 62200 48 48 62 7845 24558.3
Road Runner 63 34100 66200 0 43500 11.9 28.3
Seaquest 16720 1620 360 360 38450 20182 1664.2
Skiing -524 -20810 -17079 -15331 -29842.6
Space Invaders 2545 3145 3500 3740 3385 1652 844.8
Star Gunner 1700 1400 1100 1000 1500 10250 1227.7
Tennis 1 11 20 20 19 -8.9 0
Time Pilot 30700 4700 8100 19600 43300 5925 3972
Tutankham 251 186 223 235 167.7 81.4
Up n Down 82980 70300 92600 9082 19533
Venture 0 1300 1500 200 500 1188 244.5
Video Pinball 62407 46609 14908 518516 579824 17298 9783.9
Wizard of Wor 116700 96100 90900 115900 122900 4757 2733.6
Zaxxon 30700 13900 20700 15000 23400 9173 8204.4
Table 4: Comparison table between the best configurations found so far. It shows
how many times a configuration in the rows performed worse than the configuration
in the columns over a total of 53 Atari games. The configurations correspond to
the human performance, the state-of-the-art in planning (RAM), Reinforcement
Learning (RL-Blob-PROST) and the best configurations found in this work 1_15
(IW with basic features and a frameskip of 15), 2_15 (IW with BPROS and a
frameskip of 15) and 3_15 (IW with B-PROT and a frameskip of 15).
Worse than RAM 1_15 2_15 3_15 4_15 Human DQN RL-Shallow
RAM 21 20 18 31 20 14 16
1_15 31 30 28 45 22 17 14
2_15 32 22 25 39 26 20 21
3_15 33 25 18 43 26 23 24
4_15 21 6 13 9 14 10 8
Human 33 31 25 24 38 18 13
DQN 39 36 31 28 42 32 16
RL-Shallow 37 39 30 26 44 34 33
Table 5: The table shows data gathered from the runs of the IW(1,f) configuration
on the Atari Games. The first column is the score obtained by the agent. The
second column is the number of steps the agent was able to execute (bounded by
1200 steps or 18000 frames). The third column is the average time of the lookaheads
executed by the agent. The next columns shows the Average node expanded per
lookahead (states not pruned), the average maximum depth reached per lookahead,
the number of random actions executed (when the agent does not find some score
in the lookahead), the average of states generated per lookahead and the average
depth of the nodes per lookahead.
Game Score Steps Av time Av Expanded Av max depth Random Acts Av Generated Av Depth
Alien 39920 1200 46 1441 18 4 7232 12.3
Amidar 3183 1200 23 804 48 16 3851 22.8
Assault 8294 1200 21 1213 27 0 3579 12.6
Asterix 176000 1200 59 2902 16 14 7450 10.3
Asteroids [p1] 1930 180 111 2629 10 1 5659 8.25
Atlantis 148900 1200 6 1607 40 58 991 20.9
Bank Heist 1572 437 8 261 19 62 1415 11.3
Battlezone 562000 1200 45 674 18 44 4577 10.1
Beamrider 4260 626 55 1378 20 182 5476 11.9
BerzerkBerzerk 2910 268 76 1405 12 39 4908 7.49
Breakout - Breakaway IV 226 233 5 1408 39 2 790 20
Bowling 24 570 2 114 21 421 335 13
Centipede 165290 1200 57 1905 11 0 9778 8.49
Carnival 8140 232 19 1233 47 28 3183 19.2
Chopper Command 7200 365 51 762 10 217 6865 7.33
Crazy Climber 39200 1200 9 440 25 194 1421 13.3
Demon Attack 28180 1200 62 4569 15 0 5498 9.38
Double Dunk -14 643 59 470 11 590 6982 7.88
Elevator Action 0 1200 33 212 15 1200 4547 9.18
Enduro 500 1200 44 1434 16 622 5846 10.7
Fishing Derby 29 468 19 285 25 202 2201 11.8
freeway.bin 31 547 7 1719 62 8 711 32.8
Frostbite 320 185 94 1219 13 67 5321 8.15
Gopher 16340 1200 9 919 27 40 1589 14.6
Gravitar 7200 236 116 3122 14 5 6969 9.74
H.E.R.O. 13665 282 78 1046 13 65 4732 9.15
Ice Hockey 55 1200 49 634 9 441 4376 6.71
James Bond 007 7350 633 50 1369 18 31 6509 10.1
Journey - Escape -6100 197 139 1373 9 188 9579 7.52
Kangaroo 11900 796 29 482 27 16 3988 14.6
Krull 4370 270 205 4936 10 8 6853 7.87
Kung Fu Master 41000 1200 24 590 22 42 3495 11.6
Montezumas Revenge - Starring Panama Joe 500 520 11 98 13 471 1642 7.58
Ms. Pac-Man 41001 1200 62 2619 16 50 5414 10.8
Name this game 13470 1200 30 1621 45 8 4182 16.7
Pooyan 12895 1200 38 1647 34 44 4866 17.2
Private Eye 100 720 24 145 9 710 3036 6.8
Q-bert 137800 1200 17 1219 35 0 2729 13.8
River Raid 12280 406 61 1303 12 28 8228 9.20
Robot Tank 62 1200 16 170 14 303 1819 8.51
Road Runner 43500 196 251 7237 8 29 6396 6.34
Seaquest 38450 1200 58 1311 11 13 9109 8.91
Skiing -15331 314 3 132 33 314 415 18
Space Invaders 3385 642 13 1037 36 33 1863 16.4
Star Gunner 1500 85 29 1270 7 12 5438 5.55
Tennis 19 1200 12 217 13 654 1749 7.96
Time Pilot 43300 822 39 1106 16 28 5004 9.79
Tutankham 235 703 16 503 23 232 2104 12.7
Up n Down 92600 989 85 4387 14 3 8678 11.9
Venture 500 644 14 172 23 592 1457 12
Video Pinball 579824 1200 93 2341 16 17 6738 10.6
Wizard of Wor 122900 1140 40 1149 18 263 4334 11.1
Zaxxon 23400 880 36 713 22 337 5568 11.6
Chapter 4
Conclusions
4.1 Summary of contributions

The main contribution in this dissertation was to find new ways to explore Atari
games in order to achieve reasonable results in comparison to human performance
and the state-of-the-art in planning and reinforcement learning. This goal was
achieved by the proposal of several approaches in order to extract features from
the screen pixels based on definitions from the reinforcement learning approach, like
the use of Basic, spatial or time features coming from the colors in the screen pixels
and that can represent objects moving between the frames. Furthermore, a new
variation of these features called the tiling feature was presented to catch better the
movement of the objects, but at a higher computational cost and without any major
improvement in the scores.
In order to study the relation between the features extracted from the screen and the
score, the experiments were executed on several configurations varying the frameskip
and the extracted features. However, at a first glance was hard to find a relation
between these points, but then a new variation of IW (1), which is IW (1, f ), was
proposed and taking into consideration the score showed a great improvement in the
way it explores the search space and, more important, is able to compete with the
state-of-the-art on the Atari games. Of course this shows how important the score
24
4.2. Future work 25
is in these games, but few domains still presented poor results due to the fact that
the frameskip can be another factor that affects the score, like in the game Asteroids
where good rotations of the spaceship depends directly on the frameskip as shown
on one of the runs of the IW (1, f ) algorithm.
4.2 Future work

There are still some doubts about how to improve the configuration IW(1,f) in order
to get a better score in the worst games. Maybe trying with other kind of functions
the algorithm can perform better and this could be an open problem. On the other
hand, lookaheads in the Iterated Width algorithm can be used to train and obtain
a closed-loop controller in order to compare the results to the ones presented in the
DQN paper[1] on stochastic settings.
Moreover, the next sections present two possibilities that can follow up this work.
4.2.1 Caching
The term caching refers to the fact that the states are saved when each lookahead
is executed with the Iterated Width algorithm in order to recover each state when
necessary. Planning with pixels should refer to the use of only the screen pixels to
find a plan, however, in this work every lookahead is cached and as a consequence
the RAM states are saved and used by all the configurations to restore the screen.
However, an actual scheme of planning with pixels would use only the observation
from the screen pixels to apply the transition function without any dependency on
the RAM, but in terms of Atari games and planning it is not possible to use just the
screen for these reasons: a single screen could refer to several RAM states (making
the environment stochastic); and the ALE environment does not provide a transition
function with the only use of the screen.
Another approach without caching can be implemented, instead of saving the RAM
states, it is possible to save tuples of the form (o, a, o0 ), where o, o0 are two obser-
vations (the screens) and a is the action executed. Once these tuples are saved, if
26 Chapter 4. Conclusions
the agent executes the action a in the state s with observation o and the resulting
state has the observation o0 , then if the tuple is already saved the transitions is not
discounted from the budget in the IW algorithm. Of course, this is an option to
find a work around to the caching IW anda way to make the search more variable.
4.2.2 Iterated Width With Rollouts
The main problem with the Iterated Width, which also exists with blind search
algorithms, if they use a controller (like ALE) then they have to save the controller
state when a node is expanded and as a consequence most of the execution time is
used to save or load states. In order to solve this problem, the next Iterated Width
with Rollouts (IW-R) is proposed: at each step of the IW (i) algorithm, instead of
saving controller state at each child of the current state, the paths to the states
are going to be saved and marked as solved if the sub-trees of each child have been
explored and marked as solved. A node without children, or meeting some condition
(like maximum depth) is marked as solved. When a node is marked as solved, then
the exploration has to follow on other branches of the search space and this is when
a rollout has to be done. Given that the controller state is not saved, in order to
explore other node in other branch of the search space, the controller state must be
reseted and the path to the state must be simulated, in other words, its like doing
the same IW algorithm presented before with the difference that it is stateless since
a new node to be explored mean that we have to simulate the path to that node
instead of loading the simulator state that we would have had to save in the previous
approach.
The atoms for a planning problem P are extended by the depth where they can be
found by IW-R, in other words, if the planning problem has an atom A = x, its
extended as < A = x, d > where d is the depth where the atom A = x can been
found in the original problem P . In terms of novelty, an atom is new if the depth
where the atom is found is less than the current depth of that atom.
4.2. Future work 27
Data: s0 Initial state

AP =<>;
while True do
Extend AP with any applicable action ai such that AP 0 =< AP, ai > is not labeled as solved;
Let s be the resulting state of applying AP 0 from s0 ;
Let d be the length of AP 0 ;
if Exists new atom (x, d) in s with d < d(x) then
Update novelty table to the atoms x where d < d(x);
Set AP = AP 0 ;
Add AP 0 to the tree of action sequences;
else if AP 0 is not in the Tree of action sequences and there is no new atom in s then
Mark AP 0 as Solved;
Back-propagate the Solved label;
Exit While;
else if AP 0 is in the Tree of paths and for no atom x, d = d(X), then
Mark AP 0 as Solved;
Back-propagate the Solved label;
Exit While;
else if AP 0 is in Table and d = d(x) for some atom X = x then
Set AP = AP 0 ;
Add AP 0 to the tree of action;
end
Algorithm 1: Rollout
Since each rollout improves an atom depth d(x) (it has been found at a shallower
node) or labels a child of an existing unsolved node in the tree as solved, the following
properties are met when executing the rollouts:
The length of the paths are bounded by the total number of atoms F .
The total number of unsolved nodes in the path tree is bounded by the total
number of extended atoms, so given the first property there are at most F 2
nodes of this kind.
Every state s reachable for IW (1) is also reachable for IW R(1).
Not every state s reachable for IW R(1) is reachable for IW (1).
Given a state s the rollout algorithm can be executed several times until the empty
28 Chapter 4. Conclusions
path is solved and this is defined as IW with rollouts.
Theorem 1. If the state s is reachable by IW (1) then s is reached by one or more

rollouts in IW(1) with rollouts.
Proof. Lets say there is a state s, which is the shallowest node reached by IW (1) but
not by IW (1) with rollouts. s has some atom which is novel and this atom x F is
not present on any other previous node s0 explored by IW with d(s0 ) d(s), where
d(s) means the depth of node s.
Moreover, this property holds for every node in the path from the root to s and
since s is not reached by IW (1) with rollouts, then the parent s00 was pruned by the
rollouts, but if s00 was pruned by the rollouts, it means that there is a set of nodes
with a smaller depth reaching every atom of s00 , contradicting the fact that s00 was
reached by IW (1).
Theorem 2. Number of rollouts until termination (root node solved) is bounded by

|F |3 b, where b is branching factor
Proof. Each rollout marks a node as solved or it improves an atom by giving it a

smaller depth. The depth of the search space is bounded by |F |, the number of
atoms and the number of unsolved nodes is bounded by |F |2 since each atom could
be at every depth.
However, every node to solve each of the |F |2 nodes, every sibling has to be solved, so
there is a total of |F |2 b nodes solved. Furthermore, to solve a node the algorithm
has to build again the path to the node which has at most |F | nodes. So the total
complexity of the algorithm is |F |3 b.
List of Figures
1 Space invaders example. . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 Asteroids game example. . . . . . . . . . . . . . . . . . . . . . . . . . 18
29
List of Tables
1 Score of configurations varying the frameskip and set of features. . . . 19

2 Scores of tiling features configurations . . . . . . . . . . . . . . . . . . 20
3 Score of best configurations found so far . . . . . . . . . . . . . . . . 21
4 Comparison table between the state-of-the-art and proposed methods 22
5 Gathered data from the best configuration found: IW(1,f) . . . . . . 23
30
Bibliography
[1] Yitao Liang, E. T., Marlos Machado & Bowling, M. State of the art control of
atari games using shallow reinforcement learning. Proceedings of the 15th Inter-
national Conference on Autonomous Agents and Multiagent Systems s (AAMAS
2016) (2016).
[2] Marc Bellemare, J. V., Yavar Naddaf & Bowling, M. The arcade learning en-
vironment: An evaluation platform for general agents. Journal of Artificial
Intelligence Research 47 (2013) .
[3] Malik Ghallab, D. N. & Traverso, P. Automated Planning: theory and practice
(Morgan Kaufmann Publishers, 2004).
[4] Nir Lipovetzky, M. R. & Geffner, H. Classical planning with simulators: Results
on the atari video games. IJCAI-15 (2015).
[5] Levente Kocsis, C. S. & Willemson, J. Improved monte-carlo search (2016).
[6] Lipovetzky, N. & Geffner, H. Width and serialization of classical planning prob-
lems. 20th European Conference on Artificial Intelligence (2012) .
[7] Geffner, H. & Bonet, B. A Concise Introduction to Models and Methods for
Automated Planning (Morgan and Claypool Publishers, 2013).
[8] Alexander Shleyfman, A. T. & Domshlak, C. Blind search for atari-like online
planning revisited. IJCAI-16 (2016).
31

Master Thesis

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Master Thesis

Diunggah oleh

Hak Cipta:

Format Tersedia

Master thesis on Interactive Intelligent Systems

Universitat Pompeu Fabra

Planning in the Atari video games using

Supervisor: Hector Geffner

Planning in the Atari video games using

Supervisor: Hector Geffner

1.1 Goals and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Structure of the Report . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.2 Iterated Width . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3 The ALE Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.4 Planning in ALE using RAM state . . . . . . . . . . . . . . . . . . . . 7

2.5 Reinforcement Learning in ALE . . . . . . . . . . . . . . . . . . . . . . 8

3 Planning with Pixels in ALE 10

3.1 IW with features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2 From Pixels to Features . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.3 Subgoaling and Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.3.1 Tiling Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.3.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.1 Summary of contributions . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.2.2 Iterated Width With Rollouts . . . . . . . . . . . . . . . . . . . . . . . 26

Keywords: Planning; Arcade Learning Environment; Atari; Iterated Width

1.1 Goals and Related Work

A variation of the IW algorithm with rollouts is suitable when cloning states is

By the end of this work the following objectives should be met:

1.2 Structure of the Report

Finally, chapter 4 provides a summary and a discussion of future work.

Planning is defined as a branch of AI concerned with the the synthesis of plans

a finite and discrete space S,

a known initial state s0 S,

a non-empty set SG S of goal states,

actions A(s) A applicable in each state s S,

positive action costs c(a,s).

A planning problem can be modeled as a path-finding problem on a graph and there

2.2 Iterated Width

The termination condition for IW in classical planning is the achievement of the

2.3 The ALE Environment

2.4 Planning in ALE using RAM state

Every i-set of atoms x is schematically pre-assigned a reward of r(x) = .

2.5 Reinforcement Learning in ALE

Planning with Pixels in ALE

3.1 IW with features

3.2 From Pixels to Features

Basic Pairwise Relative Offsets in Space (B-PROS) features: these features

Basic Pairwise Relative Offsets in Time (B-PROT) features: these features

3.3 Subgoaling and Scores

3.3.1 Tiling Features

(a) Space Invaders. (b) Space Invaders.

Figure 1: Space invaders example.

3.3.2 Experimental Results

The maximum lookahead budget is 150000 simulated frames when running

In the case of IW there is a reuse of lookahead (or caching), in other words, if

In the same way the experiments follow these assertions:

On each experiment with pixels, the average background is subtracted in order

Having a smaller frameskip helps to do more actions, but it doesnt necessarily

Additional information is presented in Table 5 for the best configuration found so

why the spaceship seems to rotate more than it needs?

Figure 2: Asteroids example of a rotation executed over 15 frames by the agent. It

Table 2: Scores achieved on the Atari games by IW using tiling features

Table 3: Score of best configurations found so far. Each screen configuration is

4.1 Summary of contributions

4.2 Future work

4.2.2 Iterated Width With Rollouts