Wilmer Bandres
September 2017
Master thesis on Interactive Intelligent Systems
Universitat Pompeu Fabra
Wilmer Bandres
September 2017
Contents
1 Introduction 1
2 Planning in ALE 4
2.1 Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4 Conclusions 24
4.2.1 Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
List of Tables 30
Bibliography 31
Abstract
The Arcade Learning Environment (ALE) supports the Atari 2600 games and gives
access to the states of the games, the RAM, the screen, the actions that can be
executed in each state. It is not possible to model these games with a PDDL-
model since the action effects and rewards are not known until they are applied.
However, classical blind search algorithms, like Breadth First Search or Dijkstra,
can be applied to these kind of problems but the search spaces is too large producing
poor results.
Planning algorithms that perform a structured exploration of the state space can be
used to solve problems where the search space is huge. One of them is the Iterated
Width (IW) algorithm which performs a blind breadth-first search (meaning that
it does not require prior knowledge of the state transitions costs or goals), pruning
states that do not yield a new combination of features. The IW(1) algorithm has
been recently show to yield state of the art results in the Atari games when used with
the memory (RAM) states as the lookahead states. This is different than humans
and reinforcement learning algorithms that do not have access the to RAM state of
the ALE emulator and rather use as states the states of the screen pixels. In this
work, we narrow the gap between planning and learning approaches to the Atari
games, by using the IW(1) and suitable variations for planning in the Atari games
using features obtained from the screen pixels, following the work of Yitao Liang et
al [1]. We show that the scores obtained by planning with the screen pixels rival
those reported by humans and reinforcement learning algorithms, and also those
reported for planning with (more informed) RAM states.
Introduction
One of the main objectives of the Artificial Intelligence is to reach and improve hu-
man performance on challenging tasks, like games. The Arcade Learning Environment[2]
(ALE) proportions tasks where getting the best rewards on Atari 2600 games is the
target. It provides an interface to use the simulator that, given a state and action
returns the reward of applying that action and the resulting state which makes it
perfect for planning purposes. Planning tries to solve tasks in a general way, mod-
eling each one as a graph-like problem and given a transition function like the one
proportioned by ALE. Nonetheless, classical planners cannot address Atari games
since they do not have a compact PDDL-Like encoding of the domain and the goal
is not given [3], as a consequence approaches like heuristic search cannot be used
either. In this case, the planner has to select the next action on-line using the simu-
lator to explore the search space and this action selection problem can be addressed
as a reinforcement learning problem over MDP or as a net-benefit planning problem.
1
2 Chapter 1. Introduction
like algorithms performance, however, tends to be affected when the search space
is huge. Recent search algorithms like IW[6] have been shown to scale up better,
while being applicable in all contexts where blind-search algorithms can be applied
provided that states are defined over a finite number of state variables or features.
Lately, the RAM states were used to extract features for the Iterated Width algo-
rithm to play Atari games outperforming on-line planners like the ones using BrFS
or UCS[4]. On the other hand, reinforcement learning approaches, like DQN and
Shallow RL [1], have been applied to learn controllers by processing the screen pix-
els, finding spatial relations between them and focused on trial and errors. Now
that ALE provides access to the screen pixels, these ideas can be mixed and it is
possible to do planning with pixels and extract features from the screen using IW(1)
to compute lookaheads.
Implement and study the Iterated Width algorithm using different set of fea-
tures extracted from the screen.
Study the relation between the features and the score they help to achieve.
Compare the state-of-the-art on the Atari games with the new approaches and
the human performance.
Chapter 3 covers the definitions of the features extracted from the screen, a variation
of the Iterated width algorithm and another kind of features (the tiling features).
Then, a section with the results obtained so far is given comparing the IW algorithm
with pixels against the human performance and the state-of-the-art on Reinforce-
ment Learning and planning.
Planning in ALE
The following section presents some concepts on Planning, the Arcade Learning
Environment, Iterated Width and previous work in order to give an insight on the
target of this work.
2.1 Planning
Planning is the model-based approach to autonomous behavior where the agent
selects the action to do next using a model of how actions and sensors work, what
is the current situation and the goal to be achieved. The problem of intelligent
behavior is the problem of selecting the action to do next and there are three different
approaches that can be used to solve the problem: The programmer approach is
when a solution is intended to solve a single problem based on a controller given by a
programmer; in the learning-based approach, the controller is learned by experience,
in other words, the agent performs some actions and observe his environment in
order to learn from experience; Lastly, there is the model-base approach, where the
controller is derived from a model of the actions, goals and sensors [7].
4
2.1. Planning 5
languages, a way to express the models in a compact way; and the algorithm which
generates the behavior.
Problems in planning can be modeled as a slight variation of the basic state model,
defined in terms of [7]:
a deterministic state transition function f(a,s) such that s0 = f (a, s), where
f (a, s) is the resulting state of applying the action a in s, and
Under this definition it is assumed that the costs of the actions do not depend on
the state where they are being applied, and hence c(a, s) = c(a). A solution in
this model is known is a sequence of applicable actions that map the initial state
into a goal state, and it is known as a plan. Formally, a plan = a0 , ..., an1
generates a state sequence s0 , ..., sn such that ai A(si ), si+1 = f (ai , si ). The plan
is called optimal if it has the minimum cost over all plans. There is also a similar
model for partially observable planning, which is a variation of the classical model
with uncertainty and feedback, where the state transition function returns a set of
possible next states and where is possible to observe partial information about the
state.
A planner is a program that accepts descriptions of problems and outputs the con-
trol. There are two kind of planners:
offline planners where the the optimal action on each state is given. and,
6 Chapter 2. Planning in ALE
online planners where they give the action to be executed in the current state
or situation. These planners are also called close-loop controllers since the
actions depend on the observations in each state.
In terms of automated planning the novelty of a state s is the minimum k such that
2.3. The ALE Environment 7
s is the first state in the space searched so far that makes a k tuple of atoms true.
The procedure IW (i) is a BrF S where the states are pruned when the novelty of
it is greater than i.
byte is xj [4]. The results presented by the authors for IW outperforms algorithms
like 2BF S, BrF S and U CT on 26 of 54 Atari games computing the average score
over 10 runs.
A similar approach exploits a feature coming from the state structure, the score.
The authors proposed improvements over the IW(i) algorithm [8] :
While preserving the breadth-first dynamics of IW(i), the ties in the queue are
broken to favor states with higher accumulated reward.
This approach has an advantage, when a new highest score is achieved then it is
like the novelty table gets reseted for every node with a higher score than the last
highest score. Of course, this idea punishes nodes with a low score when a new
score is found but it helps to explore more in depth interesting branches of the
search space. Moreover, the main problem with this new approach is that the IW
algorithm ceases to be polynomial in the number of atoms because as long as a new
state with a higher score is found, the algorithm will continue the search.
restarted and the agent could execute as many episodes as it wants to learn a good
controller.
Several approaches based on RL were presented and compared [1]: The DQN em-
ploys a deep convolutional neural network to represent the state-action value func-
tion Q; a deep convolutional network uses the pixel patterns along with multiple
layers to allow the network to detect patterns and relationships at progressively
higher levels of abstraction. On the other hand, the Shallow RL approach presents
boolean features in order to get information from the screen, based on the DQN ar-
chitecture which captures spatial and time relations between the screen pixels. The
authors proposed a set of features based on the pixels presented in certain places
in the screen (Basic features), the spatial relation between them (B-PROS features)
and the offset in time (B-PROT features). As a consequence the Shallow approach
competes with the DQN and reached a good performance compared to professional
players.
Unlike the planning approach, the RL approach maintain a single state in memory
at a time since only a single branch in the search space is explored; several episodes
can be run; and finally the controller can be used under stochastic situations. For
the purposes of this work the environments are deterministic, but some ideas coming
from RL can be used in combination with the planning approach.
Chapter 3
This chapter first introduces the features in the IW algorithm and an extension of
them; then, the features set presented in previous work are introduced; followed by
the definition of a new variation of the Iterated Width algorithm using an auxiliary
function; next, a new way to extend the basic features is explained; finally, the
experimental results are discussed.
The iterated Width algorithm is defined in terms of features which are variables
associated to values. In the case of the Atari games ram variables have been associ-
ated to values between 0 and 255. The state can be also represented in terms of the
screen pixels if each pixel has a variable whose value is in the set of colors. However,
this definition can be extended and a feature could be any boolean interpretation of
the state variables, e.g it is possible to define a boolean variable with the value true
if half of the pixels on the screen are red. The idea behind these boolean features is
to represent the state in other ways or maybe in a more compact form in order to
find better ways to explore the search space.
10
3.2. From Pixels to Features 11
Given that it is possible to interpret the screen pixels in several ways the authors
of the Shallow RL approach proposed a way to split the screen pixels in order to
obtain three sets of features [1]:
Basic features: to obtain these features the authors divided the Atari 2600
game screen, whose dimensions are 210x160, into 16x14 tiles of size 10x15
pixels. Given a tile at row r and column c the authors defined the binary
feature r,c,k ; meaning there is a pixel with color k presented in the tile (r, c).
Since there are 128 different colors on the Atari games screen, there is a total
of 16 14 128 = 28672 different basic features.
Moreover, these are boolean features and can be used to represent the states and
direct the search in planning algorithms with a search space bounded by the number
of different features.
12 Chapter 3. Planning with Pixels in ALE
As was mentioned, basic features are few but a way to expand these features, and
even the B-PROS or B-PROST features, is to have several novelty tables depending
on some function f . This new approach, also called IW(1,f ), is the same Iterated
Width algorithm presented in the previous section but it depends on a function f
which given a state s, the function returns a number corresponding to the novelty
table the algorithm have to check in order to know the novelty of s.
In terms of classical planning, the f function could be the function that returns the
number of top goals achieved until s, but in the case of Atari games there are not
goals, the aim is to maximize the score. Therefore, given that the score on the Atari
Games varies between 0 and 500000, a possible function f , for the purposes of this
work, is the log2 function based on the score of a state. This function just takes the
current score in the state s and the novelty table that IW should check is the table
number dlog2 (score(s))e. Hence, there are 19 possible novelty tables given a total
of 19 28672 = 544768 basic + score features. Of course there are other ways to
take the score in consideration as part of f , e.g diving the range (0, 500000) into x
equal parts make sense in some games where the score doesnt vary too much.
One of the previous sections explained a way to split the screen, the screen is di-
vided into 15 10 tiles and the basic features are represented by tuples of the form
(row, column, color) where the row is a number between 1 and 14, the column is a
value between 1 and 15 and the color is an even value between 0 and 254. These
tuples create a total of 28672 features which seems a small number, so one of the
things that is possible to do is to split the screen in more tiles, lets say in tiles of
size 5 5, giving a total of 172032 features. Furthermore, the more we split the
screen the more features we have, but the meaning of these features depends on
the frameskip used in the Atari Environment; in other words, Figure 1 shows two
screens of the Space Invaders game after 20 frames where the invader does not go
3.3. Subgoaling and Scores 13
out of the patch, as a consequence the basic features for the patch next to it dont
change and the second state can be pruned if there is not other feature that changes
either. Given this example, there are two ways to keep the second state with novelty
1:
by changing the patch size to a smaller value so that the invader passes through
other patch and causing the state to have novelty 1, or
by augmenting the frameskip so that the invader gets out of the patch.
Of course the choice of the frameskip depends on the way we want to compare
the algorithms. For example, if we want to compare the IW (1) algorithm with
the human performance, and we know the humans can do an action every 200ms,
then the frameskip should be set to 12, since a second is 60 frames on the Atari
environment. Also, changing the patches size is important because it affects the
complexity of computing the novelty on each state.
In addition to the size patch, it is possible to move the tiling either from left to right
or from top to bottom. Creating more novelty tables which differs on the way the
patches are being placed can solve the problem just mentioned and, as a consequence
moving objects can be detected.
14 Chapter 3. Planning with Pixels in ALE
The methodology used in this dissertation is experimental since the results are ob-
tained from each run on the online-planning agents and the the setup of each exper-
iments are the same proposed by Lipovetzky, Ramirez and Geffner in [4]. In other
words, each experiment meet the following constraints:
The maximum depth per lookahead is 1500, meaning that the the deepest node
that can be reached from the root node simulating is at most 1500 frames away.
If a node has a subtree explored before with more simulated frames than the
budget size, then the budget is not going to be used in that subtree.
When a node is generated, it is not pruned if after applying the same action,
who generates the node, over a maximum of 30 frames, the resulting state has
different features from the original node.
There are a lot of possible configurations varying the frameskip, the patches size,
moving the tiling and using different features. For the purposes of these results, the
basic features are referred as features of type 1, the Basic + B-PROS features are 2,
Basic + B-PROS + B-PROT are 3 and finally the configuration 4 are the features of
3.3. Subgoaling and Scores 15
IW (1, f ) where f is the log_score function. The results for the RAM configuration
are not the ones presented in [4], but are similar since we did a slight variation.
The first parameter to be tuned is the frameskip, when an action is executed then
this action is repeated n times, where n is the frameskip. The table 1 presents
results (the scores) for configurations of kind rgb_<configuration>_<frameskip>
and the RAM (with a frameskip of 5).
On the other hand, it is possible to use the tiling features mentioned in the previ-
ous section. These features address the problems of moving objects by moving the
patches upside-down or from left to right. Table 2 shows the scores of several config-
urations <frameskip>_<patch_rows_size>_<patch_columns_size>_<differential>,
where the patch_rows_size refers to the number of pixel rows each patch has, and in
the same way patch_column_size. Lastly, the differential refers to how many pixel
rows are skipped among moving patches , in other words having a patch_rows_size
of 15 and a differential of 3 means that there are five tilings ( 15
3
), the first tiling
is a normal tiling with patches of the specified size, the second one is also a tiling
but the patches are moved three rows of pixels downwards, and so on. The main
problem with these features is that the time to compute them is the same as the
time to compute the basic features but multiplied by a factor that depends on the
16 Chapter 3. Planning with Pixels in ALE
differential. However, the results show that these features dont improve the scores
a lot, except maybe for games like Montezuma where almost each configuration gets
a score different to zero. Furthermore, they seem to be worthless since there is not
still a tendency of improvement, and they are costly to compute.
In order to compare the screen features with the human performance, a frameskip
equivalent to the human reflex has to be used. Professional gamers do actions in a
1
range between 200 and 250 ms (every 5
seconds) and given that 60 frames is one
second on Atari Games, then the frameskip should be set to a number between 12
and 15. In order to fix this parameter for the next experiments a frameskip of 15 is
proposed. Since the tailing features dont present a tendency of improvement they
will be obviate. Table 3 shows the score results of four configurations with basic
(1_15), B-PROS(2_15), B-PROT (3_15) and log_score (4_15) features, and also
the human performance (results from [1]).
Finally, Table 4 sum up the score and shows how many times each configuration is
worse than the ones presented in the columns. It can be noticed that the best con-
figuration is the 4_15 outperforming even the human level of performance, maybe
because is the only configuration that takes into account the score in the same way
a human can do. Even when the number of features is less than the configurations
2_15 and 3_15, it seems that the score plays an important role in order to explore
the search space of many games. In the case of 4_15 every time a better score
is found the algorithm explore more in depth the search space since it follows the
ideas presented by Carmel[8], with the advantage that it keeps the running time and
space polynomial in the number of basic features and the number of different values
for f in IW (1, f ), instead of exponential.
is different from the generated since a expanded node means that it wasnt pruned.
While it is difficult to draw firm conclusions, it seems that basic features with a
log_score function outperforms the other configurations, and even the state-of-the-
art in reinforcement learning.
One advantage of the screen features is that unlike RAM features, they can be shown
and they give tips about whats happening on the Algorithm. In particular the
configuration 4_15 performs poorly in games like Asteroids, Chopper Command,
Crazy Climber, Fishing Derby, Freeway and Kungfu Master.
Asteroids is a interesting domain since scores can be reached faster than in other
games. Moreover, running the solutions found by 4_15 two doubts arise:
why sometimes the spaceship prefers to hit asteroids that are farther than
others? and,
A possible cause of the first observation is the discount factor, a lower discount
factor prioritizes shallower states or scores, and as a consequence it will cause the
agent to hit the nearest asteroid. However, running the same features (4_15 ) with
a discount factor of 0.8 gives a similar score to the one given by a discount factor
of 0.995 with the same set of features and discards the discount factor as a possible
cause of the low scores in Asteroids.
On the other hand, the second question gives other intuition about what is really
happening, the thing is that the frameskip seems to be too high when the agent has
to rotate the spaceship. Figure 2b shows the space ship after just one action with
15 of frameskip which rotates the spaceship almost 180 degress, it means that the
spaceship cannot rotate just 90 degrees to hit a closer asteroid. As a consequence,
even when it doesnt happen in many of the Atari Games, the frameskip could affect
the way an agent finds a higher score.
18 Chapter 3. Planning with Pixels in ALE
(a) (b)
Table 1: Score of the agents on the Atari games using IW in configurations with
frameskip of 5 or 30 and features of kind Basic (number 1), B-PROS (number 2), B-
PROT (number 3) and using the RAM states. Each configuration is represented as
rgb_<feature type>_<frameskip>, or RAM, and the agent uses the dynamic frameskip-
ping until 30 frames and a budget of 150000 frames per lookahead.
Game RAM rgb_1_5 rgb_1_30 rgb_2_5 rgb_2_30 rgb_3_5 rgb_3_30
Alien 7560 940 20360 1180 6830 1260 4480
Amidar 1879 17 2378 11 2445 6 1921
Assault 1143 787 1501 954 6247 525 3314
Asterix 47300 14900 48100 3700 6700 2150 550
Asteroids [p1] 13320 1730 1300 4370 2950 2300 1490
Atlantis 104300 70700 31100 40200 32200 7400 23500
Bank Heist 440 20 3327 40 4546 40 1599
Battlezone 1000 23000 110000 13000 85000 8000 51000
Beamrider 11460 2160 2160 660 6090 264 2216
Berzerk 1860 350 9810 350 2520 150 2490
Bowling 53 21 26 26 24 14 26
Breakout - Breakaway IV 415 2 14 0 56 0 28
Carnival 5500 740 3140 2440 4460 4640 7480
Centipede 46881 101169 26947 17392 6374 5033 10173
Chopper Command 6900 19200 1300 400 6500 400 1200
Crazy Climber 34100 9100 10600 15300 15700 12000 14300
Demon Attack 19385 21015 21640 1320 4410 140 8330
Double Dunk 0 -20 -18 4 -13 0 -12
Elevator Action 8000 0 0 1100 11200 0 5900
Enduro 200 0 399 0 104 28 85
Fishing Derby 30 -83 -13 -77 -41 -46 -25
freeway.bin 34 0 7 2 12 2 14
Frostbite 3950 120 4510 250 3750 200 3140
Gopher 16860 2580 9260 8100 11820 2640 11380
Gravitar 5300 250 3000 2850 4750 2100 4300
H.E.R.O. 13535 6850 7440 13170 7640 7450 13210
Ice Hockey 2 -7 39 -4 2 -1 2
James Bond 007 1000 100 800 100 200 0 100
Kangaroo 0 0 2100 200 5300 200 4400
Krull 5150 6200 8710 960 3720 150 2730
Kung Fu Master 28900 13400 2900 11300 2000 2600 5900
Montezumas Revenge - Starring Panama Joe 0 0 0 0 2500 0 0
Ms. Pac-Man 21281 300 26651 4360 20911 3570 5350
Name This game 8980 9950 9270 4570 7510 1200 5910
Pooyan 10235 1665 4695 5680 10425 1580 5840
Private Eye 100 100 100 100 120 100 600
Q-bert 1425 1325 20350 12975 5975 4575 28575
River Raid 6990 7070 16990 3390 6060 1580 4770
Road Runner 38800 23900 22200 12300 41200 600 16800
Robot Tank 13 1 39 7 57 4 29
Seaquest 4000 260 10120 520 1300 260 320
Skiing -524 -15366 -20998 -17403 -17462 -16343 -22840
Star Gunner 1700 8000 1800 1500 1500 1500 1400
Tennis 1 -23 20 -19 20 -15 20
Time Pilot 30700 28000 4100 23200 15800 8600 7800
Tutankham 251 135 74 221 80 62 79
Up n Down 23220 4920 44170 17690 11020 4560 2370
Venture 0 0 0 0 300 0 500
Video Olympics 20 -21 -16 -17 -11 -21 -12
Video Pinball 62407 24977 349122 10728 303510 87766 346715
Wizard of Wor 40300 2400 1500 800 1800 1400 19300
Zaxxon 14500 0 5500 4700 13000 2100 4100
20 Chapter 3. Planning with Pixels in ALE
Table 4: Comparison table between the best configurations found so far. It shows
how many times a configuration in the rows performed worse than the configuration
in the columns over a total of 53 Atari games. The configurations correspond to
the human performance, the state-of-the-art in planning (RAM), Reinforcement
Learning (RL-Blob-PROST) and the best configurations found in this work 1_15
(IW with basic features and a frameskip of 15), 2_15 (IW with BPROS and a
frameskip of 15) and 3_15 (IW with B-PROT and a frameskip of 15).
Worse than RAM 1_15 2_15 3_15 4_15 Human DQN RL-Shallow
RAM 21 20 18 31 20 14 16
1_15 31 30 28 45 22 17 14
2_15 32 22 25 39 26 20 21
3_15 33 25 18 43 26 23 24
4_15 21 6 13 9 14 10 8
Human 33 31 25 24 38 18 13
DQN 39 36 31 28 42 32 16
RL-Shallow 37 39 30 26 44 34 33
3.3. Subgoaling and Scores 23
Table 5: The table shows data gathered from the runs of the IW(1,f) configuration
on the Atari Games. The first column is the score obtained by the agent. The
second column is the number of steps the agent was able to execute (bounded by
1200 steps or 18000 frames). The third column is the average time of the lookaheads
executed by the agent. The next columns shows the Average node expanded per
lookahead (states not pruned), the average maximum depth reached per lookahead,
the number of random actions executed (when the agent does not find some score
in the lookahead), the average of states generated per lookahead and the average
depth of the nodes per lookahead.
Game Score Steps Av time Av Expanded Av max depth Random Acts Av Generated Av Depth
Alien 39920 1200 46 1441 18 4 7232 12.3
Amidar 3183 1200 23 804 48 16 3851 22.8
Assault 8294 1200 21 1213 27 0 3579 12.6
Asterix 176000 1200 59 2902 16 14 7450 10.3
Asteroids [p1] 1930 180 111 2629 10 1 5659 8.25
Atlantis 148900 1200 6 1607 40 58 991 20.9
Bank Heist 1572 437 8 261 19 62 1415 11.3
Battlezone 562000 1200 45 674 18 44 4577 10.1
Beamrider 4260 626 55 1378 20 182 5476 11.9
BerzerkBerzerk 2910 268 76 1405 12 39 4908 7.49
Breakout - Breakaway IV 226 233 5 1408 39 2 790 20
Bowling 24 570 2 114 21 421 335 13
Centipede 165290 1200 57 1905 11 0 9778 8.49
Carnival 8140 232 19 1233 47 28 3183 19.2
Chopper Command 7200 365 51 762 10 217 6865 7.33
Crazy Climber 39200 1200 9 440 25 194 1421 13.3
Demon Attack 28180 1200 62 4569 15 0 5498 9.38
Double Dunk -14 643 59 470 11 590 6982 7.88
Elevator Action 0 1200 33 212 15 1200 4547 9.18
Enduro 500 1200 44 1434 16 622 5846 10.7
Fishing Derby 29 468 19 285 25 202 2201 11.8
freeway.bin 31 547 7 1719 62 8 711 32.8
Frostbite 320 185 94 1219 13 67 5321 8.15
Gopher 16340 1200 9 919 27 40 1589 14.6
Gravitar 7200 236 116 3122 14 5 6969 9.74
H.E.R.O. 13665 282 78 1046 13 65 4732 9.15
Ice Hockey 55 1200 49 634 9 441 4376 6.71
James Bond 007 7350 633 50 1369 18 31 6509 10.1
Journey - Escape -6100 197 139 1373 9 188 9579 7.52
Kangaroo 11900 796 29 482 27 16 3988 14.6
Krull 4370 270 205 4936 10 8 6853 7.87
Kung Fu Master 41000 1200 24 590 22 42 3495 11.6
Montezumas Revenge - Starring Panama Joe 500 520 11 98 13 471 1642 7.58
Ms. Pac-Man 41001 1200 62 2619 16 50 5414 10.8
Name this game 13470 1200 30 1621 45 8 4182 16.7
Pooyan 12895 1200 38 1647 34 44 4866 17.2
Private Eye 100 720 24 145 9 710 3036 6.8
Q-bert 137800 1200 17 1219 35 0 2729 13.8
River Raid 12280 406 61 1303 12 28 8228 9.20
Robot Tank 62 1200 16 170 14 303 1819 8.51
Road Runner 43500 196 251 7237 8 29 6396 6.34
Seaquest 38450 1200 58 1311 11 13 9109 8.91
Skiing -15331 314 3 132 33 314 415 18
Space Invaders 3385 642 13 1037 36 33 1863 16.4
Star Gunner 1500 85 29 1270 7 12 5438 5.55
Tennis 19 1200 12 217 13 654 1749 7.96
Time Pilot 43300 822 39 1106 16 28 5004 9.79
Tutankham 235 703 16 503 23 232 2104 12.7
Up n Down 92600 989 85 4387 14 3 8678 11.9
Venture 500 644 14 172 23 592 1457 12
Video Pinball 579824 1200 93 2341 16 17 6738 10.6
Wizard of Wor 122900 1140 40 1149 18 263 4334 11.1
Zaxxon 23400 880 36 713 22 337 5568 11.6
Chapter 4
Conclusions
In order to study the relation between the features extracted from the screen and the
score, the experiments were executed on several configurations varying the frameskip
and the extracted features. However, at a first glance was hard to find a relation
between these points, but then a new variation of IW (1), which is IW (1, f ), was
proposed and taking into consideration the score showed a great improvement in the
way it explores the search space and, more important, is able to compete with the
state-of-the-art on the Atari games. Of course this shows how important the score
24
4.2. Future work 25
is in these games, but few domains still presented poor results due to the fact that
the frameskip can be another factor that affects the score, like in the game Asteroids
where good rotations of the spaceship depends directly on the frameskip as shown
on one of the runs of the IW (1, f ) algorithm.
Moreover, the next sections present two possibilities that can follow up this work.
4.2.1 Caching
The term caching refers to the fact that the states are saved when each lookahead
is executed with the Iterated Width algorithm in order to recover each state when
necessary. Planning with pixels should refer to the use of only the screen pixels to
find a plan, however, in this work every lookahead is cached and as a consequence
the RAM states are saved and used by all the configurations to restore the screen.
However, an actual scheme of planning with pixels would use only the observation
from the screen pixels to apply the transition function without any dependency on
the RAM, but in terms of Atari games and planning it is not possible to use just the
screen for these reasons: a single screen could refer to several RAM states (making
the environment stochastic); and the ALE environment does not provide a transition
function with the only use of the screen.
Another approach without caching can be implemented, instead of saving the RAM
states, it is possible to save tuples of the form (o, a, o0 ), where o, o0 are two obser-
vations (the screens) and a is the action executed. Once these tuples are saved, if
26 Chapter 4. Conclusions
the agent executes the action a in the state s with observation o and the resulting
state has the observation o0 , then if the tuple is already saved the transitions is not
discounted from the budget in the IW algorithm. Of course, this is an option to
find a work around to the caching IW anda way to make the search more variable.
The main problem with the Iterated Width, which also exists with blind search
algorithms, if they use a controller (like ALE) then they have to save the controller
state when a node is expanded and as a consequence most of the execution time is
used to save or load states. In order to solve this problem, the next Iterated Width
with Rollouts (IW-R) is proposed: at each step of the IW (i) algorithm, instead of
saving controller state at each child of the current state, the paths to the states
are going to be saved and marked as solved if the sub-trees of each child have been
explored and marked as solved. A node without children, or meeting some condition
(like maximum depth) is marked as solved. When a node is marked as solved, then
the exploration has to follow on other branches of the search space and this is when
a rollout has to be done. Given that the controller state is not saved, in order to
explore other node in other branch of the search space, the controller state must be
reseted and the path to the state must be simulated, in other words, its like doing
the same IW algorithm presented before with the difference that it is stateless since
a new node to be explored mean that we have to simulate the path to that node
instead of loading the simulator state that we would have had to save in the previous
approach.
The atoms for a planning problem P are extended by the depth where they can be
found by IW-R, in other words, if the planning problem has an atom A = x, its
extended as < A = x, d > where d is the depth where the atom A = x can been
found in the original problem P . In terms of novelty, an atom is new if the depth
where the atom is found is less than the current depth of that atom.
4.2. Future work 27
Since each rollout improves an atom depth d(x) (it has been found at a shallower
node) or labels a child of an existing unsolved node in the tree as solved, the following
properties are met when executing the rollouts:
The length of the paths are bounded by the total number of atoms F .
The total number of unsolved nodes in the path tree is bounded by the total
number of extended atoms, so given the first property there are at most F 2
nodes of this kind.
Given a state s the rollout algorithm can be executed several times until the empty
28 Chapter 4. Conclusions
Proof. Lets say there is a state s, which is the shallowest node reached by IW (1) but
not by IW (1) with rollouts. s has some atom which is novel and this atom x F is
not present on any other previous node s0 explored by IW with d(s0 ) d(s), where
d(s) means the depth of node s.
Moreover, this property holds for every node in the path from the root to s and
since s is not reached by IW (1) with rollouts, then the parent s00 was pruned by the
rollouts, but if s00 was pruned by the rollouts, it means that there is a set of nodes
with a smaller depth reaching every atom of s00 , contradicting the fact that s00 was
reached by IW (1).
However, every node to solve each of the |F |2 nodes, every sibling has to be solved, so
there is a total of |F |2 b nodes solved. Furthermore, to solve a node the algorithm
has to build again the path to the node which has at most |F | nodes. So the total
complexity of the algorithm is |F |3 b.
List of Figures
29
List of Tables
30
Bibliography
[1] Yitao Liang, E. T., Marlos Machado & Bowling, M. State of the art control of
atari games using shallow reinforcement learning. Proceedings of the 15th Inter-
national Conference on Autonomous Agents and Multiagent Systems s (AAMAS
2016) (2016).
[2] Marc Bellemare, J. V., Yavar Naddaf & Bowling, M. The arcade learning en-
vironment: An evaluation platform for general agents. Journal of Artificial
Intelligence Research 47 (2013) .
[3] Malik Ghallab, D. N. & Traverso, P. Automated Planning: theory and practice
(Morgan Kaufmann Publishers, 2004).
[4] Nir Lipovetzky, M. R. & Geffner, H. Classical planning with simulators: Results
on the atari video games. IJCAI-15 (2015).
[6] Lipovetzky, N. & Geffner, H. Width and serialization of classical planning prob-
lems. 20th European Conference on Artificial Intelligence (2012) .
[7] Geffner, H. & Bonet, B. A Concise Introduction to Models and Methods for
Automated Planning (Morgan and Claypool Publishers, 2013).
[8] Alexander Shleyfman, A. T. & Domshlak, C. Blind search for atari-like online
planning revisited. IJCAI-16 (2016).
31