Anda di halaman 1dari 4

DEMONSTRATING THE OPENAI GYM AND DEEP REINFORCEMENT

LEARNING WHEN APPLIED TO ATARI 2600 GAMES

DAVID MEYER
CSCI 4830: MACHINE LEARNING
OpenAI Gym
Introduction Gym Atari Learning Environment
In 2015 Elon Musk and Sam Altman founded a Skimage python image processing
new Artificial Intelligence and Machine learning library
initiative known as OpenAI. The aim of the Keras python neural networking library
initiative is to promote free collaboration
between institutions by making its patents and Keras is primarily used to define the deep q
research open to the public.1 Over $1 billion has network while TensorFlow is used for
been pledged to the company.2 The key founders, optimization and execution of the
Elon Musk and Sam Altman, have stated that
environments.5
they formed OpenAI due to concerns about
existential risk from artificial general
Related Work
intelligence.3
The Montreal Institute for Learning Algorithms
(MILA), proposed a paper defining further
In April of 2016, OpenAI released its first
asynchronous methods for deep reinforcement
product free to the public. This product is known
learning. 6 Using GPUs vs. CPUs makes a
as OpenAI Gym and it functions as a platform for
substantial difference in recorded results. MILA
reinforcement learning research. 4 It is easy to
explains this with regard to Atari games.
setup and provides a large selection of test
environments in which different methods for
Experiments
reinforcement learning can be applied.
Breakout
The first experiment involves asynchronously
In this paper OpenAI Gym and its application to
training a network to play the game Breakout.
Atari games will be demonstrated.5 Originally, a
Eight active-learner threads are used in order to
demo of reinforcement learning applied to the
aid with training.
popular PC game, Doom, was planned. However,
toolkits for plugging the PyDoom library into
deep learning reinforcement systems, such as
TensorFlow, were sparsely available.

To demonstrate the application of deep


reinforcement learning to Atari games the
following toolkits are used:
TensorFlow


1
Gershgorn, Dave (December 11, 2015). "New
'OpenAI' Artificial Intelligence Group Formed By
Elon Musk, Peter Thiel, And More". Popular Science.
2 Figure 1 (Window of Breakout)
"Tech giants pledge $1bn for 'altruistic AI' venture,
OpenAI". BBC News. $: python async_dqn.py --experiment breakout --
3 Cade Metz (27 April 2016). "Inside OpenAI, Elon
game "Breakout-v0" --num_concurrent 8
Musk's Wild Plan to Set Artificial Intelligence Free".
Wired magazine.
4 Greg Brockman; John Schulman (27 April 2016).
"OpenAI Gym Beta". OpenAI Blog. OpenAI.
5 6 Montreal Institute for Learning Algorithms (4
Corey Lynch (25 June 2016). Asyncronous RL in
Tensorflow + Keras + OpenAI's Gym. Github. February 2016). Asynchronous Methods for Deep
Github.com/coreylynch/async-rl Reinforcement Learning. University of Montreal.

1
the graphs from the training, evaluation, and
Pacman checkpoint files.
The second experiment involves four active-
learner threads in order to aid with training in
playing the classic game Pacman.


Figure 4 (Sidebar of TensorBoard)

The results for each different game (Breakout,


Figure 2 (Window of Pacman) Pacman, and Asteroids) vary greatly. Each game
is trained for roughly three minutes each. Three
$: python async_dqn.py --experiment pacman -- simulation minutes equates to roughly two hours
game "Pacman-v0" --num_concurrent 4 of real playtime. A smoothing factor of 0.6 is
applied to the results. See results below.
Asteroids
The third experiment involves four active-learner Breakout
threads in order to aid with training in playing The game Breakout poses some interesting
the game Asteroids. results. Unexpectedly, it seems this game had the
weakest results out of all the three tested.


Figure 3 (Window of Asteroids) Figure 5 (Episode Reward for Breakout)

$: python async_dqn.py --experiment asteroids -- A maximum Episode Reward of 4.0 is achieved


game "Asteroids-v0" --num_concurrent 4 roughly 4,500 steps into the simulation. Even
with smoothing applied, there is no clear
All experiments mentioned above have also been improvement over time.
repeated with different concurrent threads
ranging from two all the way up to thirty-two.
See results and analysis.

Results and Analysis


Three primary values are recorded and graphed:
Epsilon, Epsilon Reward, and Q Max. All three
values are then displayed with different amounts
of smoothing. The TensorBoard is used to render

2
8,000 steps back from this high point there is a
local minimum and then it appears that the
system has an epiphany in which the Episode
Reward begins to increase exponentially. If the
simulation were to run longer than 50,000 steps a
substantial score may be achieved.


Figure 6 (Epsilon of Breakout)

A maximum Epsilon of 0.9993 is achieved


roughly 15 seconds into the simulation. This
value maintains itself above 0.99 from then after.


Figure 9 (Epsilon of Pacman)

A maximum Epsilon of 0.9997 is achieved


roughly 10 seconds into the simulation. This
value maintains itself above 0.99 from then after.


Figure 7 (Max Q Value of Breakout)

A maximum Q Value of 0.04537 is achieved


roughly 50,000 steps into the simulation. This
value seems to be linearly increasing and may
continue to increase if the simulation were to run
for much longer.
Figure 10 (Max Q Value of Pacman)
Pacman
A maximum Q Value of 0.3066 is achieved
The game Pacman seems to benefit the most roughly 50,000 steps into the simulation. This
from deep reinforcement learning. Pacman has value seems to be increasing exponentially with
the strongest results out of all three games tested. an embedded s-curve function. No doubt this
value may continue to increase if the simulation
continued.

Asteroids
The game Asteroids behaves similarly to
Breakout in which if the simulation runs for too
long, there is no noticeable benefit from deep
reinforcement learning.


Figure 8 (Episode Reward of Pacman)

A maximum Episode Reward of 410 is achieved


roughly 50,000 steps into the simulation. About

3
Future Work
Since the OpenAI Gym is expandable, a plethora
of other Atari games could be tested with deep
reinforcement learning. Longer tests on more
powerful systems could also be performed.
Perhaps if time allowed, the recorded results for
the weakest performers (Breakout and Asteroids)
may be completely different. It is plausible that
longer simulations (100,000 steps or more) could
demonstrate better results.

Figure 11 (Episode Reward of Asteroids) Conclusion
After using OpenAI Gym paired with Atari
A maximum Episode Reward of 2200 is
games, one can conclude that the system is
achieved roughly 10,000 steps into the
generally useful for machine learning
simulation. The Episode Reward hovers around
researchers. This is because the toolkit provides
1000 then after.
a relatively simple and universal protocol for
simulating deep reinforcement learning
environments. Although the Doom environment
is not quite ready, the Atari environment proves
itself quite adaptable when TensorFlow and
Keras are applied.

As for the tested games (Breakout, Pacman, and


Asteroids), it appears some games perform better
than others. The game with the best performance,
Pacman, benefits from deep reinforcement
learning because it has a recognizable pattern in
Figure 12 (Epsilon of Asteroids) which the system can easily pick up on. The
remaining tested games, Breakout and Asteroids,
A maximum Epsilon of 0.9995 is achieved had randomly generated elements that made
roughly 5000 steps into the simulation. This learning quite difficult.
value maintains itself above 0.99 from then after.
In conclusion, OpenAI is a very useful toolset,
which continues to gain traction within the
machine learning research community. As of
December 2016, OpenAI has introduced a new
product known as OpenAI Universe.7 This new
system allows for broader testing of machine
learning algorithms and will help aid OpenAIs
goal of promoting free collaboration.


Figure 13 (Max Q Value of Asteroids)

A maximum Q Value of 0.1702 is achieved


roughly 50,000 steps into the simulation. This
value is the most linear recorded out of all three
games tested. No doubt this value may continue
to increase if the simulation continued.


7 OpenAI (5 December 2016). Universe. OpenAI
Blog. OpenAI.

Anda mungkin juga menyukai