by Jacob Campbell


Machine learning has been an extremely popular topic in recent years. OpenAI recently published about its success in developing an agent using machine learning to beat Dota2 professionals at 1v1 matches and they’re working on a solution to beat the world’s top teams in 5v5. DeepMind developed AlphaGo, an agent capable of beating the world’s best Go players–something that was thought to be almost impossible because of the staggering size of states and actions in Go.

These are huge achievements in the realm of machine learning and one important common denominator in these two achievements is the leveraging of reinforcement learning algorithms.


In this post I hope to introduce some fundamental concepts of reinforcement learning while also showing tools used by the community to build and test reinforcement learning models. By the end, you should have a basic understanding of some reinforcement terms and the ability to write a simple random agent using Python and OpenAI’s Gym. I’ll also provide some additional resources I’ve found to be helpful when learning about reinforcement learning.

The motivation for this post is as an introduction to further writing reviewing a project I’ve been working on to reproduce the findings from the DeepMind DQN paper. This paper is one of the first (if not the first) to demonstrate a successful application of deep neural networks to estimate value functions in a modified Q learning algorithm.

About Reinforcement Learning

For those interested in learning about the scope, history, and popular algorithms in reinforcement learning, I highly recommend “Reinforcement Learning: An Introduction,” a wonderful book by Richard S. Sutton and Andrew G. Barto (a link is in the further reading section). Here I will just touch the bare minimum to understand what’s going on in the following examples.

Generally speaking, the term “reinforcement learning” can refer to three concepts: the space of reinforcement learning problems, solutions to those problems, and the study of these problems and their solutions (Sutton & Barto p. 4).

What do we consider to be a reinforcement learning problem? Reinforcement learning is often concerned with the set of problems that involve an agent interacting with her environment, ideally towards some goal, to adapt (learn) over time. The scope of this field is ridiculously broad which has contributed to some of its recent popularity.

An environment is the world in which an agent acts. An agent receives some kind of signal (input) from the environment that it understands as state – analogous to the 5 senses. When the agent wants to effect change, she can take an action and observe the resulting reward and state (or lack of) change.

Representation stolen from Wikipedia

Building a Python RL Environment

In this section we’re going to build an environment that we can use to test against a classical reinforcement learning problem called the CartPoleproblem. That is, we have a pole balancing on a cart and we want to move the cart as to maximize the time that pole stays standing.

We’ll do this using Python3.7 (really any version >= 3.5 should work) and an open source tool built by OpenAI for the building and testing of reinforcement learning algorithms called Gym. Gym doesn’t make many assumptions about the algorithms that you’re testing. Simply put, Gym provides an interface between your algorithm and the environment that it’s interacting with (the two arrows in our above image).

The following incantation should be familiar to those initiated into the Python ecosystem:

mkdir is_this_rl && cd is_this_rl
virtualenv venv --python=python3.7
source venv/bin/activate
pip install Gym

Once we’ve created a virtual environment (to avoid convoluted dependency errors, avoid stepping on toes, and make things mostly repeatable) and installed Gym, we can begin to play around with the API in our REPL.

(venv)$ python
>>> import gym
>>> gym_env = gym.make('CartPole-v0')

In the above snippet we create an environment to interact with. In this instance, we’re designating an environment for the classic CartPoleproblem. Gym provides many built-in environments to test against. Fortunately, if those environments are not sufficient, they also support the addition of new ones.

Once the environment is created, we can use the attributes action_space and observation_space to glean some insight into what it means to operate in this environment:

>>> gym_env.action_space
>>> gym_env.observation_space

Hmm… Discrete(2) and Box(4,) appear to tell us something, but I prefer having some concrete examples of what these might mean. Let’s begin with digging into action_space.

>>> gym_env.action_space.sample()
>>> gym_env.action_space.sample()
>>> gym_env.action_space.sample()

Using the sample() function, we can randomly select an action from the action space. It looks like Discrete(2)means that the environment expects an action signal of either 0 or 1. If we want to make sure that our intuition is correct, a quick glance at the Discrete code confirms our suspicions.

In a very similar manner, let’s investigate the observation space.

>>> gym_env.observation_space
>>> observation = gym_env.reset()
array([-0.00092824,  0.03297694,  0.03997609, -0.01551737])

The above snippet introduces the reset()function. This function can be used to, surprisingly, reset the environment to a new initial state. In a video game, this would be equivalent to restarting a level (although the initial state isn’t necessarily the same all the time). The observation space in this environment appears to be a 1D array of length 4.

Putting all our new information together, we can construct a simple random agent like the following:

import gym 

env = gym.make('CartPole-v0')

for episode in range(10):
    print("Episode: {}".format(episode+1))
    observation = env.reset()
    print("Initial Observation: {}".format(observation))
    done = 0 
    reward = 0 
    step = 0 
    while not done:
        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
        # Only print the observations and rewards from the first episode
        if episode == 0:
            print("Step: {}".format(step+1))
            print("\tTaking action: {}".format(action))
            print("\tObservation: {}".format(observation))
            print("\tReward received: {}".format(reward))
        step += 1

Further Reading

“Reinforcement Learning: An Introduction” by Richard S. Sutton and Andrew G. Barto – this book is a solid and current introduction to reinforcement learning. For someone completely new getting into the subject, I cannot recommend this book highly enough. I’d recommend getting a hold of the 2nd edition if you can as it has many additions and discusses recent work in reinforcement learning (published in 2018).

Deep Learning Certification by Andrew Ng (Coursera) – deep learning is often used for value and policy estimation in recent papers. My goal with this course would be to become familiar enough with the terminology and methodology to be comfortable digging into deep learning literature yourself.

Gym Documentation.

OpenAI website – they have some really cool articles and papers about machine learning with a focus on working towards artificial general intelligence.