The Reinforcement Learning Workshop
上QQ阅读APP看书,第一时间看更新

OpenAI Gym

In this section, we will study the OpenAI Gym tool. We will go through the motivations behind its creation and its main elements, learning how to interact with them to properly train a reinforcement learning algorithm to tackle state-of-the-art benchmark problems. Finally, we will build a custom environment with the same set of standardized interfaces.

The role of shared standard benchmarks for machine learning algorithms is of paramount importance to measure performance and state-of-the-art improvements. While for supervised learning there have been many different examples since the early days of the discipline, the same is not true for the reinforcement learning field.

With the aim of fulfilling this need, in 2016, OpenAI released OpenAI Gym (https://gym.openai.com/). It was conceived to be to reinforcement learning what standardized datasets such as ImageNet and COCO are to supervised learning: a standard, shared context in which the performance of RL methods can be directly measured and compared, both to identify the highest-achieving ones as well as to monitor current progress.

OpenAI Gym acts as an interface between the typical Markov decision process formulation of the reinforcement learning problem and a variety of environments, covering different types of problems the agent has to solve (from classic control to Atari video games), as well as different observations and action spaces. Gym is completely independent of the structure of the agent that will be interfaced with, as well as the machine learning framework used to build and run it.

Here is the list of environment categories Gym offers, ranging from easy to difficult and involving many different kinds of data:

  • Classic control and toy text: Small-scale easy tasks, frequently found in reinforcement learning literature. These environments are the best place to start in order to gain confidence with Gym and to familiarize yourself with agent training.

    The following figure shows an example of classic control problem of CartPole:

Figure 4.1: Classic control problem- CartPole

The following figure shows an example of classic control problem of MountainCar:

Figure 4.2: Classic control problem - Mountain Car

  • Algorithmic: In these environments, the system has to learn, autonomously and purely from examples, to perform computations ranging from multi-digit additions to alphanumeric-character sequence reversal.

    The following figure shows screenshot representing instances of the algorithmic problem set:

Figure 4.3: Algorithmic problem – copying multiple instances of the input sequence

The following figure shows screenshot representing instances of the algorithmic problem set:

Figure 4.4: Algorithmic problem - copying instance of the input sequence

  • Atari: Gym integrates the Arcade Learning Environment (ALE), a software library that provides an interface we can use to train an agent to play classic Atari video games. It played a major role in helping reinforcement learning research achieve outstanding results.

    The following figure shows Atari video game Breakout, provided by ALE:

Figure 4.5: Atari video game of Breakout

The following figure shows Atari video game Pong, provided by ALE:

Figure 4.6: Atari video game of Pong

Note

The preceding figures have been sourced from the official documentation for OpenAI Gym. Please refer to the following link for more visual examples of Atari games: https://gym.openai.com/envs/#atari.

  • MuJoCo and Robotics: These environments expose typical challenges that are encountered in the field of robot control. Some of them take advantage of the MuJoCo physics engine, which was designed for fast and accurate robot simulation and offers free licenses for trial.

    The following figure shows three MuJoCo environments, all of which provide a meaningful overview of robotic locomotion tasks:

Figure 4.7: Three MuJoCo-powered environments – Ant (left), Walker (center), and Humanoid (right)

Note

The preceding images have been sourced from the official documentation for OpenAI Gym. Please refer to the following link for more visual examples of MuJoCo environments: https://gym.openai.com/envs/#mujoco.

  • The following figure shows two environments contained in the "Robotics" category, where RL agents are trained to perform robotic manipulation tasks:

Figure 4.8: Two robotics environments – FetchPickAndPlace (left) and HandManipulateEgg (Right)

Note

The preceding images have been sourced from the official documentation for OpenAI Gym. Please refer to the following link for more visual examples of Robotics environments: https://gym.openai.com/envs/#robotics.

How to Interact with a Gym Environment

In order to interact with a Gym environment, it has to, first of all, be created and initialized. The Gym module uses the make method, along with the ID of the environment as an argument, to create and return a new instance of it. To list all available environments in a given Gym installation, it is sufficient to just run the following code:

from gym import envs

print(envs.registry.all())

This prints out the following:

[EnvSpec(DoubleDunk-v0), EnvSpec(InvertedDoublePendulum-v0),

EnvSpec(BeamRider-v0), EnvSpec(Phoenix-ram-v0), EnvSpec(Asterix-v0),

EnvSpec(TimePilot-v0), EnvSpec(Alien-v0), EnvSpec(Robotank-ram-v0),

EnvSpec(CartPole-v0), EnvSpec(Berzerk-v0), EnvSpec(Berzerk-ram-v0),

EnvSpec(Gopher-ram-v0), ...

This is a list of so-called EnvSpec objects. They define specific environment-related parameters, such as the goal to be achieved, the reward threshold defining when the task is considered solved, and the maximum number of steps allowed for a single episode.

One interesting thing to note is that it is possible to easily add custom environments, as we will see later on. Thanks to this, a user can implement a custom problem using standard interfaces, making it straightforward for it to be tackled by standardized, off-the-shelf, reinforcement learning algorithms.

The fundamental elements of an environment are as follows:

  • Observation (object): An environment-specific object representing what can be observed of the environment; for example, the kinematic variables (that is, velocities and positions) of a mechanical system, pawn positions in a chess game, or the pixel frames of a video game.
  • Actions (object): An environment-specific object representing actions the agent can perform in the environment; for example, joint rotations and/or joint torques for a robot, a legal move in a board game, or buttons being pressed in combination for a video game.
  • Reward (float): The amount of reward achieved by executing the last step with the prescribed action. The reward range differs between different tasks, but in order to solve the environment, the aim is always to increase it, since this is what the RL agent tries to maximize.
  • Done (bool): This indicates whether the episode has finished. If true, the environment needs to be reset. Most, but not all, tasks are pided into well-defined episodes, where a terminated episode may represent that the robot has fallen on the ground, the board game has reached a final state, or the agent lost its last life in a video game.
  • Info (dict): This contains diagnostic information on environment internals and is useful for both debugging purposes and for an RL agent training, even if it's not allowed for standard benchmark comparisons.

The fundamental methods of an environment are as follows:

  • reset(): Input: none, output: observation. Resets the environment, bringing it to the starting point. It takes no input and outputs the corresponding observation. It has to be called right after environment creation and every time a final state is reached (done flag equal to True).
  • step(action): Input: action, output: observation – reward – done – info. Advances the environment by one step, applying the selected input action. Returns the observation of the newly reached state, which is a reward associated with the transition from the previous to the new state under the selected action. The done flag is used to indicate whether the new state is a terminal one or not (True/False, respectively), as well as the Info dict with environment internals.
  • render(): Input: none, output: environment rendering. Renders the environment and is used for visualization/presentation purposes only. It is not used during agent training, which only needs observations to know the environment's state. For example, it presents robot movements via animation graphics or outputs a video game video stream.
  • close(): Input: none, output: none. Shuts down the environment gracefully.

These elements allow us to have complete interaction with the environment simply by executing it with random inputs, training an agent, and running it. It is, in fact, an implementation of the standard reinforcement learning contextualization, which is described by the agent-environment interaction. For each timestep, the agent executes an action. This interaction with the environment causes a transition from the current state to a new state, resulting in an observation of the new state and a reward, which are returned as results. As a preliminary step, the following exercise shows how to create a CartPole environment, reset it, run it for 1,000 steps while randomly sampling one action for each step, and finally close it.

Exercise 4.01: Interacting with the Gym Environment

In this exercise, we will familiarize ourselves with the Gym environment by looking at a classic control example, CartPole. Follow these steps to complete this exercise:

  1. Import the OpenAI Gym module:

    import gym

  2. Instantiate the environment and reset it:

    env = gym.make('CartPole-v0')

    env.reset()

    The output will be as follows:

    array([ 0.03972635, 0.00449595, 0.04198141, -0.01267544])

  3. Run the environment for 1000 steps, rendering it and resetting it if a terminal state is encountered. After all steps are completed, close the environment:

    for _ in range(1000):

        env.render()

        # take a random action

        _, _, done, _ = env.step(env.action_space.sample())

        if done:

            env.reset()

    env.close()

    It renders the environment and plays it for 1,000 steps. The following figure shows one frame that was extracted from step number 12 of the entire sequence:

Figure 4.9: One frame of the 1,000 rendered steps for the CartPole environment

Note

To access the source code for this specific section, please refer to https://packt.live/30yFmOi.

This section does not currently have an online interactive example, and will need to be run locally.

This shows that the black cart can move along its rail (the horizontal line), with its pole fixed on the cart with a hinge that allows it to rotate freely. The goal is to control the cart while pushing it left and right in order to maintain the pole's vertical equilibrium, as seen in the preceding figure.

Action and Observation Spaces

In order to appropriately interact with an environment and train an agent on it, a fundamental initial step is to familiarize yourself with its action and observation spaces. For example, in the preceding exercise, the action was randomly sampled from the environment's action space.

Every environment is characterized by action_space and observation_space, which are instances of the Space class that describe the actions and observations required by Gym. The following snippet prints them out for the CartPole environment:

import gym

env = gym.make('CartPole-v0')

print("Action space =", env.action_space)

print("Observation space =", env.observation_space)

This outputs the following two rows:

Action space = Discrete(2)

Observation space = Box(4,)

The Discrete space represents the set of non-negative integer numbers (natural numbers plus 0). Its dimension defines which numbers represent valid actions. For example, in the CartPole case, it is of dimension 2 because the agent can only push the cart left and right, so the admissible values are 0 or 1. The Box space can be thought of as an n-dimensional array. In the CartPole case, the system state is defined by four variables: cart position and velocity, and pole angle with respect to the vertical and angular velocity. So, the "box observation" space dimension is equal to 4, and valid observations will be an array of four real numbers. In the latter case, it is useful to check their upper and lower bounds. This can be done as follows:

print("Observations superior limit =", env.observation_space.high)

print("Observations inferior limit =", env.observation_space.low)

This prints out the following:

Observations superior limit = array([ 2.4, inf, 0.20943951, inf])

Observations inferior limit = array([-2.4, -inf,-0.20943951, -inf])

With these new elements, it is possible to write a more complete snippet to interact with the environment, using all the previously presented interfaces. The following code shows a complete loop, executing 20 episodes, each for 100 steps, rendering the environment, retrieving observations, and printing them out while taking random actions and resetting once it reaches a terminal state:

import gym

env = gym.make('CartPole-v0')

for i_episode in range(20):

    observation = env.reset()

    for t in range(100):

        env.render()

        print(observation)

        action = env.action_space.sample()

        observation, reward, done, info = env.step(action)

        if done:

            print("Episode finished after {} timesteps".format(t+1))

            break

env.close()

The preceding code runs the environment for 20 episodes of 100 steps each, also rendering the environment, as we saw in Exercise 4.01, Interacting with the Gym Environment.

Note

In the preceding case, we run each episode for 100 steps instead of 1,000, as we did previously. There is no particular reason for doing so, but we are running 20 different episodes, not a single one, so we opted for 100 steps to keep the code execution time short enough.

In addition to that, this code also prints out the sequence of observations, as returned by the environment, for each step performed. The following are a few lines that are received as output:

[-0.061586 -0.75893141 0.05793238 1.15547541]

[-0.07676463 -0.95475889 0.08104189 1.46574644]

[-0.0958598 -1.15077434 0.11035682 1.78260485]

[-0.11887529 -0.95705275 0.14600892 1.5261692 ]

[-0.13801635 -0.7639636 0.1765323 1.28239155]

[-0.15329562 -0.57147373 0.20218013 1.04977545]

Episode finished after 14 timesteps

[-0.02786724 0.00361763 -0.03938967 -0.01611184]

[-0.02779488 -0.19091794 -0.03971191 0.26388759]

[-0.03161324 0.00474768 -0.03443415 -0.04105167]

Looking at the previous code example, we can see how, for now, the action choice is completely random. It is right here that a trained agent would make a difference: it should choose actions based on environment observations, thus appropriately responding to the state it finds itself in. So, revising the previous code by substituting a trained agent in place of a random action choice looks as follows:

  1. Import the OpenAI Gym and CartPole modules:

    import gym

    env = gym.make('CartPole-v0')

  2. Run 20 episodes of 100 steps each:

    for i_episode in range(20):

        observation = env.reset()

        for t in range(100):

  3. Render the environment and print the observation:

            env.render()

            print(observation)

  4. Use the agent's knowledge to choose the action, given the current environment state:

            action = RL_agent.select_action(observation)

  5. Step the environment:

            observation, reward, done, info = env.step(action)

  6. If successful, break the inner loop and start a new episode:

            if done:

                print("Episode finished after {} timesteps"\

                      .format(t+1))

                break

    env.close()

With a trained agent, actions will be chosen optimally since a function of the state that the agent is in, is used to maximize the expected reward. This code would result in an output similar to the previous one.

But how do we proceed and train an agent from scratch? As you will learn throughout this book, there are many different approaches and algorithms we can use to achieve this quite complex task. In general, they all need the following tuple of elements: current state, chosen action, reward obtained by performing the chosen action, and new state reached by performing the chosen action.

So, elaborating again on the previous code snippet to introduce the agent training step, it would look like this:

import gym

env = gym.make('CartPole-v0')

for i_episode in range(20):

    observation = env.reset()

    for t in range(100):

        env.render()

        print(observation)

        action = RL_agent.select_action(observation)

        new_observation, reward, done, info = env.step(action)

        RL_agent.train(observation, action, reward, \

                       new_observation)

        observation = new_observation

        if done:

            print("Episode finished after {} timesteps"\

                  .format(t+1))

            break

env.close()

The only difference in this code with respect to the previous block is the following line:

        RL_agent.train(observation, action, reward, new_observation)

This refers to the agent training step. The purpose of this code is to give us a high-level idea of all the steps involved in training an RL agent in a given environment.

This is the high-level idea behind the method adopted to carry out reinforcement learning agent training with the Gym environment. It provides access to all the required details through a very clean standard interface, thus giving us access to an extremely large set of different problems against which measuring algorithms and techniques can be used.

How to Implement a Custom Gym Environment

All the environments that are available through Gym are perfect for learning purposes, but eventually, you will need to train an agent to solve a custom problem. One good way to achieve this is to create a custom environment, specific to the problem domain.

In order to do so, a class derived from gym.Env must be created. It will implement all the objects and methods described in the previous section so that it supports the agent-world interaction cycle that's typical of any reinforcement learning setting.

The following snippet represents a frame guiding a custom environment's development:

import gym

from gym import spaces

class CustomEnv(gym.Env):

    """Custom Environment that follows gym interface"""

    metadata = {'render.modes': ['human']}

    def __init__(self, arg1, arg2, ...):

      super(CustomEnv, self).__init__()

      # Define action and observation space

      # They must be gym.spaces objects

      # Example when using discrete actions:

      self.action_space = spaces.Discrete(N_DISCRETE_ACTIONS)

      # Example for using image as input:

      self.observation_space = spaces.Box\

                               (low=0, high=255, \

                                shape=(HEIGHT, WIDTH, \

                                       N_CHANNELS), \

                                dtype=np.uint8)

    def step(self, action):

      # Execute one time step within the environment

      ...

      # Compute reward

      ...

      # Check if in final state

      ...

      return observation, reward, done, info

    def reset(self):

      # Reset the state of the environment to an initial state

      ...

      return observation

    def render(self, mode='human', close=False):

      # Render the environment to the screen

      ...

      return

In the constructor, action_space and observation_space are defined. As mentioned previously, they will contain all possible actions the agent can take in the environment and all environment data observable by the agent. They are to be attributed to the specific problem: in particular, action_space will reflect elements the agent can control to interact with the environment, while observation_space will contain all the variables we want the agent to consider when choosing the action.

The reset method will be called to periodically reset the environment to an initial state, typically after the first initialization and every time after the end of an episode. It will return the observation.

The step method receives an action as input and executes it. This will result in an environment transitioning from the current state to a new state. The observation related to the new state is returned. This is also the method where the reward is calculated as a result of the state transition generated by the action. The new state is checked to determine whether it is a terminal one, in which case, the done flag that's returned is set to true. As the last step, all useful internals are returned in the info dictionary.

Finally, the render method is the one in charge of rendering the environment. Its complexity may range from being as simple as a print statement to being as complicated as rendering a 3D environment using OpenGL.

In this section, we studied the OpenAI Gym tool. We had an overview that explained the context and motivations behind its conception, provided details about its main elements, and saw how to interact with the elements to properly train a reinforcement learning algorithm to tackle state-of-the-art benchmark problems. Finally, we saw how to build a custom environment with the same set of standardized interfaces.