This is part II of the tutorial series on building a Balancing Bot environment in OpenAI Gym, discussing implementation details of the
Env class. You can reach the first part here.
The OpenAI Gym defines an environment specification, which is implemented in a python class called
Env. Agents send actions to, and receive observations and rewards from this class. The class includes a number of attributes and methods that aid in bookkeeping and management. We will be subclassing the
Env class and implementing the working logic of our environment as part our subclass methods.
Before going into implementation details, we will briefly go through a few essentials including a short discussion of the balancing task and the robot model. We will also have a look at the
Space class, an essential component for creating observations.
The Balancing Bot
A balancing bot is a robot in an inverted pendulum configuration with two wheels on the same axis. Inverted pendulum is a naturally unstable configuration, thus the aim is to keep the body of the robot upright and optionally allow movement at a desired velocity. The robot controller accepts orientation and angular velocity information from an IMU and is also aware of the angular velocity of the wheels (either indirectly or using an encoder). The controller’s output consists of commands to either increase or decrease the wheel angular velocity in either direction, so as to maintain balance.
The Robot Model
The balancing bot comprises a rectangular body with two cylindrical wheels no each side. Total height is around 50cm (19.6 inch). You can paste this snippet into a file named
balancebot_simple.xml and save it in the same folder as the
The Space Class
The Space class provides a standardized way of defining action and observation spaces. There are many subclasses of Space included in the Gym, but in this tutorial we will deal with just two:
space.Discrete. The first is a generic class for n-dimensional continuous domains. Think of it as an n-dimensional numpy array. The
Discrete space represents a finite set of values. In fact, the
Discrete space itself does not hold any values, rather it only represents a single positive index, corresponding to the selected value.
In this tutorial, we will be describing the observation space using a 3-dimensional
Box space corresponding to three continuous values: robot inclination (pitch), angular velocity and wheel angular velocity. The action space will be a
Discrete space of nine values, each corresponding to progressively greater (negative and positive) changes in commanded wheel angular speed.
We’ll discuss usage of the Space class later on when implementing the reward, observation and reset routines.
Attributes of the Env class
Environments in OpenAI Gym are subclasses of the
gym.Env Python class. According to the
Env class documentation, there are three attributes that need to be set in every
Space object corresponding to valid actions.
Space object corresponding to valid observations.
reward_range: A tuple corresponding to the min and max possible rewards.
We will be setting these during the initialization of our
Env subclass. Our implementation of the
__init__ method will be as follows:
Methods of the Env class
Taking a look at the Env class code hints that there are five methods that we need to override, whose signatures are as folows:
_step method is where the magic happens.
_step is called once for each time step of the simulation, accepts an action from the agent, and has to return a tuple with the following four items:
- An observation
- A reward
- A boolean flag indicating whether an episode has ended
- An info dictionary.
Of those, only the first three are essential, the last one is optional and a blank dict may be returned in place. The observation is a numpy array object that needs to conform to the dimensions specified when assigning the
observation_space attribute during instance initialization. So for instance if your observation space is a box as follows:
then your returned observation should be a numpy array containing three values, one for each dimension:
The reward is a scalar value that represents the reward an agent gets for performing this particular action at the current state of the environment.
The third option is a boolean flag that indicates the end of an episode. Returning
True here will trigger a call to
_reset, which depending on the implementation, will have the environment reset itself.
The last option is really up to you. I haven’t yet tried it, but returning a blank dictionary is ok.
Here is a made up example of what a typical
_step implementation return statement would look like:
Obviously instead of fixed values your observations and rewards should be computed from the current state of the environment and the action being passed as an argument to
Our implementation of the _step method is as follows:
_step method itself has a few high-level calls to other methods that perform specific actions, and which we’ll be getting into right away.
_assign_throttle method is as follows:
In this method, the following happen: First, the
deltav, an indicator of the decided wheel speed change, is determined by selecting the element at the index specified by the action. The array contents are arbitrary and could be different. Then,
self.vt is adjusted according to
deltav. Finally a pyBullet method,
setJointMotorControl2(), is called that updates the angular velocity of the robot wheels with the new value.
The next call in the _step method is to
stepSimulation(), which advances the simulation by one step. Next, three methods are called in sequence that compute and return the observation, reward and done flag, respectively. The content of each one follows:
In _compute_observation, pyBullet specific methods
getBaseVelocity are used to get the robot pitch and angular velocity.
_compute_reward uses the same methods
getEulerFromQuaternion in order to determine the pitch. It then returns a sum of the pitch value (0 being upright) and the absolute difference between the current and desired wheel speeds. (
self.vt - self.vd).
self.vd is initially set to zero, and at this point it is not modifiable.
_compute_done checks two things: The first is whether the center of mass of the robot is below 15cm (5.9 inch). If so, it will mean the robot has fallen over, and so the episode has ended. In addition, it measures the environment step counter so as to limit the duration of each episode.
The final steps in the
_step method implementation is to increase the environment step counter by one, and return the observation, reward and done flag.
The reset method is responsible both for resetting as well as initializing the environment, as it is also called once after the class is initialized. Our implementation of
_reset is as follows:
The render method is where visualization-related actions take place. In our case the pyBullet environment takes care of all visualization, so there isn’t anything to do in this method. We will leave it blank.
Finally, the seed method sets the seed for this environment’s random number generator:
Putting it all together
balancebot_env.py file contents are as shown below:
You can paste the snippet above into your empty
balancebot_env.py, and you’ll be good to go.
Running the balancing bot simulation
It’s now time to put the script we wrote at the very beginning of this tutorial to good use. Here it is presented again for convenience:
Paste the contents to a file named
balancebot_task.py just outside your
balancebot-env folder, and run it:
You should see the the pyBullet viewer popping up and the robot should already be hitting the floor by the time you read this! Something like this:
But don’t worry! It will improve quickly. After around 2 minutes (depending on the speed of your machine) the bot should be able to stand upright for most of the time. You can use
Ctrl+C in the terminal to quit the simulation, or let it complete the predefined number of cycles, after which it will save the best model and quit.
This post just scratches the surface of what is possible with Gym, Baselines and pyBullet. Regarding the balancing bot task in particular, there are a few ideas for advancing it:
- Add sensor noise: The values that we get currently are squeaky clean, as they come from software. In real life, reading sensors would involve some level of noise, so one idea is to mix some of the noise in the values of the observations. Don’t worry, this will not reduce the performance of the balancing bot, on the contrary it will make it more robust.
- Add bumps and obstacles: A flat plain is good for starters, but how will the balancing bot behave in an uneven ground filled with obstacles?
- Build more advanced robot models: Boston Dynamics’ Handle is a balancing bot at heart but it is so much more agile due to all the articulations and suspension it includes! Control of a complex robot like the Handle would be an ideal scenario for a reinforcement learning algorithm!
In this post we completed implementation of the balancing bot task and saw in detail what each part of the
Env subclass does, getting a working Gym+Baselines+pyBullet experiment at the end. It’s good fun to watch the little bot learning to balance, but there are so much more to explore!
I will be shortly releasing the source of this experiment on Github, so stay tuned!
Do you have any questions or comments regarding this tutorial? If so, share your experience in the comments below!