Build a Balancing Bot with OpenAI Gym, Pt II: The Robot and Environment

This is part II of the tutorial series on building a Balancing Bot environment in OpenAI Gym, discussing implementation details of the Env class. You can reach the first part here.

The OpenAI Gym defines an environment specification, which is implemented in a python class called Env. Agents send actions to, and receive observations and rewards from this class. The class includes a number of attributes and methods that aid in bookkeeping and management. We will be subclassing the Env class and implementing the working logic of our environment as part our subclass methods.

Before going into implementation details, we will briefly go through a few essentials including a short discussion of the balancing task and the robot model. We will also have a look at the Space class, an essential component for creating observations.

The Balancing Bot

A balancing bot is a robot in an inverted pendulum configuration with two wheels on the same axis. Inverted pendulum is a naturally unstable configuration, thus the aim is to keep the body of the robot upright and optionally allow movement at a desired velocity. The robot controller accepts orientation and angular velocity information from an IMU and is also aware of the angular velocity of the wheels (either indirectly or using an encoder). The controller’s output consists of commands to either increase or decrease the wheel angular velocity in either direction, so as to maintain balance.

The Robot Model

pyBullet accepts robot models defined in the URDF format, which is an XML format used by ROS. I have prepared a basic balancing robot model that I am sharing below:

The balancing bot comprises a rectangular body with two cylindrical wheels no each side. Total height is around 50cm (19.6 inch). You can paste this snippet into a file named balancebot_simple.xml and save it in the same folder as the file.

The Space Class

The Space class provides a standardized way of defining action and observation spaces. There are many subclasses of Space included in the Gym, but in this tutorial we will deal with just two: space.Box and space.Discrete. The first is a generic class for n-dimensional continuous domains. Think of it as an n-dimensional numpy array. The Discrete space represents a finite set of values. In fact, the Discrete space itself does not hold any values, rather it only represents a single positive index, corresponding to the selected value.

In this tutorial, we will be describing the observation space using a 3-dimensional Box space corresponding to three continuous values: robot inclination (pitch), angular velocity and wheel angular velocity. The action space will be a Discrete space of nine values, each corresponding to progressively greater (negative and positive) changes in commanded wheel angular speed.

We’ll discuss usage of the Space class later on when implementing the reward, observation and reset routines.

Attributes of the Env class

Environments in OpenAI Gym are subclasses of the gym.Env Python class. According to the Env class documentation, there are three attributes that need to be set in every Env subclass:

action_space: The Space object corresponding to valid actions.

observation_space: The Space object corresponding to valid observations.

reward_range: A tuple corresponding to the min and max possible rewards.

We will be setting these during the initialization of our Env subclass. Our implementation of the __init__ method will be as follows:

Methods of the Env class

Taking a look at the Env class code hints that there are five methods that we need to override, whose signatures are as folows:

The _step method

The _step method is where the magic happens. _step is called once for each time step of the simulation, accepts an action from the agent, and has to return a tuple with the following four items:

  • An observation
  • A reward
  • A boolean flag indicating whether an episode has ended
  • An info dictionary.

Of those, only the first three are essential, the last one is optional and a blank dict may be returned in place. The observation is a numpy array object that needs to conform to the dimensions specified when assigning the observation_space attribute during instance initialization. So for instance if your observation space is a box as follows:

then your returned observation should be a numpy array containing three values, one for each dimension:

The reward is a scalar value that represents the reward an agent gets for performing this particular action at the current state of the environment.

The third option is a boolean flag that indicates the end of an episode. Returning True here will trigger a call to _reset, which depending on the implementation, will have the environment reset itself.

The last option is really up to you. I haven’t yet tried it, but returning a blank dictionary is ok.

Here is a made up example of what a typical _step implementation return statement would look like:

Obviously instead of fixed values your observations and rewards should be computed from the current state of the environment and the action being passed as an argument to _step.

Our implementation of the _step method is as follows:

The _step method itself has a few high-level calls to other methods that perform specific actions, and which we’ll be getting into right away.

The _assign_throttle method is as follows:

In this method, the following steps take place: First, the deltav, an indicator of the change in wheel speed, is determined by selecting the element at the index specified by the action. The array contents are arbitrary and could be different. Then, self.vt is adjusted according to deltav. Finally a pyBullet method, setJointMotorControl2(), is called that updates the angular velocity of the robot wheels with the new value.

The next call in the _step method is to stepSimulation(), which advances the simulation by one step. Next, three methods are called in sequence that compute and return the observation, reward and done flag, respectively. The content of  each one follows:

In _compute_observation, pyBullet specific methods getBasePositionAndOrientationgetEulerFromQuaternion and getBaseVelocity are used to get the robot pitch and angular velocity.

_compute_reward uses the same methods getBasePositionAndOrientationgetEulerFromQuaternion in order to determine the pitch. It then returns a sum of the pitch value (0 being upright) and the absolute difference between the current and desired wheel speeds. (self.vt - self.vd). self.vd is initially set to zero, and at this point it is not modifiable.

_compute_done checks two things: The first is whether the center of mass of the robot is below 15cm (5.9 inch). If so, it will mean the robot has fallen over, and so the episode has ended. In addition, it measures the environment step counter so as to limit the duration of each episode.

The final steps in the _step method implementation is to increase the environment step counter by one, and return the observation, reward and done flag.

The _reset method

The reset method is responsible both for resetting as well as initializing the environment, as it is also called once after the class is initialized. Our implementation of _reset is as follows:

The _render method

The render method is where visualization-related actions take place. In our case the pyBullet environment takes care of all visualization, so there isn’t anything to do in this method. We will leave it blank.

The _seed method

Finally, the seed method sets the seed for this environment’s random number generator:

Putting it all together

The complete file contents are as shown below:

You can paste the snippet above into your empty, and you’ll be good to go.

Running the balancing bot simulation

It’s now time to put the script we wrote at the very beginning of this tutorial to good use. Here it is presented again for convenience:

Paste the contents to a file named just outside your balancebot-env folder, and run it:

You should see the the pyBullet viewer popping up and the robot should already be hitting the floor by the time you read this! Something like this:

But don’t worry! It will improve quickly. After around 2 minutes (depending on the speed of your machine) the bot should be able to stand upright for most of the time. You can use Ctrl+C in the terminal to quit the simulation, or let it complete the predefined number of cycles, after which it will save the best model and quit.

What’s next?

This post just scratches the surface of what is possible with Gym, Baselines and pyBullet. Regarding the balancing bot task in particular, there are a few ideas for advancing it:

  • Add sensor noise: The values that we get currently are squeaky clean, as they come from software. In real life, reading sensors would involve some level of noise, so one idea is to mix some of the noise in the values of the observations. Don’t worry, this will not reduce the performance of the balancing bot, on the contrary it will make it more robust.
  • Add bumps and obstacles: A flat plain is good for starters, but how will the balancing bot behave in an uneven ground filled with obstacles?
  • Build more advanced robot models: Boston Dynamics’ Handle is a balancing bot at heart but it is so much more agile due to all the articulations and suspension it includes! Control of a complex robot like the Handle would be an ideal scenario for a reinforcement learning algorithm!


In this post we completed implementation of the balancing bot task and saw in detail what each part of the Env subclass does, getting a working Gym+Baselines+pyBullet experiment at the end. It’s good fun to watch the little bot learning to balance, but there are so much more to explore!

The code for this tutorial series is now available on Github.

Do you have any questions or comments regarding this tutorial? Ask and share in the comments below!

Happy hacking!

For more exciting experiments and tutorials, subscribe to our email:

and tagged
28 replies on “ Build a Balancing Bot with OpenAI Gym, Pt II: The Robot and Environment ”
  1. You said to keep the file outside balancebot-env folder. But no such folder is mentioned in the directory tree in the first part of the tutorial
    Also I am getting a import error :No module name ‘balance-bot’ while running the script

    1. Hi and thanks for your comment. You’re right, I’ve amended the post with instructions on making a project folder as a first step. Regarding your error, have you run pip install -e inside your balancebot-env folder? Also, are you running all of these commands inside a Conda environment?

      1. I ran pip install -e . inside the balance-bot folder (where is present) and it successfulled installed balance-bot. But now I am getting “ImportError:cannot import the name ‘balancebotEnv’ ” while running
        I am running all the commands inside the conda environment .

        1. Only thing I can think of is your balancebotEnv should be with capital B, i.e. BalancebotEnv. I’ll be publishing the code in this tutorial to Github once holidays are over and I get back to my computer 🙂

          1. Thanks its working fine. But after the bot has learnt to balance itself quite properly, its performance goes down again in the end .i.e the mean 100 episode reward increase at the beginning , reaches a max value and then decreases.
            How to avoid that?

          2. It seems that this is caused by the buffer_size parameter being much smaller than the max_timesteps parameter. I’ve since found that setting them to the same value eliminates this problem.

  2. Just to point out, a lot of ‘0’s seem to be missing from in the code listed on this web page (for example line 44), so it would not compile if copy pasted from here, and there are differences between the code on GitHub.

    1. There are indeed some differences with github code (which was updated even after this post was published).

      Can you elaborate on the issue you are facing running the code? It seems line 44 of does not contain any 0s.


  3. Code is working great, pybullet for some reason is not opening though? I am just getting results in the command prompt window. What could be the issue?


      1. Thank you, I added the following line to the and this did the job!

        import pybullet as p; p.connect(p.GUI)

        1. It would also be helpful to assign the client returned by connect() to a variable like shown in the post, i.e. self.physicsClient = p.connect(p.GUI)

    1. For me, I looked and saw ./envs/ had by default “render=False”.

      Therefore for file ./ I added to that render should be True:

    1. One joint is rotated 180 degrees around the z axis, so it needs the opposite throttle setting to rotate in the same direction.

      1. Thank you for the reply. My confusion is more like why isn’t the policy outputting 2 velocities, when for the left and one for the right wheel. Based on that these are then assigned to each wheel. IThank you for the time.

        1. Your suggestion could also work of course. In this post, however, I was looking to introduce the simplest version of the problem. Having a policy with two outputs could introduce undesired outcomes, e.g. think of a robot that learns to “trick” the environment by continuously spinning and never falling. To address this issue one would need to find a more involved objective function, and then other issues may arise etc. Simplifying to a “pseudo-2D” you deal away with these issues in a way that is realistic for the most part.

  4. Fantastic tutorial! This really helped me out a whole bunch. I have a questions though. Is it possible to run the best model independently somehow? Like just watch it balance.

  5. You have taken the range of Angular Velocity as -pi to pi, but for Angular velocity of Wheel -5 to 5. I am a bit confused that is it the Angular Velocity of the wheel, of the Velocity of Wheel which is Angular velocity*radius of Wheel

  6. Fantastic tutorial! It helped me out so much! One question though, is there a way to view/run the best round? I see that in the ‘’ it saves ‘balance.pkl’ which is the trained model. How would I be able to watch it? Similar to the ‘’ from your neuroevolution tutorial. Thanks again!

    1. Not really but your could easily make one yourself using the enjoy script of the repo you linked and initialize the environment with the render=True parameter. Then you can call the _step() method to advance the simulation.

  7. Thanks for this simple and very understandable toturial! (I don’t find much OpenAI toturial even at

    My question is: since you want to keep the robot balanced as long as posible, why don’t you include envStepCounter in the reward?

  8. while running ‘python’ , i am getting an Attribute Error : module ‘baselines.deepq.models’ has no attribute ‘mlp’ .
    please help me out as i am not able to find any proper solution anywhere.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.