Balancing bot using Neuroevolution

In a series of previous posts I’ve discussed setting up and training a controller for a balancing bot using reinforcement learning. In this post I’m going to discuss another technique called neuroevolution for the same task. Neuroevolution optimizes parameters of an artificial neural network using an evolutionary algorithm. Parameters may refer to weights of the network, it’s topology, or both. Neuroevolution is an old method, its first appearance being back in the 90s. Recently, however, the method has been revisited by OpenAI, Uber and others.

This experiment began out of necessity rather than curiosity. I was trying to transfer the model obtained through reinforcement learning to a physical balancing bot running a Raspberry Pi 3 (more on that in the next post). That meant also installing the OpenAI baselines package as it contains all the routines for saving and loading models. Baselines requires Python 3.5, but there is no precompiled binary of Tensorflow for this version. I found a tutorial that covers compiling Tensorflow for Raspberry Pi on Github, so I tried that. Unfortunately during the process I faced several problems and errors, the most recent of which is still open. At the end of a weekend-long sprint, my productivity was approaching the chart below:

productivity pie chart

This was when I considered getting rid of baselines completely and trying an alternative approach. I have experience with genetic algorithms from grad school and my teaching position, so I quickly whipped together a problem definition and algorithm using the DEAP framework. The result was very satisfying, exceeding the performance that baselines achieved with a very simple implementation of neuroevolution.

Background

Neuroevolution applies principles of evolutionary algorithms (such as genetic algorithms) to optimizing neural networks. A population of alternative policies known as individuals is tested using the environment. A score for each is obtained using the reward function. Individuals in the population undergo a performance-biased selection process that varies according to the nature of the evolutionary algorithm. Individuals that are selected are a diverse blend of well-performing policies. They are then tweaked and recombined with each other using strategies that are algorithm-dependent. Recombination and variation of individuals is the mechanism by which the search space is explored, and new and better solutions are introduced. The process is repeated until a termination criterion is met.

Environment

I partially reused the environment I built for an earlier post that uses OpenAI Gym, extending the functionality and removing Gym dependencies. The result is a self-contained module that only depends on numpy and DEAP. I also extended the task by including uneven ground, sensor noise and randomized changes in the desired speed of the bot. I added checkpoint and saving functionality, and a little “player” that allows you to watch your controller go around after it’s trained. I’ve also made some minor tweaks to physics parameters, and increased the resolution of the geometry imported from URDF.

All of this is now available on Github.

Algorithm

The algorithm in use is a very simple variant of a genetic algorithm. In fact, its written after a basic DEAP example, with the only difference being the addition of elitism. Elitism is a widely applied technique in evolutionary computation that ensures that only the best performing individuals from current and past generations are passed on to the next. Elitism can improve convergence to better solutions faster. The algorithm, utility functions stripped, is presented below:

The evaluate function is what performs evaluation of each policy. It updates the neural net with the individual parameters and runs an episode of the environment:

Read also:  Training Pipeline for Autonomous Driving

Finally, run_episode is responsible for actually running a full episode until done and returning the total reward:

EDIT: Later in development I amended the run_episode method to perform evaluation more than once. This is to ensure that the agent doesn’t get “lucky” and obtains an evaluation that is abnormally high. The amended code is below:

Results

As seen in an earlier post, the reward for each time step is as follows:

V_w=0.1 - |V_t - V_d| \cdot 0.005

The maximum number of time steps is set to 1500 1000. This means that the maximum score that a policy can achieve is equal to 150 100. Previous models trained with reinforcement learning (Deep-Q learning) attained a score of around 60-80 on the simplified balancing task. In my experiments the neuro-evolved policy routinely attains a score greater than 90 in the more complex task, which is close to actually solving the task completely.

Initially this seemed suspicious, so I threw together a simple script to check the performance of the best policy found. The script loads the policy and runs the environment repeatedly, visualizing the output. As you can see in the video below, the bot performs surprisingly well:

Conclusion

In this post I investigated neuroevolution as an alternative to reinforcement learning in the context of building a controller for a balancing bot. In my experiments the former performs much better, attaining policies that are closer to the maximum score for the task. Next step is to plug the controller in a physical balancing bot platform and check it’s performance!

What do you think of neuroevolution as an alternative to reinforcement learning? Have you experimented with it yourself? Share your experience in the comments below!

For more exciting experiments and tutorials, subscribe to our email:

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.