Skip to main content


Learn About Our Meetup

5000+ Members



Join our meetup, learn, connect, share, and get to know your Toronto AI community. 



Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.



Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

[D]My ML Journal #11: Macro view of reinforcement learning and more OpenAI gym games

[D]My ML Journal #11: Macro view of reinforcement learning and more OpenAI gym games

Thanks for the support in my last post.

Here’s the vlog version of this journal as usual:

Most of the resources I used and talked about in this post can be found in this google doc, I am not linking many directly because I think my posts were getting auto classified as spam for having too many links

I spent a few hours researching just what the heck is going on with RL. I debriefed the most starred GitHub projects, the SOTA (state of the art) algorithms, and the open source platforms. Generally speaking, RL should be used whenever a problem can be modeled with an agent, environment, and reward setup. Technically speaking we can train a lot of models with an RL mindset. For example, a traditional GAN has a generator and a discriminator, the generator is trained by how much it fooled the discriminator. With an RL mindset, we can define the generator as the agent, its state being the random strokes it painted, and the rewards being the degree it fooled the discriminator. Still the same workflow, but now we introduce more possibilities, we can tune the reward algorithms & RL training mechanisms, etc.

My general plan for learning RL is to implement gym games first, then playing around with complex environments like Project Malmo and ViZDoom, and at last, I will get onto the Unity RL env and making my own game and training my own RL agent to beat it!

So right now, let me implement a few gym games first, for this time, I am trying to beat Cartpole & Acrobot.

Processing gif sqjzj2cgnpt21…

Cartpole, objective is to balance the stick so it doesn’t tilt over 15 degrees

Since I copied and understood the atari breakout code, I thought this was a piece of cake! Interestingly, the observations of the environment turned out to be an array with 4 floats while I thought it was going to be a picture (like in Atari). That changes my game plan, I can’t just use the Atari code for this, because in Atari, we predict an action by feeding in a 210 * 180 * 3 image, which is the state, that image goes through a few Conv layers, connects to a Dense layer with 4 outputs, which represents the actions we can choose (we will choose the output that had the highest value because that represents the one that will yield the highest reward). But a cartpole state is a 4 * 1 array, so I decided to feed it through a random neural network and connected it to a dense layer with 2 outputs at the end.

But I still ended up using a great portion of gsurma’s code. The reason being that the original Atari code wasn’t the best, it is intuitive to do this:

class AtariSolver: def __init__(...): define the model structure def saveMemory(...): self.memory.append the current state, reward, etc so we can train the model with the memory variable later def getAction(...): ... 

But the Atari code scattered all of these across places, they weren’t contained inside of a single class, which would make a lot of sense. Here’s my GitHub repository that beat Cartpole, again, huge credits to gsurma.

Now onto my nemesis: the Acrobot.

Processing gif 1z6fixzenpt21…

The observation of an acrobot is an array with 6 floats, so I thought I could’ve beat it with the same approach with Cartpole and use the Deep Q-Learning algorithm to train the RL agent.

I was wrong! I am still trying to motivate this skinny blue dude to cross the black line (the objective of the game)! The agent gets rewarded with negative points each frame the RL agent cannot reach the black line. The problem is, a lot of negative points won’t help on making the best choice if the agent never experienced a positive reward. And since the default Deep Q-Learning algorithm decreases the exploration rate over time, the agent will try random stuff less and less. That is at least what I am feeling. I ran this model for 100 iterations and all of it terminated because it reached the maximum 500 timeframes.

It’s okay, the next time I will be reporting back to you, I will have defeated this skinny blue dude. I looked into something that might help me, on the openAI Gym Leaderboard, someone beat acrobot with an algorithm called PPO (proximal policy optimization). It seems really hard to understand mathematically, but I will understand it, beat acrobot, and share it with you next time!

submitted by /u/RedditAcy
[link] [comments]