Skip to main content

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

LEARN, CONNECT, SHARE

Join our meetup, learn, connect, share, and get to know your Toronto AI community. 

JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

[D] Understanding proof of MaxEnt theorem

[D] Understanding proof of MaxEnt theorem

I’m reading Brian Ziebart’s work on maximum causal entropy optimization for inverse reinforcement learning. I’m reading through a few of his thesis chapters to get a deeper understanding, but have gotten stuck on one particular proof: the first line of the proof of Theorem 6.10. The theorem follows easily after the first line, but I can’t make sense of the logic behind the first line.

In a nutshell, the theorem shows that under a maximum causal entropy distribution, the likelihood of any policy pi increases in proportion to the expected reward (linear in [state, action] features) under that policy. However to prove this, he starts off by writing the P(pi) = Product over all trajectories (A, S) of P_MaxEnt(A, S)^pi(A, S). I do not understand where this equation comes from. It seems strange to me that it is raising maximum entropy distribution probabilities to the power of the policy probabilities.

I would greatly appreciate it if anyone could help me understand this.

The theorem is from his thesis (pg 210), available here: http://www.cs.cmu.edu/~bziebart/publications/thesis-bziebart.pdf

Full theorem and proof included below:

https://i.redd.it/61807bbw17o31.png

https://i.redd.it/b6ps5amx17o31.png

submitted by /u/celestialquestrial
[link] [comments]