[D] Understanding proof of MaxEnt theorem

Written by torontoai on September 21, 2019. Posted in Reddit MachineLearning.

I’m reading Brian Ziebart’s work on maximum causal entropy optimization for inverse reinforcement learning. I’m reading through a few of his thesis chapters to get a deeper understanding, but have gotten stuck on one particular proof: the first line of the proof of Theorem 6.10. The theorem follows easily after the first line, but I can’t make sense of the logic behind the first line.

In a nutshell, the theorem shows that under a maximum causal entropy distribution, the likelihood of any policy pi increases in proportion to the expected reward (linear in [state, action] features) under that policy. However to prove this, he starts off by writing the P(pi) = Product over all trajectories (A, S) of P_MaxEnt(A, S)^pi(A, S). I do not understand where this equation comes from. It seems strange to me that it is raising maximum entropy distribution probabilities to the power of the policy probabilities.

I would greatly appreciate it if anyone could help me understand this.

The theorem is from his thesis (pg 210), available here: http://www.cs.cmu.edu/~bziebart/publications/thesis-bziebart.pdf

Full theorem and proof included below:

https://i.redd.it/61807bbw17o31.png

https://i.redd.it/b6ps5amx17o31.png

submitted by /u/celestialquestrial
[link] [comments]