[D] Inverse reinforcement learning without assuming the agent’s behaviour is optimal?
As I understand it (and please correct me if I’m wrong), inverse reinforcement learning + reinforcement learning will eventually produce the same result as supervised learning/behavioural cloning. Inverse RL assumes the agent’s behaviour is optimal, so it will end up just imitating the agent.
Let’s say you want to do a task better than the agent. Has there been any research on deriving a reward function from agent behaviour without assuming the agent’s behaviour is optimal?