Trust region policy optimization

TRPO aims steady improvement of policy.

Policy Gradient Theorem

Policy Gradient Theorem for discounted reward setting.