Paper Summary: Proximal Policy Optimization Algorithms
Last updated:Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.

WHAT
Proximal Policy Optimization (PPO) is an enhancement to the Trust Region Policy Optimization (TRPO) Policy-Gradient algorithm for learning RL policies.
WHY
Because TRPO is complicated and it cannot be adequately used with Deep NN architectures nor share learning parameters between the value and policy functions.
HOW
PPO enhances TRPO by adding a "proximity" constraint such that for every iteration, the new policy estimate cannot be too different from the previous one. This avoids "destructively large" policy updates and helps the algorithm learn better.
TRPO also uses a KL-divergence proximity constraint but PPO adds that into the objective function itself (using a \(clip\) function).
CLAIMS
- Clipping vs KL-divergence penalty in the objective function: Using a \(clip\) function yields better results than using a \(KL\)-divergence penalty in the objective function
EXTENDS/USES
- Trust Region Policy Optimization (TRPO) (Schulman et al., 2015)
NOTES
Proximal: The word "proximal" in the title refers to the proximity constraint that keeps each new version of the policy "close" to the previous one, to avoid destructive optimization steps.
Policy-Gradient Methods (such as PPO) attempt to learn the policy function, which is a probability distribution of suitable actions given a state. Value-based methods are the other main type of RL algorithms and they attempt to learn the value function, which is a function that calculates the reward for a given state. Two different things.
Where do neural nets come in? They are used to approximate both the value function and the policy function.
Value function and policy function trained together: Although this is not IMO greatly highlighted in the text, the value function and the policy function are optimized together!!
MY 2¢
Value function IMO not enough detail was given about the value function part of the algorithms, nor to the fact that the value function is trained alongside the policy function.
Now, is PPO really a policy-gradient method? Altough most sources classify it under "policy-gradient", some others put it under "actor-critic" models, which sounds more adequate given that it learns both the value-function and the policy-function concurrently.