Paper Summary: Proximal Policy Optimization Algorithms

Last updated: 20 Jul 2025

Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.

proximal-policy-optimization-algorithms-front-page-screenshot

Proximal Policy Optimization Algorithms Source

WHAT

Proximal Policy Optimization (PPO) is an enhancement to the Trust Region Policy Optimization (TRPO) Policy-Gradient algorithm for learning RL policies.

WHY

Because TRPO is complicated and it cannot be adequately used with Deep NN architectures nor share learning parameters between the value and policy functions.

HOW

PPO enhances TRPO by adding a "proximity" constraint such that for every iteration, the new policy estimate cannot be too different from the previous one. This avoids "destructively large" policy updates and helps the algorithm learn better.

TRPO also uses a KL-divergence proximity constraint but PPO adds that into the objective function itself (using a \(clip\) function).

CLAIMS

Clipping vs KL-divergence penalty in the objective function: Using a \(clip\) function yields better results than using a \(KL\)-divergence penalty in the objective function

EXTENDS/USES

Trust Region Policy Optimization (TRPO) (Schulman et al., 2015)
Generalized Advantage Estimator (GAE) (Schulman et al., 2015)

NOTES

Proximal: The word "proximal" in the title refers to the proximity constraint that keeps each new version of the policy "close" to the previous one, to avoid destructive optimization steps.
Policy-Gradient Methods (such as PPO) attempt to learn the policy function, which is a probability distribution of suitable actions given a state. Value-based methods are the other main type of RL algorithms and they attempt to learn the value function, which is a function that calculates the reward for a given state. Two different things.
Where do neural nets come in? They are used to approximate both the value function and the policy function.
Value function and policy function trained together: Although this is not IMO greatly highlighted in the text, the value function and the policy function are optimized together!!
- Actually, PPO uses a variant of the value function, called the Advantage, estimated using something called the Generalized Advantage Estimator (GAE)

MY 2¢

Value function: IMO not enough detail was given about the value function part of the algorithms, nor to the fact that the value function is trained alongside the policy function.
Now, is PPO really a policy-gradient method? Altough most sources classify it under "policy-gradient", some others put it under "actor-critic" models, which sounds more adequate given that it learns both the value-function and the policy-function concurrently.

References

Arxiv: Proximal Policy Optimization Algorithms by Schulman et al., 2017

Felipe 06 Apr 2025 20 Jul 2025 paper-summary reinforcement-learning language-modeling