I have taken some courses related to reinforcement learning during my Ph.D. study. However, I have not touched it for a long time. Recently, I am working on end-to-end autonomous driving and the success of reinforcement learning in large language models brings me back to this topic. In this post, we will introduce the basic concepts and commonly used algorithms in reinforcement learning. More detailed information can be found in the reference book [1] and OpenAI RL page.

The interaction between environment and agent
The interaction between environment and agent. [1]

The Environment and Agent

The environment is the world in which the agent operates. The agent is the entity that makes decisions and takes actions. We denote the core components of reinforcement learning as follows (following the notations in [1] with some abused notations between scalar and vector):

  • $S$: environment state
  • $A$: agent action
  • $R$: environment reward after taking an action under certain state, it should reflect the agent’s goal;
  • $p(S_{t+1}, R_{t+1}|S_{0:t}, A_{0:t})$: the transition probability that given the history states and actions, the reward $R_{t+1}$ and the next state $S_{t+1}$
  • $\pi(A|S)$: the policy that the agent follows given the state $S$, it can be random or deterministic
  • $G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$ is the return, where $\gamma$ is the discount factor. The discount is not necessary, for example, in the episodic case. The return can also be defined by the average of the rewards.
  • $v(s) = \mathbb{E}[G_t|S_t = s]$ is the value function of state $s$. It is the expected return from state $s$.
  • $q(s, a) = \mathbb{E}[G_t|S_t = s, A_t = a]$ is the value function of state-action pair $(s, a)$. It is the expected return from state-action pair $(s, a)$.
  • $A(s, a) = Q(s, a) - V(s)$ is the advantage function of state-action pair $(s, a)$. It is the difference between the value function of state-action pair $(s, a)$ and the value function of state $s$.
  • Prediction: we care more about the state value function $v(s)$; (In prediction, we need to predict how well the state is.)
  • Control: we care more about the state-action value function $q(s, a)$. (In control, we need to choose the best action in a state.)

When we write the expectation, there are usually only two sources of randomness: the environment transition and the policy. Sometimes, we may omit them for simplicity.

Markov Decision Process (MDP)

In reinforcement learning, we usually assume the environment is a Markov Decision Process, i.e., mathematically,

$$ p(S_{t+1}, R_{t+1}| S_{0:t}, A_{0:t}) = p(S_{t+1}, R_{t+1}| S_t, A_t). $$

Bellman Equation

Recap the definition of the value function,

$$ \begin{aligned} v(s) &= \mathbb{E}[G_t|S_t = s] \\ &= \mathbb{E}[\sum_{k=0}^{\infty} \gamma^k R_{t+k+1} |S_t = s]\\ &= \mathbb{E}[R_{t+1} + \gamma G_{t+1} |S_t = s] \\ &= \mathbb{E}_{S_{t+1}|S_t}\mathbb{E}_{\cdot|S_{t+1}, S_t}[R_{t+1} + \gamma G_{t+1}] \\ &=\mathbb{E}_{S_{t+1}|S_t}\mathbb{E}_{\cdot|S_{t+1}}[R_{t+1} + \gamma G_{t+1}]\\ &= \sum_{a, s', r}\pi(a|s) p(s', r|s, a) [r + \gamma\mathbb{E}_{\cdot|S_{t+1}=s'} (G_{t+1}|S_{t+1}=s')]\\ & = \sum_{a, s', r}\pi(a|s) p(s', r|s, a) [r + \gamma v(s')]\\ \end{aligned} \tag{1} $$

Similarly, the Bellman equation for the action-value function is

$$ \begin{aligned} q(s, a) &= \mathbb{E}[G_t|S_t = s, A_t = a]\\ &= \mathbb{E}[R_{t+1} + \gamma G_{t+1}|S_t = s, A_t = a]\\ &= \mathbb{E}_{S_{t+1}|S_t, A_t}\mathbb{E}_{\cdot|S_{t+1}, S_t, A_t}[R_{t+1} + \gamma G_{t+1}]\\ &=\mathbb{E}_{S_{t+1}|S_t, A_t}\mathbb{E}_{\cdot|S_{t+1}}[R_{t+1} + \gamma G_{t+1}]\\ &=\sum_{r, s'}p(s', r|s, a)[r + \gamma\mathbb{E}_{\cdot|S_{t+1}=s'}[G_{t+1}|S_{t+1}=s']]\\ & = \sum_{r, s'}p(s', r|s, a)[r + \gamma\mathbb{E}_{A_{t+1}=a'|S_k=s}\mathbb{E}_{\cdot|S_{k=1}, A_{k+1}}G_{k+1}]\\ & = \sum_{r, s'}p(s', r|s, a)[r + \gamma\mathbb{E}_{A_{t+1}=a'|S_k=s}q(s', a')]\\ &= \sum_{r, s'}p(s', r|s, a)[r + \gamma\sum_{a'}\pi(a'|s')q(s', a')] \end{aligned}\tag{2} $$

From the above derivation, we can also see the relationship between the value function and the action-value function,

$$ v(s) = \sum_{a}\pi(a|s)q(s, a) $$$$ q(s, a) = \sum_{s', r}p(s', r|s, a)[r + \gamma v(s')] $$

Solving the Bellman Equation

Let’s write the Bellman value function as

$$ \begin{aligned} \mathcal{H} \circ v(s) &= \sum_{a}\pi(a|s)\sum_{s', r} p(s', r|s, a)[r + \gamma v(s')]\\ &= \sum_{r} p(r|s) r + \gamma \sum_{s'} p(s'|s) v(s') \\ &= \bar{R}_s + \gamma P_s \vec{v}(s) \end{aligned} $$

where $\bar{R}_s$ is the expected reward from state $s$, $P_s$ is the transition probability from state $s$ to all other states. Writing above in the matrix form, we have

$$ \mathcal{H} \circ \vec{v} = \bar{\mathbf{R}} + \gamma \mathbf{P} \vec{v} $$

We can then solve the Bellman equation by solving the linear system,

$$ \vec{v} = \bar{\mathbf{R}} + \gamma \mathbf{P} \vec{v} $$

We can also resort to the iterative method by observing that the $\mathcal{H}$ is a contraction mapping, for any two vectors $\mathbf{x}$ and $\mathbf{y}$,

$$ \|\mathcal{H} \circ \mathbf{x} - \mathcal{H} \circ \mathbf{y}\| = \|\gamma \mathbf{P} (\mathbf{x} - \mathbf{y})\| \le \gamma \|\mathbf{P} \|\|\mathbf{x} - \mathbf{y}\| \le \gamma \|\mathbf{x} - \mathbf{y}\| < \|\mathbf{x} - \mathbf{y}\| $$

This allows us to use the fixed point method to solve the Bellman equation.

Similarly, we can also prove that the operation in the Bellman equation for the action-value function is a contraction mapping.

Optimal Policy

The optimal policy, $\pi_*$, satisfies

$$\forall s, \quad v_{\pi_*}(s) \ge v_\pi(s)$$

That is to say, the optimal policy will yield a greater or equal value function than any other policy for any state. The optimal policy may not be unique, but all optimal policies share the same optimal value function, $v_*$.

Optimal Policy Bellman Equation

$$ \begin{aligned} v_*(s) &= \max_{a} q_*(s, a) \\ &= \max_{a} \mathbb{E}_{\pi_*}[R_{t+1} + \gamma G_{t+1}|S_t = s, A_t = a] \\ &= \max_{a} \mathbb{E}_{\pi_*}[R_{t+1} + \gamma v_*(S_{t+1})|S_t = s, A_t = a] \\ &= \max_{a} \sum_{s', r} p(s', r|s, a)[r + \gamma v_*(s')] \end{aligned} $$

and the state-action value function of the optimal policy satisfies,

$$ \begin{aligned} q_*(s, a) &= \mathbb{E}_{\pi_*}[R_{t+1} + \gamma G_{t+1}|S_t = s, A_t = a] \\ &= \mathbb{E}[R_{t+1} + \gamma v_*(s')|S_t = s, A_t = a] \\ &= \mathbb{E}[R_{t+1} + \gamma \max_{a'} q_*(s', a')|S_t = s, A_t = a] \\ &= \sum_{s', r} p(s', r|s, a)[r + \gamma \max_{a'} q_*(s', a')] \end{aligned} $$

Similar to the ordinary Bellman equation, we can also prove it is a contraction mapping, and then use the fixed point method to solve it.

Suppose there are two optimal value function $v_1$ and $v_2$, the Bellman operator is $\mathcal{H}$. Then,

$$ \begin{aligned} \|\mathcal{H}v_1 - \mathcal{H}v_2 \|_\infty &= \max_{s} | \mathcal{H}v_1(s) - \mathcal{H}v_2(s) | \\ &= \max_{s} | \max_{a} \sum_{s', r} p(s', r|s, a)[r + \gamma v_1(s')] - \max_{a} \sum_{s', r} p(s', r|s, a)[r + \gamma v_2(s')] | \\ &\underset{\le}{(a)} \max_{s, a} | \sum_{s', r} p(s', r|s, a)[r + \gamma v_1(s')] - \sum_{s', r} p(s', r|s, a)[r + \gamma v_2(s')] | \\ &= \max_{s, a} \gamma |\sum_{s'} p(s'|s, a)[v_1(s') - v_2(s')] | \\ &\le \gamma \max_{s, a} \sum_{s'} p(s'|s, a) \cdot \max_{s'} \{|v_1(s') - v_2(s')|\} \\ &= \gamma \|\vec{v_1} - \vec{v_2}\|_\infty\\ &<\|\vec{v}_1 - \vec{v}_2\|_\infty \end{aligned} $$

where (a) holds because of triangle inequality, $|\|x\| - \|y\| | \le \|x - y\| \le \|x\| + \|y\|$. Similarly, for the optimal action-value Bellman equation, we have

$$ \begin{aligned} \|\mathcal{H}q_1 - \mathcal{H}q_2 \|_\infty &= \max_{s, a} | \mathcal{H}q_1(s, a) - \mathcal{H}q_2(s, a) | \\ &= \max_{s, a} | \sum_{s', r} p(s', r|s, a')[r + \gamma \max_{a'} q_1(s', a')] - \sum_{s', r} p(s', r|s, a')[r + \gamma \max_{a'} q_2(s', a')] | \\ &\le \gamma\max_{s, a} \sum_{s'} p(s'| s, a)|\max_{a'}q_1(s', a') - \max_{a'}q_2(s', a')| \| \\ &\le \gamma \max_{s, a} \sum_{s'} p(s'| s, a) \cdot \max_{a'} \{|q_1(s', a') - q_2(s', a')|\} \\ &\le \gamma \max_{s', a'} \{|q_1(s', a') - q_2(s', a')|\} \\ &= \gamma \|\vec{q_1} - \vec{q_2}\|_\infty\\ &< \|\vec{q}_1 - \vec{q}_2\|_\infty \end{aligned} $$

Thus we prove that Bellman operator is a contraction mapping for both state and state-action value functions.

Dynamic Programming

Policy Evaluation

Using the fixed point method, we can iteratively

$$ v_{k+1}(s) = \sum_{a}\pi(a|s)\sum_{s', r} p(s', r|s, a)[r + \gamma v_k(s')] $$

By running this several times, we get the state/state-action value function under policy $\pi$. However, we should note that this method requires full knowledge of the environment, i.e. the transition probability $p(s', r|s, a)$.

Policy Improvement

Suppose we have two deterministic policies, $\pi$ and $\pi'$, such that

$$ q(s, \pi'(s)) \ge v(s) \quad \forall s \in \mathcal{S} $$

Then, the policy $\pi'$ is as good as or better than $\pi$, i.e.

$$ v_{\pi'}(s) \ge v_\pi(s) \quad \forall s \in \mathcal{S} $$

This can be proved below

$$ \begin{aligned} v_{\pi}(s) &\leq q_{\pi}(s, \pi'(s)) \\ &= \mathbb{E} [ R_{t+1} + \gamma v_{\pi}(S_{t+1}) \mid S_t = s, A_t = \pi'(s)] \\ &= \mathbb{E}_{\pi'} [ R_{t+1} + \gamma v_{\pi}(S_{t+1}) \mid S_t = s] \\ &\leq \mathbb{E}_{\pi'} [ R_{t+1} + \gamma q_{\pi}(S_{t+1}, \pi'(S_{t+1})) \mid S_t = s] \\ &= \mathbb{E}_{\pi'} [ R_{t+1} + \gamma \mathbb{E} [ R_{t+2} + \gamma v_{\pi}(S_{t+2}) \mid S_{t+1}, A_{t+1} = \pi'(S_{t+1})] \mid S_t = s] \\ &= \mathbb{E}_{\pi'} [ R_{t+1} + \gamma R_{t+2} + \gamma^2 v_{\pi}(S_{t+2}) \mid S_t = s] \\ &\leq \mathbb{E}_{\pi'} [ R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \gamma^3 v_{\pi}(S_{t+3}) \mid S_t = s] \\ &\vdots \\ &\leq \mathbb{E}_{\pi'} [ R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \gamma^3 R_{t+4} + \cdots \mid S_t = s] \\ &= v_{\pi'}(s). \end{aligned} $$

The third equation holds because $\pi'$ is a deterministic policy.

Policy improvement takes a greedy step on the original policy with respect to the value function of the original policy,

$$ \begin{aligned} \pi'(s) &= \underset{a}{\text{argmax}} q_\pi(s, a) \\ &= \underset{a}{\text{argmax}} \sum_{s', r} p(s', r| s, a) [r + \gamma v_\pi(s')] \end{aligned} $$

If the new policy $\pi'$ is as good as but not better than the old policy $\pi$, according to the optimal Bellman equation, it is optimal.

Policy and Value Iteration

$$ \pi_0 \xrightarrow{\mathbf{E}} v_{\pi_0} \xrightarrow{\mathbf{I}} \pi_1 \xrightarrow{\mathbf{E}} v_{\pi_1} \xrightarrow{\mathbf{I}} \pi_2 \xrightarrow{\mathbf{E}} \cdots \xrightarrow{\mathbf{I}} \pi_* \xrightarrow{\mathbf{E}} v_* $$

Monte Carlo Methods

According to the definition of the value function, if we sample the whole trajectory, we can estimate the value function by the average of the returns.

the first visit Monte Carlo method
The first visit Monte Carlo method. [1]

We can also evaluate the state-action value function and improve the policy as:

Monte Carlo Exploring Starts
Monte Carlo Exploring Starts. [1]

Off-policy Prediction via Importance Sampling

Suppose there are two policies, $b$ is for behavior policy, we simulate the process according to $b$ and generate the episodes. $\pi$ is the target policy, we want to learn the state or state-action value function under $\pi$. To use the importance sampling method, we need follow the coverage assumption, i.e. every action taken under $\pi$ is also taken, at least occasionally, under $b$, which requires $\pi(a|s) > 0$ implies $b(a|s) > 0$.

Importance sampling is a general technique used to estimate expected values under one distribution while sampling from another. In reinforcement learning, we use it to estimate values under the target policy $\pi$ while sampling trajectories from the behavior policy $b$.

Let’s consider a trajectory $\tau = (S_0, A_0, R_1, S_1, A_1, R_2, ..., S_T)$ of length $T$. The probability of this trajectory under policy $\pi$ is:

$$ p_{\pi}(\tau) = p(S_0) \prod_{t=0}^{T-1} \pi(A_t|S_t)p(S_{t+1}, R_{t+1}|S_t, A_t) $$

Similarly, the probability under the behavior policy $b$ is:

$$ p_{b}(\tau) = p(S_0) \prod_{t=0}^{T-1} b(A_t|S_t)p(S_{t+1}, R_{t+1}|S_t, A_t) $$

The importance sampling ratio is:

$$ \rho_{0:T-1} = \frac{p_{\pi}(\tau)}{p_{b}(\tau)} = \frac{p(S_0) \prod_{t=0}^{T-1} \pi(A_t|S_t)p(S_{t+1}, R_{t+1}|S_t, A_t)}{p(S_0) \prod_{t=0}^{T-1} b(A_t|S_t)p(S_{t+1}, R_{t+1}|S_t, A_t)} = \prod_{t=0}^{T-1}\frac{\pi(A_t|S_t)}{b(A_t|S_t)} $$

The environment dynamics terms cancel out, leaving only the ratio of policy probabilities. We can use this importance sampling ratio to correct the returns when estimating value functions:

$$ v_{\pi}(s) = \mathbb{E}_{\pi}[G_t|S_t=s] = \mathbb{E}_{b}[\rho_{t:T-1} G_t|S_t=s] $$

Temporal-Difference Learning

$$ V(S_t) \gets V(S_t) + \alpha \left[ R_{t+1} + \gamma V(S_{t+1}) - V(S_t) \right] $$

which uses the currently sampled reward $R_{t+1}$ and current next state value $V(S_{t+1})$ to update the current state value $V(S_t)$. This represents a moving average of the current state value $V(S_t)$ and the estimated state value, $R_{t+1} + \gamma V(S_{t+1})$.

This quantity,

$$ \delta_t \doteq R_{t+1} + \gamma V(S_{t+1}) - V(S_t), $$

called the TD error arises in various forms throughout reinforcement learning.

tabular TD(0) for estimation state value function
Tabular TD(0) for estimation state value function. (Chapter 6 of [1])

Convergent Conditions

  1. Ergodicity of the environment’s Markov chain For TD(0) to work, the environment’s Markov chain, defined by the policy, must be ergodic. This means:

    • The environment is stationary;
    • You can reach any state from any other state eventually; (imagine what happens if there is a dead cycle)
    • All states are visited infinitely often, which is crucial for the algorithm to learn about every state.
  2. step size and Robbins-Monro Framework The step size $\alpha_t$ plays a pivotal role in ensuring convergence, and research aligns with the stochastic approximation theory, particularly the Robbins-Monro conditions. These conditions are:

    • $\sum_{t=1}^{\infty} \alpha_t = \infty$, ensuring that the updates are significant enough initially to overcome random fluctuations and explore the value space;

    • $\sum_{t=1}^{\infty} \alpha_t^2 < \infty$, ensuring that the updates eventually become small enough to allow convergence without excessive oscillation.

    A typical choice is $\alpha_t = \frac{1}{t}$. Though we set the step size to satisfy the Robbins-Monro conditions, other choices are beneficial for non-stationary environments.

Please find the full proof in [2].

Advantages of TD Methods

“TD methods have an advantage over DP methods in that they do not require a model of the environment, of its reward and next-state probability distributions. The next most obvious advantage of TD methods over Monte Carlo methods is that they are naturally implemented in an online, fully incremental fashion.” [1]

Sarsa: On-policy TD Control

$$ Q(S_t, A_t) \gets Q(S_t, A_t) + \alpha \left[ R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t) \right]. $$
Sarsa algorithm
Sarsa algorithm. (Chapter 6 of [1])

Q-learning: Off-policy TD Control

$$ Q(S_t, A_t) \gets Q(S_t, A_t) + \alpha \left[ R_{t+1} + \gamma \max_{a'} Q(S_{t+1}, a') - Q(S_t, A_t) \right]. $$
Q-learning algorithm
Q-learning algorithm. (Chapter 6 of [1])

The expected Sarsa algorithm is a variant of the Q-learning algorithm, which uses the expected value of the next state-action value function instead of the maximum value.

$$ Q(S_t, A_t) \gets Q(S_t, A_t) + \alpha \left[ R_{t+1} + \gamma \sum_{a}\pi(a|S_{t+1})Q(S_{t+1},a) - Q(S_t, A_t) \right]. $$

Bootstrapping

Unified View of the TD, Monte Carlo, and DP Methods

Let’s define the n-step return as

$$ G_{t:t+n} \triangleq R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^n V(S_{t+n}), $$

Then, you can pick any $n$ step return as the estimation of the value function. The following shows the backup diagram of the TD, Monte Carlo, and DP methods.

TD methods Monte Carlo methods DP methods
Backup diagrams of TD, Monte Carlo, and DP methods. [from David Silver's RL course lecture 4, "Model-Free Prediction]

Policy Gradient Methods

In this section, we will introduce the policy gradient methods. These methods directly learn a parameterized policy and select actions directly from the policy. We may still need to learn the value function. Let’s denote $\pi(a|s;\theta)= P(A_t=a | S_t =s; \boldsymbol{\theta})$ as the policy, where $\boldsymbol{\theta}$ is the parameter. Let $v(s; \mathbf{w})$ be the value function, where $\mathbf{w}$ is the parameter.

Policy Gradient Theorem

In this section, we consider the episodic case, where the process starts with some initial state $s_0$ (not random). Our goal is to maximize the expected return starting from $s_0$, i.e. $J(\boldsymbol{\theta}) = \mathbb{E}_{\pi}[G_0|S_0=s_0]$.

For a state value function $v(s)$, it is a function of the policy. So, we can compute the gradient of the policy with respect to policy parameter, $\boldsymbol{\theta}$. The gradient can be computed by

$$ \begin{aligned} \nabla v(s) &= \nabla \sum_{a} \pi(a|s) q(s, a) \\ &= \sum_{a} \nabla \pi(a|s)q(s, a) + \sum_a \pi(a|s) \nabla q(s, a) \\ &=\sum_a \nabla \pi(a|s)q(s, a) + \sum_a \pi(a|s) \nabla \sum_{s', r} p(s', r|s, a)[r + v(s')] \\ &= \sum_a \nabla \pi(a|s)q(s, a) + \sum_a \pi(a|s) \sum_{s'} p(s'|s, a) \nabla v(s') \\ &\underset{=}{(a)}\sum_a \nabla \pi(a|s)q(s, a) + \sum_{s'} p(s'|s) \nabla v(s')\\ &= \sum_a \nabla \pi(a|s)q(s, a) + \sum_{s'}\sum_a p(s'|s)\nabla \pi(a|s')q(s', a) +\sum_{s''}p(s''|s)\nabla v(s'') \\ &= \sum_{x \in \mathcal{S}} \sum_{K=0}^\infty P(s\to x, K, \pi) \sum_a \nabla \pi(a|x)q(x, a) \end{aligned} $$

where $P(s\to x, K, \pi)$ is the probability of starting from state $s$ and reaching state $x$ in $K$ steps under policy $\pi$.

$$ \begin{aligned} \nabla J(\boldsymbol{\theta}) &= \sum_{s\in \mathcal{S}} \sum_{K=0}^\infty P(s_0\to x, K, \pi) \sum_a \nabla \pi(a|x)q(x, a) \\ &= \sum_{s \in \mathcal{S}} \left(\sum_{K=0}^\infty P(s_0\to s, K, \pi)\right) \sum_a \nabla \pi(a|s)q(s, a) \\ &=\sum_{s \in \mathcal{S}} \eta(s) \sum_a\nabla \pi(a|s)q(s, a) \\ &=\sum_{s'}\eta(s') \sum_s \frac{\eta(s)}{\sum_{s'}\eta(s')} \sum_a\nabla \pi(a|s)q(s, a) \\ &= \sum_{s'}\eta(s') \sum_s \mu(s) \sum_a\nabla \pi(a|s)q(s, a) \\ &\propto \sum_s \mu(s) \sum_a\nabla \pi(a|s)q(s, a) \end{aligned} $$

where $\eta(s)$ only depends on the policy and the environment, which can be seen the probability of visiting state $s$. The above equation is derived by letting the discount factor $\gamma = 1$. If it is the continuous case, the proportional is replaced by equality, for episodic case, the constant term is the average length of visit state $s$. Note that, if we consider the discounted reward, the general conclusion does not change. In this case, $P(s\to x, K, \pi)$ will means the discounted probability of visiting state $x$ in $K$ steps under policy $\pi$.

Then, if we follow the policy $\pi$, the gradient can be computed by

$$ \nabla J(\boldsymbol{\theta}) = \mathbb{E}_{\pi}[\sum_a q(S, a; \mathbf{w}) \nabla \pi(a|S;\boldsymbol{\theta})] $$

Then, we can instantiate the stochastic gradient ascent algorithm to update the policy parameter:

$$ \boldsymbol{\theta}_{t+1} \gets \boldsymbol{\theta}_t + \alpha \sum_a q(S_t, a; \mathbf{w}_t) \nabla \pi(a|S_t;\boldsymbol{\theta}_t) $$

REINFORCE Algorithm

$$ \begin{aligned} \nabla J(\boldsymbol{\theta}) &= \mathbb{E}_{\pi}[\sum_a q(S, a; \mathbf{w}) \nabla \pi(a|S;\boldsymbol{\theta})]\\ &= \mathbb{E}_{\pi}[\sum_a \pi(a|S; \boldsymbol{\theta}) q(S, a; \mathbf{w})\frac{\nabla \pi(a|S;\boldsymbol{\theta})}{\pi(a|S; \boldsymbol{\theta})}]\\ &= \mathbb{E}_{\pi}[q(S, A; \mathbf{w})\frac{\nabla \pi(a|S;\boldsymbol{\theta})}{\pi(a|S; \boldsymbol{\theta})}] \\ &= \mathbb{E}_{\pi}[q(S, A; \mathbf{w})\nabla \ln \pi(A|S; \boldsymbol{\theta})] \\ &= \mathbb{E}_{\pi}[G_t \nabla \ln \pi(A_t|S_t; \boldsymbol{\theta})] \quad (\text{because } \mathbb{E} [q(S, A; \mathbf{w})|S, A] = G_t) \end{aligned} $$$$ \boldsymbol{\theta}_{t+1} \gets \boldsymbol{\theta}_t + \alpha G_t \nabla \ln \pi(A_t|S_t; \boldsymbol{\theta}_t) $$

REINFORCE with Baseline

We can also add a baseline to the REINFORCE algorithm to reduce the variance. The baseline can be any random variable as long as it is independent with the action.

$$ \nabla J(\boldsymbol{\theta}) \propto \sum_s \mu(s) \sum_a \nabla \pi(a|s) (q(s, a; \mathbf{w}) - b(s)) $$

This is verified by

$$ \begin{aligned} \nabla J(\boldsymbol{\theta}) &\propto \sum_s \mu(s) \sum_a \nabla \pi(a|s) (q(s, a; \mathbf{w}) - b(s)) \\ &\propto \sum_s \mu(s) [\sum_a \nabla \pi(a|s)q(s, a; \mathbf{w}) - \sum_a \nabla \pi(a|s)b(s)] \\ &= \sum_s \mu(s) [\sum_a \nabla \pi(a|s)q(s, a; \mathbf{w}) - b(s)\nabla\sum_a \pi(a|s)] \\ &= \sum_s \mu(s) \sum_a q(s, a)\nabla \pi(a|s) \end{aligned} $$

A natural choice of the baseline is the state value function. So, the REINFORCE with baseline algorithm is:

REINFORCE with baseline algorithm
REINFORCE with baseline. (Chapter 13 of [1])

Actor-Critic Methods

Actor-critic method use one-step TD to estimate the full return in the REINFORCE algorithm. In actor-critic method, we need estimate two functions: the advantage function and the policy.

Actor-Critic algorithm
Actor-Critic algorithm. (Chapter 13 of [1])

Trust Region Policy Optimization

$$ \begin{aligned} L_{\pi}(\tilde{\pi}) &= v(\pi) + \sum_{t=0}^\infty \sum_s P(s_t =s | \pi) \sum_a \tilde{\pi}(a|s) A_\pi(s, a) \\ &= v_\pi + \sum_{s} \rho_{\pi}(s) \sum_{a} \tilde{\pi}(a \mid s) A_{\pi}(s, a). \end{aligned} $$

where $\rho_{\pi}(s) = \sum_{t=0}^{\infty} \gamma^t P(s_t = s \mid \pi)$ is the sum of the discounted visitation probability of state $s$ under policy $\pi$.

Let $D_{KL}^{\max}(\pi, \tilde{\pi}) = \max_s D_{KL}(\pi(\cdot|s), \tilde{\pi}(\cdot|s))$ be the maximum KL divergence between the two policies.

Theorem: Let $\pi$ and $\tilde{\pi}$ be two policies. Then the following bound holds:

$$ v_{\tilde{\pi}} \geq L_{\pi}(\tilde{\pi}) - \frac{4\epsilon\gamma}{(1-\gamma)^2}\alpha^2 $$

where $\alpha = D_{KL}^{\max}(\pi, \tilde{\pi})$ and $\epsilon = \max_{s,a}|A_{\pi}(s,a)|$.

Let the right hand side of the above inequality be $M_{\pi}(\tilde{\pi})$. We have $v_{\tilde{\pi}} \geq M_{\pi}(\tilde{\pi})$ and $v_{\pi} = M_{\pi}(\pi)$. So, we have

$$ v_{\tilde{\pi}} - v_\pi = M_{\pi}(\tilde{\pi}) - M_{\pi}(\pi). $$

If we make an ascent step of $M_\pi(\pi)$, i.e., $M_\pi(\tilde{\pi}) > M_\pi(\pi)$, we will get a new policy $\tilde{\pi}$ that has a higher value function. Then it guarantees that we get a monotonic improved policies.

If we follow the theoretic objective $M_\pi(\tilde{\pi})$, the constant term $\frac{4\epsilon \gamma}{(1-\gamma)^2}$, will be very large. Therefore, we place a significant constraint on the KL divergence. Thus, the new policy will likely follow the old policy. So, the author proposed solving

$$ \begin{aligned} &\underset{\theta}{\text{maximize}} \quad L_{\theta_{\text{old}}}(\theta) \\ &\text{subject to} \quad \overline{D}_{\text{KL}}^{\rho_{\theta_{\text{old}}}}(\theta_{\text{old}}, \theta) \leq \delta. \end{aligned} $$

where

$$ \begin{aligned} \overline{D}_{\text{KL}}^{\rho}(\theta_1, \theta_2) &= \mathbb{E}_{s \sim \rho} \left[ D_{\text{KL}} \big( \pi_{\theta_1}(\cdot \mid s) \,\big\|\, \pi_{\theta_2}(\cdot \mid s) \big) \right]. \end{aligned} $$

Sample-based Estimation

Let us write the full optimization problem as

$$ \begin{aligned} &\max_\theta \sum_s \rho_{\theta_{old}}(s)\sum_a\pi_\theta(a|s)A_{\theta_{old}}(s, a) \\ &\text{subject to} \quad \overline{D}_{\text{KL}}^{\rho_{\theta_{old}}}(\theta_{old}, \theta) \leq \delta. \end{aligned} $$

If we use importance sampling to estimate the objective, we have

$$ \sum_a\pi_\theta(a|s)A_{\theta_{old}}(s, a) = \mathbb{E}_{a\sim q}[\frac{\pi_\theta(a|s)}{q(a|s)} A_{\theta_{old}}(s, a)]. $$

Then our optimization problem becomes

$$ \begin{aligned} &\max_\theta \mathbb{E}_{s\sim\rho_{\theta_{old}}, a\sim q}[\frac{\pi_\theta(a|s)}{q(a|s)} A_{\theta_{old}}(s, a)] \\ &\text{subject to} \quad \mathbb{E}_{s \sim \rho_{\theta_{old}}} D_{\text{KL}}^{\rho_{\theta_{old}}}(\theta_{old}, \theta) \leq \delta. \end{aligned} $$

Sampling Methods

The author proposed the uniform sampling method for discrete action space and $\pi_{\theta_{old}}(a|s)$ for continuous action space. It also suggested using the weighted importance sampling for large and continuous action spaces.

Proximal Policy Optimization

This section is based on [4].

The objective of the PPO is

$$ L^{CLIP}(\theta) = \hat{\mathbb{E}}_{t} \left[ \min \left( \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)} A_t, \text{clip}(\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}, 1-\epsilon, 1+\epsilon) A_t \right) \right]. $$

where $\epsilon$ is a hyperparameter, say $\epsilon = 0.2$, $\hat{\mathbb{E}}_{t}$ means we are doing mean over the older policy trajectory. The following figure shows the value of the objective value with respect to the ratio between the new policy and the older one. This clipping mechanism prevents excessive large policy updates.

Function value with respect to the ratio
Function value with respect to the ratio [4]

Group Relative Policy Optimization

The Group Relative Policy Optimization (GRPO) method was proposed by DeepSeek in [5]. While the original implementation in [5] relied on a reward model, DeepSeek-R1 later refined this approach to use only accuracy and format rewards, acknowledging that the reward model can be easily hacked [6].

The objective of the group relative policy optimization in [6] is

$$ \begin{aligned} &\mathcal{J}_{GRPO}(\theta) = \mathbb{E}_{q \sim P(Q), \{o_i\}_{i=1}^{G} \sim \pi_{\theta_{\text{old}}} (O | q) } \Bigg[ \\ &\frac{1}{G} \sum_{i=1}^{G} \min \left[ \frac{\pi_{\theta} (o_{i} | q)}{\pi_{\theta_{\text{old}}} (o_{i} | q)} \hat{A}_{i}, \operatorname{clip} \left( \frac{\pi_{\theta} (o_{i} | q)}{\pi_{\theta_{\text{old}}} (o_{i} | q)}, 1 - \epsilon, 1 + \epsilon \right) \hat{A}_{i} \right] \\ &- \beta \mathbb{D}_{KL} \big[\pi_{\theta} \| \pi_{\text{ref}} \big] \Bigg], \end{aligned} $$

where $q$ is the question, and the advantage $\hat{A}_{i}$ is calculated by

$$ \hat{A}_{i} = \tilde{r}_i = \frac{r_i - \text{mean}(r)}{\text{std}(r)}, $$

where $r_i$ is the reward in the $i$-th group. $\pi_{ref}$ is the reference model, typically from the pretrained model. (not sure whether is fixed or not during the reinforcement learning process)

The KL divergence is estimated by

$$ \mathbb{D}_{KL} \big[\pi_{\theta} \| \pi_{\text{ref}} \big]=\frac{\pi_{\text{ref}}}{\pi_\theta} - \ln \frac{\pi_{\text{ref}}}{\pi_\theta} - 1. $$

This estimation is unbiased due to

$$ \begin{aligned} \mathbb{E}_{a \sim \pi_\theta}[\frac{\pi_{ref}(a)}{\pi_\theta(a)} - \ln \frac{\pi_{ref}(a)}{\pi_\theta(a)} - 1] &= \mathbb{E}_{a} \pi_{ref}(a) \left( \frac{\pi_{ref}(a)}{\pi_\theta(a)} - \ln \frac{\pi_{ref}(a)}{\pi_\theta(a)} - 1 \right) \\ &= -\mathbb{E}_{a \sim \pi_\theta} \frac{\pi_{ref}(a)}{\pi_\theta(a)} \\ &= \mathbb{D}_{KL} \big[\pi_{\theta} \| \pi_{\text{ref}} \big]. \end{aligned} $$

It is also positive by observing the function $f(x) = x - \ln x -1$ attains its minimum, $f(1)=0$, at $x=1$ in the domain $x>0$.

By computing the relative advantage in the group, we can alleviate the need for using the state value function. Since a state value function for large language model might be as large as the model itself, this method dramatically reduces the computational cost. However, since the reasoning process is modeled as a episodic process and the method is more like Monte Carlo sampling, I am not sure whether the above claim is valid. Though, recent replications of [6] with smaller models on a specific task shows that PPO also works well and even the cross-entropy loss is not necessary, I do not think the cross-entropy loss can be ignored for general purpose LLMs. For small task (e.g. the countdown), we drop the cross-entropy because the simple task policy may indeed deviate from the pre-trained model. However, that can not be true if you want to solve a general task.

Dr. GRPO

[Updated on Mar 21, 2025] [8] Suggest removing the dividing by the length of the response and the standard deviation term.

objectives of GRPO and Dr. GRPO
Objectives of GRPO and Dr. GRPO [8]
The intuition behind these revisions are:
  • Response-level length bias: This arises from dividing by $|o_i|$. For positive advantages ($\hat{A}_{i,t} > 0$, indicating a correct response), this bias results in greater gradient updates for shorter responses, leading the policy to favor brevity in correct answers. Conversely, for negative advantages ($\hat{A}_{i,t} < 0$, indicating an incorrect response), longer responses are penalized less due to their larger $|o_i|$, causing the policy to prefer lengthier responses when incorrect.

  • Question-level difficulty bias: This is caused by dividing the centered reward by $\mathrm{std}(\{R(q, o_1), \ldots, R(q, o_G)\})$. Questions with lower standard deviations are given higher weights during policy updates. While advantage normalization is a common trick in RL (Andrychowicz et al., 2021), it is typically computed across an entire batch. In contrast, question-level normalization results in varying weights in the objective for different questions.

DAPO

In [9], they increased theceilling parameters $\varepsilon_{high}$. While in the orginal PPO, $\varepsilon_{low}$ and $\varepsilon_{high}$ are with the same value. When training the model, they observed that the model output entropy decreases quikly. They identified that the upper clip restrict the increase of the low probability output, thus potentially constrain the output diversity. Further, when sampling the responses, with a question, $q$, they will prune all correct or all incorrect response groups, i.e., they will sample until they find a group with at least one correct and one incorrect response.

$$ \begin{aligned} \mathcal{J}_{\text{DAPO}}(\theta) &= \mathbb{E}_{(q,a) \sim \mathcal{D}, \{o_i\}_{i=1}^G \sim \pi_{\theta_{\text{old}}}(\cdot|q)} \Bigg[ \\ &\frac{1}{\sum_{i=1}^G |o_i|} \sum_{i=1}^G \sum_{t=1}^{|o_i|} \min \left( r_{i,t}(\theta) \hat{A}_{i,t}, \operatorname{clip}\left(r_{i,t}(\theta), 1 - \textcolor{red}{\varepsilon_{\text{low}}}, 1 + \textcolor{red}{\varepsilon_{\text{high}}}\right) \hat{A}_{i,t} \right) \Bigg]\\ &\textcolor{red}{\text{s.t.}} \quad 0 < \left|\left\{ o_i \mid \text{is\_equivalent}(a, o_i) \right\}\right| < G, \end{aligned} $$

Appendix

Why Baseline Can Reduce the Variance

This proof was generated by ChatGPT and subsequently reviewed and revised by myself. More rigorous analysis can be found in [7].

1. Policy Gradient (REINFORCE): Let $\pi_\theta(a|s)$ be a parameterized policy giving the probability (or density) of action $a$ in state $s$. We seek to maximize the expected return $J(\theta)$ (the cumulative reward). The policy gradient theorem (REINFORCE) gives the gradient of the objective as an expectation over the policy’s randomness. Formally, one version of the REINFORCE gradient is:

$$ \nabla_{\theta} J(\theta) \;=\; \mathbb{E}_{\pi_\theta}\!\Big[\, G_t \,\nabla_{\theta} \ln \pi_\theta(A_t \mid S_t)\Big]\,, $$

where $G_t$ is the total return (cumulative reward) following time $t$, and the expectation $\mathbb{E}_{\pi_\theta}[\,\cdot\,]$ is taken over the trajectory distribution induced by $\pi_\theta$. In words, an unbiased stochastic estimator for the policy gradient is

$$ \hat{g} \;=\; G_t \,\nabla_{\theta} \ln \pi_\theta(A_t \mid S_t)\,, $$

since $\mathbb{E}[\hat{g}] = \nabla_\theta J(\theta)$ by the policy gradient theorem. This is the basis of the REINFORCE algorithm: adjust $\theta$ in the direction of $G_t \nabla_\theta \ln\pi_\theta(A_t|S_t)$ on each sampled trajectory.

2. Adding a Baseline to the Gradient Estimator: While $\hat{g}$ is an unbiased gradient estimator, it can have high variance, slowing convergence. A common technique to reduce variance is to subtract a baseline $b(s)$ from the return. The modified update uses

$$ \hat{g}_b \;=\; \big(G_t - b(S_t)\big)\,\nabla_{\theta} \ln \pi_\theta(A_t \mid S_t)\,, $$

where $b(s)$ is any function of state (it can even be a constant). Importantly, $b(s)$ must not depend on the action $A_t$ (or any random outcome correlated with the action) – it should be independent of the sampling of $A_t$. This ensures the estimator remains unbiased. Intuitively, $b(s)$ is a baseline level of expected return against which the actual return $G_t$ is compared; $G_t - b(S_t)$ is often called the advantage. Choosing a good baseline (e.g. an estimate of the state’s value $V^\pi(s)$) can greatly reduce variance without changing the expected value ().

3. Unbiasedness with a Baseline: We now show that $\hat{g}_b$ is an unbiased estimator of the true gradient $\nabla_\theta J(\theta)$. Taking expectation (over all randomness in $S_t, A_t,$ and rewards):

$$ \begin{aligned} \mathbb{E}\!\big[(G_t - b(S_t))\,\nabla_\theta \ln \pi_\theta(A_t|S_t)\big] &= \mathbb{E}_{S_t}\!\Big[\,\mathbb{E}_{A_t\sim\pi}[\, (G_t - b(S_t))\,\nabla_\theta \ln \pi_\theta(A_t|S_t) \mid S_t \,] \Big] \\ &= \mathbb{E}_{S_t}\!\Big[\, (G_t - b(S_t)) \,\mathbb{E}_{A_t\sim\pi}[\,\nabla_\theta \ln \pi_\theta(A_t|S_t) \mid S_t\,] \Big]\,, \end{aligned} $$

because $G_t$ and $b(S_t)$ are constant with respect to the inner expectation (conditional on $S_t$). Now, note the key identity for any fixed state $s$:

$$ \mathbb{E}_{A\sim\pi_\theta(\cdot|s)}\!\big[\nabla_\theta \ln \pi_\theta(A|s)\big] \;=\; \nabla_\theta \int_a \pi_\theta(a|s)\, da \;=\; \nabla_\theta\,1 \;=\; \mathbf{0}\,. $$

In other words, $\mathbb{E}[\nabla_\theta \ln \pi_\theta(A_t|S_t) \mid S_t=s] = 0$ for all $s$. This follows from the fact that $\int \pi_\theta(a|s)da=1$ and differentiating yields zero. Using this result above:

$$ \mathbb{E}\!\big[(G_t - b(S_t))\,\nabla_\theta \ln \pi_\theta(A_t|S_t)\big] = \mathbb{E}_{S_t}\!\Big[\, (G_t - b(S_t)) \cdot \mathbf{0} \Big] = 0 + \mathbb{E}[G_t\,\nabla_\theta \ln \pi_\theta(A_t|S_t)]\,, $$

since the $b(S_t)$ term drops out. But $\mathbb{E}[G_t\,\nabla_\theta \ln \pi_\theta(A_t|S_t)] = \nabla_\theta J(\theta)$ by the REINFORCE formula. Therefore,

$$ \mathbb{E}[\hat{g}_b] \;=\; \mathbb{E}[(G_t - b(S_t))\,\nabla_\theta \ln \pi_\theta(A_t|S_t)] \;=\; \nabla_\theta J(\theta)\,. $$

Conclusion: Adding the baseline does not change the expected gradient. The baseline term is an expectation-zero control variate – we’ve added a quantity whose mean is zero, so it doesn’t bias the estimator (machine learning - Advantage Function - Variance Reduction - Data Science Stack Exchange). Thus, $\hat{g}_b$ is an unbiased gradient estimator, just like the original $\hat{g}$ ().

4. Variance Reduction via the Baseline: Although $\hat{g}_b$ has the same mean as $\hat{g}$, its variance can be much lower. We will compare $\mathrm{Var}(\hat{g}_b)$ to $\mathrm{Var}(\hat{g})$ and show the former is smaller (for an appropriate choice of $b$). For simplicity, consider one component of the gradient (one parameter dimension) and write $X = \nabla_\theta \ln \pi_\theta(A_t|S_t)$ and $Y = G_t$ for shorthand. Then our original estimator is $U = X Y$ and the baseline estimator is $U_b = X (Y - b(S_t))$. We analyze the variance of these scalar random variables (the argument can be applied to each component of the gradient vector separately () ()):

  • Variance without baseline: $\displaystyle\mathrm{Var}(U) \;=\; \mathbb{E}[X^2 Y^2] \;-\; (\mathbb{E}[X\,Y])^2.$

  • Variance with baseline: $\displaystyle\mathrm{Var}(U_b) \;=\; \mathbb{E}[X^2 (Y - b(S_t))^2] \;-\; (\mathbb{E}[X\,(Y - b(S_t))])^2.$

We already know $\mathbb{E}[X\,(Y - b(S_t))] = \mathbb{E}[X Y]$ (unbiasedness proved above). So:

$$ \begin{aligned} \mathrm{Var}(U_b) &= \mathbb{E}[X^2 (Y - b(S_t))^2] - (\mathbb{E}[X Y])^2, \\ \text{where } \mathbb{E}[X^2 (Y - b)^2] &= \mathbb{E}[X^2 (Y^2 - 2\,bY + b^2)] \\ &= \mathbb{E}[X^2 Y^2] \;-\; 2\,b\,\mathbb{E}[X^2 Y] \;+\; b^2\,\mathbb{E}[X^2]\,. \end{aligned} $$

Now, the difference in variances is:

$$ \mathrm{Var}(U) - \mathrm{Var}(U_b) \;=\; \mathbb{E}[X^2 Y^2] - (\mathbb{E}[XY])^2 \;-\; \Big(\mathbb{E}[X^2 (Y - b)^2] - (\mathbb{E}[XY])^2\Big)\,. $$

Cancelling the $(\mathbb{E}[XY])^2$ terms and substituting the expanded form, we get:

$$ \begin{aligned} \mathrm{Var}(U) - \mathrm{Var}(U_b) &= \mathbb{E}[X^2 Y^2] - \Big(\mathbb{E}[X^2 Y^2] - 2\,b\,\mathbb{E}[X^2 Y] + b^2\,\mathbb{E}[X^2]\Big) \\ &= 2\,b\,\mathbb{E}[X^2 Y] \;-\; b^2\,\mathbb{E}[X^2]\,. \end{aligned} $$

This expression is a quadratic function of $b$. To maximize the variance reduction, we can choose an optimal $b$. Setting the derivative with respect to $b$ to zero:

$$ \frac{d}{db}\big(2\,b\,\mathbb{E}[X^2 Y] - b^2\,\mathbb{E}[X^2]\big) = 2\,\mathbb{E}[X^2 Y] - 2\,b\,\mathbb{E}[X^2] = 0 \\ \implies b^* \;=\; \frac{\mathbb{E}[X^2 Y]}{\mathbb{E}[X^2]}\,. $$

Plugging $b^*$ back in, the minimum achievable variance with an optimal baseline is:

$$ \mathrm{Var}(U_{b^*}) \;=\; \mathbb{E}[X^2 Y^2] - (\mathbb{E}[X^2 Y])^2/\mathbb{E}[X^2] \;-\; (\mathbb{E}[XY])^2\,. $$

Compare this to $\mathrm{Var}(U) = \mathbb{E}[X^2 Y^2] - (\mathbb{E}[XY])^2$. The difference is

$$ \mathrm{Var}(U) - \mathrm{Var}(U_{b^*}) \;=\; \frac{\big(\mathbb{E}[X^2 Y]\big)^2}{\mathbb{E}[X^2]}\,, $$

which is nonnegative (indeed, a perfect square divided by $\mathbb{E}[X^2]>0$). Thus $\mathrm{Var}(U_{b^*}) \le \mathrm{Var}(U)$. In other words, by choosing an appropriate baseline $b$, we can only decrease (or at worst, leave unchanged) the variance of the gradient estimator. Generally, a well-chosen baseline will significantly reduce variance. In fact, in policy gradient theory it is known that the variance is minimized when the baseline $b(s)$ equals the expected return from state $s$, i.e. the state-value $V^\pi(s)$ (). In that case $Y - b(S_t) = G_t - V^\pi(S_t)$ is the advantage $A^\pi(S_t, A_t)$, which has expectation zero given $S_t$. Using the advantage $A(s,a) = Q^\pi(s,a) - V^\pi(s)$ in the update yields the smallest possible variance for an unbiased estimator ().

5. Summary: We have proven that adding a baseline $b(S_t)$ to the REINFORCE policy gradient estimator leaves its expectation unchanged (unbiased) while reducing its variance. The baseline effectively acts as a control variate – a zero-mean adjustment that removes unnecessary fluctuation (machine learning - Advantage Function - Variance Reduction - Data Science Stack Exchange). By appropriate choice of $b(s)$ (often an estimate of $V^\pi(s)$), the variance of the gradient estimator is provably lowered without introducing bias (). This leads to more stable and sample-efficient learning in practice. The key results can be summarized as:

  • Unbiasedness: $\mathbb{E}[(G_t - b(S_t))\,\nabla_\theta \ln\pi_\theta(A_t|S_t)] = \mathbb{E}[G_t\,\nabla_\theta \ln\pi_\theta(A_t|S_t)] = \nabla_\theta J(\theta)$.

  • Variance reduction: $\mathrm{Var}[(G_t - b(S_t))\,\nabla_\theta \ln\pi] = \mathrm{Var}[G_t\,\nabla_\theta \ln\pi] - \frac{(\mathbb{E}[X^2 Y])^2}{\mathbb{E}[X^2]}\,$, which is no greater than the original variance, and is minimized when $b(s)=V^\pi(s)$ (making $Y - b(s)$ the advantage) ().

Thus, adding a baseline in REINFORCE keeps the gradient estimator unbiased but lowers its variance, fulfilling the proof as required.

References

[1] Richard S. Sutton, and Andrew G. Barto. “Reinforcement learning: An introduction”, second edition, 2017.

[2] T. Jaakkola, M. I. Jordan, and S. P. Singh, “On the Convergence of Stochastic Iterative Dynamic Programming Algorithms,” Neural Computation, vol. 6, no. 6, pp. 1185–1201, Nov. 1994, doi: 10.1162/neco.1994.6.6.1185.

[3] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel, “Trust Region Policy Optimization,” arXiv:1502.05477 [cs], Feb. 2015, Accessed: Dec. 26, 2017. [Online]. Available: http://arxiv.org/abs/1502.05477

[4] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal Policy Optimization Algorithms,” arXiv:1707.06347 [cs], Jul. 2017, Accessed: Apr. 10, 2018. [Online]. Available: http://arxiv.org/abs/1707.06347

[5] Z. Shao et al., “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models,” Apr. 27, 2024, arXiv:2402.03300. doi: 10.48550/arXiv.2402.03300.

[6] D. Guo, D. Yang, H. Zhang, J. Song, and R. Zhang, “DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning,” arXiv preprint arXiv:2501.12948, 2025. [Online]. Available: https://arxiv.org/abs/2501.12948.

[7] E. Greensmith, P. L. Bartlett, J. Baxter, and P. Com, “Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning”.

[8] Z. Liu et al., “Understanding R1-Zero-Like Training: A Critical Perspective”.

[9] Q. Yu et al., “DAPO: An Open-Source LLM Reinforcement Learning System at Scale,” Mar. 18, 2025, arXiv: arXiv:2503.14476. doi: 10.48550/arXiv.2503.14476.