An Introduction to PPO in RLHF

Reinforcement Learning in NLP

首先，回顾强化学习（Reinforcement Learning, RL）的基本流程。RL 的两个核心角色为智能体（Agent）与环境（Environment）。Agent 位于环境中，且与环境进行交互。在时刻 $t$ ，二者交互流程如下：

Agent 观测到环境的状态（State） $s_{t}$ ，接着决定采取的行动（Action） $a_{t}$
环境由于 Agent 的动作而改变（也可能自主改变），同时给予 Agent 即时奖励（Reward） $r_{t}$

RL 的目标是为了学习一个策略 $\pi_{\theta}(a \mid s)$ ，以指导 agent 在处于状态 $s$ 下，采取行动 $a$ ，从而获得最大的累计奖励（Cumulative Reward） $G_{t}$ ，也称为回报（Return）。

在 LLMs 的背景下，RL 概念的对应关系如下：

Agent: LLM 自身
Environment: RLHF 下的环境是一个较为抽象的概念，是确定性的和非交互式的。可以理解为 LLM 的上下文。
State $S_{t}$ : LLM 当前时刻 $t$ 的上下文
Policy $\pi_{\theta}(\cdot \mid S_{t})$ : LLM 在当前时刻 $t$ 下词汇表的概率分布
Action $A_{t}$ : LLM 当前时刻 $t$ 输出的 token
Episode $\tau$ : LLM 的整个生成过程是一个单一轨迹

Agent 与环境的交互过程如下：

用户向模型输入一个 prompt，期望模型输出一个符合其偏好的 response (a.k.a., completion)
在第 $t$ 个解码时间步，模型根据当前的上文 $S_{t}$ 输出一个 token $A_{t}$ ，同时得到即时奖励 $R_{t}$ ，其期望回报被估计为 $V^{\pi}(S_{t})$

其中，即时奖励是当前输出 token $A_{t}$ 带来的收益，而期望回报（或者称为状态价值）是当前时间步 $t$ 一直到整个回复生成结束时的总收益。

更进一步对比 RLHF 与传统 RL：

对比维度	Traditional RL	RLHF
状态转移	环境通常是动态的、随机的，当前状态 $S_{t}$ 与动作 $A_{t}$ 共同影响下一状态 $S_{t+1}$ 的分布，即： $S_{t+1} \sim P(\cdot \mid S_{t},A_{t})$	状态转移是确定性的，而非环境驱动的。状态转移只是一个简单的拼接操作 $S_{t+1}=S_{t} \oplus A_{t}$
奖励来源	奖励函数 $R(s_{t},a_{t},s_{t+1})$ 是环境固有的一部分，是客观且确定的	无固定的奖励函数，取而代之的是一个参数化的奖励模型 $r_{\phi}(x,y)$ ，其中 $x$ 是 prompt， $y$ 是 completion。此外，PPO 还会将 KL 散度融合到最终的即时奖励中。
奖励粒度	通常是细粒度的。Agent 在每个时间步 $t$ 执行动作 $A_{t}$ 后，都可能从环境中获得一个即时奖励 $R_{t}$	奖励模型通常在整个回合结束才计算 response-level 的奖励，而 KL 散度是 token-level 的

在下一章节，我们将深入探讨即时奖励 $R_{t}$ 与期望回报 $V^{\pi}(S_{t})$ 由哪些模型给出，这些模型由什么数据集训练得到。

Overview of Models and Datasets

本章节关注 RLHF 阶段的几个问题：

涉及哪些数据集与模型？
分为几个阶段来训练这些模型？
涉及的模型从何而来？具体而言，如何初始化？

首先，是涉及的四个主要模型：

Model	Description	Input	Output
Actor Model	已经过 SFT，待偏好对齐的目标语言模型	当前状态 $S_{t}$ ，即 $t$ 时刻的上文 $(x,y_{<t})$	next token $A_{t}$ probability $\pi_{\theta}(\cdot \mid S_{t})$
Reference Model	已经过 SFT，无需偏好对齐的参考语言模型	同 Actor Model	同 Actor Model
Critic Model	估计期望回报 $G_{t}$ 的回归模型	同 Actor Model	标量分数，对期望回报 $V^{\pi}(S_{t})$ 的估计
Reward Model	估计即时奖励 $R_{t}$ 的回归模型	通常为完整拼接的 `{prompt, completion}`	标量分数，对即时奖励 $R_{t}$ 的估计

接着，是涉及的两个数据集，分别为：

Preference Dataset: 用于训练 Reward Model。数据格式可以是 {prompt, chosen, rejected}，其中：
- prompt: 用户输入
- chosen: 符合用户偏好的回复
- rejected: 不符合用户偏好的回复
Prompt Dataset: 用于 PPO。数据格式可以是 {prompt}

具体格式详见：Hugging Face RLHF Dataset Formats and Types

然后，是 RLHF 的核心步骤：

Step 0: SFT 在预训练后的 Pretrained Model 基础上，利用人类标注的 prompt-completion pair 数据集，监督微调得到 SFT Model。Reference Model 与 Actor Model 由该模型初始化得到。这一步骤被视为第 0 步，原因在于其通常属于 RLHF 的前置阶段。然而，DeepSeek R1 Zero 则跳过这一阶段，直接进行强化学习。
Step 1: Train a Reward Model Reward Model 可以在 Pretrained Model 基础上初始化，也可以基于 SFT Model。在训练之前，需要构建 Preference Dataset。接着，利用该数据集训练 Reward Model。
Step 2: RLHF-PPO 该阶段利用 PPO 算法，对 Actor Model 与 Critic Model 进行训练。而此前得到的 Reference Model 与 Reward Model 的参数全部冻结，无需调整。

最后，是各模型的初始化与改造。如果直接加载 4 个 SFT Model，可能显存开销过大。因而，可以采用共享参数与 LoRA 参数高效微调：

Reference Model: 直接从 SFT Model 中加载，且无需改造
Reward Model: 在 Reference Model 基础上，添加 1 个 LoRA。同时，添加 1 个回归头，用于输出整个序列的即时奖励
Actor & Critic Model: 在 Reference Model 基础上，添加 1 个 LoRA。Actor 与 Critic 共享这 1 个 LoRA。同时，原有的 LM Head 直接继承给 Actor Model。再为 Critic Model 新增 1 个回归头 (i.e., value head)，用于输出上文每个 token 的状态价值 $V^{\pi}(s_{t})$

这样，相当于只加载了 1 个完整的 LLM (i.e., Reference Model)，添加了 2 个 LoRA 与 2 个回归头，大大节省显存开销。

接下来，我们按照 RLHF 核心步骤，依次讲解。

Step 1: Reward Modeling

本节将简要概述 Reward Model 的训练与推理。Reward Model 的输入是一个序列的 token indices，维度为 (B,T)。具体来说：

训练阶段：输入是正样本 {prompt, chosen} 与负样本 {prompt, rejected}。
推理阶段：输入是 Actor Model 的输出 {prompt, completion}

由于移除了 LM head，采用回归头，故其输出通常是一个标量得分，维度为 (B,)。具体来说，通常只保留最后一个位置 token 的输出。从而，一个序列仅对应一个得分。

Reward Model 的损失函数如下：

$\mathcal{L}_{\text{RM}}(\phi) = -\mathbb{E}_{(x,y_{w},y_{l}) \sim \mathcal{D}_{\text{PREF}}} [\log(\sigma(r_{\phi}(x,y_{w})-r_{\phi}(x,y_{l})))]$

其中：

$(x,y_{w},y_{l})$ : 从偏好数据集 $\mathcal{D}_{\text{PREF}}$ 中采样的样本，格式为 {prompt, chosen, rejected}
$r_{\phi}(x,y_{w}),r_{\phi}(x,y_{l})$ : Reward Model 的输出，分别为偏好样本与拒绝样本的奖励。其中， $\phi$ 是 Reward Model 的参数
$\sigma(z) = 1/(1+e^{-z})$ : sigmoid 函数

将 sigmoid 函数 $\sigma(\cdot)$ 代入后可得：

$\mathcal{L}_{\text{RM}}(\phi) = \mathbb{E}_{(x,y_{w},y_{l}) \sim \mathcal{D}_{\text{PREF}}} \Big[\log(1 + e^{r_{\phi}(x,y_{w})-r_{\phi}(x,y_{l})}\Big]$

训练结束后，Reward Model 参数冻结。

Step 2: RLHF-PPO

RLHF-PPO 整体流程伪代码如下：

for prompt in prompt_dataset:
    # Assuming output = [prompt, completion]
    output = policy_model.generate(prompt)

    scores = reward_model(output)
    old_logprobs, old_values = policy_model(output)
    ref_logprobs = ref_model(output)

    rewards = compute_rewards(old_logprobs, ref_logprobs, scores)
    advantages, returns = compute_gae_advantage_return(rewards, old_values)

    for _ in range(num_ppo_epochs):
        logprobs, values = policy_model(output)

        pg_loss = compute_actor_loss(logprobs, old_logprobs, advantages)
        vf_loss = compute_critic_loss(values, old_values, returns)
        loss = pg_loss + vf_coef * vf_loss

        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

接着，从各模型的输入输出出发，来解读各个变量：

policy_model: 即 Actor & Critic Model。由于这两个模型共享 LoRA，只是最终预测头不同，因此不妨统称为 policy_model。这两个模型的输入均是 {prompt, completion} 的拼接，变量表示为 output。Actor 与 Critic 输出分别为对数概率分布与状态价值。在 PPO 的外层循环，模型的输出会带上 old_ 前缀，用于表示在 PPO 优化前的值。这些变量主要用于重要性采样中。接下来，解读不带前缀的变量，这些变量加上 old_ 或 ref_ 前缀后，核心含义不变。
- probs: Actor 对序列中每个位置实际采样 token 的对数概率分布，形状为 (B, T)。注意，在以上伪代码中，可能不存在 B 这个批量维度，换句话说，B=1
- values:Critic 对序列中每个位置（更准确的表述应该是状态）的价值评估 $V(s)$ ，形状为 (B, T)。
ref_model: 即 Reference Model。该模型与 Actor Model 类似，输出对数概率分布，用于计算 KL 散度。由于没有 Critic 的回归头，因而不会输出 ref_values
reward_model: 对整个输出序列进行打分，直接输出的维度 (B, T)，通常仅保留序列最后一个位置的输出。也就是说，最终模型的输出维度为 (B,)

接下来，将探讨伪代码中剩余的一些逻辑：

最终即时奖励 $R_{t}$ 的计算：compute_rewards
优势与回报的计算：compute_gae_advantage_return
PPO 损失的计算：compute_actor_loss 与 compute_critic_loss

Note

以下各节实际定义与实现的函数，与伪代码中调用的不一定一致。伪代码只是为了方便理解整个流程，因此会省略一些次要逻辑。

即时奖励的计算

目前，我们已经得到了 Reward Model 的打分 score，形状为 (B, T)。在实际计算时，我们仅保留序列最后一个位置的打分，将序列前 $T-1$ 个位置的 score 均置为 $0$ 。直观上理解，Reward Model 给出的打分，像是对序列的结果性打分，我们还缺乏一个过程性打分。在 Actor Model 生成的过程中，我们希望其与 Reference Model 的概率分布不要相差太大。为此，我们引入 KL 散度，作为过程性奖励：

如果概率分布相近，KL 散度越接近 $0$ ，我们就给予 Actor 更高的奖励
如果概率分布相差较大，KL 散度就越大，我们就给予 Actor 更小的奖励

首先，对于两个离散的概率分布 $p(x)$ 与 $q(x)$ ，从 $p$ 到 $q$ 的 KL 散度定义为：

$\begin{aligned} \mathbb{D}_{\text{KL}}(p(\cdot) \| q(\cdot)) &= \sum_{x \in \mathcal{X}} p(x) \log \frac{p(x)}{q(x)} \\ &= \mathbb{E}_{x \sim p(\cdot)} \bigg[ \log \frac{p(x)}{q(x)} \bigg] \end{aligned}$

由第二个等式可知， $\log p(x) - \log q(x)$ 是 $\mathbb{D}_{\text{KL}}(p(\cdot) \| q(\cdot))$ 的一个无偏估计。在 RLHF 上下文中，Actor Model 与 Ref Model 的 next token probability KL 散度表示如下：

$\begin{aligned} \mathbb{D}_{\text{KL}}\Big(\pi_{\theta_{old}}(\cdot \mid s) \| \pi_{ref}(\cdot \mid s)\Big) &= \sum_{a \in \mathcal{V}} \pi_{\theta_{old}}(a \mid s) \log \frac{\pi_{\theta_{old}}(a \mid s)}{\pi_{ref}(a \mid s)} \\ &= \mathbb{E}_{a \sim \pi_{\theta_{old}}(\cdot \mid s)} \bigg[ \log \frac{\pi_{\theta_{old}}(a \mid s)}{\pi_{ref}(a \mid s)} \bigg] \end{aligned}$

至此，我们知道了 KL 散度的计算方法，也得到了 Reward Model 的打分。现给出最终即时奖励的简易计算公式：

$R_{t} = \begin{cases} \log \pi_{ref}(a_{t} \mid s_{t}) - \log \pi_{\theta_{old}}(a_{t} \mid s_{t}), & t < T-1 \\ \log \pi_{ref}(a_{t} \mid s_{t}) - \log \pi_{\theta_{old}}(a_{t} \mid s_{t}) + r_{\phi}(s_{T-1}), & t = T-1 \end{cases}$

其中， $r_{\phi}(s_{T-1})$ 是 Reward Model 的打分。这里的 $s_{T-1}$ 实际上就是 prompt 与 response 的拼接。

当 $t<T-1$ ，即非序列末尾位置的 token，即时奖励为 KL 散度近似值的负数。因为我们希望 KL 散度越低时，奖励越高
当 $t=T-1$ ，即序列最后一个 token，即时奖励为 KL 散度近似值的负数，加上对 Reward Model 对整个序列的打分

最后，我们给出 compute_rewards 的具体实现：

def compute_rewards(
    prompts: torch.Tensor,
    mask: torch.Tensor,
    old_logprobs: torch.Tensor,
    ref_logprobs: torch.Tensor,
    reward_scores: torch.Tensor,
    kl_ctl: float = 0.1,
    clip_reward_value: float = 5.0,
) -> torch.Tensor:
    """Compute rewards based on the reward scores and KL divergence penalty.

    - B: Batch size
    - T_prompt: Length of the prompt
    - T: Length of the response

    Args:
        prompts (torch.Tensor): Input prompt tokens of shape (B, T_prompt).
        mask (torch.Tensor): Binary mask indicating response positions of shape (B, T).
        old_logprobs (torch.Tensor): Log probabilities from the old actor model of shape (B, T).
        ref_logprobs (torch.Tensor): Log probabilities from the reference model of shape (B, T).
        reward_scores (torch.Tensor): Reward scores from the reward model of shape (B,).
        kl_ctl (float, optional): KL divergence control coefficient. Defaults to 0.1.
        clip_reward_value (float, optional): Maximum absolute value for reward clipping. Defaults to 5.0.

    Returns:
        torch.Tensor: Computed rewards of shape (B, T).
    """
    # 计算 KL 散度的近似值
    kl_div = old_logprobs - ref_logprobs  # (B, T)
    # 初始化奖励为 KL 散度估计值
    rewards = kl_ctl * kl_div  # (B, T)
    # 最终奖励的计算只考虑 response 部分，不考虑 prompt 部分
    # 计算 response 的起始位置 (prompt 结束位置)
    # 由于 prompts 进行 padding 处理，因此每个 batch 中每个 prompt 的长度是一致的
    start: int = prompts.shape[1] - 1
    # 计算 response 的结束位置
    # 每个 batch 中 response 长度不尽相同
    ends = start + mask[:, start:].sum(1) + 1  # (B,)
    # 对奖励模型的输出分数进行裁剪，防止过大的奖励值
    reward_clip = torch.clamp(
        reward_scores, -clip_reward_value, clip_reward_value
    )  # (B,)
    batch_size = old_logprobs.shape[0]
    for i in range(batch_size):
        # rewards[i, start : ends[i]]: 当前 batch 中第 i 个样本的 response 部分
        # reward_clip[i]: 当前 batch 中第 i 个样本的奖励分数
        # 将奖励分数添加到 response 部分的最后一个 token 上
        rewards[i, start : ends[i]][-1] += reward_clip[i]
    return rewards

在实际实现中，我们新增了以下逻辑：

对 KL 散度新增惩罚系数 kl_ctl
对奖励模型的输出分数进行裁剪，防止过大的奖励值

Note

verl 的实现仅保留了核心逻辑，没有 mask 处理（可能是直接对参数进行限制），也没有对奖励分数进行裁剪。

优势与回报的计算

PPO 的标准实现使用的是 GAE 优势函数，回顾 GAE 的定义：

$\hat{A}_{t}^{\mathrm{GAE}(\gamma,\lambda)} = \sum_{l=0}^{\infty} (\gamma \lambda)^{l} \delta_{t+l}$

由于无限求和公式在实际计算中不可行，我们可以推导出反向迭代的形式：

$\hat{A}_{t}^{\mathrm{GAE}(\gamma,\lambda)} = \delta_{t} + \gamma \lambda \hat{A}_{t+1}^{\mathrm{GAE}(\gamma,\lambda)}$

在一次轨迹采样结束后，可以计算所有时间步的 TD Error $\delta_{t}$ 。关注最后一个时刻 $T-1$ ，由于 Actor Model 生成已经结束，进而未来期望收益 $V_{w}(s_{T})$ 与未来优势 $\hat{A}_{T}^{\mathrm{GAE}(\gamma,\lambda)}$ 均为 $0$ ，因此 $\delta_{T-1} = r_{T-1} + \gamma V_{w}(s_{T}) - V_{w}(s_{T}) = r_{T} - V_{w}(s_{T-1})$ ，而 $\hat{A}_{T-1}^{\mathrm{GAE}(\gamma,\lambda)} = \delta_{T-1} + \gamma \lambda \cdot 0 = \delta_{T-1} = r_{T-1} - V_{w}(s_{T-1})$ 。 $r_{T-1}$ 由 compute_reward 方法计算得到，而 $V_{w}(s_{T-1})$ 由 Critic 网络给出，因而 $\hat{A}_{T-1}^{\mathrm{GAE}(\gamma,\lambda)}$ 就可以计算。从后往前扫描一遍，就能计算出所有时间步的 GAE。

对于优势与回报计算的实现，见 trl 与 verl。我们从 verl 提取核心逻辑如下：

def compute_gae_advantage_return(
    token_level_rewards: torch.Tensor,
    values: torch.Tensor,
    gamma: float,
    lambda_: float,
) -> tuple[torch.Tensor, torch.Tensor]:
    """Compute GAE advantages and returns for PPO training.

    Assumes all tokens in the input are valid response tokens (no masking needed).
    All padding tokens after EOS and prompt tokens should be excluded.

    - B: Batch size
    - T: Length of the response

    Args:
        token_level_rewards (torch.Tensor): Token-level rewards of shape (B, T).
        values (torch.Tensor): Value estimates of shape (B, T).
        gamma (float): Discount factor for future rewards.
        lambda_ (float): Lambda parameter for GAE bias-variance tradeoff.

    Returns:
        tuple[torch.Tensor, torch.Tensor]:
            - advantages: Normalized advantages of shape (B, T)
            - returns: Returns (advantages + values) of shape (B, T)
    """
    with torch.no_grad():
        # length of the generated tokens, denoted as T
        gen_len = token_level_rewards.shape[-1]
        nextvalues = 0  # V(s_T) = 0
        lastgaelam = 0  # A_T = 0
        advantages_reversed: list[torch.Tensor] = []  # [A_{T-1}, ..., A_0]
        for t in reversed(range(gen_len)):
            # δ_t = r_t + γ * V(s_{t+1}) - V(s_T)
            delta = (
                token_level_rewards[:, t] + gamma * nextvalues - values[:, t]
            )  # (B,)
            # A_t = δ_t + γλ * A_{t+1}
            lastgaelam = delta + gamma * lambda_ * lastgaelam  # (B,)
            nextvalues = values[:, t]  # (B,)
            advantages_reversed.append(lastgaelam)

        # advantages_reversed[::-1]: [A_{T-1}, ..., A_0] -> [A_0, ..., A_{T-1}]
        # dim=1: stack along the time dimension
        advantages = torch.stack(advantages_reversed[::-1], dim=1)

        # G_t = A_t + V(s_T)
        returns = advantages + values

        # Normalize advantages to have zero mean and unit variance
        advantages = whiten(advantages)

    return advantages, returns


def whiten(values: torch.Tensor) -> torch.Tensor:
    """Normalize values to have zero mean and unit variance.

    Args:
        values (torch.Tensor): Values to normalize.

    Returns:
        torch.Tensor: Normalized values.
    """
    mean = values.mean()
    var = values.var()
    whitened = (values - mean) * torch.rsqrt(var + 1e-8)
    return whitened

这里的实现移除了对 token mask 的考虑。注意 with torch.no_grad()，对于优势与回报我们无需计算梯度，因为它们是作为接下来 PPO 训练的 target/label。

PPO 损失的计算

在正式讲解前，我们回顾传统 RL 中 PPO 的算法。PPO 有两种变体：PPO-Clip 与 PPO-Penalty。这里我们只讨论 PPO-Clip，Actor 的损失函数如下：

$\mathcal{L}^{\text{CLIP}}(\theta) = -\mathbb{E}_{t} \Big[\min\big( r_{t}(\theta) \hat{A}_{t}, \mathrm{clip}\big(r_{t}(\theta),1-\epsilon,1+\epsilon\big)\hat{A}_{t} \big)\Big]$

其中， $r_{t}(\theta)$ 表示新旧 Actor Model 的 next token 概率分布比率：

$r_{t}(\theta) = \frac{\pi_{\theta}(a_{t} \mid s_{t})}{\pi_{\theta_{old}}(a_{t} \mid s_{t})}$

$\mathrm{clip}\big(r_{t}(\theta),1-\epsilon,1+\epsilon\big)$ 是一个裁剪函数，用于将概率比率 $r_{t}(\theta)$ 限制在 $[1-\epsilon,1+\epsilon]$ 的区间内。这里的 $\epsilon$ 是一个超参数，定义信任区域的大小。

在 RLHF 的上下文中，PPO Actor 损失函数通常表示如下：

$\mathcal{L}^{\text{CLIP}}(\theta) = -\mathbb{E}_{x \sim p} \Bigg[ \mathbb{E}_{y \sim \pi_{\theta_{old}}(\cdot \mid x)} \bigg[ \sum_{t=1}^{|y|} \min\Big( r_{t}(\theta) \hat{A}_{t}, \mathrm{clip}\big(r_{t}(\theta),1-\epsilon,1+\epsilon\big)\hat{A}_{t} \Big)\bigg] \Bigg]$

verl 的实现是基于 Dual-Clip PPO，见官方文档。这里我们仅给出常规 PPO 的实现：

def compute_actor_loss(
    logprobs: torch.Tensor,
    old_logprobs: torch.Tensor,
    advantages: torch.Tensor,
    mask: torch.Tensor,
    cliprange: float,
) -> torch.Tensor:
    """Compute the actor loss for PPO.

    - B: Batch size
    - T: Length of the response

    Args:
        logprobs (torch.Tensor): Log probabilities from the new actor model of shape (B, T).
        old_logprobs (torch.Tensor): Log probabilities from the old actor model of shape (B, T).
        advantages (torch.Tensor): Advantage estimates of shape (B, T).
        mask (torch.Tensor): Boolean or numeric mask tensor of shape (B, T).
        cliprange (float): Clipping range for the policy ratio.

    Returns:
        torch.Tensor: Computed actor loss of shape (B,).
    """
    # Compute the ratio = pi / pi_old
    ratio = torch.exp(logprobs - old_logprobs)
    pg_losses1 = -advantages * ratio  # - ratio * A
    pg_losses2 = -advantages * torch.clamp(
        ratio, 1.0 - cliprange, 1.0 + cliprange
    )  # -clip(ratio, 1-cliprange, 1+cliprange) * A
    pg_loss_max = torch.maximum(
        pg_losses1, pg_losses2
    )  # -min(ratio * A, clip(ratio, 1-cliprange, 1+cliprange) * A)
    pg_loss = masked_mean(pg_loss_max, mask)
    return pg_loss

def masked_mean(
    values: torch.Tensor,
    mask: torch.Tensor,
    dim: int | list[int] | tuple[int, ...] | None = None,
) -> torch.Tensor:
    """Compute the mean of values while applying a mask.

    Args:
        values (torch.Tensor): Input tensor.
        mask (torch.Tensor): Boolean or numeric mask tensor of the same shape as `values`.
        dim (int | list[int] | tuple[int, ...] | None, optional): Dimensions along which to compute the mean. Defaults to None.

    Returns:
        torch.Tensor: Mean of the masked values, with the same shape as `values` reduced over `dim`.
    """
    return (values * mask).sum(dim) / mask.sum(dim).clamp(min=1e-8)

变量/方法与公式对应如下：

Variable/Method	Formulation	Description
`logprobs`	$\log \pi_{\theta}(a_{t} \mid s_{t})$	当前 Actor 网络预测的对数概率分布
`old_logprobs`	$\log \pi_{\theta_{old}}(a_{t} \mid s_{t})$	Old Actor 网络预测的对数概率分布
`advantages`	$\hat{A}_{t}$	GAE 作为对优势函数真值 $A^{\pi_{old}}(s_{t},a_{t})$ 的估计
`cliprange`	$\epsilon$	对当前 Actor 网络预测的裁剪范围
`masked_mean`	$\sum_{t=1}^{\|y\|}/\|y\|$	对一条回复中，所有有效时间步的数据求均值

Note

这里的 masked_mean 是对所有有效时间步求均值，而原始公式只是求和，并没有使用回复长度 $|y|$ 进行归一化。具体的讨论见 [2503.20783] Understanding R1-Zero-Like Training: A Critical Perspective

对于 Critic，其损失函数如下：

$\begin{aligned} \mathcal{L}(w) &= \frac{1}{2} \mathbb{E}_{t} \Big[ V^{\pi_{\theta_{old}}}(s_{t}) - V_{w}(s_{t})\big)^{2} \Big] \\ &= \frac{1}{2} \mathbb{E}_{t} \Big[ \big( \hat{G}_{t}^{\lambda} - V_{w}(s_{t})\big)^{2} \Big] \\ &= \frac{1}{2} \mathbb{E}_{t} \Big[ \big( \hat{A}_{t}^{\mathrm{GAE}(\gamma,\lambda)} + V_{w_{old}}(s_{t}) - V_{w}(s_{t})\big)^{2} \Big] \end{aligned}$

在 RLHF 的上下文中，PPO Critic 损失函数通常表示如下：

$\mathcal{L}(w) = \frac{1}{2} \mathbb{E}_{x \sim p} \Bigg[ \mathbb{E}_{y \sim \pi_{\theta_{old}}(\cdot \mid x)} \bigg[ \sum_{t=1}^{|y|} \big( \hat{A}_{t}^{\mathrm{GAE}(\gamma,\lambda)} + V_{w_{old}}(x,y_{<t}) - V_{w}(x,y_{<t})\big)^{2} \bigg] \Bigg]$

其中，理论真值 $V^{\pi_{\theta_{old}}}(s_{t})$ 通过 $\lambda$ -return $\hat{G}_{t}^{\lambda}$ 来近似。具体实现时，通常会对新预测的 $V_{w}(s_{t})$ 裁剪，类似 Actor。具体来说，会将 $V_{w}(s_{t})$ 控制在 $V_{w_{old}}(s_{t}) \pm \epsilon$ 。因此，最终目标函数如下：

$\begin{aligned} \mathcal{L}(w) &= \frac{1}{2} \mathbb{E}_{\tau \sim \pi_{\theta_{old}}} \bigg[ \max\Big( \big(V_{w}(s_{t}) - \hat{G}_{t}^{\lambda}\big)^{2}, \big(\mathrm{clip}(V_{w}(s_{t}),V_{w_{old}}(s_{t}) - \epsilon,V_{w_{old}}(s_{t}) + \epsilon) - \hat{G}_{t}^{\lambda}\big)^{2} \Big) \bigg] \\ &= \frac{1}{2} \mathbb{E}_{x \sim p} \Bigg[ \mathbb{E}_{y \sim \pi_{\theta_{old}}(\cdot \mid x)} \bigg[ \max\Big( \big(V_{w}(s_{t}) - \hat{G}_{t}^{\lambda}\big)^{2}, \big(\mathrm{clip}(V_{w}(x,y_{<t}),V_{w_{old}}(x,y_{<t}) - \epsilon,V_{w_{old}}(x,y_{<t}) + \epsilon) - \hat{G}_{t}^{\lambda}\big)^{2} \Big) \bigg] \Bigg] \\ &\approx \frac{1}{2} \mathbb{E}_{x \sim p} \Bigg[ \frac{1}{|y|} \sum_{t=1}^{|y|} \max\Big( \big(V_{w}(s_{t}) - \hat{G}_{t}^{\lambda}\big)^{2}, \big(\mathrm{clip}(V_{w}(x,y_{<t}),V_{w_{old}}(x,y_{<t}) - \epsilon,V_{w_{old}}(x,y_{<t}) + \epsilon) - \hat{G}_{t}^{\lambda}\big)^{2} \Big) \Bigg] \end{aligned}$

我们参考 verl 的实现，给出代码如下：

def compute_critic_loss(
    values: torch.Tensor,
    old_values: torch.Tensor,
    returns: torch.Tensor,
    mask: torch.Tensor,
    cliprange: float,
) -> torch.Tensor:
    """Compute the critic loss for PPO.

    - B: Batch size
    - T: Length of the response

    Args:
        values (torch.Tensor): Value estimates of shape (B, T).
        old_values (torch.Tensor): Old value estimates of shape (B, T).
        returns (torch.Tensor): Returns of shape (B, T).
        mask (torch.Tensor): Boolean or numeric mask tensor of shape (B, T).
        cliprange (float): Clipping range for the value function.

    Returns:
        torch.Tensor: Computed critic loss of shape (B,).
    """
    clipped_values = torch.clamp(
        values, old_values - cliprange, old_values + cliprange
    )  # clip(V, V_old - cliprange, V_{old} + cliprange)
    vf_losses1 = torch.square(values - returns)  # -(G - V)^2
    vf_losses2 = torch.square(clipped_values - returns)  # -(G - clip(V))^2
    vf_loss_max = torch.max(
        vf_losses1, vf_losses2
    )  # -min((G - V)^2, (G - clip(V))^2)
    vf_loss = 0.5 * masked_mean(vf_loss_max, mask)
    return vf_loss

变量/方法与公式对应如下：

Variable/Method	Formulation	Description
`values`	$V_{w}(s_{t})$	当前 Critic 网络预测的状态价值
`old_values`	$V_{w_{old}}(s_{t})$	Old Critic 网络预测的状态价值
`returns`	$\hat{G}_{t}^{\lambda}$	$\lambda$ -return 作为对状态价值真值 $V^{\pi_{\theta_{old}}}(s_{t})$ 的估计
`cliprange`	$\epsilon$	对当前 Critic 网络预测的裁剪范围
`masked_mean`	$\sum_{t=1}^{\|y\|}/\|y\|$	对一条回复中，所有有效时间步的数据求均值

Reference

论文原文：

传统 RL 中的 PPO: [1707.06347] Proximal Policy Optimization Algorithms
OpenAI 将 PPO 引入 RLHF: [2009.01325] Learning to summarize from human feedback

对 RLHF-PPO 解读的文章列举如下：