本文主要摘录自 OpenAI Spinning Up Documentation Part 1: Key Concepts in RL ,忽略了对具体实现的讲解,同时对回报,价值函数等概念进行了更规范的形式化表述。此外,对 Bellman Equations 展开了详尽地推导。A (Long) Peek into Reinforcement Learning | Lil’Log 一文对这些概念的讲解更加理论深入。
Overview of Reinforcement Learning
强化学习(Reinforcement Learning, RL)中的主要角色是智能体(Agent) 和 环境(Environment) 。agent 位于环境中,且与环境进行交互。在每一步交互中,
agent 会观测当前环境状态(State)的(可能不完整的)部分信息,然后决定采取一个动作(Action)
agent 的动作会改变环境,但环境也可能自行发生变化。同时,环境会给 agent 一个奖励(Reward),用来告诉它当前状态的好坏。
强化学习的目标在于,学习如何采取行动,以最大化其累计奖励,也称为回报(Return)。
States and Observations
状态(State) s s s :对环境状态的完整描述
观测(Observation) o o o :对环境状态的部分描述,可能会遗漏一些信息
Action Spaces
不同的环境允许不同种类的动作。给定环境中所有有效动作的集合通常称为动作空间(Action Space),表示为 A \mathcal{A} A 。动作空间可能是离散的(如围棋),也可能是连续的(如机器人控制)
Policies
策略(Policy)是 agent 用于决定动作的规则。它可以是确定性的(Deterministic Policy),如给定状态总是选择相同动作,表示为 a t = μ θ ( s t ) a_{t} = \mu_{\theta}(s_{t}) a t = μ θ ( s t ) ;或随机性的(Stochastic Policy),如根据概率分布采样。A t ∼ π θ ( ⋅ ∣ s t ) A_{t} \sim \pi_{\theta}(\cdot \mid s_{t}) A t ∼ π θ ( ⋅ ∣ s t ) 。具体地,随机性策略可表示为:
π ( a ∣ s ) = P ( A = a ∣ S = s ) \pi(a \mid s) = \mathbb{P}(A = a \mid S = s)
π ( a ∣ s ) = P ( A = a ∣ S = s )
其中:
A = a A = a A = a 表示动作随机变量 A A A 当前的观测值为 a a a 。S = s S=s S = s 同理
P ( ⋅ ) \mathbb{P}(\cdot) P ( ⋅ ) 表示概率分布
在深度强化学习中,策略通常通过神经网络参数化:
确定性策略:a t = μ θ ( s t ) a_{t} = \mu_{\theta}(s_{t}) a t = μ θ ( s t )
随机性策略:A t ∼ π θ ( ⋅ ∣ s t ) A_{t} \sim \pi_{\theta}(\cdot \mid s_{t}) A t ∼ π θ ( ⋅ ∣ s t )
Trajectories
轨迹(Trajectory, a.k.a. Episode or Rollout)τ \tau τ 是由环境中的状态和动作构成的序列:
τ = ( S 0 , A 0 , S 1 , A 1 , … ) \tau=(S_{0}, A_{0}, S_{1}, A_{1}, \ldots)
τ = ( S 0 , A 0 , S 1 , A 1 , … )
其中,初始状态 S 0 S_{0} S 0 从初始状态概率分布 ρ 0 ( ⋅ ) \rho_{0}(\cdot) ρ 0 ( ⋅ ) 中采样得到。
在具体的交互过程中,轨迹通常是有限的,T T T -step 轨迹可以表示如下:
τ = ( S 0 , A 0 , S 1 , A 1 , … , S T − 1 , A T − 1 , S T ) \tau=(S_{0}, A_{0}, S_{1}, A_{1}, \ldots,S_{T-1},A_{T-1},S_{T})
τ = ( S 0 , A 0 , S 1 , A 1 , … , S T − 1 , A T − 1 , S T )
State Transitions
状态转移(State Transitions)描述环境如何从状态 s t s_{t} s t 转移到下一状态 s t + 1 s_{t+1} s t + 1 的。可以是确定性的:
s t + 1 = f ( s t , a t ) s_{t+1} = f(s_{t}, a_{t})
s t + 1 = f ( s t , a t )
也可以是随机性的:
S t + 1 ∼ P ( ⋅ ∣ s t , a t ) S_{t+1} \sim P(\cdot \mid s_{t}, a_{t})
S t + 1 ∼ P ( ⋅ ∣ s t , a t )
状态转移函数由环境决定,仅依赖于最近的动作 a t a_{t} a t 与状态 s t s_{t} s t
Summary: Two Sources of Randomness
总结当前系统中出现的两个随机性:
动作的随机性:A ∼ π ( ⋅ ∣ s ) A \sim \pi(\cdot \mid s) A ∼ π ( ⋅ ∣ s )
状态的随机性:S t + 1 ∼ P ( ⋅ ∣ s t , a t ) S_{t+1} \sim P(\cdot \mid s_{t}, a_{t}) S t + 1 ∼ P ( ⋅ ∣ s t , a t )
Reward and Return
Reward
即时奖励(Reward)R t R_{t} R t 为 agent 提供反馈,表明当前状态 s t s_{t} s t 或动作 a t a_{t} a t 的好坏。奖励函数(Reward Function)R R R 依赖于当前状态 s t s_{t} s t ,动作 a t a_{t} a t 与下一状态 s t + 1 s_{t+1} s t + 1 ,可表示为:
R t = R ( s t , a t , s t + 1 ) R_{t} = R(s_{t}, a_{t}, s_{t+1})
R t = R ( s t , a t , s t + 1 )
但也可以简化为:
仅依赖当前状态:R t = R ( s t ) R_{t} = R(s_{t}) R t = R ( s t )
依赖状态 - 动作对:R t = R ( s t , a t ) R_{t}= R(s_{t}, a_{t}) R t = R ( s t , a t )
此时,我们可以更新 T T T -step 轨迹 τ \tau τ 的表示,将即时奖励加入到轨迹中:
τ = ( S 0 , A 0 , R 0 , S 1 , A 1 , … , S T − 1 , A T − 1 , R T − 1 , S T ) \tau = (S_{0},A_{0},R_{0},S_{1},A_{1},\ldots,S_{T-1},A_{T-1},R_{T-1},S_{T})
τ = ( S 0 , A 0 , R 0 , S 1 , A 1 , … , S T − 1 , A T − 1 , R T − 1 , S T )
Return
强化学习的目标是最大化某种累计奖励,即:回报(Return)。回报可以表示成多种形式,如:
1. finite-horizon undiscounted return. 在固定步长 T T T 内获得奖励的总和:
G t : t + T = R t + R t + 1 + ⋯ + R t + T − 1 = ∑ k = 0 T − 1 R t + k G_{t:t+T} = R_{t} + R_{t+1} + \cdots + R_{t+T-1}= \sum_{k=0}^{T-1} R_{t+k}
G t : t + T = R t + R t + 1 + ⋯ + R t + T − 1 = k = 0 ∑ T − 1 R t + k
2. infinite-horizon discounted return 从时间 t t t 开始所有折扣奖励和:
G t ( ∞ ) = R t + γ R t + 1 + γ 2 R t + 2 + ⋯ = ∑ k = 0 ∞ γ k R t + k G_{t}^{(\infty)} = R_{t} + \gamma R_{t+1} + \gamma^{2} R_{t+2} + \cdots = \sum_{k=0}^{\infty} \gamma^{k} R_{t+k}
G t ( ∞ ) = R t + γ R t + 1 + γ 2 R t + 2 + ⋯ = k = 0 ∑ ∞ γ k R t + k
其中,γ ∈ [ 0 , 1 ] \gamma \in [0,1] γ ∈ [ 0 , 1 ] 是折扣因子。折扣因子既具有直观意义(" 现在的钱比未来的钱更值钱 "),又在数学上使得无限步长奖励和在合理条件下收敛,便于处理。
上文的有限步/无限步,有无折扣这两个条件可以随机组合。在理论推导中,通常使用 G t G_{t} G t ,默认表示为无限折扣奖励和,而在实际应用中,一般为有限折扣奖励和。
Value Functions
Action-Value Function
On-Policy Action-Value Function 给出:在状态 s t s_{t} s t 下采取动作 a t a_{t} a t ,之后始终按照策略 π \pi π 行动,所能获得的期望回报:
Q π ( s t , a t ) = E π [ G t ∣ S t = s t , A t = a t ] Q^{\pi}(s_{t},a_{t}) = \mathbb{E}_{\pi} \big[G_{t} \mid S_{t} = s_{t}, A_{t} = a_{t}\big]
Q π ( s t , a t ) = E π [ G t ∣ S t = s t , A t = a t ]
该函数
依赖于当前状态 s t s_{t} s t 、动作 a t a_{t} a t 、策略 π \pi π 以及状态转移函数 P P P
与后续状态 S t + 1 , … S_{t+1},\ldots S t + 1 , … 和动作 A t + 1 , … A_{t+1}, \ldots A t + 1 , … 独立
Optimal Action-Value Function 给出:在状态 s t s_{t} s t 下采取动作 a t a_{t} a t ,之后始终按照环境中的最优策略行动,所能获得的期望回报:
Q ∗ ( s t , a t ) = max π Q π ( s t , a t ) Q^{*}(s_{t},a_{t}) = \max_\pi Q^{\pi}(s_{t},a_{t})
Q ∗ ( s t , a t ) = π max Q π ( s t , a t )
State-Value Function
On-Policy State-Value Function 给出:在状态 s t s_{t} s t 下,始终按照策略 π \pi π 行动,所能获得的期望回报:
V π ( s t ) = E π [ G t ∣ S t = s t ] V^{\pi}(s_{t}) = \mathbb{E}_{\pi}[G_{t} \mid S_{t} = s_{t}]
V π ( s t ) = E π [ G t ∣ S t = s t ]
Optimal State-Value Function 给出:在状态 s t s_{t} s t 下,始终按照环境中的最优策略行动,所能获得的最大期望回报:
V ∗ ( s t ) = max π V π ( s t ) V^{*}(s_{t}) = \max_{\pi} V^{\pi}(s_{t})
V ∗ ( s t ) = π max V π ( s t )
Connections Between Action-Value Function and State-Value Function
首先 ,V π ( s t ) V^{\pi}(s_{t}) V π ( s t ) 可以视为当前时刻 t t t ,所有可能动作的价值 Q π ( s t , a t ) Q^{\pi}(s_{t},a_{t}) Q π ( s t , a t ) 的加权平均,权重由策略 π ( a t ∣ s t ) \pi(a_{t} \mid s_{t}) π ( a t ∣ s t ) 决定。对于离散动作空间:
V π ( s t ) = E A t ∼ π ( ⋅ ∣ s t ) [ Q π ( S t , A t ) ∣ S t = s t ] = ∑ a t ∈ A π ( a t ∣ s t ) ⋅ Q π ( s t , a t ) V^{\pi}(s_{t}) = \mathbb{E}_{A_{t} \sim \pi(\cdot \mid s_{t})}\big[Q^{\pi}(S_{t},A_{t}) \mid S_{t}=s_{t}\big] = \sum_{a_{t} \in \mathcal{A}} \pi(a_{t} \mid s_{t}) \cdot Q^{\pi}(s_{t},a_{t})
V π ( s t ) = E A t ∼ π ( ⋅ ∣ s t ) [ Q π ( S t , A t ) ∣ S t = s t ] = a t ∈ A ∑ π ( a t ∣ s t ) ⋅ Q π ( s t , a t )
其中,A \mathcal{A} A 为所有有效动作的集合。而对于连续动作空间,则为:
V π ( s t ) = E A t ∼ π ( ⋅ ∣ s t ) [ Q π ( S t , A t ) ∣ S t = s t ] = ∫ π ( a t ∣ s t ) ⋅ Q π ( s t , a t ) d a t V^{\pi}(s_{t}) = \mathbb{E}_{A_{t} \sim \pi(\cdot \mid s_{t})}\big[Q^{\pi}(S_{t},A_{t}) \mid S_{t}=s_{t}\big] = \int \pi(a_{t} \mid s_{t}) \cdot Q^{\pi}(s_{t},a_{t}) da_{t}
V π ( s t ) = E A t ∼ π ( ⋅ ∣ s t ) [ Q π ( S t , A t ) ∣ S t = s t ] = ∫ π ( a t ∣ s t ) ⋅ Q π ( s t , a t ) d a t
在 Bellman Equations 一节,会对该结论给出详尽的推导。
其次 ,V ∗ ( s t ) V^{*}(s_{t}) V ∗ ( s t ) 可看作是从状态 s t s_{t} s t 开始,通过选择最佳动作 a t a_{t} a t 所能实现的最大期望回报:
V ∗ ( s t ) = max a t Q ∗ ( s t , a t ) V^{*}(s_{t}) = \max_{a_{t}}Q^{*}(s_{t},a_{t})
V ∗ ( s t ) = a t max Q ∗ ( s t , a t )
换句话说,如果已知 Q ∗ Q^{*} Q ∗ ,那么当前时刻最优动作可直接得到:
a t ∗ ( s t ) = arg max a t Q ∗ ( s t , a t ) a^{*}_{t}(s_{t}) = \arg\max_{a_{t}}Q^{*}(s_{t},a_{t})
a t ∗ ( s t ) = arg a t max Q ∗ ( s t , a t )
Advantage Functions
在强化学习中,有时我们不需要绝对地描述某个动作有多好,而只关心它相对于平均水平的优势,即该动作比按照策略随机选择的动作更好多少。为此,我们定义优势函数(Advantage Function):
A π ( s t , a t ) = Q π ( s t , a t ) − V π ( s t ) A^{\pi}(s_{t},a_{t}) = Q^{\pi}(s_{t},a_{t}) - V^{\pi}(s_{t})
A π ( s t , a t ) = Q π ( s t , a t ) − V π ( s t )
在 Bellman Equations 一节,会进一步深入探讨该优势函数。
Agent 与环境的交互过程可以形式化为一个马尔可夫决策过程。一个马尔可夫决策过程(Markov Decision Process, MDP)是一个五元组 ⟨ S , A , P , R , γ ⟩ \langle \mathcal{S},\mathcal{A},P,R,\gamma \rangle ⟨ S , A , P , R , γ ⟩ ,其中:
S \mathcal{S} S : 所有有效状态的集合
A \mathcal{A} A : 所有有效动作的集合
P : S × A → P ( S ) P:\mathcal{S} \times \mathcal{A} \to \mathbb{P}(\mathcal{S}) P : S × A → P ( S ) : 状态转移概率函数,其中 P ( s ′ ∣ s , a ) P(s^{\prime} \mid s,a) P ( s ′ ∣ s , a ) 表示从状态 s s s 出发、采取动作 a a a 后转移到状态 s ′ s^{\prime} s ′ 的概率
R : S × A × S → R R:\mathcal{S} \times \mathcal{A} \times \mathcal{S} \to \mathbb{R} R : S × A × S → R : 奖励函数,对应 R t = R ( s t , a t , s t + 1 ) R_{t} = R(s_{t},a_{t},s_{t+1}) R t = R ( s t , a t , s t + 1 )
γ ∈ [ 0 , 1 ] \gamma \in [0,1] γ ∈ [ 0 , 1 ] : 折扣因子
MDP 这一名称表明系统满足 Markov Property :即状态转移只依赖于最近的状态和动作,而与之前的历史无关。这一性质会在 Bellman Equations 的推导中起到重要作用。
Bellman Equations
Overview of Bellman Equations
Bellman Equations 概括起来为:一个状态的价值,等于期望从该状态获得的即时奖励,加上下一步所到达的后继状态的价值。形式化表述如下:
V π ( s ) = E a ∼ π ( ⋅ ∣ s ) [ E s ′ ∼ P ( ⋅ ∣ s , a ) [ R t + γ V π ( s ′ ) ] ] Q π ( s , a ) = E s ′ ∼ P ( ⋅ ∣ s , a ) [ R t + γ V π ( s ′ ) ] \begin{aligned}
V^{\pi}(s) &= \mathbb{E}_{a \sim \pi(\cdot \mid s)} \Big[\mathbb{E}_{s^{\prime} \sim P(\cdot \mid s,a)}\big[R_{t} + \gamma V^{\pi}(s^{\prime})\big]\Big] \\
Q^{\pi}(s,a) &= \mathbb{E}_{s^{\prime} \sim P(\cdot \mid s,a)}\big[ R_{t} + \gamma V^{\pi}(s^{\prime}) \big]
\end{aligned}
V π ( s ) Q π ( s , a ) = E a ∼ π ( ⋅ ∣ s ) [ E s ′ ∼ P ( ⋅ ∣ s , a ) [ R t + γ V π ( s ′ ) ] ] = E s ′ ∼ P ( ⋅ ∣ s , a ) [ R t + γ V π ( s ′ ) ]
之后为了简洁,我们通常直接使用 a a a , s s s , s ′ s^{\prime} s ′ 来代替之前的 a t a_{t} a t , s t s_{t} s t , s t + 1 s_{t+1} s t + 1 。
为了推导 Bellman Equations,我们从 infinite-horizon discounted return G t G_{t} G t 的定义出发,将其写成递归形式:
G t = R t + γ R t + 1 + γ 2 R t + 2 + … = R t + γ ( R t + 1 + γ R t + 2 + … ) = R t + γ G t + 1 \begin{aligned}
G_{t} &= R_{t} + \gamma R_{t+1} + \gamma^{2} R_{t+2} + \ldots \\
&= R_{t} + \gamma (R_{t+1} + \gamma R_{t+2} + \ldots) \\
&= R_{t} + \gamma G_{t+1}
\end{aligned}
G t = R t + γ R t + 1 + γ 2 R t + 2 + … = R t + γ ( R t + 1 + γ R t + 2 + … ) = R t + γ G t + 1
直观地理解,当前时刻 t t t 的回报 G t G_{t} G t 等于即时奖励 R t R_{t} R t 加上下一时刻 t + 1 t+1 t + 1 的折扣回报 γ G t + 1 \gamma G_{t+1} γ G t + 1 。
Bellman Equation for State-Value Function
从状态价值函数的定义出发:V π ( s ) V^{\pi}(s) V π ( s ) 表示在状态 s s s 下,始终按照策略 π \pi π 行动,所能获得的期望回报:
V π ( s ) = E π [ G t ∣ S t = s ] V^{\pi}(s) = \mathbb{E}_{\pi}[G_{t} \mid S_{t} = s]
V π ( s ) = E π [ G t ∣ S t = s ]
将 G t = R t + γ G t + 1 G_{t} = R_{t} + \gamma G_{t+1} G t = R t + γ G t + 1 代入上式得:
V π ( s ) = E π [ G t ∣ S t = s ] = E π [ R t + γ G t + 1 ∣ S t = s ] = E π [ R t ∣ S t = s ] + γ E [ G t + 1 ∣ S t = s ] \begin{aligned}
V^{\pi}(s) &= \mathbb{E}_{\pi}[G_{t} \mid S_{t} = s] \\
&= \mathbb{E}_{\pi}[R_{t} + \gamma G_{t+1} \mid S_{t} = s] \\
&= \mathbb{E}_{\pi}[R_{t} \mid S_{t} = s] + \gamma \mathbb{E}[G_{t+1} \mid S_{t} = s]
\end{aligned}
V π ( s ) = E π [ G t ∣ S t = s ] = E π [ R t + γ G t + 1 ∣ S t = s ] = E π [ R t ∣ S t = s ] + γ E [ G t + 1 ∣ S t = s ]
该公式将 V π ( s ) V^{\pi}(s) V π ( s ) 分解成两部分:即时奖励的期望 与 未来回报的期望 。
第一项:即时奖励的期望
要计算从状态 s s s 出发能获得的期望即时奖励,需要考虑两重不确定性:
首先,Agent 会根据策略 π ( ⋅ ∣ s ) \pi(\cdot \mid s) π ( ⋅ ∣ s ) 随机选择一个动作 a a a
然后,环境根据状态转移概率 P ( ⋅ ∣ s , a ) P(\cdot \mid s,a) P ( ⋅ ∣ s , a ) 转移到一个新的状态 s ′ s^{\prime} s ′ ,并给出即时奖励 R t = R ( s , a , s ′ ) R_{t} = R(s,a,s^{\prime}) R t = R ( s , a , s ′ )
因此,需要对所有可能的动作 以及所有可能的下一状态 进行加权平均,权重分别是策略概率与转移概率。以离散空间为例,推导如下:
E π [ R t ∣ S t = s ] = ∑ a ∈ A π ( a ∣ s ) E π [ R t ∣ S t = s , A t = a ] = ∑ a ∈ A π ( a ∣ s ) ∑ s ′ ∈ S P ( s ′ ∣ s , a ) R ( s t , a t , s t + 1 ) = ∑ a ∈ A π ( a ∣ s ) ∑ s ′ ∈ S P ( s ′ ∣ s , a ) R t \begin{aligned}
\mathbb{E}_{\pi}[R_{t} \mid S_{t} = s] &= \sum_{a \in \mathcal{A}} \pi(a \mid s) \mathbb{E}_{\pi}[R_{t} \mid S_{t} = s, A_{t} = a] \\
&= \sum_{a \in \mathcal{A}} \pi(a \mid s) \sum_{s^{\prime} \in \mathcal{S}} P(s^{\prime} \mid s, a) R(s_{t}, a_{t}, s_{t+1}) \\
&= \sum_{a \in \mathcal{A}} \pi(a \mid s) \sum_{s^{\prime} \in \mathcal{S}} P(s^{\prime} \mid s, a) R_{t}
\end{aligned}
E π [ R t ∣ S t = s ] = a ∈ A ∑ π ( a ∣ s ) E π [ R t ∣ S t = s , A t = a ] = a ∈ A ∑ π ( a ∣ s ) s ′ ∈ S ∑ P ( s ′ ∣ s , a ) R ( s t , a t , s t + 1 ) = a ∈ A ∑ π ( a ∣ s ) s ′ ∈ S ∑ P ( s ′ ∣ s , a ) R t
第二项:未来回报的期望
要计算从状态 s s s 出发,下一时刻的期望回报,同样需要考虑动作 a a a 与下一状态 s ′ s^{\prime} s ′ 的不确定性。此外,根据马尔可夫性质,一旦我们知道了下一状态是 s ′ s' s ′ ,那么从 s ′ s' s ′ 出发能获得的未来期望回报就完全由 V π ( s ′ ) V^{\pi}(s') V π ( s ′ ) 定义,而与之前的状态 s s s 和动作 a a a 无关,即 E π [ G t + 1 ∣ S t + 1 = s ′ ] = V π ( s ′ ) \mathbb{E}_{\pi}[G_{t+1} \mid S_{t+1} = s'] = V^{\pi}(s') E π [ G t + 1 ∣ S t + 1 = s ′ ] = V π ( s ′ ) 。具体推导如下:
E π [ G t + 1 ∣ S t = s ] = ∑ s ′ ∈ S E π [ G t + 1 ∣ S t = s , S t + 1 = s ′ ] P ( s ′ ∣ s ) = ∑ s ′ ∈ S E π [ G t + 1 ∣ S t + 1 = s ′ ] P ( s ′ ∣ s ) = ∑ s ′ ∈ S V π ( s ′ ) P ( s ′ ∣ s ) = ∑ s ′ ∈ S V π ( s ′ ) ∑ a ∈ A P ( s ′ ∣ s , a ) π ( a ∣ s ) = ∑ a ∈ A π ( a ∣ s ) ∑ s ′ ∈ S P ( s ′ ∣ s , a ) V π ( s ′ ) \begin{aligned}
\mathbb{E}_{\pi}[G_{t+1} \mid S_{t} = s] &= \sum_{s^{\prime} \in \mathcal{S}} \mathbb{E}_{\pi}[G_{t+1} \mid S_{t} = s, S_{t+1} = s^{\prime}] P(s^{\prime} \mid s) \\
&= \sum_{s^{\prime} \in \mathcal{S}} \mathbb{E}_{\pi}[G_{t+1} \mid S_{t+1} = s^{\prime}] P(s^{\prime} \mid s) \\
&= \sum_{s^{\prime} \in \mathcal{S}} V^{\pi}(s^{\prime}) P(s^{\prime} \mid s) \\
&= \sum_{s^{\prime} \in \mathcal{S}} V^{\pi}(s^{\prime}) \sum_{a \in \mathcal{A}} P(s^{\prime} \mid s,a) \pi(a \mid s) \\
&= \sum_{a \in \mathcal{A}} \pi(a \mid s) \sum_{s^{\prime} \in \mathcal{S}} P(s^{\prime} \mid s,a) V^{\pi}(s^{\prime})
\end{aligned}
E π [ G t + 1 ∣ S t = s ] = s ′ ∈ S ∑ E π [ G t + 1 ∣ S t = s , S t + 1 = s ′ ] P ( s ′ ∣ s ) = s ′ ∈ S ∑ E π [ G t + 1 ∣ S t + 1 = s ′ ] P ( s ′ ∣ s ) = s ′ ∈ S ∑ V π ( s ′ ) P ( s ′ ∣ s ) = s ′ ∈ S ∑ V π ( s ′ ) a ∈ A ∑ P ( s ′ ∣ s , a ) π ( a ∣ s ) = a ∈ A ∑ π ( a ∣ s ) s ′ ∈ S ∑ P ( s ′ ∣ s , a ) V π ( s ′ )
最后两行的求和顺序的调换,是为了后续合并的方便。此外,如果不对 P ( s ′ ∣ s ) P(s^{\prime} \mid s) P ( s ′ ∣ s ) 继续展开,还可以得到:
E π [ G t + 1 ∣ S t = s ] = ∑ s ′ ∈ S V π ( s ′ ) P ( s ′ ∣ s ) = E π [ V π ( s ′ ) ∣ S t = s ] \begin{aligned}
\mathbb{E}_{\pi}[G_{t+1} \mid S_{t} = s] &= \sum_{s^{\prime} \in \mathcal{S}} V^{\pi}(s^{\prime}) P(s^{\prime} \mid s) \\
&= \mathbb{E}_{\pi}[V^{\pi}(s^{\prime}) \mid S_{t} = s]
\end{aligned}
E π [ G t + 1 ∣ S t = s ] = s ′ ∈ S ∑ V π ( s ′ ) P ( s ′ ∣ s ) = E π [ V π ( s ′ ) ∣ S t = s ]
该结论会在 Bellman Equation for Action-Value Function 中,推导未来回报期望时用到。
合并两项
V π ( s ) = E π [ R t ∣ S t = s ] + γ E π [ G t + 1 ∣ S t = s ] = ∑ a ∈ A π ( a ∣ s ) ∑ s ′ ∈ S P ( s ′ ∣ s , a ) ( R t + γ V π ( s ′ ) ) = ∑ a ∈ A π ( a ∣ s ) E s ′ ∼ P ( ⋅ ∣ s , a ) [ R t + γ V π ( s ′ ) ] = E a ∼ π ( ⋅ ∣ s ) [ E s ′ ∼ P ( ⋅ ∣ s , a ) [ R t + γ V π ( s ′ ) ] ] \begin{aligned}
V^{\pi}(s) &= \mathbb{E}_{\pi}[R_{t} \mid S_{t} = s] + \gamma \mathbb{E}_{\pi}[G_{t+1} \mid S_{t} = s] \\
&= \sum_{a \in \mathcal{A}} \pi(a \mid s) \sum_{s^{\prime} \in \mathcal{S}} P(s^{\prime} \mid s, a) (R_{t} + \gamma V^{\pi}(s^{\prime})) \\
&= \sum_{a \in \mathcal{A}} \pi(a \mid s) \mathbb{E}_{s^{\prime} \sim P(\cdot \mid s,a)}[R_{t} + \gamma V^{\pi}(s^{\prime})] \\
&= \mathbb{E}_{a \sim \pi(\cdot \mid s)} \Big[\mathbb{E}_{s^{\prime} \sim P(\cdot \mid s,a)}\big[R_{t} + \gamma V^{\pi}(s^{\prime})\big]\Big]
\end{aligned}
V π ( s ) = E π [ R t ∣ S t = s ] + γ E π [ G t + 1 ∣ S t = s ] = a ∈ A ∑ π ( a ∣ s ) s ′ ∈ S ∑ P ( s ′ ∣ s , a ) ( R t + γ V π ( s ′ )) = a ∈ A ∑ π ( a ∣ s ) E s ′ ∼ P ( ⋅ ∣ s , a ) [ R t + γ V π ( s ′ )] = E a ∼ π ( ⋅ ∣ s ) [ E s ′ ∼ P ( ⋅ ∣ s , a ) [ R t + γ V π ( s ′ ) ] ]
Bellman Equation for Action-Value Function
Q π ( s , a ) Q^{\pi}(s,a) Q π ( s , a ) 的 Bellman Equation 推导过程非常类似,但更简单一些,因为假定了第一个动作 a a a 已知。仍然从 Q π ( s , a ) Q^{\pi}(s,a) Q π ( s , a ) 的定义出发,代入 G t = R t + γ G t + 1 G_{t} = R_{t} + \gamma G_{t+1} G t = R t + γ G t + 1 :
Q π ( s , a ) = E π [ G t ∣ S t = s , A t = a ] = E π [ R t + γ G t + 1 ∣ S t = s , A t = a ] = E π [ R t ∣ S t = s , A t = a ] + γ E π [ G t + 1 ∣ S t = s , A t = a ] \begin{aligned}
Q^{\pi}(s,a) &= \mathbb{E}_{\pi} \big[G_{t} \mid S_{t} = s, A_{t} = a\big] \\
&= \mathbb{E}_{\pi} \big[R_{t} + \gamma G_{t+1} \mid S_{t} = s, A_{t} = a\big] \\
&= \mathbb{E}_{\pi} \big[R_{t} \mid S_{t} = s, A_{t} = a\big] + \gamma \mathbb{E}_{\pi} \big[G_{t+1} \mid S_{t} = s, A_{t} = a\big]
\end{aligned}
Q π ( s , a ) = E π [ G t ∣ S t = s , A t = a ] = E π [ R t + γ G t + 1 ∣ S t = s , A t = a ] = E π [ R t ∣ S t = s , A t = a ] + γ E π [ G t + 1 ∣ S t = s , A t = a ]
由于动作 a a a 已知,所以不确定性只剩下环境的状态转移 P ( ⋅ ∣ s , a ) P(\cdot \mid s,a) P ( ⋅ ∣ s , a ) 。因此,我们只需要对所有可能的下一状态 s ′ s' s ′ 求期望。上式仍然将 Q π ( s ) Q^{\pi}(s) Q π ( s ) 分解成两部分:即时奖励的期望 与 未来回报的期望 。
第一项:即时奖励的期望
E π [ R t ∣ S t = s , A t = a ] = ∑ s ′ ∈ S P ( s ′ ∣ s , a ) R ( s , a , s ′ ) = ∑ s ′ ∈ S P ( s ′ ∣ s , a ) R t \begin{aligned}
\mathbb{E}_{\pi} \big[R_{t} \mid S_{t} = s, A_{t} = a\big] &= \sum_{s^{\prime} \in \mathcal{S}} P(s^{\prime} \mid s,a) R(s,a,s^{\prime}) \\
&= \sum_{s^{\prime} \in \mathcal{S}} P(s^{\prime} \mid s,a) R_{t}
\end{aligned}
E π [ R t ∣ S t = s , A t = a ] = s ′ ∈ S ∑ P ( s ′ ∣ s , a ) R ( s , a , s ′ ) = s ′ ∈ S ∑ P ( s ′ ∣ s , a ) R t
第二项:未来回报的期望
E π [ G t + 1 ∣ S t = s , A t = a ] = ∑ s ′ ∈ S E π [ G t + 1 ∣ S t = s , A t = a , S t + 1 = s ′ ] P ( s ′ ∣ s , a ) = ∑ s ′ ∈ S E π [ G t + 1 ∣ S t + 1 = s ′ ] P ( s ′ ∣ s , a ) = ∑ s ′ ∈ S V π ( s ′ ) P ( s ′ ∣ s , a ) \begin{aligned}
\mathbb{E}_{\pi} \big[G_{t+1} \mid S_{t} = s, A_{t} = a\big] &= \sum_{s^{\prime} \in \mathcal{S}} \mathbb{E}_{\pi} \big[G_{t+1} \mid S_{t} = s, A_{t} = a, S_{t+1} = s^{\prime}\big] P(s^{\prime} \mid s,a) \\
&= \sum_{s^{\prime} \in \mathcal{S}} \mathbb{E}_{\pi} \big[G_{t+1} \mid S_{t+1} = s^{\prime}\big] P(s^{\prime} \mid s,a) \\
&= \sum_{s^{\prime} \in \mathcal{S}} V^{\pi}(s^{\prime}) P(s^{\prime} \mid s,a)
\end{aligned}
E π [ G t + 1 ∣ S t = s , A t = a ] = s ′ ∈ S ∑ E π [ G t + 1 ∣ S t = s , A t = a , S t + 1 = s ′ ] P ( s ′ ∣ s , a ) = s ′ ∈ S ∑ E π [ G t + 1 ∣ S t + 1 = s ′ ] P ( s ′ ∣ s , a ) = s ′ ∈ S ∑ V π ( s ′ ) P ( s ′ ∣ s , a )
这里第二个等式利用马尔可夫性质,G t + 1 G_{t+1} G t + 1 与过去的动作 a a a 与状态 s s s 无关。第三个等式利用 Bellman Equation for State-Value Function 中推导未来回报期望时的一个结论:E π [ G t + 1 ∣ S t = s ] = E π [ V π ( s ′ ) ∣ S t = s ] \mathbb{E}_{\pi}[G_{t+1} \mid S_{t} = s] = \mathbb{E}_{\pi}[V^{\pi}(s^{\prime}) \mid S_{t} = s] E π [ G t + 1 ∣ S t = s ] = E π [ V π ( s ′ ) ∣ S t = s ]
合并两项
Q π ( s , a ) = E π [ R t ∣ S t = s , A t = a ] + γ E π [ G t + 1 ∣ S t = s , A t = a ] = ∑ s ′ ∈ S P ( s ′ ∣ s , a ) R t + γ ∑ s ′ ∈ S P ( s ′ ∣ s , a ) V π ( s ′ ) = ∑ s ′ ∈ S P ( s ′ ∣ s , a ) ( R t + γ V π ( s ′ ) ) = E s ′ ∼ P ( ⋅ ∣ s , a ) [ R t + γ V π ( s ′ ) ] \begin{aligned}
Q^{\pi}(s,a) &= \mathbb{E}_{\pi} \big[R_{t} \mid S_{t} = s, A_{t} = a\big] + \gamma \mathbb{E}_{\pi} \big[G_{t+1} \mid S_{t} = s, A_{t} = a\big] \\
&= \sum_{s^{\prime} \in \mathcal{S}} P(s^{\prime} \mid s,a) R_{t} + \gamma \sum_{s^{\prime} \in \mathcal{S}} P(s^{\prime} \mid s,a) V^{\pi}(s^{\prime}) \\
&= \sum_{s^{\prime} \in \mathcal{S}} P(s^{\prime} \mid s,a) \big(R_{t} + \gamma V^{\pi}(s^{\prime})\big) \\
&= \mathbb{E}_{s^{\prime} \sim P(\cdot \mid s,a)}\big[ R_{t} + \gamma V^{\pi}(s^{\prime}) \big]
\end{aligned}
Q π ( s , a ) = E π [ R t ∣ S t = s , A t = a ] + γ E π [ G t + 1 ∣ S t = s , A t = a ] = s ′ ∈ S ∑ P ( s ′ ∣ s , a ) R t + γ s ′ ∈ S ∑ P ( s ′ ∣ s , a ) V π ( s ′ ) = s ′ ∈ S ∑ P ( s ′ ∣ s , a ) ( R t + γ V π ( s ′ ) ) = E s ′ ∼ P ( ⋅ ∣ s , a ) [ R t + γ V π ( s ′ ) ]
至此,我们已经推导出了 V π ( s ) V^{\pi}(s) V π ( s ) 与 Q π ( s , a ) Q^{\pi}(s,a) Q π ( s , a ) 的 Bellman Equations。观察一下,可以将 Q π ( s , a ) Q^{\pi}(s,a) Q π ( s , a ) 继续代入到 V π ( s ) V^{\pi}(s) V π ( s ) 中进行推导:
V π ( s ) = E a ∼ π ( ⋅ ∣ s ) [ E s ′ ∼ P ( ⋅ ∣ s , a ) [ R t + γ V π ( s ′ ) ] ] = E a ∼ π ( ⋅ ∣ s ) [ Q π ( s , a ) ] = ∑ a ∈ A π ( a ∣ s ) Q π ( s , a ) = E a ∼ π ( ⋅ ∣ s ) [ Q π ( s , a ) ] \begin{aligned}
V^{\pi}(s) &= \mathbb{E}_{a \sim \pi(\cdot \mid s)} \Big[\mathbb{E}_{s^{\prime} \sim P(\cdot \mid s,a)}\big[R_{t} + \gamma V^{\pi}(s^{\prime})\big]\Big] \\
&= \mathbb{E}_{a \sim \pi(\cdot \mid s)} \big[ Q^{\pi}(s,a) \big] \\
&= \sum_{a \in \mathcal{A}} \pi(a \mid s) Q^{\pi}(s,a) \\
&= \mathbb{E}_{a \sim \pi(\cdot \mid s)} \big[ Q^{\pi}(s,a) \big]
\end{aligned}
V π ( s ) = E a ∼ π ( ⋅ ∣ s ) [ E s ′ ∼ P ( ⋅ ∣ s , a ) [ R t + γ V π ( s ′ ) ] ] = E a ∼ π ( ⋅ ∣ s ) [ Q π ( s , a ) ] = a ∈ A ∑ π ( a ∣ s ) Q π ( s , a ) = E a ∼ π ( ⋅ ∣ s ) [ Q π ( s , a ) ]
这与我们在 Connections Between Action-Value Function and State-Value Function 直观上得到的结论一致:一个状态的价值,是该状态下所有可能动作的动作价值,按照策略 π \pi π 的概率进行的加权平均
Temporal-Difference (TD) Error
推导出 V π ( s ) V^{\pi}(s) V π ( s ) 与 Q π ( s , a ) Q^{\pi}(s,a) Q π ( s , a ) 的 Bellman Equations 之后,我们可以进一步推导优势函数:
A π ( s , a ) = Q π ( s , a ) − V π ( s ) = E s ′ ∼ P ( ⋅ ∣ s , a ) [ R t + γ V π ( s ′ ) ⏟ TD Target ] − E s ′ ∼ P ( ⋅ ∣ s , a ) [ V π ( s ) ] = E s ′ ∼ P ( ⋅ ∣ s , a ) [ R t + γ V π ( s ′ ) − V π ( s ) ⏟ TD Error ] \begin{aligned}
A^{\pi}(s,a) &= Q^{\pi}(s,a) - V^{\pi}(s) \\
&= \mathbb{E}_{s^{\prime} \sim P(\cdot \mid s,a)}\big[ \underbrace{R_{t} + \gamma V^{\pi}(s^{\prime})}_{\text{TD Target}} \big] - \mathbb{E}_{s^{\prime} \sim P(\cdot \mid s,a)}\big[ V^{\pi}(s) \big] \\
&= \mathbb{E}_{s^{\prime} \sim P(\cdot \mid s,a)}\big[ \underbrace{R_{t} + \gamma V^{\pi}(s^{\prime}) - V^{\pi}(s)}_{\text{TD Error}} \big]
\end{aligned}
A π ( s , a ) = Q π ( s , a ) − V π ( s ) = E s ′ ∼ P ( ⋅ ∣ s , a ) [ TD Target R t + γ V π ( s ′ ) ] − E s ′ ∼ P ( ⋅ ∣ s , a ) [ V π ( s ) ] = E s ′ ∼ P ( ⋅ ∣ s , a ) [ TD Error R t + γ V π ( s ′ ) − V π ( s ) ]
上式中,V π ( s ) V^{\pi}(s) V π ( s ) 可以直接写成 E s ′ ∼ P ( ⋅ ∣ s , a ) [ V π ( s ) ] \mathbb{E}_{s^{\prime} \sim P(\cdot \mid s,a)}\big[V^{\pi}(s) \big] E s ′ ∼ P ( ⋅ ∣ s , a ) [ V π ( s ) ] ,是因为 V π ( s ) V^{\pi}(s) V π ( s ) 只依赖于确定的 s s s ,而与 s ′ s^{\prime} s ′ 无关。在实际算法中,通常使用神经网络来近似状态价值函数,表示为 V w ( s ) V_{w}(s) V w ( s ) ,其中 w w w 为神经网络参数。因此,优势函数的近似 A ^ t \hat{A}_{t} A ^ t 可以表示如下:
A ^ t = r t + γ V w ( s t + 1 ) ⏞ TD Target − V w ( s ) ⏟ TD Error \hat{A}_{t} = \underbrace{\overbrace{r_{t} + \gamma V_{w}(s_{t+1})}^{\text{TD Target}} - V_{w}(s)}_{\text{TD Error}}
A ^ t = TD Error r t + γ V w ( s t + 1 ) TD Target − V w ( s )
这里我们暂时不讨论偏差与方差的问题,放到具体算法的文章中讨论。
其中,V w ( s ) V_{w}(s) V w ( s ) 是我们对当前状态价值的原始估计 ,而 r t + γ V w ( s t + 1 ) r_{t} + \gamma V_{w}(s_{t+1}) r t + γ V w ( s t + 1 ) 是 Agent 执行一步动作 a a a 后,并观察到奖励 r t r_{t} r t 与下一状态 s t + 1 s_{t+1} s t + 1 后,对当前状态价值的更优估计 (可作为 TD Target)。而两者之差,称为 TD Error,通常表示为:
δ t = r t + γ V w ( s t + 1 ) − V w ( s t ) \delta_{t} = r_{t} + \gamma V_{w}(s_{t+1}) - V_{w}(s_{t})
δ t = r t + γ V w ( s t + 1 ) − V w ( s t )
直观地理解,TD Error 就是 当前的预测 与 一个更靠谱的、基于下一步真实情况的新预测 之间的差距。
这个差距可以当做学习的信号。如果差距很大,说明之前的预测很不准,需要大幅调整;如果差距很小,说明预测得还不错,稍微微调即可。这也引出了 TD Learning,核心思想在于用下一步的预测来纠正当前的预测 。
更直观地,想象一下你每天开车上班,想预测全程需要多长时间。
第一天(刚出门) :你凭感觉猜测 :" 嗯,今天路况不错,估计全程要 30 分钟 。"
开了 10 分钟后(到达了中途的 A 点) :你发现路上有个小事故,有点堵车。你看了看导航,它根据当前路况重新预测,从 A 点到公司还需要 25 分钟 。
这时,你有了一个新的、更靠谱的预测 :已经花费的 10 分钟 (真实成本) + 导航预测的未来 25 分钟 (新的未来预测) = 总共 35 分钟 。
这 35 分钟 就是我们所说的 TD Target 。它更靠谱,因为它包含了一部分真实发生的情况(已经开的 10 分钟)。
计算误差 :现在,你拿新的预测和旧的预测一比较:TD Error = (已花费的 10 分钟 + 未来还需 25 分钟) - (最初的预测 30 分钟) = +5 分钟
学习与更新 :这个 +5 分钟 的误差告诉你:你最初的预测太乐观了!你需要把对 " 全程时间 " 的预测调高一些。于是你心想:" 看来以后出门时,应该预测要 30 多分钟才对。" 你根据这个误差,更新了你大脑里的模型。
这就是 TD Learning 的核心思想:不需要等到全程跑完(不需要等最终结果),而是在每一步都利用新信息,不断地更新和校准自己之前的预测。
n n n -step Return
前文讨论的 TD Target 的核心思想在于:执行一步,然后用观测到的即时奖励 r t r_{t} r t 与对下一状态的价值估计 V w ( s t + 1 ) V_{w}(s_{t+1}) V w ( s t + 1 ) 来构建一个更优的目标 TD Target,并更新当前状态的价值估计 V w ( s t ) V_{w}(s_{t}) V w ( s t ) 。这种方法被称为 TD Learning,其理论基础来自 Bellman Equation:
V π ( s t ) = E a t ∼ π ( ⋅ ∣ s ) [ E s t + 1 ∼ P ( ⋅ ∣ s , a ) [ R t + γ V π ( s t + 1 ) ] ] V^{\pi}(s_{t}) = \mathbb{E}_{a_{t} \sim \pi(\cdot \mid s)} \Big[\mathbb{E}_{s_{t+1} \sim P(\cdot \mid s,a)}\big[R_{t} + \gamma V^{\pi}(s_{t+1})\big]\Big]
V π ( s t ) = E a t ∼ π ( ⋅ ∣ s ) [ E s t + 1 ∼ P ( ⋅ ∣ s , a ) [ R t + γ V π ( s t + 1 ) ] ]
根据状态价值函数的定义:
V π ( s t ) = E π [ G t ∣ S t = s t ] V^{\pi}(s_{t}) = \mathbb{E}_{\pi}[G_{t} \mid S_{t} = s_{t}]
V π ( s t ) = E π [ G t ∣ S t = s t ]
我们发现,上面两式的差异在于期望的下标。Bellman Equation 是单步的期望,而状态价值函数定义是整个轨迹上的期望。为了更好地描述单步多步时,V π ( s t ) V^{\pi}(s_{t}) V π ( s t ) 与回报的联系,我们定义 n n n -step 回报如下:
G t ( n ) = R t + γ R t + 1 + γ 2 R t + 2 + ⋯ + γ n − 1 R t + n − 1 + γ n V π ( s t + n ) = ∑ l = 0 n − 1 γ l R t + l + γ n V π ( s t + n ) \begin{aligned}
G_{t}^{(n)} &= R_{t} + \gamma R_{t+1} + \gamma^{2} R_{t+2} + \cdots + \gamma^{n-1} R_{t+n-1} + \gamma^{n} V^{\pi}(s_{t+n}) \\
&= \sum_{l=0}^{n-1} \gamma^{l} R_{t+l} + \gamma^{n} V^{\pi}(s_{t+n})
\end{aligned}
G t ( n ) = R t + γ R t + 1 + γ 2 R t + 2 + ⋯ + γ n − 1 R t + n − 1 + γ n V π ( s t + n ) = l = 0 ∑ n − 1 γ l R t + l + γ n V π ( s t + n )
用自然语言来描述,n n n -step 回报就是执行 n n n 步后,累计实际观测到的奖励和,加上第 n n n 步的状态价值。实际上,更常见的形式是其估计量 G ^ t ( n ) \hat{G}_{t}^{(n)} G ^ t ( n ) ,表示如下:
G ^ t ( n ) = ∑ l = 0 n − 1 γ l r t + l + γ n V w ( s t + n ) \hat{G}_{t}^{(n)} = \sum_{l=0}^{n-1} \gamma^{l} r_{t+l} + \gamma^{n} V_{w}(s_{t+n})
G ^ t ( n ) = l = 0 ∑ n − 1 γ l r t + l + γ n V w ( s t + n )
其中,V w ( s t + n ) V_{w}(s_{t+n}) V w ( s t + n ) 是对理论真值 V π ( s t ) V^{\pi}(s_{t}) V π ( s t ) 的近似。观察发现:
当 n = 1 n=1 n = 1 时,G ^ t ( 1 ) = r t + γ V w ( s t + 1 ) \hat{G}_{t}^{(1)} = r_{t} + \gamma V_{w}(s_{t+1}) G ^ t ( 1 ) = r t + γ V w ( s t + 1 ) 恰好是 1 1 1 -step TD Target。
当 n → ∞ n \to \infty n → ∞ 时,G ^ t ( ∞ ) = G t ( ∞ ) = ∑ l = 0 ∞ γ l r t + l \hat{G}_{t}^{(\infty)} = {G}_{t}^{(\infty)} = \sum_{l=0}^{\infty} \gamma^{l} r_{t+l} G ^ t ( ∞ ) = G t ( ∞ ) = ∑ l = 0 ∞ γ l r t + l 恰好是回报的定义,也称为蒙特卡洛回报
V π ( s ) = E a ∼ π ( ⋅ ∣ s ) [ E s ′ ∼ P ( ⋅ ∣ s , a ) [ R t + γ V π ( s ′ ) ] ] \begin{aligned}
V^{\pi}(s) &= \mathbb{E}_{a \sim \pi(\cdot \mid s)} \Big[\mathbb{E}_{s^{\prime} \sim P(\cdot \mid s,a)}\big[R_{t} + \gamma V^{\pi}(s^{\prime})\big]\Big] \\
\end{aligned}
V π ( s ) = E a ∼ π ( ⋅ ∣ s ) [ E s ′ ∼ P ( ⋅ ∣ s , a ) [ R t + γ V π ( s ′ ) ] ]
接下来,证明对于给定的策略 π \pi π 和起始状态 S t = s t S_{t} = s_{t} S t = s t ,n n n -step 回报 G t ( n ) G_{t}^{(n)} G t ( n ) 是状态价值函数 V π ( s t ) V^{\pi}(s_{t}) V π ( s t ) 的无偏估计:
E π [ G t ( n ) ∣ S t = s t ] = V π ( s t ) \mathbb{E}_{\pi}[G_{t}^{(n)} \mid S_{t} = s_{t}] = V^{\pi}(s_{t})
E π [ G t ( n ) ∣ S t = s t ] = V π ( s t )
其中:E π [ ⋅ ∣ S t = s t ] \mathbb{E}_{\pi}[\cdot \mid S_{t} = s_{t}] E π [ ⋅ ∣ S t = s t ] 表示从 s t s_{t} s t 出发,遵循策略 π \pi π 生成的整个未来轨迹的条件期望。
我们将从 V π ( s t ) V^{\pi}(s_t) V π ( s t ) 的定义出发,通过递归地代入 Bellman Equation 来证明。
步骤 1: 基础步骤
根据 Bellman Equation 的定义:
V π ( s t ) = E a t ∼ π ( ⋅ ∣ s t ) , s t + 1 ∼ P ( ⋅ ∣ s t , a t ) [ R t + γ V π ( s t + 1 ) ∣ S t = s t ] V^{\pi}(s_{t}) = \mathbb{E}_{a_{t} \sim \pi(\cdot \mid s_{t}), s_{t+1} \sim P(\cdot \mid s_{t},a_{t})} \big[R_{t} + \gamma V^{\pi}(s_{t+1}) \mid S_{t} = s_{t} \big]
V π ( s t ) = E a t ∼ π ( ⋅ ∣ s t ) , s t + 1 ∼ P ( ⋅ ∣ s t , a t ) [ R t + γ V π ( s t + 1 ) ∣ S t = s t ]
这两个嵌套的期望,其联合效果就是对从 s t s_t s t 开始的一个时间步内的所有可能性(即 ( a t , s t + 1 , r t ) (a_t, s_{t+1}, r_t) ( a t , s t + 1 , r t ) 的组合)进行加权平均。这正是全轨迹期望 E π \mathbb{E}_{\pi} E π 在一个时间步内的定义。所以,我们可以写成:
V π ( s t ) = E π [ R t + γ V π ( s t + 1 ) ∣ S t = s t ] V^{\pi}(s_{t}) = \mathbb{E}_{\pi}[R_{t} + \gamma V^{\pi}(s_{t+1}) \mid S_{t} = s_{t}]
V π ( s t ) = E π [ R t + γ V π ( s t + 1 ) ∣ S t = s t ]
注意到 G t ( 1 ) = R t + γ V π ( s t + 1 ) G_t^{(1)} = R_{t} + \gamma V^{\pi}(s_{t+1}) G t ( 1 ) = R t + γ V π ( s t + 1 ) 。因此,我们已经严格证明了 n = 1 n=1 n = 1 的情况:
V π ( s t ) = E π [ G t ( 1 ) ∣ S t = s t ] V^{\pi}(s_{t}) = \mathbb{E}_{\pi}\Big[G_{t}^{(1)} \mid S_{t} = s_{t}\Big]
V π ( s t ) = E π [ G t ( 1 ) ∣ S t = s t ]
步骤 2: 递归展开
现在,我们对上式中的 V π ( s t + 1 ) V^{\pi}(s_{t+1}) V π ( s t + 1 ) 项应用 Bellman Equation。对于任意给定的 s t + 1 s_{t+1} s t + 1 ,我们有:
V π ( s t + 1 ) = E a t + 1 ∼ π ( ⋅ ∣ s t + 1 ) , s t + 2 ∼ P ( ⋅ ∣ s t + 1 , a t + 1 ) [ R t + 1 + γ V π ( s t + 2 ) ∣ S t + 1 = s t + 1 ] V^{\pi}(s_{t+1}) = \mathbb{E}_{a_{t+1} \sim \pi(\cdot \mid s_{t+1}), s_{t+2} \sim P(\cdot \mid s_{t+1},a_{t+1})} \big[ R_{t+1} + \gamma V^{\pi}(s_{t+2}) \mid S_{t+1} = s_{t+1} \big]
V π ( s t + 1 ) = E a t + 1 ∼ π ( ⋅ ∣ s t + 1 ) , s t + 2 ∼ P ( ⋅ ∣ s t + 1 , a t + 1 ) [ R t + 1 + γ V π ( s t + 2 ) ∣ S t + 1 = s t + 1 ]
我们将这个表达式代入到步骤 1 的结果中:
V π ( s t ) = E a t ∼ π ( ⋅ ∣ s t ) , s t + 1 ∼ P ( ⋅ ∣ s t , a t ) [ R t + γ ( E a t + 1 ∼ π ( ⋅ ∣ s t + 1 ) , s t + 2 ∼ P ( ⋅ ∣ s t + 1 , a t + 1 ) [ R t + 1 + γ V π ( s t + 2 ) ∣ S t + 1 ] ) ∣ S t = s t ] V^{\pi}(s_{t}) = \mathbb{E}_{a_{t} \sim \pi(\cdot \mid s_{t}), s_{t+1} \sim P(\cdot \mid s_{t},a_{t})} \Big[ R_{t} + \gamma \big( \mathbb{E}_{a_{t+1} \sim \pi(\cdot \mid s_{t+1}), s_{t+2} \sim P(\cdot \mid s_{t+1},a_{t+1})} \big[ R_{t+1} + \gamma V^{\pi}(s_{t+2}) \mid S_{t+1} \big] \big) \mid S_{t} = s_{t} \Big]
V π ( s t ) = E a t ∼ π ( ⋅ ∣ s t ) , s t + 1 ∼ P ( ⋅ ∣ s t , a t ) [ R t + γ ( E a t + 1 ∼ π ( ⋅ ∣ s t + 1 ) , s t + 2 ∼ P ( ⋅ ∣ s t + 1 , a t + 1 ) [ R t + 1 + γ V π ( s t + 2 ) ∣ S t + 1 ] ) ∣ S t = s t ]
观察期望的结构:
内层期望的结果是是一个 S t + 1 S_{t+1} S t + 1 的函数
外层期望则对这个函数以及 R t R_{t} R t 求期望。
根据期望的线性性质,拆分上式可得:
V π ( s t ) = E a t ∼ π ( ⋅ ∣ s t ) , s t + 1 ∼ P ( ⋅ ∣ s t , a t ) [ R t + γ ( E a t + 1 ∼ π ( ⋅ ∣ s t + 1 ) , s t + 2 ∼ P ( ⋅ ∣ s t + 1 , a t + 1 ) [ R t + 1 + γ V π ( s t + 2 ) ∣ S t + 1 ] ) ∣ S t = s t ] = E a t ∼ π ( ⋅ ∣ s t ) , s t + 1 ∼ P ( ⋅ ∣ s t , a t ) [ R t ∣ S t = s t ] + γ ⋅ E a t ∼ π ( ⋅ ∣ s t ) , s t + 1 ∼ P ( ⋅ ∣ s t , a t ) [ E a t + 1 ∼ π ( ⋅ ∣ s t + 1 ) , s t + 2 ∼ P ( ⋅ ∣ s t + 1 , a t + 1 ) [ R t + 1 + γ V π ( s t + 2 ) ∣ S t + 1 ] ∣ S t = s t ] \begin{aligned}
V^{\pi}(s_{t}) =\ & \mathbb{E}_{a_{t} \sim \pi(\cdot \mid s_{t}), s_{t+1} \sim P(\cdot \mid s_{t},a_{t})} \Big[ R_{t} + \gamma \big( \mathbb{E}_{a_{t+1} \sim \pi(\cdot \mid s_{t+1}), s_{t+2} \sim P(\cdot \mid s_{t+1},a_{t+1})} \big[ R_{t+1} + \gamma V^{\pi}(s_{t+2}) \mid S_{t+1} \big] \big) \mid S_{t} = s_{t} \Big] \\
=\ & \mathbb{E}_{a_{t} \sim \pi(\cdot \mid s_{t}), s_{t+1} \sim P(\cdot \mid s_{t},a_{t})} \big[R_{t} \mid S_{t} = s_{t}\big] + \gamma \cdot \mathbb{E}_{a_{t} \sim \pi(\cdot \mid s_{t}), s_{t+1} \sim P(\cdot \mid s_{t},a_{t})}\\ \big[
& \mathbb{E}_{a_{t+1} \sim \pi(\cdot \mid s_{t+1}), s_{t+2} \sim P(\cdot \mid s_{t+1},a_{t+1})}[R_{t+1} + \gamma V^{\pi}(s_{t+2}) \mid S_{t+1}] \mid S_{t} = s_{t}
\big]
\end{aligned}
V π ( s t ) = = [ E a t ∼ π ( ⋅ ∣ s t ) , s t + 1 ∼ P ( ⋅ ∣ s t , a t ) [ R t + γ ( E a t + 1 ∼ π ( ⋅ ∣ s t + 1 ) , s t + 2 ∼ P ( ⋅ ∣ s t + 1 , a t + 1 ) [ R t + 1 + γ V π ( s t + 2 ) ∣ S t + 1 ] ) ∣ S t = s t ] E a t ∼ π ( ⋅ ∣ s t ) , s t + 1 ∼ P ( ⋅ ∣ s t , a t ) [ R t ∣ S t = s t ] + γ ⋅ E a t ∼ π ( ⋅ ∣ s t ) , s t + 1 ∼ P ( ⋅ ∣ s t , a t ) E a t + 1 ∼ π ( ⋅ ∣ s t + 1 ) , s t + 2 ∼ P ( ⋅ ∣ s t + 1 , a t + 1 ) [ R t + 1 + γ V π ( s t + 2 ) ∣ S t + 1 ] ∣ S t = s t ]
观察第二项:
外层期望:给定 S t = s t S_{t} = s_{t} S t = s t 时,关于 A t A_{t} A t 和 S t + 1 S_{t+1} S t + 1 求期望
内层期望:给定 S t + 1 S_{t+1} S t + 1 时,关于 A t + 1 A_{t+1} A t + 1 和 S t + 2 S_{t+2} S t + 2 求期望
由期望的迭代定律,第二项可以重写为:
E a t ∼ π ( ⋅ ∣ s t ) , s t + 1 ∼ P ( ⋅ ∣ s t , a t ) , a t + 1 ∼ π ( ⋅ ∣ s t + 1 ) , s t + 2 ∼ P ( ⋅ ∣ s t + 1 , a t + 1 ) [ R t + 1 + γ V π ( s t + 2 ) ∣ S t = s t ] \mathbb{E}_{a_{t} \sim \pi(\cdot \mid s_{t}), s_{t+1} \sim P(\cdot \mid s_{t},a_{t}), a_{t+1} \sim \pi(\cdot \mid s_{t+1}), s_{t+2} \sim P(\cdot \mid s_{t+1},a_{t+1})} \big[ R_{t+1} + \gamma V^{\pi}(s_{t+2}) \mid S_{t} = s_{t} \big]
E a t ∼ π ( ⋅ ∣ s t ) , s t + 1 ∼ P ( ⋅ ∣ s t , a t ) , a t + 1 ∼ π ( ⋅ ∣ s t + 1 ) , s t + 2 ∼ P ( ⋅ ∣ s t + 1 , a t + 1 ) [ R t + 1 + γ V π ( s t + 2 ) ∣ S t = s t ]
合并后的期望表示,从 S t = s t S_{t} = s_{t} S t = s t 开始,遵循策略 π \pi π 的两个时间步的轨迹求期望。由于第一项中的 R t R_{t} R t 仅依赖于 A t A_{t} A t 与 S t + 1 S_{t+1} S t + 1 ,不依赖于 A t + 1 A_{t+1} A t + 1 与 S t + 2 S_{t+2} S t + 2 。因此,可以合并这两项。由期望的线性性质,得到最终的合并形式:
V π ( s t ) = E a t ∼ π ( ⋅ ∣ s t ) , s t + 1 ∼ P ( ⋅ ∣ s t , a t ) , a t + 1 ∼ π ( ⋅ ∣ s t + 1 ) , s t + 2 ∼ P ( ⋅ ∣ s t + 1 , a t + 1 ) [ R t + γ R t + 1 + γ 2 V π ( s t + 2 ) ∣ S t = s t ] = E π [ R t + γ R t + 1 + γ 2 V π ( s t + 2 ) ∣ S t = s t ] = E π [ G t ( 2 ) ∣ S t = s t ] \begin{aligned}
V^{\pi}(s_{t}) &= \mathbb{E}_{a_{t} \sim \pi(\cdot \mid s_{t}), s_{t+1} \sim P(\cdot \mid s_{t},a_{t}), a_{t+1} \sim \pi(\cdot \mid s_{t+1}), s_{t+2} \sim P(\cdot \mid s_{t+1},a_{t+1})} \big[ R_{t} + \gamma R_{t+1} + \gamma^{2} V^{\pi}(s_{t+2}) \mid S_{t} = s_{t} \big] \\
&= \mathbb{E}_{\pi}\big[ R_{t} + \gamma R_{t+1} + \gamma^{2} V^{\pi}(s_{t+2}) \mid S_{t} = s_{t} \big] \\
&= \mathbb{E}_{\pi}\big[ G_{t}^{(2)} \mid S_{t} = s_{t} \big]
\end{aligned}
V π ( s t ) = E a t ∼ π ( ⋅ ∣ s t ) , s t + 1 ∼ P ( ⋅ ∣ s t , a t ) , a t + 1 ∼ π ( ⋅ ∣ s t + 1 ) , s t + 2 ∼ P ( ⋅ ∣ s t + 1 , a t + 1 ) [ R t + γ R t + 1 + γ 2 V π ( s t + 2 ) ∣ S t = s t ] = E π [ R t + γ R t + 1 + γ 2 V π ( s t + 2 ) ∣ S t = s t ] = E π [ G t ( 2 ) ∣ S t = s t ]
至此,我们证明了 2 2 2 -step 回报 G t ( 2 ) G_{t}^{(2)} G t ( 2 ) 是状态价值函数 V π ( s t ) V^{\pi}(s_{t}) V π ( s t ) 的无偏估计。实际上,我们可以一直推广到 n n n -step,这里不给出详细的推导过程。
综上,n n n -step 回报 G t ( n ) G_{t}^{(n)} G t ( n ) 是状态价值函数 V π ( s t ) V^{\pi}(s_{t}) V π ( s t ) 的无偏估计