4  Policy Gradient Methods

4.1 Introduction

Discrete action spaces

\[ \pi(a|s, \mathbf{\theta}) = \frac{e^{h(s, a, \mathbf{\theta})}}{\sum_{b} e^{h(s, b, \mathbf{\theta})}} \]

Continuous action spaces

The probability density function of the normal distribution is given by:

\[ f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right) \]

The policy can be parameterized as a normal distribution:

\[ \pi(a|s, \mathbf{\theta}) = \frac{1}{\sigma(s, \mathbf{\theta})} \exp\left(-\frac{(a - \mu(s, \mathbf{\theta}))^2}{2\sigma(s, \mathbf{\theta})^2}\right) \]

\[ J(\mathbf{\theta}) = v_{\pi}(s_0) \]

The idea is to change the policy parameters \(\mathbf{\theta}\) to improve \(J(\mathbf{\theta})\). However, \(\theta\) have an effect on both the policy and the distribution of states.

Luckily, the Policy Gradient Theorem shows that \(\nabla J(\mathbf{\theta})\) can only involve the derivative of the policy, and not the derivative of the state distribution.

4.2 The Policy Gradient Theorem

\[ \nabla J(\mathbf{\theta}) \propto \sum_{s} \mu(s) \sum_{a} q_{\pi}(s, a) \nabla \pi(a|s, \mathbf{\theta}) \]

where \(\mu(s)\) is the stationary distribution of states under the policy \(\pi_{\mathbf{\theta}}\).

\[ \mu(s) = \lim_{t \to \infty} p(s_t = s | s_0, \pi_{\mathbf{\theta}}) \]

Proof. Remeber that \(J(\mathbf{\theta}) = v_{\pi}(s_0)\). The objective is to prove that \[ \nabla J(\mathbf{\theta}) = \sum_{s} \mu(s) \sum_{a} q_{\pi}(s, a) \nabla \pi(a|s, \mathbf{\theta}) \]

First, we can express \(v_{\pi}(s)\) in terms of \(q_{\pi}(s, a)\): \[ v_{\pi}(s) = \sum_{a} \pi(a|s, \mathbf{\theta}) q_{\pi}(s, a) \]

Next, we can compute the gradient of \(v_{\pi}(s)\) with respect to \(\mathbf{\theta}\):

\[ \begin{align*} \nabla v_{\pi}(s) &= \nabla \left( \sum_{a} \pi(a|s, \mathbf{\theta}) q_{\pi}(s, a) \right) \\ &= \sum_{a} \left( \nabla \pi(a|s, \mathbf{\theta}) q_{\pi}(s, a) + \pi(a|s, \mathbf{\theta}) \nabla q_{\pi}(s, a) \right) \\ &= \sum_{a} \left( \nabla \pi(a|s, \mathbf{\theta}) q_{\pi}(s, a) + \pi(a|s, \mathbf{\theta}) \sum_{s'} p(s'|s, a) \nabla v_{\pi}(s') \right)\\ &= \sum_{a} \nabla \pi(a|s, \mathbf{\theta}) q_{\pi}(s, a) + \sum_{a} \pi(a|s, \mathbf{\theta}) \sum_{s'} p(s'|s, a) \nabla v_{\pi}(s') \\ &= \sum_{a} \nabla \pi(a|s, \mathbf{\theta}) q_{\pi}(s, a) + \sum_{s'} \sum_{a} \pi(a|s, \mathbf{\theta}) p(s'|s, a) \nabla v_{\pi}(s') \end{align*} \]

To simplify the expression, we define \(\phi(s) = \sum_{a} \nabla \pi(a|s, \mathbf{\theta}) q_{\pi}(s, a)\). Then we can rewrite the equation as:

\[ \nabla v_{\pi}(s) = \phi(s) + \sum_{s'} \sum_{a} \pi(a|s, \mathbf{\theta}) p(s'|s, a) \nabla v_{\pi}(s') \]

Let \(\rho(s \rightarrow x, k, \pi)\) be the probability of reaching state \(x\) from state \(s\) in \(k\) steps under policy \(\pi\). We know that

\[ \begin{align*} \rho(s \rightarrow s, 0, \pi) &= 1 \\ \rho(s \rightarrow s', 1, \pi) &= \sum_{a} \pi(a|s, \mathbf{\theta}) p(s'|s, a) \\ \rho(s \rightarrow x, k+1, \pi) &= \sum_{s'} \rho(s \rightarrow s', k, \pi) \rho(s' \rightarrow x, 1, \pi) \end{align*} \]

Using this definition, we can express \(\nabla v_{\pi}(s)\) as:

\[ \begin{align*} \nabla v_{\pi}(s) &= \phi(s) + \sum_{s'} \sum_{a} \pi(a|s, \mathbf{\theta}) p(s'|s, a) \nabla v_{\pi}(s') \\ &= \phi(s) + \sum_{s'} \rho_\pi(s \rightarrow s', 1) \nabla v_{\pi}(s') \\ &= \phi(s) + \sum_{s'} \rho_\pi(s \rightarrow s', 1) \left( \phi(s') + \sum_{s''} \rho_\pi(s' \rightarrow s'', 1) \nabla v_{\pi}(s'') \right) \\ &= \phi(s) + \sum_{s'} \rho_\pi(s \rightarrow s', 1) \phi(s') + \sum_{s''} \rho_\pi(s \rightarrow s'', 2) \nabla v_{\pi}(s'') \\ &= \phi(s) + \sum_{s'} \rho_\pi(s \rightarrow s', 1) \phi(s') + \sum_{s''} \rho_\pi(s \rightarrow s'', 2) \phi(s'') + \sum_{s'''} \rho_\pi(s \rightarrow s''', 3) \nabla v_{\pi}(s''') \\ &= \sum_{x \in \mathcal{S}} \sum_{k=0}^{\infty} \rho_\pi(s \rightarrow x, k) \phi(x) \end{align*} \]

Note that \(\rho_\pi(s \rightarrow s, 0) = 1\) and \(\rho_\pi(s \rightarrow s', 0) = 0\) for \(s' \neq s\).

Let \(\eta(s) = \sum_{k=0}^{\infty} \rho_\pi(s_0 \rightarrow s, k)\). We have \(\mu(s) = \frac{\eta(s)}{\sum_{s} \eta(s)}\). Then we can express \(\nabla J(\mathbf{\theta})\) as:

\[ \begin{align*} \nabla J(\mathbf{\theta}) &= \nabla v_{\pi}(s_0) \\ &= \sum_{x \in \mathcal{S}} \sum_{k=0}^{\infty} \rho_\pi(s_0 \rightarrow x, k) \phi(x) \\ &= \sum_{x \in \mathcal{S}} \eta(x) \phi(x) \\ &= \sum_{s} \eta(s) \frac{\sum_{s} \eta(s)}{\sum_{s} \eta(s)} \sum_{a} q_{\pi}(s, a) \nabla \pi(a|s, \mathbf{\theta}) \\ &= \sum_{s} \eta(s) \sum_{s} \frac{\eta(s)}{\sum_{s} \eta(s)} \sum_{a} q_{\pi}(s, a) \nabla \pi(a|s, \mathbf{\theta}) \\ &= \sum_{s} \eta(s) \sum_{s} \mu(s) \sum_{a} q_{\pi}(s, a) \nabla \pi(a|s, \mathbf{\theta}) \\ \end{align*} \]

In the episodic case, \(\sum_{s} \eta(s)\) is the averange episode length, which is a constant.

\[ \sum_{s} \mu(s) = \mathbb{E}[\tau] \]

In the continuing case, \(\sum_{s} \eta(s) = 1\).

\[ \sum_{s} \mu(s) = \sum_{s} \lim_{t \to \infty} p(s_t = s) = 1 \]

Then we have

\[ \nabla J(\mathbf{\theta}) \propto \sum_{s} \mu(s) \sum_{a} q_{\pi}(s, a) \nabla \pi(a|s, \mathbf{\theta}) \]

The gradient \(\nabla J(\mathbf{\theta})\) can also be expressed as

\[ \begin{align*} \nabla J(\mathbf{\theta}) & \propto \sum_{s} \mu(s) \sum_{a} q_{\pi}(s, a) \nabla \pi(a|s, \mathbf{\theta}) \\ &= \mathbb{E}_{s \sim \mu} \left[\sum_{a} q_{\pi}(s, a) \nabla \pi(a|s, \mathbf{\theta})\right] \\ &= \mathbb{E}_{s \sim \mu} \left[\sum_{a} \pi(a|s, \mathbf{\theta}) q_{\pi}(s, a) \frac{\nabla \pi(a|s, \mathbf{\theta})}{\pi(a|s, \mathbf{\theta})}\right] \\ &= \mathbb{E}_{s \sim \mu, a \sim \pi} \left[q_{\pi}(s, a) \frac{\nabla \pi(a|s, \mathbf{\theta})}{\pi(a|s, \mathbf{\theta})}\right] \\ &= \mathbb{E}_{s \sim \mu, a \sim \pi} \left[q_{\pi}(s, a) \nabla \ln \pi(a|s, \mathbf{\theta})\right] \end{align*} \]

Here, \(\nabla \ln x = \nabla x / x\) is used to derive the last line.

4.3 REINFORCE Algorithm

Since \(q_{\pi}(S_t, A_t) = \mathbb{E}[G_t | S_t, A_t]\), we have

\[ \begin{align*} \nabla J(\mathbf{\theta}) &\propto [q_{\pi}(s, a) \nabla \ln \pi(a|s, \mathbf{\theta})] \\ &= \mathbb{E} \left[\mathbb{E}[G_t | S_t, A_t] \nabla \ln \pi(a|s, \mathbf{\theta})\right] \\ &= \mathbb{E}\left[G_t \nabla \ln \pi(a|s, \mathbf{\theta})\right] \end{align*} \]

\begin{algorithm} \caption{REINFORCE} \begin{algorithmic} \State Initialize policy parameters $\mathbf{\theta}$ \While{True} \State Generate an episode $S_0, A_0, R_1 \dots, S_{T-1}, A_{T-1}, R_T$ using policy $\pi_{\mathbf{\theta}}$ \For{$t = 0, 1, \dots, T-1$} \State $G_t \gets \sum_{k=t+1}^{T} \gamma^{k-t-1} R_k$ \State $\mathbf{\theta} \gets \mathbf{\theta} + \alpha \gamma^t G_t \nabla \ln \pi(A_t|S_t, \mathbf{\theta})$ \EndFor \EndWhile \end{algorithmic} \end{algorithm}
\begin{algorithm} \caption{REINFORCE with Baseline} \begin{algorithmic} \State Initialize policy parameters $\mathbf{\theta}$ \State Initialize baseline parameters $\mathbf{w}$ \While{True} \State Generate an episode $S_0, A_0, R_1 \dots, S_{T-1}, A_{T-1}, R_T$ using policy $\pi_{\mathbf{\theta}}$ \For{$t = 0, 1, \dots, T-1$} \State $G_t \gets \sum_{k=t+1}^{T} \gamma^{k-t-1} R_k$ \State $\delta \gets G_t - \hat{v}(S_t, \mathbf{w})$ \State $\mathbf{\theta} \gets \mathbf{\theta} + \alpha \gamma^t \delta \nabla \ln \pi(A_t|S_t, \mathbf{\theta})$ \State $\mathbf{w} \gets \mathbf{w} + \beta \delta \nabla \hat{v}(S_t, \mathbf{w})$ \EndFor \EndWhile \end{algorithmic} \end{algorithm}

4.4 Actor-Critic Methods