4 Policy Gradient Methods

4.1 Introduction

Discrete action spaces

\[ \pi(a|s, \mathbf{\theta}) = \frac{e^{h(s, a, \mathbf{\theta})}}{\sum_{b} e^{h(s, b, \mathbf{\theta})}} \]

Continuous action spaces

The probability density function of the normal distribution is given by:

\[ f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right) \]

The policy can be parameterized as a normal distribution:

\[ \pi(a|s, \mathbf{\theta}) = \frac{1}{\sigma(s, \mathbf{\theta})} \exp\left(-\frac{(a - \mu(s, \mathbf{\theta}))^2}{2\sigma(s, \mathbf{\theta})^2}\right) \]

\[ J(\mathbf{\theta}) = v_{\pi}(s_0) \]

The idea is to change the policy parameters $\mathbf{\theta}$ to improve $J(\mathbf{\theta})$. However, $\theta$ have an effect on both the policy and the distribution of states.

Luckily, the Policy Gradient Theorem shows that $\nabla J(\mathbf{\theta})$ can only involve the derivative of the policy, and not the derivative of the state distribution.

4.2 The Policy Gradient Theorem

\[ \nabla J(\mathbf{\theta}) \propto \sum_{s} \mu(s) \sum_{a} q_{\pi}(s, a) \nabla \pi(a|s, \mathbf{\theta}) \]

where $\mu(s)$ is the stationary distribution of states under the policy $\pi_{\mathbf{\theta}}$.

\[ \mu(s) = \lim_{t \to \infty} p(s_t = s | s_0, \pi_{\mathbf{\theta}}) \]

Proof. Remeber that $J(\mathbf{\theta}) = v_{\pi}(s_0)$. The objective is to prove that \[ \nabla J(\mathbf{\theta}) = \sum_{s} \mu(s) \sum_{a} q_{\pi}(s, a) \nabla \pi(a|s, \mathbf{\theta}) \]

First, we can express $v_{\pi}(s)$ in terms of $q_{\pi}(s, a)$: \[ v_{\pi}(s) = \sum_{a} \pi(a|s, \mathbf{\theta}) q_{\pi}(s, a) \]

Next, we can compute the gradient of $v_{\pi}(s)$ with respect to $\mathbf{\theta}$:

\[ \begin{align*} \nabla v_{\pi}(s) &= \nabla \left( \sum_{a} \pi(a|s, \mathbf{\theta}) q_{\pi}(s, a) \right) \\ &= \sum_{a} \left( \nabla \pi(a|s, \mathbf{\theta}) q_{\pi}(s, a) + \pi(a|s, \mathbf{\theta}) \nabla q_{\pi}(s, a) \right) \\ &= \sum_{a} \left( \nabla \pi(a|s, \mathbf{\theta}) q_{\pi}(s, a) + \pi(a|s, \mathbf{\theta}) \sum_{s'} p(s'|s, a) \nabla v_{\pi}(s') \right)\\ &= \sum_{a} \nabla \pi(a|s, \mathbf{\theta}) q_{\pi}(s, a) + \sum_{a} \pi(a|s, \mathbf{\theta}) \sum_{s'} p(s'|s, a) \nabla v_{\pi}(s') \\ &= \sum_{a} \nabla \pi(a|s, \mathbf{\theta}) q_{\pi}(s, a) + \sum_{s'} \sum_{a} \pi(a|s, \mathbf{\theta}) p(s'|s, a) \nabla v_{\pi}(s') \end{align*} \]

To simplify the expression, we define $\phi(s) = \sum_{a} \nabla \pi(a|s, \mathbf{\theta}) q_{\pi}(s, a)$. Then we can rewrite the equation as:

\[ \nabla v_{\pi}(s) = \phi(s) + \sum_{s'} \sum_{a} \pi(a|s, \mathbf{\theta}) p(s'|s, a) \nabla v_{\pi}(s') \]

Let $\rho(s \rightarrow x, k, \pi)$ be the probability of reaching state $x$ from state $s$ in $k$ steps under policy $\pi$. We know that

\[ \begin{align*} \rho(s \rightarrow s, 0, \pi) &= 1 \\ \rho(s \rightarrow s', 1, \pi) &= \sum_{a} \pi(a|s, \mathbf{\theta}) p(s'|s, a) \\ \rho(s \rightarrow x, k+1, \pi) &= \sum_{s'} \rho(s \rightarrow s', k, \pi) \rho(s' \rightarrow x, 1, \pi) \end{align*} \]

Using this definition, we can express $\nabla v_{\pi}(s)$ as:

\[ \begin{align*} \nabla v_{\pi}(s) &= \phi(s) + \sum_{s'} \sum_{a} \pi(a|s, \mathbf{\theta}) p(s'|s, a) \nabla v_{\pi}(s') \\ &= \phi(s) + \sum_{s'} \rho_\pi(s \rightarrow s', 1) \nabla v_{\pi}(s') \\ &= \phi(s) + \sum_{s'} \rho_\pi(s \rightarrow s', 1) \left( \phi(s') + \sum_{s''} \rho_\pi(s' \rightarrow s'', 1) \nabla v_{\pi}(s'') \right) \\ &= \phi(s) + \sum_{s'} \rho_\pi(s \rightarrow s', 1) \phi(s') + \sum_{s''} \rho_\pi(s \rightarrow s'', 2) \nabla v_{\pi}(s'') \\ &= \phi(s) + \sum_{s'} \rho_\pi(s \rightarrow s', 1) \phi(s') + \sum_{s''} \rho_\pi(s \rightarrow s'', 2) \phi(s'') + \sum_{s'''} \rho_\pi(s \rightarrow s''', 3) \nabla v_{\pi}(s''') \\ &= \sum_{x \in \mathcal{S}} \sum_{k=0}^{\infty} \rho_\pi(s \rightarrow x, k) \phi(x) \end{align*} \]

Note that $\rho_\pi(s \rightarrow s, 0) = 1$ and $\rho_\pi(s \rightarrow s', 0) = 0$ for $s' \neq s$.

Let $\eta(s) = \sum_{k=0}^{\infty} \rho_\pi(s_0 \rightarrow s, k)$. We have $\mu(s) = \frac{\eta(s)}{\sum_{s} \eta(s)}$. Then we can express $\nabla J(\mathbf{\theta})$ as:

\[ \begin{align*} \nabla J(\mathbf{\theta}) &= \nabla v_{\pi}(s_0) \\ &= \sum_{x \in \mathcal{S}} \sum_{k=0}^{\infty} \rho_\pi(s_0 \rightarrow x, k) \phi(x) \\ &= \sum_{x \in \mathcal{S}} \eta(x) \phi(x) \\ &= \sum_{s} \eta(s) \frac{\sum_{s} \eta(s)}{\sum_{s} \eta(s)} \sum_{a} q_{\pi}(s, a) \nabla \pi(a|s, \mathbf{\theta}) \\ &= \sum_{s} \eta(s) \sum_{s} \frac{\eta(s)}{\sum_{s} \eta(s)} \sum_{a} q_{\pi}(s, a) \nabla \pi(a|s, \mathbf{\theta}) \\ &= \sum_{s} \eta(s) \sum_{s} \mu(s) \sum_{a} q_{\pi}(s, a) \nabla \pi(a|s, \mathbf{\theta}) \\ \end{align*} \]

In the episodic case, $\sum_{s} \eta(s)$ is the averange episode length, which is a constant.

\[ \sum_{s} \mu(s) = \mathbb{E}[\tau] \]

In the continuing case, $\sum_{s} \eta(s) = 1$.

\[ \sum_{s} \mu(s) = \sum_{s} \lim_{t \to \infty} p(s_t = s) = 1 \]

Then we have

\[ \nabla J(\mathbf{\theta}) \propto \sum_{s} \mu(s) \sum_{a} q_{\pi}(s, a) \nabla \pi(a|s, \mathbf{\theta}) \]

The gradient $\nabla J(\mathbf{\theta})$ can also be expressed as

\[ \begin{align*} \nabla J(\mathbf{\theta}) & \propto \sum_{s} \mu(s) \sum_{a} q_{\pi}(s, a) \nabla \pi(a|s, \mathbf{\theta}) \\ &= \mathbb{E}_{s \sim \mu} \left[\sum_{a} q_{\pi}(s, a) \nabla \pi(a|s, \mathbf{\theta})\right] \\ &= \mathbb{E}_{s \sim \mu} \left[\sum_{a} \pi(a|s, \mathbf{\theta}) q_{\pi}(s, a) \frac{\nabla \pi(a|s, \mathbf{\theta})}{\pi(a|s, \mathbf{\theta})}\right] \\ &= \mathbb{E}_{s \sim \mu, a \sim \pi} \left[q_{\pi}(s, a) \frac{\nabla \pi(a|s, \mathbf{\theta})}{\pi(a|s, \mathbf{\theta})}\right] \\ &= \mathbb{E}_{s \sim \mu, a \sim \pi} \left[q_{\pi}(s, a) \nabla \ln \pi(a|s, \mathbf{\theta})\right] \end{align*} \]

Here, $\nabla \ln x = \nabla x / x$ is used to derive the last line.

4.3 REINFORCE Algorithm

Since $q_{\pi}(S_t, A_t) = \mathbb{E}[G_t | S_t, A_t]$, we have

\[ \begin{align*} \nabla J(\mathbf{\theta}) &\propto [q_{\pi}(s, a) \nabla \ln \pi(a|s, \mathbf{\theta})] \\ &= \mathbb{E} \left[\mathbb{E}[G_t | S_t, A_t] \nabla \ln \pi(a|s, \mathbf{\theta})\right] \\ &= \mathbb{E}\left[G_t \nabla \ln \pi(a|s, \mathbf{\theta})\right] \end{align*} \]

\begin{algorithm} \caption{REINFORCE} \begin{algorithmic} \State Initialize policy parameters $\mathbf{\theta}$ \While{True} \State Generate an episode $S_0, A_0, R_1 \dots, S_{T-1}, A_{T-1}, R_T$ using policy $\pi_{\mathbf{\theta}}$ \For{$t = 0, 1, \dots, T-1$} \State $G_t \gets \sum_{k=t+1}^{T} \gamma^{k-t-1} R_k$ \State $\mathbf{\theta} \gets \mathbf{\theta} + \alpha \gamma^t G_t \nabla \ln \pi(A_t|S_t, \mathbf{\theta})$ \EndFor \EndWhile \end{algorithmic} \end{algorithm}

\begin{algorithm} \caption{REINFORCE with Baseline} \begin{algorithmic} \State Initialize policy parameters $\mathbf{\theta}$ \State Initialize baseline parameters $\mathbf{w}$ \While{True} \State Generate an episode $S_0, A_0, R_1 \dots, S_{T-1}, A_{T-1}, R_T$ using policy $\pi_{\mathbf{\theta}}$ \For{$t = 0, 1, \dots, T-1$} \State $G_t \gets \sum_{k=t+1}^{T} \gamma^{k-t-1} R_k$ \State $\delta \gets G_t - \hat{v}(S_t, \mathbf{w})$ \State $\mathbf{\theta} \gets \mathbf{\theta} + \alpha \gamma^t \delta \nabla \ln \pi(A_t|S_t, \mathbf{\theta})$ \State $\mathbf{w} \gets \mathbf{w} + \beta \delta \nabla \hat{v}(S_t, \mathbf{w})$ \EndFor \EndWhile \end{algorithmic} \end{algorithm}

4.1 Introduction

4.2 The Policy Gradient Theorem

4.3 REINFORCE Algorithm

4.4 Actor-Critic Methods