4 Policy Gradient Methods
4.1 Introduction
Discrete action spaces
\[ \pi(a|s, \mathbf{\theta}) = \frac{e^{h(s, a, \mathbf{\theta})}}{\sum_{b} e^{h(s, b, \mathbf{\theta})}} \]
Continuous action spaces
The probability density function of the normal distribution is given by:
\[ f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right) \]
The policy can be parameterized as a normal distribution:
\[ \pi(a|s, \mathbf{\theta}) = \frac{1}{\sigma(s, \mathbf{\theta})} \exp\left(-\frac{(a - \mu(s, \mathbf{\theta}))^2}{2\sigma(s, \mathbf{\theta})^2}\right) \]
\[ J(\mathbf{\theta}) = v_{\pi}(s_0) \]
The idea is to change the policy parameters \(\mathbf{\theta}\) to improve \(J(\mathbf{\theta})\). However, \(\theta\) have an effect on both the policy and the distribution of states.
Luckily, the Policy Gradient Theorem shows that \(\nabla J(\mathbf{\theta})\) can only involve the derivative of the policy, and not the derivative of the state distribution.
4.2 The Policy Gradient Theorem
\[ \nabla J(\mathbf{\theta}) \propto \sum_{s} \mu(s) \sum_{a} q_{\pi}(s, a) \nabla \pi(a|s, \mathbf{\theta}) \]
where \(\mu(s)\) is the stationary distribution of states under the policy \(\pi_{\mathbf{\theta}}\).
\[ \mu(s) = \lim_{t \to \infty} p(s_t = s | s_0, \pi_{\mathbf{\theta}}) \]
Proof. Remeber that \(J(\mathbf{\theta}) = v_{\pi}(s_0)\). The objective is to prove that \[ \nabla J(\mathbf{\theta}) = \sum_{s} \mu(s) \sum_{a} q_{\pi}(s, a) \nabla \pi(a|s, \mathbf{\theta}) \]
First, we can express \(v_{\pi}(s)\) in terms of \(q_{\pi}(s, a)\): \[ v_{\pi}(s) = \sum_{a} \pi(a|s, \mathbf{\theta}) q_{\pi}(s, a) \]
Next, we can compute the gradient of \(v_{\pi}(s)\) with respect to \(\mathbf{\theta}\):
\[ \begin{align*} \nabla v_{\pi}(s) &= \nabla \left( \sum_{a} \pi(a|s, \mathbf{\theta}) q_{\pi}(s, a) \right) \\ &= \sum_{a} \left( \nabla \pi(a|s, \mathbf{\theta}) q_{\pi}(s, a) + \pi(a|s, \mathbf{\theta}) \nabla q_{\pi}(s, a) \right) \\ &= \sum_{a} \left( \nabla \pi(a|s, \mathbf{\theta}) q_{\pi}(s, a) + \pi(a|s, \mathbf{\theta}) \sum_{s'} p(s'|s, a) \nabla v_{\pi}(s') \right)\\ &= \sum_{a} \nabla \pi(a|s, \mathbf{\theta}) q_{\pi}(s, a) + \sum_{a} \pi(a|s, \mathbf{\theta}) \sum_{s'} p(s'|s, a) \nabla v_{\pi}(s') \\ &= \sum_{a} \nabla \pi(a|s, \mathbf{\theta}) q_{\pi}(s, a) + \sum_{s'} \sum_{a} \pi(a|s, \mathbf{\theta}) p(s'|s, a) \nabla v_{\pi}(s') \end{align*} \]
To simplify the expression, we define \(\phi(s) = \sum_{a} \nabla \pi(a|s, \mathbf{\theta}) q_{\pi}(s, a)\). Then we can rewrite the equation as:
\[ \nabla v_{\pi}(s) = \phi(s) + \sum_{s'} \sum_{a} \pi(a|s, \mathbf{\theta}) p(s'|s, a) \nabla v_{\pi}(s') \]
Let \(\rho(s \rightarrow x, k, \pi)\) be the probability of reaching state \(x\) from state \(s\) in \(k\) steps under policy \(\pi\). We know that
\[ \begin{align*} \rho(s \rightarrow s, 0, \pi) &= 1 \\ \rho(s \rightarrow s', 1, \pi) &= \sum_{a} \pi(a|s, \mathbf{\theta}) p(s'|s, a) \\ \rho(s \rightarrow x, k+1, \pi) &= \sum_{s'} \rho(s \rightarrow s', k, \pi) \rho(s' \rightarrow x, 1, \pi) \end{align*} \]
Using this definition, we can express \(\nabla v_{\pi}(s)\) as:
\[ \begin{align*} \nabla v_{\pi}(s) &= \phi(s) + \sum_{s'} \sum_{a} \pi(a|s, \mathbf{\theta}) p(s'|s, a) \nabla v_{\pi}(s') \\ &= \phi(s) + \sum_{s'} \rho_\pi(s \rightarrow s', 1) \nabla v_{\pi}(s') \\ &= \phi(s) + \sum_{s'} \rho_\pi(s \rightarrow s', 1) \left( \phi(s') + \sum_{s''} \rho_\pi(s' \rightarrow s'', 1) \nabla v_{\pi}(s'') \right) \\ &= \phi(s) + \sum_{s'} \rho_\pi(s \rightarrow s', 1) \phi(s') + \sum_{s''} \rho_\pi(s \rightarrow s'', 2) \nabla v_{\pi}(s'') \\ &= \phi(s) + \sum_{s'} \rho_\pi(s \rightarrow s', 1) \phi(s') + \sum_{s''} \rho_\pi(s \rightarrow s'', 2) \phi(s'') + \sum_{s'''} \rho_\pi(s \rightarrow s''', 3) \nabla v_{\pi}(s''') \\ &= \sum_{x \in \mathcal{S}} \sum_{k=0}^{\infty} \rho_\pi(s \rightarrow x, k) \phi(x) \end{align*} \]
Note that \(\rho_\pi(s \rightarrow s, 0) = 1\) and \(\rho_\pi(s \rightarrow s', 0) = 0\) for \(s' \neq s\).
Let \(\eta(s) = \sum_{k=0}^{\infty} \rho_\pi(s_0 \rightarrow s, k)\). We have \(\mu(s) = \frac{\eta(s)}{\sum_{s} \eta(s)}\). Then we can express \(\nabla J(\mathbf{\theta})\) as:
\[ \begin{align*} \nabla J(\mathbf{\theta}) &= \nabla v_{\pi}(s_0) \\ &= \sum_{x \in \mathcal{S}} \sum_{k=0}^{\infty} \rho_\pi(s_0 \rightarrow x, k) \phi(x) \\ &= \sum_{x \in \mathcal{S}} \eta(x) \phi(x) \\ &= \sum_{s} \eta(s) \frac{\sum_{s} \eta(s)}{\sum_{s} \eta(s)} \sum_{a} q_{\pi}(s, a) \nabla \pi(a|s, \mathbf{\theta}) \\ &= \sum_{s} \eta(s) \sum_{s} \frac{\eta(s)}{\sum_{s} \eta(s)} \sum_{a} q_{\pi}(s, a) \nabla \pi(a|s, \mathbf{\theta}) \\ &= \sum_{s} \eta(s) \sum_{s} \mu(s) \sum_{a} q_{\pi}(s, a) \nabla \pi(a|s, \mathbf{\theta}) \\ \end{align*} \]
In the episodic case, \(\sum_{s} \eta(s)\) is the averange episode length, which is a constant.
\[ \sum_{s} \mu(s) = \mathbb{E}[\tau] \]
In the continuing case, \(\sum_{s} \eta(s) = 1\).
\[ \sum_{s} \mu(s) = \sum_{s} \lim_{t \to \infty} p(s_t = s) = 1 \]
Then we have
\[ \nabla J(\mathbf{\theta}) \propto \sum_{s} \mu(s) \sum_{a} q_{\pi}(s, a) \nabla \pi(a|s, \mathbf{\theta}) \]
The gradient \(\nabla J(\mathbf{\theta})\) can also be expressed as
\[ \begin{align*} \nabla J(\mathbf{\theta}) & \propto \sum_{s} \mu(s) \sum_{a} q_{\pi}(s, a) \nabla \pi(a|s, \mathbf{\theta}) \\ &= \mathbb{E}_{s \sim \mu} \left[\sum_{a} q_{\pi}(s, a) \nabla \pi(a|s, \mathbf{\theta})\right] \\ &= \mathbb{E}_{s \sim \mu} \left[\sum_{a} \pi(a|s, \mathbf{\theta}) q_{\pi}(s, a) \frac{\nabla \pi(a|s, \mathbf{\theta})}{\pi(a|s, \mathbf{\theta})}\right] \\ &= \mathbb{E}_{s \sim \mu, a \sim \pi} \left[q_{\pi}(s, a) \frac{\nabla \pi(a|s, \mathbf{\theta})}{\pi(a|s, \mathbf{\theta})}\right] \\ &= \mathbb{E}_{s \sim \mu, a \sim \pi} \left[q_{\pi}(s, a) \nabla \ln \pi(a|s, \mathbf{\theta})\right] \end{align*} \]
Here, \(\nabla \ln x = \nabla x / x\) is used to derive the last line.
4.3 REINFORCE Algorithm
Since \(q_{\pi}(S_t, A_t) = \mathbb{E}[G_t | S_t, A_t]\), we have
\[ \begin{align*} \nabla J(\mathbf{\theta}) &\propto [q_{\pi}(s, a) \nabla \ln \pi(a|s, \mathbf{\theta})] \\ &= \mathbb{E} \left[\mathbb{E}[G_t | S_t, A_t] \nabla \ln \pi(a|s, \mathbf{\theta})\right] \\ &= \mathbb{E}\left[G_t \nabla \ln \pi(a|s, \mathbf{\theta})\right] \end{align*} \]