Gradient Estimation for Policy Gradient Methods

This post is about gradient estimation for policy gradient methods.

Consider the problem of estimating the gradient of an expectation $\frac{\partial}{\partial\theta}\mathrm{E}_{x\sim p_\theta}[f(x)]$ where $f$ is a function, $x$ is a random variable and $p$ is a parameterized distribution.

There are two main ways to estimate the gradient (see a summary in [1]). The first one is the likelihood ratio estimator: $$ \frac{\partial}{\partial\theta}\mathrm{E}_{x\sim p_\theta}\left[f(x)\right] = \mathrm{E}_{x\sim p_\theta}\left[f(x)\frac{\partial}{\partial\theta}\log p_\theta(x)\right]. $$

The equation is derived as follows: $$\begin{align} \frac{\partial}{\partial\theta}\mathrm{E}_{x\sim p_\theta}\left[f(x)\right] &= \frac{\partial}{\partial\theta} \int f(x) p_\theta(x) dx \\ &= \int f(x) \frac{\partial}{\partial\theta} p_\theta(x) dx \\ &= \int f(x) p_\theta(x) \frac{\partial}{\partial\theta} \log p_\theta(x) dx \\ &= \mathrm{E}_{x\sim p_\theta}\left[f(x)\frac{\partial}{\partial\theta}\log p_\theta(x)\right]. \end{align}$$ The interchange of differentiation and integral is due to the Leibniz integral rule which requires $p_\theta(x),\frac{\partial}{\partial\theta} p_\theta(x)$ are continuous in $\theta$ and $x$. However, when $x$ is discrete, the equation still holds since we can interchange the differentiation and the sum. The third equation is from the identity $\frac{\partial}{\partial\theta} \log p_\theta(x) = \frac{1}{p_\theta(x)} \frac{\partial}{\partial\theta}p_\theta(x)$.

The second one is the pathwise derivative estimator, or the reparameterization trick. Let $x=g_\theta(z)$ be a deterministic and function function of $\theta$ and another random variable $z$ that we can sample. Then, $$ \frac{\partial}{\partial\theta}\mathrm{E}_{z}\left[f(g_\theta(z))\right] = \mathrm{E}_{z}\left[\frac{\partial}{\partial\theta}f(g_\theta(z))\right]. $$ Here we interchange the differentiation and the integral. Note that it requires $f$ is differentiable.

Now let’s move back to reinforcement learning. Recall that the policy gradient theorem states that $$\begin{align} \frac{\partial}{\partial\theta}J(\theta) &\propto \sum_s \mu(s) \sum_a Q^\pi(s,a) \nabla \pi_\theta(a|s) \\ &= \sum_{s,a} \mu(s)\pi(a|s) Q^\pi(s,a) \nabla \log\pi_\theta(a|s). \end{align}$$

REINFORCE [2] uses the likelihood ratio estimator. Fix a state $s$, the policy $\pi_\theta(a|s)$ is the parameterized distribution where $a$ is a random variable, $Q$ is a function of the state-action pair, e.g., the true q-function $Q^\pi$ or our estimate $\hat Q$ . We want to estimate $\frac{\partial}{\partial\theta}\mathrm{E}_{a\sim \pi_\theta(\cdot|s)}[Q(s.a)]$: $$ \frac{\partial}{\partial\theta}\mathrm{E}_{a\sim \pi_\theta(\cdot|s)}\left[Q(s,a)\right] = \mathrm{E}_{a\sim \pi_\theta(\cdot|s)}\left[Q(s,a)\frac{\partial}{\partial\theta}\log \pi_\theta(a|s)\right]. $$

DPG [3] and SAC [4] use the reparameterization trick. Let $g_\theta(z,s)$ be the output of the policy where $z$ is a random variable: $$\begin{align} \frac{\partial}{\partial\theta}\mathrm{E}_{z}\left[Q(s,g_\theta(z,s))\right] &= \mathrm{E}_{z}\left[\frac{\partial}{\partial\theta}Q(s,g_\theta(z,s))\right]\\ &= \mathrm{E}_{z}\left[\frac{\partial}{\partial a}Q(s,a) |_{a=g_\theta(z,s)} \frac{\partial}{\partial \theta}g_\theta(z,s)\right]. \end{align}$$ The second equation follows from the chain rule.

SAC has an additional term, but we can still estimate the gradient. $$\begin{align} &\frac{\partial}{\partial\theta}\mathrm{E}_{z}\left[\alpha\log\pi_\theta(g_\theta(z,s)|s) + Q(s,g_\theta(z,s))\right] \\ % &=\frac{\partial}{\partial\theta}\mathrm{E}_{z}\left[\alpha\log\pi_\theta(g_\theta(z,s)|s)\right] + \frac{\partial}{\partial\theta}\mathrm{E}_{z}\left[ Q(s,g_\theta(z,s))\right] \\ &=\mathrm{E}_{z}\Big[\frac{\partial}{\partial\theta}\alpha\log\pi_\theta(a|s)|_{a=g_\theta(z,s)} \\ &\ \ \ \ + \frac{\partial}{\partial a} \alpha\log\pi_\theta(a|s)|_{a=g_\theta(z,s)}\frac{\partial}{\partial \theta}g_\theta(z,s) \\ &\ \ \ \ +\frac{\partial}{\partial a}Q(s,a) |_{a=g_\theta(z,s)} \frac{\partial}{\partial \theta}g_\theta(z,s)\Big]. \end{align}$$

I provide some Pytorch examples with Gaussian policies below. The Pytorch implementation can be found here.

The likelihood ratio estimator:

mean, std = policy_network(state)
m = torch.distributions.Normal(mean, std)
action = m.sample()
log_prob = m.log_prob(action)
q_value = critic_network(state, action).detach()
loss = -log_prob * q_value
loss.backward()

The reparameterization trick:

mean, std = policy_network(state)
m = torch.distributions.Normal(mean, std)
action = m.rsample()
log_prob = m.log_prob(action)
q_value = critic_network(state, action) # critic_network needs to be differentiable
loss = -q_value
loss.backward()

[1] Gradient Estimation Using Stochastic Computation Graphs
[2] Simple statistical gradient-following algorithms for connectionist reinforcement learning
[3] Deterministic Policy Gradient Algorithms
[4] Soft Actor-Critic Algorithms and Applications

Other related work:
[5] Learning Continuous Control Policies by Stochastic Value Gradients
[6] Action-depedent Control Variates for Policy Optimization via Stein’s Identity

Vincent Liu
Vincent Liu
PhD Candidate

I am a PhD candidate working on reinforcement learning.