## NIPS 2015 – RL Tutorial

Posted: 2015-12-08 in conferences
Tags:

Rich Sutton gave a tutorial on function approximation in RL. Rich being one of the pioneers of RL, I was looking forward to his insights. He started off with some inherent properties of reinforcement learning, which include:

• Evaluative Feedback
• Delayed consequences
• Trial and error learning
• Non-stationarity
• Sequential problems
• No human instruction

There are a bunch of interesting RL success stories, such as

• TD Gammon (Tesauro 1992)
• A-B tests
• Learning to fly helicopters (Ng et al., 2006)
• IBM Watson System (2011)
• Atari Games (Mnih et al., 2015)

Rich’s talk was frequently spiced up with various demos. The first one was running a simple MDP with the audience, where the audience played the role of a reinforcement learning system. I really liked it, and it got the audience engaged. This demo of an MDP was followed by the definition of an MDP (policies, states, actions, rewards).

A key component of the talk was the action-value function $Q^\pi(s,a)$: How good is it to be in state $\latex s$, apply action $a$, and thereafter follow a policy $\pi$. The action-value function is interesting because it characterizes optimal policies in the sense that optimal policies maximize the action-value function. The optimal action-value function $Q^*$ is shared by all optimal policies.
Once the optimal policy is found, acting optimally is easy by “greedification”, i.e., $\pi^*(s) = \arg\max_a Q^*(s,a)$. There is always (at least) one deterministic optimal policy.

### Exact Solution Methods (Tabular Methods)

The talk continued with exact solution methods (tabular methods) for RL problems, and we focused on Q-learning, an off-policy learning method, which directly implements the connection between optimal policies and state-action value function.

Looking at policies, we distinguish between a target policy (optimal policy) and a behavioral policy (for generating the data). This is useful for trading off exploration and exploitation, but finding the right balance is an open problem in the field. Q-learning sidesteps this to some degree, because the behavioral problem is different from the target policy (off-policy).

He went on and formulated the policy improvement theorem. In words:
Given the value function for any policy, it can always be made better $\geq$ by greedification. If $=$ then both policies are optimal.

This leads to policy iteration, which converges in a finite number of steps (in a tabular setting with finitely many states and actions).

A key concept in RL is bootstrapping, and the Bellman equation is the prominent example for bootstrapping in RL.

### Approximate Solutions (Function Approximation)

As soon as we have too many states or actions or continuous-valued spaces, we need to do something about it.

We need to approximate the policy, the model of the world, but most importantly the action-value function: $Q(s,a,\theta) \approx Q(s,a)$. The focus in the tutorial was on linear-in-the-parameters models, i.e., linear feature weighting. It is important that the computations scale linearly with the number of the parameters.

How do we do function approximation in practice? He gave the example of Robocup keepaway (Stone et al., 2005)

• We define state features
• We use a function approximator to create giant feature vectors (e.g., tile coding)
• We find a linear map from features to actions

Q-learning often works with function approximation, but there are counter-examples, which hints at some sort of problem.

Semi-gradient methods (Watkins 1989, Sutton 1988, Rummery 1994) ignore parts of the gradient when updating the value function. Q-learning approximates $Q^*$, whereas SARSA approximates $Q^\pi$, and the policy is $\epsilon$-greedy. Semi-gradient SARSA has good convergence properties (unlike Q-learning).
An interesting point he made was that on-policy methods often perform better than off-policy methods, but their policies are often worse. I am not sure whether this is an empirical statement or proven. Nevertheless, as an example, Rich gave the cliff example where the optimal policy would be walking along the cliff. On policy methods (SARSA) choose to stay away from the cliff and find a “safe” path, whereas off-policy RL (Q-learning) is quite happy to walk along the cliff.
It was not quite evident to me why this is the case, but here is an attempt for an explanation: SARSA approximates the expected action-value function and therefore accounts for some uncertainty, whereas Q-learning is only interested in approximating the optimal value function.

He then got back to discussing why there are instabilities with semi-gradient Q-learning: It is not a problem with learning/sampling/greedification/exploration, but it a problem with combining function approximation, bootstrapping, off-policy learning. Either two of them is fine (empirically).

He then analyzed this a bit: on-policy instead of off-policy gives SARSA, which is stable. If we replace bootstrapping, we get Monte-Carlo RL: Monte-Carlo has high variance, but the degree of bootstrapping can be controlled by a parameter $\lambda$, which results in semi-gradient SARSA($\lambda$). However, no bootstrapping is not a good idea if we only have a finite amount of time – he gave a few examples.

There are potentially some other ways out of the problem with semi-gradient Q-learning, including

• Good feature
• High $\lambda$
• Maybe replay (see Double Q-learning, van Hasselt 2010)
• Least-squares methods (but computational issues)
• Emphatic TD.

He also mentioned that value functions are useful to reduce the variance (compared to policy gradient methods without value functions). We see this quite a bit when looking at “compatible function approximation”.

Although the tutorial largely focused on value function approximation, but why should we no not approximate policy directly because

• Often, the policy is simpler to model than the value function
• Smoother change in policy would be possible
• We would avoid the max operator

However, should we choose to approximate the policy, we should never discount when optimizing approximate policies  because it breaks the definition of an optimal policy. Instead we should look at average rewards

At the end, Rich briefly touched upon model-based RL, where one learns a model of the environment, which is subsequently used to generate simulated trajectories. Once the model is learned, optimal policies are found quickly. Unfortunately, Rich did not have the time to discuss model-based RL in more detail and hint at some of the problems with model errors etc.

Rich ended with a slide on Eligibility Traces, which unify bootstrapping and Monte Carlo and allow an interaction with off-policy learning via importance sampling.

Overall, an interesting and good tutorial that gave some details and insights into various issues in RL.