## NIPS 2015 – Deep RL Workshop

Posted: 2015-12-13 in conferences, research
Tags: ,

This is a brief summary of the first part of the Deep RL workshop at NIPS 2015. I couldn’t get a seat for the second half…

### Deep RL with Predictions (Honglak Lee)

How to use predictions from a simulator to predict rewards and  optimal policies:

The idea is to use deep learning for generalization, but it is primarily good for perception problems, therefore, look at “visual RL”. Focus on the Atari Games Learning Environment (ALE).

Honglak was giving a brief history on algorithms for the ALE: DeepMind made the ALE widely  known through their Nature paper. They used convolutional networks to map the input space to a low-dimensional feature space, approximate the Q-function, then use Q-learning for finding good policies (DQN). Up to that paper, the state of the art was using Monte-Carlo Tree Search for computing approximate Q-values for actions at the current frame. It achieved very good performance, but this is not a real-time player.

Now, we want to combine these two ideas: UCT data to train the CNN results in a real-time player. Honglak discussed two methods: UCT for regression (Q-values) or UCT for classification (policy).

Potential issues: Mismatch between trained player’s state distribution and UCT player’s state distribution. Interleave training, getting the data (special instance of the DAGGER framework [Ross & Bagnell]).

The classification-based method works well. It seems that the regression-based technique does not work that well, which raises the question whether regressing Q-values is worthwhile.

How to do video predictions (video frames for the future)

In the flavor of a Deep Dynamical Model (Wahlstroem et al., 2014), Honglak uses a convolutional auto-encoder network. Another option is to use a single frame, but a recurrent network makes sure that the latent state has a memory.

Honglak mentioned “uncertainty reduction” by using actions (more information available compared to just using the state), but it seems they do not explicitly represent uncertainty by means of probability distributions. They use predictions as input into a pre-trained Deep Learning Controller. The predictive model can also be used for informed exploration using a visit-frequency heuristic.

### On General Problem Solving & How to Learn an Algorithm (Juergen Schmidhuber)

For three decades he has been working on AI, and the proper framework for doing this is RL. But he looked at a more general framework than the textbook RL. He is more interested in the “full thing”.

Juergen gave a brief history of RNNs. RNNs are general purpose computers, and learning a program boils down to learning a weight matrix. An LSTM is an instance of an RNN, good for learning long-term relationships. Juergen was giving a short history of LSTMs, starting with supervised learning, but also RL (e.g., Bakker et al., IROS 2003). LSTMs are now used for all kinds of things (speech recognition, translation, ….). Then he made this incredibly funny comment that

Google already existed 13 years ago, but it was not yet aware that it was becoming a big LSTM.

Juergen’s first deep network back in 1991 was a stack of RNNs (Hierarchical Temporal Memory). Deep learning was started by Ivakhnenko in 1965, backpropagation by Linnainmaa in 1970, and Werbos was the first one to apply this idea to NNs back in 1982.

Unsupervised learning is basically nothing but compression.

In the context of life-long meta learning, Juergen mentioned the Goedel machine. The Goedel machine (2003) is a theoretically optimal self-improver, which kind of ticks all kinds of boxes for life-long learning. It provides a solution to the towers-of-Hanoi problem: Learn context-free and context-sensitive languages, and finally learn to solve this problem.

In 2013, he used RNNs for learning from raw videos “RL in partially observable worlds” using compressed network search.

The talk was more an interesting history lesson than a discussion of current research activities, but nevertheless very interesting.

Future directions: Learn subgoals in a hierarchical fashion; learns faster.

### Michael Bowling

As one of the fathers of the Atari Learning Environment, initially Michael gave an overview of what the original intentions were. The Atari Learning Environment (ALE) contains a large number of games, but very different ones. ALE is deterministic, and therefore, planning/predictions are quite easy. The main idea is to use ALE for generalizing across games: Learn games that you have never seen before by using knowledge from old games.

A substantial boost for the ALE and the RL community in general was DeepMind’s Nature paper.

DeepMind’s Nature paper boosted the RL community, and RL is very hip again.

DQN can do about 50% of the games at human levels. DQN does policy optimization; how is this different from “continual learning”? He gave an example where linear function approximation for the game “Up and Down” performs better than DQN at some point during learning. Food for thought.

One of the most interesting points was that Michael highlighted that

There is an implicit bias created by the deep network architecture: spatial invariance, non-Markov features, object detection (due to the size of the convolution [8×8] and the size of the screen).

If we use the same biases in linear function approximation, it works very well, at the same level as DQN (but not with as much variance as DQN). They call it “shallow RL“, and it runs significantly faster than DQN. There is some room for thinking about these issues.

Coming back to the goal of continual learning, Michael highlighted that self-driven exploration is critical. And a way of transfer learning is to share parts of the weight space, e.g., the middle layers.

### Faster Deep Reinforcement Learning (V Mnih)

A problem is that visual deep RL is computationally super expensive. Distributed training over multiple machines (Nair et al., 2015). But ideally, one would achieve fast training on a simple machine, have a flag for on/off policy learning and add more flexibility with regard to discrete/continuous actions. DeepMind has developed such a system:

AsyncRL

• Shared model
• Parallel actor learners (threads) have a stabilizing effect
• Runs on CPU
• Family of RL algorithms

Some more insights:

• Optimization: async SGD can be unstable, RMSProp with shared statistics improves stability significantly.
• Exploration should be diverse among actors: e.g., different $\epsilon$ values in $epsilon$-greedy exploration.
• Policy gradient methods and Q-learning are implemented and can be turned on/off
• Seems to perform at least as good as DQN, but in some situations it can drastically outperform DQN.
• Using more threads makes you more data efficient because state distributions are much smoother and better distributed. (1-step methods)
• For n-step methods, there is no gain in terms of data efficiency, but it is quite stable, even with nonlinear function approximation.
• Overall ascync Actor Critic performed best.
• Games that can be solved to a reasonable degree are Montezuma’s revenge, TORCS, 1st-person exploration game, MuJoCo (incl. robot control).
• Works on both features and pixels.

### Deep RL in Games Research @IBM (G Tesauro)

Two weeks of CPU time in 1990 are 5 minutes on a laptop now. Running it for longer now with more hidden units gets TD-Gammon up to world-champion performance with no engineered features.

Simplified version of Wolfenstein 3D. This is a POMDP, but with a clear measure of progress and a high-quality simulator.

IBM use a deep feedforward network + Q-learning (similar to DeepMind’s DQN). Works to some degree, but not too well. A fix to this is to add some small hints, e.g., partial credits, limiting the use of some actions.