NIPS 2015 Posner Lecture – Zoubin Ghahramani: Probabilistic Machine Learning

Posted: 2015-12-08 in research

In the beginning of the talk, Zoubin had an interesting look back to early 90s when he joined NIPS for the first time:

  • At that time, neural networks were hip, Hamiltonian Monte Carlo was introduced (Radford Neal), Laplace Approximations for neural networks were introduced (David MacKay), SVMs were coming up.
  • Neural networks had the same problems we have today: local optima, choice of architectures, long training times, …
  • Radford Neal showed that Bayesian neural networks with a single hidden layer converges to a Gaussian process in the limit of infinitely many hidden units. He also analyzed infinitely deep neural networks.
  • New ideas that came about at that time: EM, graphical models, variational inference.

Since then, many of these ideas have gained/lost/re-gained momentum, but they were definitely shaping machine learning.

Part I: Machine Learning as Probabilistic Modeling

A model describes data that one could observe from a system. Then, use inverse probability (Bayes’ theorem) for inference about hypotheses from data.

Two simple rules in probabilistic modeling from which all follows: Sum rule and product rule.

  • Learning. Find good model parameters (more precisely: a posterior) given some data
  • Prediction.
  • Model Comparison.

Model comparison is important in all sorts of contexts:

  • # clusters
  • intrinsic dimensionality
  • order of dynamical systems
  • architecture of a neural network
  • # states in an HMM

It’s all model comparison.

Bayesian Occam’s razor: If we want to compare two models, the marginal likelihood is the key ingredient. One interpretation of the marginal likelihood is the following:
“Probability of the data under the model, averaging over all possible parameter values.”
Bayesian Occam’s razor rejects models that are too simple or too complicated to explain the data.

A common problem is that the marginal likelihood cannot be computed analytically, which requires to compute integrals. We do have a lot of tools for computing integrals:

  • Variational approximations
  • Expectation propagation
  • MCMC
  • SMC
  • Laplace approximations
  • Bayesian information criterion

When do we need probabilities?

There are lots of areas:

  • Forecasting
  • Decision making
  • Learning from limited, noisy and missing data
  • Learning complex personalized models
  • Data compression
  • Automatic scientific modeling, discovery and experimental design

Part II: Applications

From here, Zoubin gave brief overviews of various research areas in a problem-solution scheme:

Bayesian Nonparametrics

Problem: Need flexible probabilistic models to deal with complicated problems
Solution: Define infinite-dimensional probabilistic models (Gaussian process, Dirichlet process, …). Models become richer if the amount of data increases.

Application areas: function approximation, classification, clustering, time series, feature discovery

Example: Gaussian Processes

GP allows us to define a distribution over functions. They can be used for regression, classification, ranking, dimensionality reduction, …

NN and GPs are coming back together and are being analyzed recently.

Probabilistic Programming

Problem: probabilistic model development and the derivation of inference algorithms is error prone and time consuming. Currently this is done by hand.

Solution: Develop a programming language where a model is a computer program that generates data. Derive then a universal inference engine for this kind of language that do inference over program traces given observed data (practically, this is running Bayes’ theorem on computer programs [i.e., models]).

Currently, these things exist but they are not yet super-efficient. Inference is typically based on MCMC

There are probabilistic programming languages for nearly all “standard” programming languages (python, Haskell, C, Julia, …)

Bayesian Optimization

Problem: Global optimization of black-box functions, which are expensive to evaluate

Solution: Treat optimization as a sequential decision-making problem under uncertainty. Model uncertainty about the function explicitly. Lots of applications (robotics, drug design, finding hyper-parameters of neural networks).

Data Compression

Problem: Compress data

Solution: Compression algorithms need to follow Shannon’s source coding theorem and are implicitly based on probabilistic models. Thus, develop better sequential adaptive nonparametric models of the data that allow us to predict the data better, making it cheaper to store or transmit data.

Automatic Statistician

Problem: Lots of data available, but not enough people to analyze it

Solution: Automate data analysis (focus on model discovery)

  • Language of models. Here: Atoms of language are the kernels of a Gaussian process
  • Systematic search procedure within this language
  • Principled method to evaluate models
  • Automatic analysis generation in a human-understandable format

Rational Allocation of Computational Resources

The key idea here is to build predictive models of how things will perform in the long term (extension of Freeze-Thaw Bayesian Optimization [Swersky et al., 2014]) and allocate computational resources accordingly.

Advertisements
Comments
  1. Great write-up, I¡¦m regular visitor of one¡¦s blog, maintain up the excellent operate, and It’s going to be a regular visitor for a lengthy time.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s