In the beginning of the talk, Zoubin had an interesting look back to early 90s when he joined NIPS for the first time:

- At that time, neural networks were hip, Hamiltonian Monte Carlo was introduced (Radford Neal), Laplace Approximations for neural networks were introduced (David MacKay), SVMs were coming up.
- Neural networks had the same problems we have today: local optima, choice of architectures, long training times, …
- Radford Neal showed that Bayesian neural networks with a single hidden layer converges to a Gaussian process in the limit of infinitely many hidden units. He also analyzed infinitely deep neural networks.
- New ideas that came about at that time: EM, graphical models, variational inference.

Since then, many of these ideas have gained/lost/re-gained momentum, but they were definitely shaping machine learning.

## Part I: Machine Learning as Probabilistic Modeling

A model describes data that one could observe from a system. Then, use inverse probability (Bayes’ theorem) for inference about hypotheses from data.

Two simple rules in probabilistic modeling from which all follows: **Sum rule and product rule**.

- Learning. Find good model parameters (more precisely: a posterior) given some data
- Prediction.
- Model Comparison.

**Model comparison** is important in all sorts of contexts:

- # clusters
- intrinsic dimensionality
- order of dynamical systems
- architecture of a neural network
- # states in an HMM

It’s all model comparison.

**Bayesian Occam’s razor:** If we want to compare two models, the marginal likelihood is the key ingredient. One interpretation of the **marginal likelihood** is the following:

“Probability of the data under the model, *averaging* over all possible parameter values.”

Bayesian Occam’s razor rejects models that are too simple or too complicated to explain the data.

A common problem is that the marginal likelihood cannot be computed analytically, which requires to compute integrals. We do have a lot of tools for computing integrals:

- Variational approximations
- Expectation propagation
- MCMC
- SMC
- Laplace approximations
- Bayesian information criterion

**When do we need probabilities?**

There are lots of areas:

- Forecasting
- Decision making
- Learning from limited, noisy and missing data
- Learning complex personalized models
- Data compression
- Automatic scientific modeling, discovery and experimental design

## Part II: Applications

From here, Zoubin gave brief overviews of various research areas in a problem-solution scheme:

**Bayesian Nonparametrics**

Problem: Need flexible probabilistic models to deal with complicated problems

Solution: Define infinite-dimensional probabilistic models (Gaussian process, Dirichlet process, …). Models become richer if the amount of data increases.

Application areas: function approximation, classification, clustering, time series, feature discovery

Example: Gaussian Processes

GP allows us to define a distribution over functions. They can be used for regression, classification, ranking, dimensionality reduction, …

NN and GPs are coming back together and are being analyzed recently.

**Probabilistic Programming**

Problem: probabilistic model development and the derivation of inference algorithms is error prone and time consuming. Currently this is done by hand.

Solution: Develop a programming language where a model is a computer program that generates data. Derive then a universal inference engine for this kind of language that do inference over program traces given observed data (practically, this is running Bayes’ theorem on computer programs [i.e., models]).

Currently, these things exist but they are not yet super-efficient. Inference is typically based on MCMC

There are probabilistic programming languages for nearly all “standard” programming languages (python, Haskell, C, Julia, …)

**Bayesian Optimization**

Problem: Global optimization of black-box functions, which are expensive to evaluate

Solution: Treat optimization as a sequential decision-making problem under uncertainty. Model uncertainty about the function explicitly. Lots of applications (robotics, drug design, finding hyper-parameters of neural networks).

**Data Compression**

Problem: Compress data

Solution: Compression algorithms need to follow Shannon’s source coding theorem and are implicitly based on probabilistic models. Thus, develop better sequential adaptive nonparametric models of the data that allow us to predict the data better, making it cheaper to store or transmit data.

**Automatic Statistician**

Problem: Lots of data available, but not enough people to analyze it

Solution: Automate data analysis (focus on model discovery)

- Language of models. Here: Atoms of language are the kernels of a Gaussian process
- Systematic search procedure within this language
- Principled method to evaluate models
- Automatic analysis generation in a human-understandable format

**Rational Allocation of Computational Resources**

The key idea here is to build predictive models of how things will perform in the long term (extension of Freeze-Thaw Bayesian Optimization [Swersky et al., 2014]) and allocate computational resources accordingly.

Great write-up, I¡¦m regular visitor of one¡¦s blog, maintain up the excellent operate, and It’s going to be a regular visitor for a lengthy time.

LikeLike