## NIPS 2015 – Bayesian Optimization Workshop

Posted: 2015-12-13 in conferences, research
Tags: ,

I attended the Bayesian Optimization workshop at NIPS 2015, and the following summarizes what was going on in the workshop from my perspective. This post primarily serves my self-interest in not losing these notes. But it may be useful for others as well.

Organizers:

Website

The workshop was effectively run by Bobak Shahriari and Roberto Calandra. In the beginning of the workshop, Bobak Shahriari was giving a brief introduction to Bayesian Optimization (BO), motivating the entire setting of data-efficient global black-box optimization and the gap that this workshop will address.

Applications of Bayesian optimization include drug trials in healthcare (Thompson 1933), model selection (i.e., minimize generalization loss), AB testing and personalized advertising, Recommender system (e.g., Scott 2010), experimental design and sensor networks (Krause & Guestrin, 2008), posterior estimation (Kandasamy et al., 2015), mining, …

Following this, he introduced Bayesian Optimization from a more technical point of view, including the surrogate GP model, exploration strategies, Bayes’ theorem for model updates and the importance of the acquisition functions. The kernel can encode lots of structure, which makes modeling quite powerful.

Unfortunately, there are some issues. In particular, the computational cost. Various sparse GP approximations or feature learning techniques try to address this problem. Mondrian forests seem to be promising as well.

### Bayesian Optimization (Zoubin Ghahramani)

Optimization is a sequential decision making problem, and uncertainty is important to do it. Bayesian optimization useful for all kinds of optimization problems, but it is especially useful if

• evaluating the function is expensive
• we have no gradients (or they are difficult to obtain)
• there is noise in the function evaluation
• there are (possibly noisy) constraints
• there is some prior information about the function to be optimized
• one needs to optimize many similar functions

The use of GPs within Bayesian optimization implies various advantages (good uncertainty, estimates, generality), but also some disadvantages (expensive, limited by the kernel).

After this introduction, Zoubin talked about the relationship between Bayesian optimization and the Automatic Statistician (AS).

• The AS can be used within BO to find a good model for the BO problem (find a good kernel).
• BO can be used to help the AS to do model selection and to allocate computational resources.

Allocation of Resources: Especially when dealing with large models or under resource constraints, we need to find a trade-off between statistical and computational efficiency. Therefore, treat the allocation of computational resources as a part of the overall problem (see Freeze-Thaw Bayesian Optimization by Swersky et al., 2015): Try a few models and look at their learning curves. Then, fit a model to the learning curves that predicts how well one would do if we would dump more computational resources at the problem. This can be generalized to ensemble methods, and the utility depends on the ensemble. Ideally, do also some transfer learning by looking at the correlation structure between individual models.

At the end, Zoubin linked BO to probabilistic numerics (e.g., quadrature, solving differential equations, …). They treat  numerical problems as inference/learning/decision-making under uncertainty problems.

There is not much work in BO with multi-step optimization; current focus largely on myopic optimization. Mike Osborne does have some work on multi-step BO in his 2009-LION3 paper.

### Safe Exploration for Bayesian Optimization (Andreas Krause)

Andreas started off with motivating applications:

• Therapeutic spinal cord stimulation, where some electrodes can be implanted in patients. There is no good bio-physical model, i.e., things need to be learned from data.
• Autonomous controller tuning in robotics. You need to learn controller settings that do not break the platform.

Safe automatic tuning. You can evaluate the objective anywhere, but you must not fall below a safety threshold.

Start with an informed initial guess (maybe with the help of engineers), then exploit some smoothness properties. This will then realistically lead to the problem of finding local optima. Research areas within BO where this shows up:

• Bayesian optimization under unknown constraints (Snoek 2013; Gelbart et al., 2014)
• Level set identification/Multi-objective BO (Knowles et al., 2006; Zuluaga et al., 2103; Gotovos et al., 2013)
• Safe exploration in control and RL (e.g., Moldovan & Abbeel, 2013)

Plausible maximizers: look at the best lower bound of the GP model (with high confidence the true function value is better). This idea shows up in the Bandit literature (upper confidence sampling, e.g., Auer et al., 2002). UCB exploration parameter $\beta$ is usually not chosen with safety in mind:

$\alpha(x) = \mu(x) + \beta \sigma(x)$

However, we can statistically “certify” safety where the lower confidence bound exceeds the safety threshold. This is the idea behind Safe-UCB and SafeOpt (Sui et al., ICML 2015). SafeUCB gets stuck in bad local optima. SafeOpt reasons about potential region expansion, track those sets over time. Keep sampling, pick the most uncertain point that is safe. This will fully explore a “safe” part of the parameter space. It does give some guarantees on safety and how long it takes to reach a good solution within the safe region.

Applications: Movie recommendations (MovieLens dataset), spinal cord therapy, robot control (UAV).

Next steps: Go from bandits to RL tasks and do safe exploration in MDPs.

Open questions: Lower bounds (sample complexity), high dimensions, non-GP models?

### Scalable & Flexible Bayesian Optimization for Algorithm Configuration (Frank Hutter)

Problem setting: Algorithm with some parameters $\theta$ and a distribution over problem instances $D$. Solve:

$\theta^*\in\arg\min_\theta E_D[m(\theta, D)]$

where $m$ is a performance metric.

After the problem setting, Frank explained how to use random forests for BO. Look at the random forest as a mixture model of trees, and get predictive distribution, where the mean is the sample mean, and the variance is computed using the law of total variance.

• Scale in $O(N(logN)^2)$, where $N$ is the number of observations.
• Scales well with high dimensions
• Handles conditional inputs (e.g., number of units in a hidden layer of a NN conditioned on the number of layers)
• Natively handles continuous and categorical inputs
• Non-Gaussian noise
• Non-stationarity
• Robustness

SMAC (Hutter et al., 2009-2015): Use RFs inside the BO loop. It is about 1000x faster than the DNGO approach (Snoek et al., ICML 2015), which uses Deep neural networks followed by Bayesian optimization. You can also do learning-curve predictions (similar to Freeze-Thaw).

Go beyond black-box optimization, e.g., by looking at

• Warm-start by meta-learning
• Early termination
• Reasoning across subsets of data

### Poster Spotlights

• Predictive Entropy Search for Multi-objective Bayesian Optimization: Find Pareto set within a small number of evaluations. Each objective modeled by GP, therefore the Pareto set is random. Minimize the entropy of optimal parameters. Overall acquisition function is the sum of individual acquisitions.
• Optimization with Gaussian Processes via Chaining: Proof theoretical properties on the regret. Use UCB for this. Canonical pseudo distance gives a bound on the regret but does not incorporate the variance. Union bound for all parameters works for finite sets of parameters. Extension to infinite sets in the poster.
• Towards efficient Bayesian Optimization for Big Data: Automatically adjust the size of the data during optimization. Works well if the performance on small subsets is representative. Model the performance across data set sizes, similar to MultiTaskBO (Swersky et al.)
• UAVs doing Bayesian Optimization to Locate WiFi Devices: Drones do BO to find a cell-phone based on signal strength (very noisy) of the phone.
• Locally-biased Bayesian Optimization using Nonstationary Gaussian Processes: Kernels in GPs are designed for regression, but not for optimization. Optimization trades off exploitation and exploration. The surrogate model should provide information about the location of the optimum. Define the Spartan kernel, which is a combination of a global and a local kernel. Works also for RL problems.
• Active Contextual Entropy Search: Apply to multi-task robotic behavior learning (policy search framework; e.g., adapt hyper-parameters of movement primitives). Look at contextual entropy search; allows for actively selecting contexts via active learning.
• Designing Engaging Games Using Bayesian Optimization: Find the most engaging game designs. Example: Flappy Bird game. Measure engagement (voluntary time on task)
• Adaptive Bayesian Optimization for Online Portfolio Selection: time-varying problem. Solve several related optimization problems.
• Optimization as Estimation with Gaussian Processes in Bandit Settings: There are hyper-parameters to be set in BO (e.g., exploration parameter). Estimate the posterior of $\arg\max f(x)$, similar to entropy search. Adaptively set the BO-hyper-parameters. Generalizes PI and UCB. Tighter regret bounds than GP-UCB.
• Batch Bayesian Optimization via Local Penalization: BO where evaluations can be done in parallel. Designing batches optimally is computationally demanding. Use local penalization (good heuristic using Lipschitz continuity) to determine exploration/exploitation within batch.
• GLASSES: Relieving the Myopia of Bayesian Optimization: Look-ahead through stochastic simulation and expected-loss search gives an acquisition function that takes tens of steps ahead into account. Jointly model epistemic uncertainty about the future steps. Side-effect: very explorative if there are lots evaluations remain.

### Bayesian Optimization and Embedded Learning Systems (Jeff Schneider)

Jeff started off with a history of autonomous driving at CMU, starting from  ALVINN (late 80s) via Crusher, DARPA Urban Challenge (2007) to Caterpillar Autonomous Trucks and Uber. Nowadays, Uber is collecting data using their cars (looks similar to Google Maps car), but data from those cameras is tricky (cluttered, difficult scenes). There is a lot of room for machine learning, including safety, efficiency, active learning and optimization in embedded systems, such as an autonomous car.

An issue with BO is that it normally only works in small dimensions (see Srinivas et al., 2010), i.e., we have to turn this into a lower-dimensional search space. Two challenges: Statistical and computational. Jeff gave a brief overview of related work in this area.

They decided to use an additive model of the form

$f(x) = \sum_k f_k(x^{(k)})$

where all the $f_k$ operate on disjoint subspaces of the overall search space. One can then optimize the components separately/independently. If the components would be overlapping this would no longer work. Jeff gave some theoretical results.

• Generalization to other acquisition functions (not only UCB, EI)
• How to choose the number of components
• Non-axis-aligned models
• Multi-fidelity optimization

### Applications of Bayesian Optimization to Systems (Marc Deisenroth)

The focus of the talk was the application of Bayesian optimization to various systems. The beginning of the talk was on robot locomotion, highlighting the usefulness of Bayesian optimization, which sped up learning controller parameters from weeks (manual search) to hours using BO (Calandra et al., 2014-2015).

In the second part, the talk focused on optimizing hyper-parameters of simulators of real systems to make them more realistic. Here, we were looking at a bioprocess simulator for a micro-algae metabolism with 12 free parameters. A challenge with BO in these settings is that experiments can be conducted with little effort (sub-second), leading to large data sets. A first attempt to solve this by using additive GPs (Kandasamy, 2015) was not successful, and the Dimension Scheduling Algorithm (Ulmasov et al., 2015) was introduced, which samples parameter dimensions randomly without imposing a direct-sum decomposition of the parameter space. The number of data points per GP (1 GP per subset of dimensions) is low, the search problem is relatively simple, which then leads to a tractable optimization problem. In the experiments, the DSA gets a good solution quickly, and the computational speedup compared to standard BO was good (20% of the computational resources without any parallelization).

### Information-based Methods for Black-box (Bayesian) Optimization (Matthew Hoffman)

Matt started by making clear that Bayesian optimization is optimization with Bayesian tools, but it is primarily a black-box optimization problem. Two questions:

• What is my model?
• What is my exploration strategy (given the model)?

Exploration (in this talk) may be myopic, but with respect to something that is close to a value function (see RL), in which case myopic is a good choice.

The model chosen in this talk is a Gaussian process, but independent of this any model in BO must be able to

• Make predictions
• Maintain a measure of uncertainty

There are various exploration strategies, and they are usually encoded by means  of the acquisition function. Matt focused on  the entropy of the latent maximizer $x_*\in\arg\max_x f(x)$. The corresponding acquisition function is (Hennig & Schuler, 2013)

$\alpha(x) = H[x_*|D] - E_y[H[x_*|D\cup\{y\}|D]]$

This last expectation is tricky to compute. But the expression above is the mutual information, which allows us to flip the arguments around (symmetry), which leads to

$\alpha(x) = H[y|D] - E_x[H[y|D, x_*]|D]$

where the last expected is relatively easy to compute using Monte Carlo. To compute the acquisition function, we just have to sample  $x_*$ and approximate $p(y|x_*^i)$ by a Gaussian. Sampling is done by Thompson sampling. The conditional is approximated using EP (Minka, 2001).

Matt gave a bunch of examples that highlight that this “trick” of using the symmetry of the mutual information is actually helpful.

This idea can be extended to constrained black-box optimization, i.e., the constraints themselves can only be evaluated by querying the system. Each black-box constraint is modeled by a GP. They used this for tuning a neural network, constrained to make fast predictions for tuning Hamiltonian MCMC subject to constraints on the convergence diagnostics.

Interesting ideas of applying this to model search within the Automatic Statistician.

Matt was also discussing some problems of predictive entropy search:

• Non-conjugate likelihoods
• Disjoint input spaces
• Slightly incorrect marginalization of the hyper-parameters

Output predictive entropy search somewhat addresses some of these problems.

At the end of the talk, Matt advertised pybo, the BO toolbox Bobak and he developed.

### Some questions that came up during the workshop and panel discussion

• Stopping rules?
• What happens if the model is wrong?
• Sensitivity analysis?
• Is it possible to integrate other optimization techniques (e.g., swarm optimization) into BO?
• Should we focus on model learning or exploration strategies (acquisition functions)?
• How do we scale BO up in terms of number of data points and number of parameters (if we use GPs)?
• Is the Gaussian process the right “default” model?
• What are the properties of the model that we really need?
• Should we incorporate prior information from experts if this is available?
• What are promising directions for learning with lots of parameters?

It turns out that answers to many questions were somewhat already addressed by David MacKay in his papers from the 90s. Also, the “Query by Committee” paper by Seung, Opper and Sompolinksky (1992) should be mentioned in this context.  It is definitely worthwhile having a look at these papers.