Kernel Interpolation for Scalable Structured Gaussian Processes

Posted: 2015-07-20 in paper of the day, research
Tags: ,

Andrew Gordon Wilson and Hannes Nickisch:
Kernel Interpolation for Scalable Structured Gaussian Processes
ICML 2015

This paper was clearly one of my highlights at ICML and falls into the category of large-scale kernel machines, one of the trends at ICML. Wilson and Nickisch combine the advantages of inducing point and structure-exploiting (e.g., Kronecker/Toeplitz) approaches.

The key idea behind structured kernel interpolation is to (a) describe current inducing points approaches as a global kernel interpolation, (b) use local interpolation strategies to approximate the cross-covariance between inducing and “real” training points, which allows them to get rid of the requirement of placing inducing points on a grid. This allows us to say something about how the quality of an inducing point approach depends on the number of inducing (=interpolation) points, interpolation strategy, and the GP kernel.

Let us have a look at how the kernel-interpolation approach is derived: In sparse GPs (based on inducing inputs methods), we often compute expressions like

K_{XX}\approx K_{XU}K_{UU}^{-1}K_{UX}

To approximate the kernel matrix. The cross-covariance K_{XU} still depends on the number of data points, i.e., computing this quantity is expensive.

Wilson and Nickisch propose an interpolation-based approach to approximate K_{XX}. As an illustration, let’s assume that a data point x_i lies “between” inducing inputs u_a, u_b. This motivates a possible approximation of the cross-covariance by a convex combination k(x_i, u_j) \approx w_i k(u_j, u_a) + (1-w_i) k(u_j, u_b), which no longer requires x_i explicitly. This leads to the central approximation in this paper:

K_{XU} \approx W K_{UU},

where W is a matrix of interpolation weights that are determined based on the distances between the training points x_i and the inducing inputs u_j. Note that this matrix can be very sparse. With this approach we arrive at the alternative approximation of the kernel matrix:

K_{XX} \approx W K_{UU} W^T.

This approximation can be combined with structure-exploiting approaches (e.g., Toeplitz, Kronecker).

Even without structure-exploitation, the computational complexity is O(n + m^2), an improvement over the typical O(nm^2) of sparse GP methods based on inducing points.

The kernel interpolation approach is an interesting approach to sparse GPs. It seems that currently we still depend on some sort of “distance”, which allows us to compute the weight matrix W, i.e., this approach might not directly work with the full breadth of kernels that are available. However, it might be possible to learn an inducing input representation that addresses this problem; and potentially the inducing inputs may lie in a significantly lower-dimensional space than the original inputs. Let’s see what follow-ups will come.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s