## Kernel Interpolation for Scalable Structured Gaussian Processes

Posted: 2015-07-20 in paper of the day, research
Tags: ,

Andrew Gordon Wilson and Hannes Nickisch:
Kernel Interpolation for Scalable Structured Gaussian Processes
ICML 2015

This paper was clearly one of my highlights at ICML and falls into the category of large-scale kernel machines, one of the trends at ICML. Wilson and Nickisch combine the advantages of inducing point and structure-exploiting (e.g., Kronecker/Toeplitz) approaches.

The key idea behind structured kernel interpolation is to (a) describe current inducing points approaches as a global kernel interpolation, (b) use local interpolation strategies to approximate the cross-covariance between inducing and “real” training points, which allows them to get rid of the requirement of placing inducing points on a grid. This allows us to say something about how the quality of an inducing point approach depends on the number of inducing (=interpolation) points, interpolation strategy, and the GP kernel.

Let us have a look at how the kernel-interpolation approach is derived: In sparse GPs (based on inducing inputs methods), we often compute expressions like

$K_{XX}\approx K_{XU}K_{UU}^{-1}K_{UX}$

To approximate the kernel matrix. The cross-covariance $K_{XU}$ still depends on the number of data points, i.e., computing this quantity is expensive.

Wilson and Nickisch propose an interpolation-based approach to approximate $K_{XX}$. As an illustration, let’s assume that a data point $x_i$ lies “between” inducing inputs $u_a, u_b$. This motivates a possible approximation of the cross-covariance by a convex combination $k(x_i, u_j) \approx w_i k(u_j, u_a) + (1-w_i) k(u_j, u_b)$, which no longer requires $x_i$ explicitly. This leads to the central approximation in this paper:

$K_{XU} \approx W K_{UU},$

where $W$ is a matrix of interpolation weights that are determined based on the distances between the training points $x_i$ and the inducing inputs $u_j$. Note that this matrix can be very sparse. With this approach we arrive at the alternative approximation of the kernel matrix:

$K_{XX} \approx W K_{UU} W^T$.

This approximation can be combined with structure-exploiting approaches (e.g., Toeplitz, Kronecker).

Even without structure-exploitation, the computational complexity is $O(n + m^2)$, an improvement over the typical $O(nm^2)$ of sparse GP methods based on inducing points.

The kernel interpolation approach is an interesting approach to sparse GPs. It seems that currently we still depend on some sort of “distance”, which allows us to compute the weight matrix $W$, i.e., this approach might not directly work with the full breadth of kernels that are available. However, it might be possible to learn an inducing input representation that addresses this problem; and potentially the inducing inputs may lie in a significantly lower-dimensional space than the original inputs. Let’s see what follow-ups will come.