using FrankWolfe
using Plots
using LinearAlgebra
using Random
using KernelHerding

include(joinpath(dirname(pathof(FrankWolfe)), "../examples/plot_utils.jl"))

Main.check_gradients

Kernel herding: Frank-Wolfe algorithms in an infinite-dimensional setting

In this example, we illustrate how the Frank-Wolfe algorithm can be applied to infinite-dimensional kernel herding problems. First, we present a quick primer on kernel herding.

Kernel herding: a primer

Kernel herding is known to be equivalent to solving a quadratic optimization problem in a Reproducing Kernel Hilbert Space (RKHS) with the Frank-Wolfe algorithm (Bach et al.). Here, we explain kernel herding following the presentation of Wirth et al..

Let $\mathcal{Y} \subseteq \mathbb{R}$ be an observation space, $\mathcal{H}$ a RKHS with inner product $\langle \cdot, \cdot \rangle_\mathcal{H}$, and $\Phi \colon \mathcal{Y} \to \mathcal{H}$ a feature map such that any element $x\in \mathcal{H}$ has an associated real function defined via

\[x(y) = \langle x, \Phi(y)\rangle_\mathcal{H}\]

for $y\in \mathcal{Y}$. The feature map $\Phi$ has an associated positive defininte kernel

\[k \colon (y, z) \mapsto k(y, z) := \langle \Phi(y), \Phi(z) \rangle_\mathcal{H}\]

for $y,z\in \mathcal{Y}$. Here, the feasible region is the marginal polytope

\[\mathcal{C} : = \text{conv}\left(\lbrace \Phi(y) \mid y \in \mathcal{Y} \rbrace\right) \subseteq \mathcal{H}.\]

We consider a probability distribution $\rho(y)$ over $\mathcal{Y}$ with associated mean element

\[\mu(z) : = \mathbb{E}_{\rho(y)} \Phi(y)(z) = \int_{\mathcal{Y}} k(z, y) \rho(y) dy \in \mathcal{C},\]

where $\mu \in \mathcal{C}$ due to the support of $p(y)$ being in $\mathcal{Y}$. Then, kernel herding is equivalent to solving the minimization problem

\[\min_{x\in \mathcal{C}} f(x), \qquad \qquad \text{(OPT-KH)}\]

where $f(x) := \frac{1}{2} \left\|x - \mu \right\|_\mathcal{H}^2$, with the Frank-Wolfe algorithm and open loop step-size rule $\eta_t = \frac{1}{t + 1}$. The well-definedness of kernel herding is guaranteed if there exists a constant $R > 0$ such that $\|\Phi(y)\|_\mathcal{H} = R$ for all $y\in \mathcal{Y}$. In that case, all extreme points of $\mathcal{C}$ are of the form $\Phi(y)$ for $y\in \mathcal{Y}$. Thus, iterates constructed with the Frank-Wolfe algorithm are convex combinations of the form $x_t = \sum_{i=1}^t w_i \Phi(y_i)$, where $w =(w_1, \ldots, w_t)^\intercal \in \mathbb{R}^t$ is a weight vector such that $w_i \geq 0$ for all $i \in \{1, \ldots, t\}$ and $\sum_{i=1}^t w_i = 1$. Observe that the iterate $x_t$ is the mean element of the associated empirical distribution $\tilde{\rho}_t(y)$ over $\mathcal{Y}$, that is,

\[\tilde{\mu}_t(z) = \mathbb{E}_{\tilde{\rho}_t(y)}\Phi(y)(z) = \sum_{i=1}^tw_i \Phi(y_i)(z) = x_t(z).\]

Then, Bach et al. showed that

\[\sup_{x\in \mathcal{H}, \|x\|_\mathcal{H} = 1} \lvert \mathbb{E}_{\rho (y)} [x(y)] - \mathbb{E}_{\tilde{\rho}_t(y)} [x(y)] \rvert = \|\mu - \tilde{\mu}_t\|_\mathcal{H}.\]

Thus, with kernel herding, by finding a good bound on $\|\mu - \tilde{\mu}_t\|_\mathcal{H}$, we can bound the error when computing the expectation of $x\in \mathcal{H}$ with $\|x\|_\mathcal{H} = 1$.

Infinite-dimensional kernel herding

Now that we have introduced the general kernel herding setting, we focus on a specific kernel studied in Wahba and later in Bach et al. and Wirth et al. that maps into a Hilbert space whose ambient dimension is infinite. Let $\mathcal{Y} = [0, 1]$ and

\[\mathcal{H}:= \left\lbrace x \colon [0, 1] \to \mathbb{R} \mid x(y) = \sum_{j = 1}^\infty (a_j \cos(2\pi j y) + b_j \sin(2 \pi j y)), x'(y) \in L^2([0,1]), \text{ and } a_j,b_j\in\mathbb{R}\right\rbrace.\]

For $w, x \in \mathcal{H}$,

\[\langle w, x \rangle_\mathcal{H} := \int_{[0,1]} w'(y)x'(y)dy\]

defines an inner product. Thus, $(\mathcal{H}, \langle \cdot, \cdot \rangle_{\mathcal{H}})$ is a Hilbert space. Wirth et al. showed that $\mathcal{H}$ is a Reproducing Kernel Hilbert Space (RKHS) with associated kernel

\[k(y, z) = \frac{1}{2} B_2(| y - z |),\]

where $y,z\in [0, 1]$ and $B_2(y) = y^2 - y + \frac{1}{6}$ is the Bernoulli polynomial.

Set-up

Below, we compare different Frank-Wolfe algorithms for kernel herding in the Hilbert space $\mathcal{H}$: the Frank-Wolfe algorithm with open loop step-size rule $\eta_t = \frac{2}{t+2}$ (FW-OL), the Frank-Wolfe algorithm with short-step (FW-SS), and the Blended Pairwise Frank-Wolfe algorithm with short-step (BPFW-SS). We do not use line search because it is equivalent to short-step for the squared loss used in kernel herding.

The LMO in the here-presented kernel herding problem is implemented using exhaustive search over $\mathcal{Y} = [0, 1]$.

max_iterations = 200
max_iterations_lmo = 1.5 * max_iterations
lmo = MarginalPolytopeWahba(max_iterations_lmo)

MarginalPolytopeWahba(300)

Uniform distribution

First, we consider the uniform distribution $\rho = 1$, which results in the mean element being zero, that is, $\mu = 0$.

mu = ZeroMeanElement()
iterate = KernelHerdingIterate([1.0], [0.0])
gradient = KernelHerdingGradient(iterate, mu)
f, grad = create_loss_function_gradient(mu)

FW_OL = FrankWolfe.frank_wolfe(f, grad, lmo, iterate, line_search=FrankWolfe.Agnostic(), verbose=true, gradient=gradient, memory_mode=FrankWolfe.OutplaceEmphasis(), max_iteration=max_iterations, trajectory=true)
FW_SS = FrankWolfe.frank_wolfe(f, grad, lmo, iterate, line_search=FrankWolfe.Shortstep(1), verbose=true, gradient=gradient, memory_mode=FrankWolfe.OutplaceEmphasis(), max_iteration=max_iterations, trajectory=true)
BPFW_SS = FrankWolfe.blended_pairwise_conditional_gradient(f, grad, lmo, iterate, line_search=FrankWolfe.Shortstep(1), verbose=true, gradient=gradient, memory_mode=FrankWolfe.OutplaceEmphasis(), max_iteration=max_iterations, trajectory=true)

(KernelHerdingIterate{Float64}([0.03610006888698272, 0.04495204713172179, 0.05583643165466949, 0.03201602501785662, 0.03376719017382021, 0.045162031890950645, 0.03595309842964004, 0.0375551107026565, 0.035008662809564446, 0.03171758558857998  …  0.03132898633423751, 0.03108062789823514, 0.03368985181311681, 0.029094049574354856, 0.028482775147798498, 0.028427253645173758, 0.024342412061299484, 0.03188577831670907, 0.028178411762572537, 0.02271054692723429], [0.0, 0.5, 0.25, 0.75, 0.36666666666666664, 0.8766666666666667, 0.6233333333333333, 0.12666666666666668, 0.43333333333333335, 0.6866666666666666  …  0.4, 0.16, 0.4666666666666667, 0.03333333333333333, 0.6566666666666666, 0.5933333333333334, 0.09666666666666666, 0.8466666666666667, 0.9666666666666667, 0.33666666666666667]), KernelHerdingIterate{Float64}([1.0], [0.22]), 6.765805984314172e-5, 0.00047010515215908553, Any[(1, 0.041666666666666664, -0.08333333333333334, 0.125, 0.0), (2, 0.010416666666666664, -0.020833333333333336, 0.03125, 0.785656274), (3, 0.006510416666666664, -0.024739583333333336, 0.03125, 0.785736473), (4, 0.0028339460784313703, -0.007248733660130722, 0.010082679738562092, 0.785810572), (5, 0.0026897707612456717, -0.006671225009611696, 0.009360995770857368, 0.824127305), (6, 0.0026417123221837728, -0.00544604238754326, 0.008087754709727033, 0.824205205), (7, 0.002294609716762346, -0.008135906753674708, 0.010430516470437055, 0.824279104), (8, 0.00216888821432709, -0.005695651791426426, 0.007864540005753515, 0.824359803), (9, 0.0018424103754264023, -0.005989425018019469, 0.007831835393445872, 0.824428903), (10, 0.0017160059172212016, -0.006994825344501573, 0.008710831261722774, 0.824488302)  …  (194, 6.957594971023116e-5, -0.0005203949210883609, 0.0005899708707985921, 0.847211984), (195, 6.91248832775302e-5, -0.00040190902566603965, 0.00047103390894356987, 0.847411182), (196, 6.883112228816022e-5, -0.00044512003260507865, 0.0005139511548932389, 0.84761718), (197, 6.861144938292564e-5, -0.0004149084537331591, 0.00048351990311608475, 0.847828478), (198, 6.830969615495296e-5, -0.0004480530770504065, 0.0005163627732053595, 0.848031676), (199, 6.808726075867979e-5, -0.00038863898860350593, 0.0004567262493621857, 0.848238474), (200, 6.793401718205458e-5, -0.00041184022643582334, 0.0004797742436178779, 0.848446472), (201, 6.781038569003891e-5, -0.0003911701110435159, 0.0004589804967335548, 0.848634471), (201, 6.765805984314172e-5, -0.0004024470923159438, 0.00047010515215908553, 0.849014267), (201, 6.765805984314172e-5, -0.0004024470923159438, 0.00047010515215908553, 0.897428503)], Tuple{Float64, KernelHerdingIterate{Float64}}[(0.03610006888698272, KernelHerdingIterate{Float64}([1.0], [0.0])), (0.04495204713172179, KernelHerdingIterate{Float64}([1.0], [0.5])), (0.05583643165466949, KernelHerdingIterate{Float64}([1.0], [0.25])), (0.03201602501785662, KernelHerdingIterate{Float64}([1.0], [0.75])), (0.03376719017382021, KernelHerdingIterate{Float64}([1.0], [0.36666666666666664])), (0.045162031890950645, KernelHerdingIterate{Float64}([1.0], [0.8766666666666667])), (0.03595309842964004, KernelHerdingIterate{Float64}([1.0], [0.6233333333333333])), (0.0375551107026565, KernelHerdingIterate{Float64}([1.0], [0.12666666666666668])), (0.035008662809564446, KernelHerdingIterate{Float64}([1.0], [0.43333333333333335])), (0.03171758558857998, KernelHerdingIterate{Float64}([1.0], [0.6866666666666666]))  …  (0.03132898633423751, KernelHerdingIterate{Float64}([1.0], [0.4])), (0.03108062789823514, KernelHerdingIterate{Float64}([1.0], [0.16])), (0.03368985181311681, KernelHerdingIterate{Float64}([1.0], [0.4666666666666667])), (0.029094049574354856, KernelHerdingIterate{Float64}([1.0], [0.03333333333333333])), (0.028482775147798498, KernelHerdingIterate{Float64}([1.0], [0.6566666666666666])), (0.028427253645173758, KernelHerdingIterate{Float64}([1.0], [0.5933333333333334])), (0.024342412061299484, KernelHerdingIterate{Float64}([1.0], [0.09666666666666666])), (0.03188577831670907, KernelHerdingIterate{Float64}([1.0], [0.8466666666666667])), (0.028178411762572537, KernelHerdingIterate{Float64}([1.0], [0.9666666666666667])), (0.02271054692723429, KernelHerdingIterate{Float64}([1.0], [0.33666666666666667]))])

We plot the results.

data = [FW_OL[end], FW_SS[end], BPFW_SS[end - 1]]
labels = ["FW-OL", "FW-SS", "BPFW-SS"]
plot_trajectories(data, labels, xscalelog=true)

Observe that FW-OL converges faster than FW-SS and BPFW-SS. Wirth et al. proved an accelerated convergence rate of $\mathcal{O}(1/t^2)$ for FW-OL with step-size rule $\eta_t = \frac{1}{t+1}$, but it remains an open problem to prove that FW-SS and BPFW-SS do not admit this accelerated rate.

Non-uniform distribution

Second, we consider a non-uniform distribution

\[\rho(y) \backsim \left(\sum_{i = 1}^n a_i \cos(2\pi i y) + b_i \sin (2\pi i y) \right)^2,\]

where $n\in \mathbb{N}$, $a_i,b_i \in \mathbb{R}$ for all $i \in \{1, \ldots, n\}$, and $a_i, b_i$ are chosen such that

\[\int_{\mathcal{Y}} \rho(y) dy = 1.\]

To obtain such a $\rho$, we start with an arbitrary tuple of vectors:

rho = (rand(Float64, (1, 5)), rand(Float64, (1, 8)))

([0.23416011719036778 0.2673459029758869 … 0.22889026015513236 0.24873671856401813], [0.910454625293045 0.9383317818646637 … 0.8961276185014644 0.6612624903562814])

We then normalize the vectors to obtain a $\rho$ that is indeed a distribution.

normalized_rho = construct_rho(rho)

([0.14160173451522662 0.16166990361620093 … 0.13841493692650686 0.15041652356902555], [0.5505717868858716 0.5674297120153016 … 0.5419079331244305 0.3998798630946342])

We then run the experiments.

mu = mu_from_rho(normalized_rho)
iterate = KernelHerdingIterate([1.0], [0.0])
gradient = KernelHerdingGradient(iterate, mu)
f, grad = create_loss_function_gradient(mu)

FW_OL = FrankWolfe.frank_wolfe(f, grad, lmo, iterate, line_search=FrankWolfe.Agnostic(), verbose=true, gradient=gradient, memory_mode=FrankWolfe.OutplaceEmphasis(), max_iteration=max_iterations, trajectory=true)
FW_SS = FrankWolfe.frank_wolfe(f, grad, lmo, iterate, line_search=FrankWolfe.Shortstep(1), verbose=true, gradient=gradient, memory_mode=FrankWolfe.OutplaceEmphasis(), max_iteration=max_iterations, trajectory=true)
BPFW_SS = FrankWolfe.blended_pairwise_conditional_gradient(f, grad, lmo, iterate, line_search=FrankWolfe.Shortstep(1), verbose=true, gradient=gradient, memory_mode=FrankWolfe.OutplaceEmphasis(), max_iteration=max_iterations, trajectory=true)

(KernelHerdingIterate{Float64}([0.023125314249083324, 0.011536931944721477, 0.04367468982078932, 0.03629990297226666, 0.06092370189291349, 0.10187959164686783, 0.020489219867130582, 0.01569648454899316, 0.0861699409033749, 0.06310220183894716  …  0.019408279362152066, 0.06863199740417475, 0.06292604439123273, 0.07594136232348483, 0.03250310180409834, 0.02123343324164793, 0.0812186737624059, 0.026136482434164027, 0.02733626032131402, 0.012655033935139056], [0.0, 0.5266666666666666, 0.9233333333333333, 0.17, 0.06333333333333334, 0.9566666666666667, 0.8033333333333333, 0.31, 0.9733333333333334, 0.03666666666666667  …  0.84, 0.05, 0.9833333333333333, 0.9666666666666667, 0.08, 0.14666666666666667, 0.95, 0.023333333333333334, 0.9333333333333333, 0.18666666666666668]), KernelHerdingIterate{Float64}([1.0], [0.44666666666666666]), 3.227768764942074e-5, 0.0001744926671685132, Any[(1, 0.010701882098531436, -0.025747866317315134, 0.03644974841584657, 0.0), (2, 0.008037134051596471, -0.014279797390838754, 0.022316931442435226, 0.042836295), (3, 0.004164953573883773, -0.015385342075705064, 0.019550295649588836, 0.043092892), (4, 0.0026967251147483745, -0.005188542905554296, 0.00788526802030267, 0.04334419), (5, 0.0024927243701528407, -0.005672970837673784, 0.008165695207826625, 0.083745308), (6, 0.0018615164008261714, -0.005780656831284661, 0.007642173232110833, 0.084009605), (7, 0.0017876919715930709, -0.004692621255680471, 0.006480313227273542, 0.084268303), (8, 0.0010151157547773626, -0.006119769995989685, 0.007134885750767048, 0.084519801), (9, 0.0008170783011724053, -0.002880531214577925, 0.00369760951575033, 0.084755698), (10, 0.0006884730263831407, -0.0032148149577689517, 0.0039032879841520924, 0.084991896)  …  (194, 3.282849573756769e-5, -0.0001451092860153963, 0.00017793778175296399, 0.165795332), (195, 3.274610209710177e-5, -0.00012743427361246173, 0.0001601803757095635, 0.166175128), (196, 3.266412880674474e-5, -0.00014243876299815245, 0.0001751028918048972, 0.166554425), (197, 3.258728015863299e-5, -0.0001271964457981306, 0.0001597837259567636, 0.166930221), (198, 3.2516414232658386e-5, -0.00014353634061031045, 0.00017605275484296884, 0.167306318), (199, 3.245420347048472e-5, -0.00012455887350733275, 0.00015701307697781747, 0.167682514), (200, 3.239258790314892e-5, -0.00014112314810550697, 0.00017351573600865589, 0.168056611), (201, 3.233291991180248e-5, -0.0001248031645188566, 0.00015713608443065907, 0.168432007), (201, 3.227768764942074e-5, -0.00014221497951909245, 0.0001744926671685132, 0.169143), (201, 3.227768764942074e-5, -0.00014221497951909245, 0.0001744926671685132, 0.219195327)], Tuple{Float64, KernelHerdingIterate{Float64}}[(0.023125314249083324, KernelHerdingIterate{Float64}([1.0], [0.0])), (0.011536931944721477, KernelHerdingIterate{Float64}([1.0], [0.5266666666666666])), (0.04367468982078932, KernelHerdingIterate{Float64}([1.0], [0.9233333333333333])), (0.03629990297226666, KernelHerdingIterate{Float64}([1.0], [0.17])), (0.06092370189291349, KernelHerdingIterate{Float64}([1.0], [0.06333333333333334])), (0.10187959164686783, KernelHerdingIterate{Float64}([1.0], [0.9566666666666667])), (0.020489219867130582, KernelHerdingIterate{Float64}([1.0], [0.8033333333333333])), (0.01569648454899316, KernelHerdingIterate{Float64}([1.0], [0.31])), (0.0861699409033749, KernelHerdingIterate{Float64}([1.0], [0.9733333333333334])), (0.06310220183894716, KernelHerdingIterate{Float64}([1.0], [0.03666666666666667]))  …  (0.019408279362152066, KernelHerdingIterate{Float64}([1.0], [0.84])), (0.06863199740417475, KernelHerdingIterate{Float64}([1.0], [0.05])), (0.06292604439123273, KernelHerdingIterate{Float64}([1.0], [0.9833333333333333])), (0.07594136232348483, KernelHerdingIterate{Float64}([1.0], [0.9666666666666667])), (0.03250310180409834, KernelHerdingIterate{Float64}([1.0], [0.08])), (0.02123343324164793, KernelHerdingIterate{Float64}([1.0], [0.14666666666666667])), (0.0812186737624059, KernelHerdingIterate{Float64}([1.0], [0.95])), (0.026136482434164027, KernelHerdingIterate{Float64}([1.0], [0.023333333333333334])), (0.02733626032131402, KernelHerdingIterate{Float64}([1.0], [0.9333333333333333])), (0.012655033935139056, KernelHerdingIterate{Float64}([1.0], [0.18666666666666668]))])

We plot the results.

data = [FW_OL[end], FW_SS[end], BPFW_SS[end - 1]]
labels = ["FW-OL", "FW-SS", "BPFW-SS"]
plot_trajectories(data, labels, xscalelog=true)

Observe that FW-OL converges with a rate of $\mathcal{O}(1/t^2)$, which is faster than the $\mathcal{O}(1/t)$ convergence rate of FW-SS and BPFW-SS. To the best of our knowledge, an explanation for this acceleration is yet to be given.

Conclusion

We presented two experiments which show how to use Frank-Wolfe algorithms to solve optimization problems in infinite-dimensional Hilbert spaces.

References

Bach, F., Lacoste-Julien, S. and Obozinski, G., 2012, June. On the Equivalence between Herding and Conditional Gradient Algorithms. In ICML 2012 International Conference on Machine Learning.

Wahba, G., 1990. Spline models for observational data. Society for industrial and applied mathematics.

Wirth, E., Kerdreux, T. and Pokutta, S., 2022. Acceleration of Frank-Wolfe Algorithms with Open-Loop Step-Sizes. arXiv preprint arXiv:2205.12838.

This page was generated using Literate.jl.