[PDF] Alignment Based Kernel Learning with a Continuous Set of Base Kernels

Abstract

The success of kernel-based learning methods depend on the choice of kernel. Recently, kernel learning methods have been proposed that use data to select the most appropriate kernel, usually by combining a set of base kernels. We introduce a new algorithm for kernel learning that combines a {\em continuous set of base kernels}, without the common step of discretizing the space of base kernels. We demonstrate that our new method achieves state-of-the-art performance across a variety of real-world datasets. Furthermore, we explicitly demonstrate the importance of combining the right dictionary of kernels, which is problematic for methods based on a finite set of base kernels chosen a priori. Our method is not the first approach to work with continuously parameterized kernels. However, we show that our method requires substantially less computation than previous such approaches, and so is more amenable to multiple dimensional parameterizations of base kernels, which we demonstrate.

Full PDF

aa r X i v : . [ c s . L G ] D ec Alignment Based Kernel Learning with a ContinuousSet of Base Kernels

Arash Afkanpour Csaba Szepesv´ari Michael Bowling

Department of Computing ScienceUniversity of AlbertaEdmonton, AB T6G 1K7 { afkanpou,szepesva,mbowling } @ualberta.ca Abstract

The success of kernel-based learning methods depend on the choice of kernel.Recently, kernel learning methods have been proposed that use data to select themost appropriate kernel, usually by combining a set of base kernels. We intro-duce a new algorithm for kernel learning that combines a continuous set of basekernels , without the common step of discretizing the space of base kernels. Wedemonstrate that our new method achieves state-of-the-art performance across avariety of real-world datasets. Furthermore, we explicitly demonstrate the im-portance of combining the right dictionary of kernels, which is problematic formethods based on a ﬁnite set of base kernels chosen a priori. Our method is notthe ﬁrst approach to work with continuously parameterized kernels. However, weshow that our method requires substantially less computation than previous suchapproaches, and so is more amenable to multiple dimensional parameterizationsof base kernels, which we demonstrate.

A well known fact in machine learning is that the choice of features heavily inﬂuences the per-formance of learning methods. Similarly, the performance of a learning method that uses a kernelfunction is highly dependent on the choice of kernel function. The idea of kernel learning is to usedata to select the most appropriate kernel function for the learning task.In this paper we consider kernel learning in the context of supervised learning. In particular, weconsider the problem of learning positive-coefﬁcient linear combinations of base kernels, wherethe base kernels belong to a parameterized family of kernels, ( κ σ ) σ ∈ Σ . Here Σ is a “continuous”parameter space, i.e., some subset of a Euclidean space. A prime example (and extremely popularchoice) is when κ σ is a Gaussian kernel, where σ can be a single common bandwidth or a vectorof bandwidths, one per coordinate. One approach then is to discretize the parameter space Σ andthen ﬁnd an appropriate non-negative linear combination of the resulting set of base kernels, N = { κ σ , . . . , κ σ p } . The advantage of this approach is that once the set N is ﬁxed, any of the manyefﬁcient methods available in the literature can be used to ﬁnd the coefﬁcients for combining the basekernels in N (see the papers by Lanckriet et al. 2004; Sonnenburg et al. 2006; Rakotomamonjy et al.2008; Cortes et al. 2009a; Kloft et al. 2011 and the references therein). One potential drawback ofthis approach is that it requires an appropriate, a priori choice of N . This might be problematic,e.g., if Σ is contained in a Euclidean space of moderate, or large dimension (say, a dimension over20) since the number of base kernels, p , grows exponentially with dimensionality even for moderatediscretization accuracies. Furthermore, independent of the dimensionality of the parameter space,the need to choose the set N independently of the data is at best inconvenient and selecting an1ppropriate resolution might be far from trivial. In this paper we explore an alternative methodwhich avoids the need for discretizing the space Σ .We are not the ﬁrst to realize that discretizing a continuous parameter space might be troublesome:The method of Argyriou et al. (2005, 2006) can also work with continuously parameterized spacesof kernels. The main issue with this method, however, is that it may get stuck in local optima since itis based on alternating minimization and the objective function is not jointly convex. Nevertheless,empirically, in the initial publications of Argyriou et al. (2005, 2006) this method was found to haveexcellent and robust performance, showing that despite the potential difﬁculties, the idea of avoidingdiscretizations might have some traction.Our new method is similar to that of Argyriou et al. (2005, 2006), in that it is still based on localsearch. However, our local search is used within a boosting, or more precisely, forward-stagewiseadditive modeling (FSAM) procedure, a method that is known to be quite robust to how its “greedystep” is implemented (Hastie et al., 2001, Section 10.3). Thus, we expect to suffer minimally fromissues related to local minima. A second difference to Argyriou et al. (2005, 2006) is that our methodbelongs to the group of two-stage kernel learning methods. The decision to use a two-stage kernellearning approach was motivated by the recent success of the two-stage method of Cortes et al.(2010). In fact, our kernel learning method uses the centered kernel alignment metric of Cortes et al.(2010) (derived from the uncentered alignment metric of Cristianini et al. (2002)) in its ﬁrst stageas the objective function of the FSAM procedure, while in the second stage a standard supervisedlearning technique is used.The technical difﬁculty of implementing FSAM is that one needs to compute the functional gradientof the chosen objective function. We show that in our case this problem is equivalent to solving anoptimization problem over σ ∈ Σ with an objective function that is a linear function of the Grammatrix derived from the kernel κ σ . Because of the nonlinear dependence of this matrix on σ , thisis the step where we need to resort to local optimization: this optimization problem is in generalnon-convex. However, as we shall demonstrate empirically, even if we use local solvers to solvethis optimization step, the algorithm still shows an overall excellent performance as compared toother state-of-the-art methods. This is not completely unexpected: One of the key ideas underlyingboosting is that it is designed to be robust even when the individual “greedy” steps are imperfect(cf., Chapter 12, B¨uhlmann and van de Geer 2011). Given the new kernel to be added to the existingdictionary, we give a computationally efﬁcient, closed-form expression that can be used to determinethe coefﬁcient on the new kernel to be added to the previous kernels.The empirical performance of our proposed method is explored in a series of experiments. Ourexperiments serve multiple purposes. Firstly, we explore the potential advantages, as well as limita-tions of the proposed technique. In particular, we demonstrate that the procedure is indeed reliable(despite the potential difﬁculty of implementing the greedy step) and that it can be successfully usedeven when Σ is a subset of a multi-dimensional space. Secondly, we demonstrate that in some cases,kernel learning can have a very large improvement over simpler alternatives, such as combiningsome ﬁxed dictionary of kernels with uniform weights. Whether this is true is an important issuethat is given weight by the fact that just recently it became a subject of dispute (Cortes, 2009). Fi-nally, we compare the performance of our method, both from the perspective of its generalizationcapability and computational cost, to its natural, state-of-the-art alternatives, such as the two-stagemethod of Cortes et al. (2010) and the algorithm of Argyriou et al. (2005, 2006). For this, we com-pared our method on datasets used in previous kernel-learning work. To give further weight to ourresults, we compare on more datasets than any of the previous papers that proposed new kernellearning methods.Our experiments demonstrate that our new method is competitive in terms of its generalization per-formance, while its computational cost is signiﬁcantly less than that of its competitors that enjoysimilarly good generalization performance as our method . In addition, our experiments also re-vealed an interesting novel insight into the behavior of two-stage methods: we noticed that two-stage methods can “overﬁt” the performance metric of the ﬁrst stage. In some problem we observedthat our method could ﬁnd kernels that gave rise to better (test-set) performance on the ﬁrst-stagemetric, while the method’s overall performance degrades when compared to using kernel combina-tions whose performance on the ﬁrst metric is worse. The explanation of this is that metric of theﬁrst stage is a surrogate performance measure and thus just like in the case of choosing a surro-gate loss in classiﬁcation, better performance according to this surrogate metric does not necessarily2ransfer into better performance in the primary metric as there is no monotonicity relation betweenthese two metrics. We also show that with proper capacity control, the problem of overﬁtting thesurrogate metric can be overcome. Finally, our experiments show a clear advantage to using kernellearning methods as opposed to combining kernels with a uniform weight, although it seems thatthe advantage mainly comes from the ability of our method to discover the right set of kernels. Thisconclusion is strengthened by the fact that the closest competitor to our method was found to be themethod of Argyriou et al. (2006) that also searches the continuous parameter space, avoiding dis-cretizations. Our conclusion is that it seems that the choice of the base dictionary is more importantthan how the dictionary elements are combined and that the a priori choice of this dictionary maynot be trivial. This is certainly true already when the number of parameters is moderate. Moreover,when the number of parameters is larger, simple discretization methods are infeasible, whereas ourmethod can still produce meaningful dictionaries. The purpose of this section is to describe our new method. Let us start with the introduction of theproblem setting and the notation. We consider binary classiﬁcation problems, where the data D =(( X , Y ) , . . . , ( X n , Y n )) is a sequence of independent, identically distributed random variables,with ( X i , Y i ) ∈ R d × {− , +1 } . For convenience, we introduce two other pairs of random variables ( X, Y ) , ( X ′ , Y ′ ) , which are also independent of each other and they share the same distributionwith ( X i , Y i ) . The goal of classiﬁer learning is to ﬁnd a predictor, g : R d → {− , +1 } such thatthe predictor’s risk, L ( g ) = P ( g ( X ) = Y ) , is close to the Bayes-risk, inf g L ( g ) . We will consider atwo-stage method, as noted in the introduction. The ﬁrst stage of our method will pick some kernel k : R d × R d → R from some set of kernels K based on D , which is then used in the second stage,using the same data D to ﬁnd a good predictor. Consider a parametric family of base kernels, ( κ σ ) σ ∈ Σ . The kernels considered by our methodbelong to the set K = ( r X i =1 µ i κ σ i : r ∈ N , µ i ≥ , σ i ∈ Σ , i = 1 , . . . , r ) , i.e., we allow non-negative linear combinations of a ﬁnite number of base kernels. For exam-ple, the base kernel could be a Gaussian kernel, where σ > is its bandwidth: κ σ ( x, x ′ ) =exp( −k x − x ′ k /σ ) , where x, x ′ ∈ R d . However, one could also have a separate bandwidthfor each coordinate.The “ideal” kernel underlying the common distribution of the data is k ∗ ( x, x ′ ) = E [ Y Y ′ | X = x, X ′ = x ′ ] . Our new method attempts to ﬁnd a kernel k ∈ K which is maximallyaligned to this ideal kernel, where, following Cortes et al. (2010), the alignment between two kernels k, ˜ k is measured by the centered alignment metric , A c ( k, ˜ k ) def = h k c , ˜ k c ik k c kk ˜ k c k , where k c is the kernel underlying k centered in the feature space (similarly for ˜ k c ), h k, ˜ k i = E h k ( X, X ′ )˜ k ( X, X ′ ) i and k k k = h k, k i . A kernel k centered in the feature space, by deﬁnition,is the unique kernel k c , such that for any x, x ′ , k c ( x, x ′ ) = h Φ( x ) − E [Φ( X )] , Φ( x ′ ) − E [Φ( X )] i ,where Φ is a feature map underlying k . By considering centered kernels k c , ˜ k c in the alignmentmetric, one implicitly matches the mean responses E [ k ( X, X ′ )] , E [˜ k ( X, X ′ )] before consideringthe alignment between the kernels (thus, centering depends on the distribution of X ). An alterna-tive way of stating this is that centering cancels mismatches of the mean responses between the twokernels. When one of the kernels is the ideal kernel, centered alignment effectively standardizes thealignment by cancelling the effect of imbalanced class distributions. For further discussion of thevirtues of centered alignment, see the paper by Cortes et al. (2010). One could consider splitting the data, but we see no advantage to doing so. Also, the methods for thesecond stage are not a focus of this work and the particular methods used in the experiments are described later. Note that the word metric is used in its everyday sense and not in its mathematical sense. lgorithm 1 Forward stagewise additive modeling for kernel learning with a continuouslyparametrized set of kernels. For the deﬁnitions of f , F , F ′ and K : K → R n × n , see the text. Inputs: data D , kernel initialization parameter ε , the number of iterations T , tolerance θ , max-imum stepsize η max > . K ← εI n . for t = 1 to T do P ← F ′ ( K t − ) P ← C n P C n σ ∗ = arg max σ ∈ Σ h P, K ( κ σ ) i F K ′ = C n K ( κ σ ∗ ) C n η ∗ = arg max ≤ η ≤ η max F ( K t − + ηK ′ ) K t ← K t − + η ∗ K ′ if F ( K t ) ≤ F ( K t − ) + θ then terminate end for Since the common distribution underlying the data is unknown, one resorts to empirical approxima-tions to alignment and centering, resulting in the empirical alignment metric, A c ( K, ˜ K ) = h K c , ˜ K c i F k K c k F k ˜ K c k F , where, K = ( k ( X i , X j )) ≤ i,j ≤ n , and ˜ K = (˜ k ( X i , X j )) ≤ i,j ≤ n are the kernel matrices underlying k and ˜ k , and for a kernel matrix, K , K c = C n KC n , where C n is the so-called centering matrixdeﬁned by C n = I n × n − ⊤ /n , I n × n being the n × n identity matrix and = (1 , . . . , ⊤ ∈ R n . The empirical counterpart of maximizing A c ( k, k ∗ ) is to maximize A c ( K, ˆ K ∗ ) , where ˆ K ∗ def = YY T , and Y = ( Y , . . . , Y n ) ⊤ collects the responses into an n -dimensional vector. Here, K is thekernel matrix derived from a kernel k ∈ K . To make this connection clear, we will write K = K ( k ) .Deﬁne f : K → R by f ( k ) = A c ( K ( k ) , ˆ K ∗ ) .To ﬁnd an approximate maximizer of f , we propose a steepest ascent approach to forward stagewiseadditive modeling (FSAM). FSAM (Hastie et al., 2001) is an iterative method for optimizing anobjective function by sequentially adding new basis functions without changing the parameters andcoefﬁcients of the previously added basis functions. In the steepest ascent approach, in iteration t ,we search for the base kernel in ( κ σ ) deﬁning the direction in which the growth rate of f is thelargest, locally in a small neighborhood of the previous candidate k t − : σ ∗ t = arg max σ ∈ Σ lim ε → f ( k t − + ε κ σ ) − f ( k t − ) ε . (1)Once σ ∗ t is found, the algorithm ﬁnds the coefﬁcient ≤ η t ≤ η max such that f ( k t − + η t κ σ ∗ t ) is maximized and the candidate is updated using k t = k t − + η t κ σ ∗ t . The process stops when theobjective function f ceases to increase by an amount larger than θ > , or when the number ofiterations becomes larger then a predetermined limit T , whichever happens earlier. Proposition 1.

The value of σ ∗ t can be obtained by σ ∗ t = arg max σ ∈ Σ (cid:10) K ( κ σ ) , F ′ ( ( K ( k t − )) c ) (cid:11) F , (2) where for a kernel matrix K , F ′ ( K ) = ˆ K ∗ c − k K k − F h K, ˆ K ∗ c i F K k K k F k ˆ K ∗ c k F . (3)The proof can be found in the supplementary material. The crux of the proposition is that thedirectional derivative in (1) can be calculated and gives the expression maximized in (2). In all our experiments we use the arbitrary value η max = 1 . Note that the value of η max , together with thelimit T acts as a regularizer. However, in our experiments, the procedure always stops before the limit T on thenumber of iterations is reached. Abbr. Method

CA Our new methodCR From Argyriou et al. (2005)DA From Cortes et al. (2010)D1 ℓ -norm MKL (Kloft et al., 2011)D2 ℓ -norm MKL (Kloft et al., 2011)DU Uniform weights over kernelsIn general, the optimization problem (2) is not convex and the cost of obtaining a (good approximate)solution is hard to predict. Evidence that, at least in some cases, the function to be optimized isnot ill-behaved is presented in Section B.1 of the supplementary material. In our experiments, anapproximate solution to (2) is found using numerical methods. As a ﬁnal remark to this issue, notethat, as is usual in boosting, ﬁnding the global optimizer in (2) might not be necessary for achievinggood statistical performance.The other parameter, η t , however, is easy to ﬁnd, since the underlying optimization problem has aclosed form solution: Proposition 2.

The value of η t is given by η t = arg max η ∈{ ,η ∗ ,η max } f ( k t − + ηκ σ ∗ t ) , where η ∗ =max(0 , ( ad − bc ) / ( bd − ae )) if bd − ae = 0 and η ∗ = 0 otherwise, a = h K, ˆ K ∗ c i F , b = h K ′ , ˆ K ∗ c i F , c = h K, K i F , d = h K, K ′ i F , e = h K ′ , K ′ i F and K = ( K ( k t − )) c , K ′ = ( K ( κ σ ∗ t )) c . The pseudocode of the full algorithm is presented in Algorithm 1. The algorithm needs the data,the number of iterations ( T ) and a tolerance ( θ ) parameter, in addition to a parameter ε used inthe initialization phase and η max . The parameter ε is used in the initialization step to avoid divi-sion by zero, and its value has little effect on the performance. Note that the cost of computing akernel-matrix, or the inner product of two such matrices is O ( n ) . Therefore, the complexity of thealgorithm (with a naive implementation) is at least quadratic in the number of samples. The actualcost will be strongly inﬂuenced by how many of these kernel-matrix evaluations (or inner productcomputations) are needed in (2). In the lack of a better understanding of this, we include actualrunning times in the experiments, which give a rough indication of the computational limits of theprocedure. In this section we compare our kernel learning method with several kernel learning methods onsynthetic and real data; see Table 1 for the list of methods. Our method is labeled CA for Con-tinuous Alignment-based kernel learning. In all of the experiments, we use the following valueswith CA: T = 50 , ε = 10 − , and θ = 10 − . The ﬁrst two methods, i.e. our algorithm, andCR (Argyriou et al., 2005), are able to pick kernel parameters from a continuous set, while the restof the algorithms work with a ﬁnite number of base kernels.In Section 3.1 we use synthetic data to illustrate the potential advantage of methods that work witha continuously parameterized set of kernels and the importance of combining multiple kernels. Wealso illustrate in a toy example that multi-dimensional kernel parameter search can improve perfor-mance. These are followed by the evaluation of the above listed methods on several real datasets inSection 3.2. The purpose of these experiments is mainly to provide empirical proof for the following hypotheses:(H1) The combination of multiple kernels can lead to improved performance as compared to what In particular, we use the fmincon function of Matlab, with the interior-point algorithm option. [ − , . The label of each data point is determined bythe function y ( x ) = sign ( f ( x )) , where f ( x ) = sin( √ x ) + sin( √ x ) + sin( √ x ) . Training andvalidation sets include data points each, while the test set includes instances. Figure 1(a)shows the functions f (blue curve) and y (red dots). For this experiment we use Dirichlet kernels ofdegree one, parameterized with a frequency parameter σ : κ σ ( x, x ′ ) = 1 + 2 cos( σ k x − x ′ k ) .In order to investigate (H1), we trained classiﬁers with a single frequency kernel from the set √ , √ , and √ (which we thought were good guesses of the single best frequencies). The trainedclassiﬁers achieved misclassiﬁcation error rates of . , . , and . , respectively. Clas-siﬁers trained with a pair of frequencies, i.e. {√ , √ } , {√ , √ } , and {√ , √ } achievederror rates of . , . , and . , respectively (the kernels were combined using uniformweights). Finally, a classiﬁer that was trained with all three frequencies achieved an error rate of . .Let us now turn to (H2). As shown in Figure 1(b), the CA and CR methods both achieved a mis-classiﬁcation error close to what was seen when the three best frequencies were used, showing thatthey are indeed effective. Furthermore, Figure 1(c) shows that the discovered frequencies are closeto the frequencies used to generate the data. For the sake of illustration, we also tested the meth-ods which require the discretization of the parameter space. We choose ten Dirichlet kernels with σ ∈ { , , . . . , } , covering the range of frequencies deﬁning f . As can be seen from Figure 1(b)in this example the chosen discretization accuracy is insufﬁcient. Although it would be easy to in-crease the discretization accuracy to improve the results of these methods, the point is that if a highresolution is needed in a single-dimensional problem, then these methods are likely to face seriousdifﬁculties in problems when the space of kernels is more complex (e.g., the parameterization ismultidimensional). Nevertheless, we are not suggesting that the methods which require discretiza-tion are universally inferior, but merely wish to point out that an “appropriate discrete kernel set”might not always be available.To illustrate (H3) we designed a second set of problems: The instances for the positive (negative)class are generated from a d = 50 -dimensional Gaussian distribution with covariance matrix C = I d × d and mean µ = ρ θ k θ k (respectively, µ = − µ for the negative class). Here ρ = 1 . . Thevector θ ∈ [0 , d determines the relevance of each feature in the classiﬁcation task, e.g. θ i = 0 implies that the distributions of the two classes have zero means in the i th feature, which rendersthis feature irrelevant. The value of each component of vector θ is calculated as θ i = ( i/d ) γ , where γ is a constant that determines the relative importance of the elements of θ . We generate sevendatasets with γ ∈ { , , , , , , } . For each value of γ , the training set consists of datapoints (the prior distribution for the two classes is uniform). The test error values are measuredon a test set with instances. We repeated each experiment times and report the averagemisclassiﬁcation error and alignment measured over the test set along with the running time.We test two versions of our method: one that uses a family of Gaussian kernels with a com-mon bandwidth (denoted by CA-1D), and another one (denoted by CA-nD) that searches in thespace ( κ σ ) σ ∈ (0 , ∞ ) , where each coordinate has a separate bandwidth parameter, κ σ ( x, x ′ ) =exp( − P di =1 ( x i − x ′ i ) /σ i ) . Since the training set is small, one can easily overﬁt while optimizingthe alignment. Hence, we modify the algorithm to shrink the values of the bandwidth parameters to We repeated the experiments using Gaussian kernels with nearly identical results. In all of the experiments in this paper, the classiﬁers for the two-stage methods were trained using thesoft margin SVM method, where the regularization coefﬁcient of SVM was chosen by cross-validation from {− , − . ,..., . , } . Further experimentation found that a discretization below . is necessary in this example. !" ! ! $ ! % ! !"!%$ &’( !"! & ’ () * + (( ’ , ’ ) + - ’ ./ . > /1 * B ’ CD - Figure 1: (a): The function f ( x ) = sin( √ x ) + sin( √ x ) + sin( √ x ) used for generatingsynthetic data, along with sign ( f ) . (b): Misclassiﬁcation percentages obtained by each algorithm.(c): The kernel frequencies found by the CA method.their common average value by modifying (2): σ ∗ t = arg min σ ∈ Σ − (cid:10) K ( κ σ ) , F ′ ( ( K ( k t − )) c ) (cid:11) F + λ k σ − ¯ σ k , (4)where, ¯ σ = r P ri =1 σ i and λ is a regularization parameter. We also include results obtained forﬁnite kernel learning methods. For these methods, we generate Gaussian kernels with bandwidths σ ∈ mg { ,..., } , where m = 10 − , and g ≈ . . Therefore, the bandwidth range constitutes ageometric sequence from − to . Further details of the experimental setup can be found inSection B.2 of the supplementary material.Figure 2 shows the results. Recall that the larger the value of γ , the larger the number of nearlyirrelevant features. Since methods which search only a one-dimensional space cannot differentiatebetween relevant and irrelevant features, their misclassiﬁcation rate increases with γ . Only CA-nDis able to cope with this situation and even improve its performance. We observed that withoutregularization, though, CA-nD drastically overﬁts (for small values of γ ). We also show the runningtimes of the methods to give the reader an idea about the scalability of the methods. The running timeof CA-nD is larger than CA-1D both because of the use of cross-validation to tune λ and becauseof the increased cost of the multidimensional search. Although the large running time might be aproblem, for some problems, CA-nD might be the only method to deliver good performance amongstthe methods studied. We evaluate the methods listed in Table 1 on several binary classiﬁcation tasks fromMNIST and the UCI Letter recognition dataset, along with several other datasets fromthe UCI machine learning repository (Frank and Asuncion, 2010) and Delve datasets (see, ). We have not attempted to run a multi-dimensional version of the CR method, since already the one-dimensional version of this method is at least one order of magnitude slower than our CA-1D method. γ m i sc l a ss i f i c a t i on e rr o r γ r unn i ng t i m e ( s e c . ) CA−1DCA−nDCRDAD1D2DU

Figure 2: Performance and running time of various methods for a -dimensional synthetic problemas a function of the relevance parameter γ . Note that the number of irrelevant features increases with γ . For details of the experiments, see the text.Table 2: Median rank and running time (sec.) of kernel learning methods obtained in experiments.CA-1D CA-nD CR DA D1 D2 DURank MNIST 1 N/A 2 4.5 4.5 5 4Letter 1 4.5 2 3.5 7 6 511 datasets 3 2 3 3 4 6 6Time MNIST ± N/A ±

56 31 ± ± ± ± Letter ± ±

247 590 ±

21 11 ± ± ± ± MNIST.

In the ﬁrst experiment, following Argyriou et al. (2005), we choose 8 handwritten digitrecognition tasks of various difﬁculty from the MNIST dataset (LeCun and Cortes, 2010). Thisdataset consists of × images with pixel values ranging between and . In these experiments,we used Gaussian kernels with parameter σ : G σ ( x, x ′ ) = exp( −k x − x ′ k /σ ) . Due to the largenumber of attributes (784) in the MNIST dataset, we only evaluate the 1-dimensional version of ourmethod. For the algorithms that work with a ﬁnite kernel set, we pick 20 kernels with the value of σ picked from an equidistant discretization of interval [500 , . In each experiment, the trainingand validation sets consist of and data points, while the test set has data points.We repeated each experiment 10 times. Due to the lack of space, the test-set error plots for all ofthe problems can be found in the supplementary material (see Section B.3). In order to give anoverall impression of the algorithms’ performance, we ranked them based on the results obtainedin the above experiment. Table 2 reports the median ranks of the methods for the experiment justdescribed.Overall, methods that choose σ from a continuous set outperformed their ﬁnite counterparts. Thissuggests again that for the ﬁnite kernel learning methods the range of σ and the discretization of thisrange is important to the accuracy of the resulting classiﬁer. UCI Letter Recognition.

In another experiment, we evaluated these methods on 12 binary clas-siﬁcation tasks from the UCI Letter recognition dataset. This dataset includes data points ofthe 26 capital letters in the English alphabet. For each binary classiﬁcation task, the training and val-idation sets include and data points, respectively. The misclassiﬁcation errors are measuredover test points. As with MNIST, we used Gaussian kernels. However, in this experiment, weran our method with both 1-dimensional and n -dimensional search procedures. The rest of the meth-ods learn a single parameter and the ﬁnite kernel learning methods were provided with 20 kernelswith σ ’s chosen from the interval [1 , in an equidistant manner. The plots of misclassiﬁcationerror and alignment are available in the supplementary material (see Section B.3). We report the me-dian rank of each method in Table 2. While the 1-dimensional version of our method outperforms8he rest of the methods, the classiﬁer built on the kernel found by the multi-dimensional version ofour method did not perform well. We examined the value of alignment between the learned kerneland the target label kernel on the test set achieved by each method. The results are available in thesupplementary material (see Section B.3). The multidimensional version of our method achievedthe highest value of alignment in every task in this experiment. Higher value of alignment betweenthe learned kernel and the ideal kernel does not necessarily translate into higher value of accuracyof the classiﬁer. Aside from this observation, the same trends observed in the MNIST data can beseen here. The continuous kernel learning methods (CA-1D and CR) outperform the ﬁnite kernellearning methods. Miscellaneous datasets.

In the last experiment we evaluate all methods on 11 datasets chosenfrom the UCI machine learning repository and Delve datasets. Most of these datasets were used pre-viously to evaluate kernel learning algorithms (Lanckriet et al., 2004; Cortes et al., 2009a,b, 2010;Rakotomamonjy et al., 2008). The speciﬁcation of each dataset and the performance of each methodare available in the supplementary material (see Section B.3). The median rank of each method isshown in Table 2. Contrary to the Letter experiment, in this case the multi-dimensional version ofour method outperforms the rest of the methods.

Running Times.

We measured the time required for each run and each kernel learning method inthe MNIST and the UCI Letter experiments. In each case we took the average of the running timeof each method over all tasks. The average required time along with the standard error values areshown in Table 2. Among all methods, the DU method is fastest, which is expected, as it requires noadditional time to compute kernel weights. The CA-1D is the fastest among the rest of the methods.In these experiments our method converges in less than 10 iterations (kernels). The general trend isthat one-stage kernel learning methods, i.e., D1, D2, and CR, are slower than two-stage methods,CA and DA. Among all methods, the other continuous kernel learning method, CR, is slowest, since(1) it is a one-stage algorithm and (2) it usually requires more iterations (around ) to converge.We also examined the DC-Programming version of the CR method Argyriou et al. (2006). While itis faster than the original gradient-based approach (roughly three times faster), it is still signiﬁcantlyslower than the rest of the methods in our experiments. We presented a novel method for kernel learning. This method addresses the problem of learninga kernel in the positive linear span of some continuously parameterized kernel family. The algo-rithm implements a steepest ascent approach to forward stagewise additive modeling to maximizean empirical centered correlation measure between the kernel and the empirical approximation to theideal response-kernel. The method was shown to perform well in a series of experiments, both withsynthetic and real-data. We showed that in single-dimensional kernel parameter search, our methodoutperforms standard multiple kernel learning methods without the need to discretizing the param-eter space. While the method of Argyriou et al. (2005) also beneﬁts from searching in a continuousspace, it was seen to require signiﬁcantly more computation time compared to our method. Wealso showed that our method can successfully deal with high-dimensional kernel parameter spaces,which, at least in our experiments, the method of Argyriou et al. (2005, 2006) had problems with.The main lesson of our experiments is that the methods that start by discretizing the kernel spacewithout using the data might lose the potential to achieve good performance before any learninghappens.We think that currently our method is the most efﬁcient method to design data-dependent dictio-naries that provide competitive performance. It remains an interesting problem to be explored inthe future whether there exist methods that are provably efﬁcient and yet their performance remainscompetitive. Although in this work we directly compared our method to ﬁnite-kernel methods, it isalso natural to combine dictionary search methods (like ours) with ﬁnite-kernel learning methods.However, the thorough investigation of this option remains for future work.A secondary outcome of our experiments is the observation that although test-set alignment is gen-erally a good indicator of good predictive performance, a larger test-set alignment does not neces-9arily transform into a smaller misclassiﬁcation error. Although this is not completely unexpected,we think that it will be important to thoroughly explore the implications of this observation.

References

Argyriou, A., Hauser, R., Micchelli, C., and Pontil, M. (2006). A DC-programming algorithm forkernel selection. In

Proceedings of the 23rd international conference on Machine learning , pages41–48.Argyriou, A., Micchelli, C., and Pontil, M. (2005). Learning convex combinations of continuouslyparameterized basic kernels. In

Proceedings of the 18th Annual Conference on Learning Theory ,pages 338–352.B¨uhlmann, P. and van de Geer, S. (2011).

Statistics for High-Dimensional Data: Methods, Theoryand Applications . Springer.Cortes, C. (2009). Invited talk: Can learning kernels help performance? In

ICML ’09 , pages 1–1.Cortes, C., Mohri, M., and Rostamizadeh, A. (2009a). L2 regularization for learning kernels. In

Proceedings of the 25th Conference on Uncertainty in Artiﬁcial Intelligence , pages 109–116.Cortes, C., Mohri, M., and Rostamizadeh, A. (2009b). Learning non-linear combinations of kernels.In

Advances in Neural Information Processing Systems 22 , pages 396–404.Cortes, C., Mohri, M., and Rostamizadeh, A. (2010). Two-stage learning kernel algorithms. In

Proceedings of the 27th International Conference on Machine Learning , pages 239–246.Cristianini, N., Kandola, J., Elisseeff, A., and Shawe-Taylor, J. (2002). On kernel-target alignment.In

Advances in Neural Information Processing Systems 15 , pages 367–373. MIT Press.Frank, A. and Asuncion, A. (2010). UCI machine learning repository.Hastie, T., Tibshirani, R., and Friedman, J. (2001).

The Elements of Statistical Learning . SpringerSeries in Statistics. Springer-Verlag New York.Kloft, M., Brefeld, U., Sonnenburg, S., and Zien, A. (2011). ℓ p -norm multiple kernel learning. Journal of Machine Learning Research , 12:953–997.Lanckriet, G., Cristianini, N., Bartlett, P., Ghaoui, L., and Jordan, M. (2004). Learning the kernelmatrix with semideﬁnite programming.

Journal of Machine Learning Research , 5:27–72.LeCun, Y. and Cortes, C. (2010). MNIST handwritten digit database.Rakotomamonjy, A., Bach, F., Canu, S., and Grandvalet, Y. (2008). SimpleMKL.

Journal ofMachine Learning Research , 9:2491–2521.Sonnenburg, S., R¨atsch, G., Sch¨afer, C., and Sch¨olkopf, B. (2006). Large scale multiple kernellearning.

The Journal of Machine Learning Research , 7:1531–1565.10

Proofs

A.1 Proof of Proposition 1

First, notice that the limit in (1) is a directional derivative, D κ σ f ( k t − ) . By the chain rule, D κ σ f ( k t − ) = h K ( κ σ ) , F ′ c ( K ( k t − )) i F , where, for convenience, we deﬁned F c ( K ) = A c ( K, ˆ K ∗ ) . Deﬁne F ( K ) = h K, ˆ K ∗ c i F / ( k K k F k ˆ K ∗ c k F ) so that F c ( K ) = F ( K c ) . Some calculations give that F ′ ( K ) = ˆ K ∗ c − k K k − F h K, ˆ K ∗ c i F K k K k F k ˆ K ∗ c k F (which is the function deﬁned in (3)). We claim that the following holds: Lemma 3. F ′ c ( K ) = C n F ′ ( K c ) C n .Proof. By the deﬁnition of derivatives, as H → , F ( K + H ) − F ( K ) = h F ′ ( K ) , H i F + o ( k H k ) . Also, F c ( K + H ) − F c ( K ) = h F ′ c ( K ) , H i F + o ( k H k ) . Now, F c ( K + H ) − F c ( K ) = F ( C n KC n + C n HC n ) − F ( C n KC n )= h F ′ ( K c ) , C n HC n i F + o ( k H k )= h C n F ′ ( K c ) C n , H i F + o ( k H k ) , where the last property follows from the cyclic property of trace. Therefore, by the uniqueness ofderivative, F ′ c ( K ) = C n F ′ ( K c ) C n .Now, notice that C n F ′ ( K c ) C n = F ′ ( K c ) . Thus, we see that the value of σ ∗ t can be obtained by σ ∗ t = arg max σ ∈ Σ (cid:10) K ( κ σ ) , F ′ ( ( K ( k t − )) c ) (cid:11) F , which was the statement to be proved. A.2 Proof of Proposition 2

Let g ( η ) = f ( k t − + ηκ σ ∗ t ) . Using the deﬁnition of f , we ﬁnd that with some constant ρ > , g ( η ) = ρ a + bη ( c + 2 dη + eη ) / . Notice that here the denominator is bounded away from zero (this follows from the form of thedenominator of f ). In particular, e > . Further, lim η →∞ g ( η ) = − lim η →−∞ g ( η ) = ρ b √ e . (5)Taking the derivative of g we ﬁnd that g ′ ( η ) = ρ bc − ad + ( bd − ae ) η ( c + 2 dη + eη ) / . Therefore, g ′ has at most one root and g has at most one global extremum, from which the resultfollows by solving for the root of g ′ (if g ′ does not have a root, g is constant).11 ! f ( ! ) (a) odd vs. even, (1)

0 5000 10000 ! f ( ! ) (b) odd vs. even, (2) ! f ( ! ) (c) 0 vs. 6, (1) ! f ( ! ) (d) 0 vs. 6, (2) ! f ( ! ) (e) B vs. E, (1) ! f ( ! ) (f) B vs. E, (2) ! f ( ! ) (g) B vs. E, (3) ! f ( ! ) (h) B vs. E, (4) Figure 3: The ﬂipped objective function underlying (2) as a function of σ , the parameter of a Gaus-sian kernel in selected MNIST and UCI Letter problems. Our algorithm needs to ﬁnd the minimumof these functions (and similar ones). B Details of the numerical experiments

In this section we provide further details and data for the numerical results.

B.1 Non-Convexity Issue

As we mentioned in Section 2, our algorithm may need to solve a non-convex optimization problemin each iteration to ﬁnd the best kernel parameter. Here, we explore this problem numerically, byplotting the function to be optimized in the case of a Gaussian kernel with a single bandwidth param-eter. In particular, we plotted the objective function of Equation 2 with its sign ﬂipped, therefore weare interested in the local minima of function h ( σ ) = − (cid:10) K ( κ σ ) , F ′ ( ( K ( k t − )) c ) (cid:11) F , see Figure3. The function h is shown for some iterations of some of the tasks from both the MNIST and theUCI Letter experiments. The number inside parentheses in the caption speciﬁes the correspondingiteration of the algorithm. On these plots, the objective function does not have more than 2 localminima. Although in some cases the functions have some steep parts (at the scales shown), theiroptimization does not seem very difﬁcult. B.2 Details of the 50-dimensional synthetic dataset experiment

The 1-dimensional version of our algorithm, CA-1D, and the CR method, employ Matlab’s fmincon function with multiple restarts from the set {− ,..., } , to choose the kernel parameters. The multi-dimensional version of our algorithm, CA-nD, uses fmincon only once, since in this particularexample the search method runs on a -dimensional search space, which makes the search anexpensive operation. The starting point of the CA-nD method is a vector of equal elements wherethis element is the weighted average of the kernel parameters found by the CA-1D method, weightedby the coefﬁcient of the corresponding kernels.The soft margin SVM regularization parameter is tuned from the set {− , − . ,..., . , } using anindependent validation set with instances. We also tuned the value of the regularization pa-rameter in Equation (4) from {− ,..., } using the same validation set (the best value of λ is theone that achieves the highest value of alignment on the validation set). We decided to use a large val-idation set, following essentially the practice of Kloft et al. (2011, Section 6.1), to make sure that inthe experiments reasonably good regularization parameters are used, i.e., to factor out the choice ofthe regularization parameters. This might bias our results towards CA-nD, as compared to CA-1D,12 γ a li gn m en t CA−1DCA−nDCRDAD1D2DU

Figure 4: Alignment values in the -dimensional synthetic dataset experiment.though similar results were achieved with a smaller validation set of size . As a ﬁnal detail notethat D1, D2 and CR also use the validation set for choosing the value of their regularization factor,and together with the regularizer, the weights also. Hence, their results might also be positivelybiased (though we don’t think this is signiﬁcant, in this case).The running times shown in Figure 2 include everything from the beginning to the end, i.e., fromlearning the kernels to training the ﬁnal classiﬁers (the extra cross-validation step is what makesCA-nD expensive).Figure 4 shows the (centered) alignment values for the learned kernels (on the test data) as a functionof the relevance parameter γ . It can be readily seen that the multi-dimensional method has a real-edge over the other methods when the number of irrelevant features is large, in terms of kernelalignment. As seen on Figure 4, this edge is also transformed into an edge in terms of the test-setperformance. Note also that the discretization is ﬁne enough so that the alignment maximizing ﬁnitekernel learning method DA can achieve the same alignment as the method CA-1D. B.3 Detailed results for the real datasets odd vs. even

CACRDAD1D2DU

Figure 5: Misclassiﬁcation percentages in different tasks of the MNIST dataset.13

B vs. E

B vs. F

C vs. G

C vs. O

E vs. F

I vs. J

I vs. L

K vs. X

O vs. Q

P vs. R

U vs. V

V vs. Y

CA−1DCA−nDCRDAD1D2DU

Figure 6: Misclassiﬁcation percentages in different tasks of the UCI Letter recognition dataset.

B vs. E

B vs. F

C vs. G

C vs. O

E vs. F

I vs. J

I vs. L

K vs. X

O vs. Q

P vs. R

U vs. V

V vs. Y

CA−1DCA−nDCRDAD1D2DU

Figure 7: Alignment values in different tasks of the UCI Letter recognition dataset.14able 3: Datasets used in the experiments

Dataset

Banana

Breast Cancer

Diabetes

German

20 1000 200 300 500

Heart

13 270 54 81 135

Image Segmentation

18 2086 400 600 1000

Ringnorm

20 7400 500 1000 2000

Sonar

60 208 41 62 105

Splice

60 2991 500 1000 1491

Thyroid

Waveform

21 5000 500 1000 2000 banana breast cancer diabetes german heart image ringnorm sonar splice thyroid waveform