On Sampling Random Features From Empirical Leverage Scores: Implementation and Theoretical Guarantees
11 On Sampling Random Features From Empirical LeverageScores: Implementation and Theoretical Guarantees
Shahin Shahrampour and Soheil Kolouri Abstract
Random features provide a practical framework for large-scale kernel approximation and supervised learning. Ithas been shown that data-dependent sampling of random features using leverage scores can significantly reduce thenumber of features required to achieve optimal learning bounds. Leverage scores introduce an optimized distributionfor features based on an infinite-dimensional integral operator (depending on input distribution), which is impracticalto sample from. Focusing on empirical leverage scores in this paper, we establish an out-of-sample performance bound,revealing an interesting trade-off between the approximated kernel and the eigenvalue decay of another kernel in thedomain of random features defined based on data distribution. Our experiments verify that the empirical algorithmconsistently outperforms vanilla Monte Carlo sampling, and with a minor modification the method is even competitiveto supervised data-dependent kernel learning, without using the output (label) information.
I. I
NTRODUCTION
Supervised learning is a fundamental machine learning problem, where a learner is given input-output data samples(from an unknown distribution), and the objective is to find a mapping from inputs to outputs [1]. Kernel methodsare powerful tools to capture the nonlinear relationship between input-outputs. These methods implicitly map theinputs (features) to a high-dimensional space, without the need for knowledge of the feature map, an idea known askernel trick. While kernel methods are theoretically well-justified, their practical applicability to large datasets islimited in that they require memory (and time) complexity that can scale quadratically (and cubically) with the sizeof data samples.In the past few years, this computational bottleneck has motivated a large body of research on (low-rank) kernelapproximation [2]–[4] for efficient learning. In these scenarios, the training can scale linearly with respect to data,introducing a dramatic decrease in the computational cost. In this line of work, an elegant idea has been the use ofthe so-called random features for kernel approximation [4] as well as training shallow networks [5]. In this approach,random features are sampled from a stochastic oracle to form the nonlinear basis functions used to describe theinput-output relationship. Replacing optimization, randomization circumvents the non-convexity in training andcomes with a theoretical generalization guarantee [5].Since its development, the randomized-feature approach has been successfully used for a wide range of problems(see e.g. [6] for matrix completion, [7] for the correlation analysis of random variables, and [8] for non-parametric [1]
Shahin Shahrampour is with Texas A&M University, College Station, TX, USA (e-mail: [email protected]). [2]
Soheil Kolouri is with HRL Laboratories, LLC., Malibu, CA, USA (e-mail: [email protected]).
March 21, 2019 DRAFT a r X i v : . [ c s . L G ] M a r statistical learning), but as pointed out in [9], since the basis functions are sampled from a distribution that is independent of data, the number of features required to learn the data subspace may be large. Therefore, a naturalquestion is whether a data-dependent stochastic oracle can prove to be useful in improving the out-of-sampleperformance.Recently, a number of works have developed supervised data-dependent methods for sampling random featureswith the goal of improving generalization [10]–[12]. This objective is achieved by pre-processing the random features(e.g. via optimizing a metric) and focusing on promising features, which amounts to learning a “good” kernel basedon input-output pairs. We provide a comprehensive review of these works in Section IV, but the focus of this workis on an unsupervised data-dependent method relying on leverage scores, calculated based only on inputs [13]–[15].[14] have discussed the impact of leverage scores for ridge regression, [15] addressed the problem in the case ofSVM, and [13] has established theoretical guarantees for Lipschitz continuous loss functions. Common in all theseresults is the fact that using leverage scores for sampling random features can significantly reduce the number offeatures required to achieve optimal learning bounds. The bounds are particularly useful when the eigenvalues ofthe integral operator corresponding to the underlying kernel decay fast enough. Nevertheless, these works do notaim to change the underlying base kernel.There are two practical hurdles in using leverage scores: (i) they introduce an optimized distribution for re-samplingfeatures, which is based on the infinite-dimensional integral operator associated to the underlying kernel, and (ii) thesupport set (domain of random features) is infinite-dimensional, making the optimized distribution impractical tosample from. An empirical sampling scheme is proposed in the experiments of [15] without the theoretical analysis,and as noted in [15] a result in the theoretical direction will be useful for guiding practitioners.In this paper, we aim to address the problem above using empirical leverage scores . In this scenario, we mustconstruct a finite counterpart of the optimized distribution to use for training. Interestingly, the out-of-sampleperformance of the algorithm (Theorem 1) reveals an interesting trade-off between two errors: (i) the approximationerror of the kernel caused by finiteness of random features, and (ii) the eigenvalue decay of another kernel in thedomain of random features defined based on data distribution. The proof of our main result uses a combinationof the approximation error result of [13] as well as the spectral approximation result of [16] for ridge leveragefunctions (which builds on recent works on matrix concentration inequalities [17]). We also verify with numericalexperiments (on practical datasets) that the empirical leverage score idea consistently outperforms vanilla MonteCarlo sampling [5], and with a minor modification in the sampling scheme, it can be even competitive to supervised data-dependent methods, without using outputs (labels). II. P
ROBLEM F ORMULATION a) Preliminaries::
We denote by [ N ] the set of positive integers { , . . . , N } , by Tr [ · ] the trace operator, by (cid:107)·(cid:107) the spectral (respectively, Euclidean) norm of a matrix (respectively, vector), by E [ · ] the expectation operator, andby var ( · ) the variance operator. Boldface lowercase variables (e.g. a ) are used for vectors, and boldface uppercasevariables (e.g. A ) are used for matrices. [ A ] ij denotes the ij -th entry of matrix A . The vectors are all in columnform. March 21, 2019 DRAFT L ( dp, X ) represents the set of square integrable functions with respect the Borel probability measure dp onthe domain X . We use (cid:104)· , ·(cid:105) F to denote the inner product associated to an inner product space F and (cid:107)·(cid:107) F forits corresponding norm. The subscript may be dropped when it is clear from the context (e.g. for the Euclideanspace). For a positive semi-definite linear operator Σ , the sequence { σ i (Σ) } ∞ i =1 denotes the set of (non-negative)eigenvalues in descending order. The sequence is finite if Σ is finite-dimensional. A. Supervised Learning
In the supervised learning problem, we are given a training set { ( x n , y n ) } Nn =1 in the form of input-output pairs,which are i.i.d. samples from an unknown distribution. The input feature space is d -dimensional, i.e., for n ∈ [ N ] ,we have x n ∈ X ⊂ R d , where X is closed and convex. For regression, we assume y n ∈ Y ⊆ [ − , , whereas forclassification we have y n ∈ {− , } . The goal of supervised learning is to find a function f : X → R based on thetraining set, which can generalize well, i.e., it can accurately predict the outputs of previously unseen inputs.The problem above can be formulated as minimizing a risk functional R ( f ) , defined as R ( f ) (cid:44) E [ L ( f ( x ) , y )] (cid:98) R ( f ) (cid:44) N N (cid:88) n =1 L ( f ( x n ) , y n ) , where L ( · , · ) is a task-dependent loss function (e.g. hinge loss for SVM), and the expectation is taken with respectto data. As this distribution is unknown, we can only minimize the empirical risk (cid:98) R ( f ) , instead of the true risk R ( f ) , and calculate the gap between the two using standard arguments from measures of function space complexity(e.g. VC dimension, Rademacher complexity, etc). We will discuss two related function classes in the next section. B. Kernels and Random Features
To minimize the risk functional, we need to focus on a function class for f ( · ) . Let us consider a symmetricpositive-definite function k ( · , · ) such that (cid:80) Ni,j =1 α i α j k ( x i , x j ) ≥ for α ∈ R N . k ( · , · ) is then called a positive(semi-)definite kernel, and a possible class to consider is the Reproducing Kernel Hilbert Space (RKHS) associatedto k ( · , · ) , defined as follows F k (cid:44) (cid:40) f ( · ) = N (cid:88) n =1 α n k ( x n , · ) : α ∈ R N (cid:41) . (1)Minimizing the empirical risk (cid:98) R ( f ) over this class of functions by optimizing over α is theoretically well-understoodand justified; however, since this approach requires O ( N ) in space and O ( N ) in time (e.g. training ridge regressionwith naive matrix inversion), the practical applicability of kernel methods to large datasets is limited.It is often useful to study RKHS through the following integral operator Σ : L ( dp, X ) → L ( dp, X )(Σ f )( · ) = (cid:90) X f ( x ) k ( x , · ) dp ( x ) . (2)The spectral properties of the kernel matrix [ K ] ij = k ( x i , x j ) /N is related to that of Σ (see e.g. [18]). When sup x ∈X k ( x , x ) < ∞ , Σ is self-adjoint, positive semi-definite and trace-class . Note that this is a sufficient (but not a necessary) condition. (cid:82) X k ( x , x ) dp ( x ) < ∞ is a weaker condition for which we can have the sameproperties [13]. March 21, 2019 DRAFT
Let us now restrict our attention to kernels that can be written as, k ( x , x (cid:48) ) = (cid:90) Ω φ ( x , ω ) φ ( x (cid:48) , ω ) dτ ( ω ) , (3)for all x , x (cid:48) ∈ X and a measure dτ on Ω . Many common kernels can take the form above. Examples includeshift-invariant kernels [4] or dot product (e.g. polynomial) kernels [19] .The integral form (3) can be approximated using Monte Carlo sampling of so-called random features { ω m } Mm =1 ,which are i.i.d. vectors generated from dτ . Then, (cid:98) k M ( x , x (cid:48) ) (cid:44) M M (cid:88) m =1 φ ( x , ω m ) φ ( x (cid:48) , ω m ) = (cid:104) φ M ( x ) , φ M ( x (cid:48) ) (cid:105) , (4)where φ M ( x ) (cid:44) √ M [ φ ( x , ω ) , . . . , φ ( x , ω M )] (cid:62) , (5)naturally leading to the function class (cid:98) F (cid:44) (cid:40) f ( · ) = M (cid:88) m =1 θ m φ ( · , ω m ) : θ ∈ R M (cid:41) . (6)The advantage of optimizing the risk function on (cid:98) F (rather than F ) is that the training can be considerably moreefficient if we can keep M (cid:28) N . For example, in the case of ridge regression, the O ( N ) time would reduce to O ( M + N M ) .In fact, recently [14] showed that to achieve the same statistical accuracy as kernel ridge regression (i.e., O (1 / √ N ) risk error), we only require M = O ( √ N log N ) random features using vanilla Monte Carlo sampling. Note thatrandomized-feature approach would also reduce the computation time of the test phase from O ( N ) to O ( M ) . C. Leverage Scores and Data-Dependent Sampling
The function class (6) can be also viewed as a one-(hidden)layer neural network (i.e., a perceptron) with anactivation function φ ( · , · ) . To minimize the empirical risk over (6), we can (in general) consider three possiblepaths: (1) Joint optimization over θ and { ω i } Mi =1 , which fully trains the neural network by solving a non-convexoptimization. (2) Monte Carlo sampling of { ω i } Mi =1 and optimizing over θ [5], which was discussed in the previoussection. (3) Data-dependent sampling of { ω i } Mi =1 and optimizing over θ . Though (1) seems to be the most powerfultechnique, the main advantage of (2) and (3) is dealing with a convex problem that avoids (potentially bad) localminima. Another potential advantage is that we do not require the gradient of the activation function for training,which broadens the scope of applicability.Recently, a number of works have proposed supervised data-dependent sampling of random features to enhancethe generalization [10]–[12]. This objective is achieved by pre-processing the random features (e.g. via optimizing ametric) and focusing on “good” ones (for the generalization purpose). We provide a comprehensive review of theseworks in Section IV, and here, we focus on presenting a promising unsupervised data-dependent method that reliesupon leverage scores [13]–[15]. We refer the reader to Table 1 in [20] as well as Table 1 in [21] for an exhaustive list.
March 21, 2019 DRAFT [14], [15] have discussed the impact of leverage scores for ridge regression and SVM. For ω ∈ Ω , leverage scoreis defined as [13] s ( ω ) (cid:44) (cid:10) φ ( · , ω ) , (Σ + λ I ) − φ ( · , ω ) (cid:11) L ( dp, X ) , (7)where Σ is the integral operator in (2). In turn, the optimized distribution for random features is derived as follows q ( ω ) = s ( ω ) (cid:82) Ω s ( ω ) dτ ( ω ) . (8)Notice that if we have access to q ( ω ) , the unbiased approximation of the kernel takes the form k ( x , x (cid:48) ) ≈ M M (cid:88) m =1 q ( ω m ) φ ( x , ω m ) φ ( x (cid:48) , ω m ) , (9)with respect to the new measure q ( ω ) dτ ( ω ) . All of the aforementioned works have established theoretical results,showing that if the eigenvalues of Σ decay fast enough, the number of random features to achieve O (1 / √ N ) errorcan significantly decrease ( to log( N ) and even constant!). There are, however, practical challenges to consider. Practical challenges:
We can observe that sampling random features according to s ( ω ) gives rise to two issues[13]: (i) we require the knowledge of the infinite-dimensional operator Σ (which is not available), and (ii) the set Ω might be large and impractical to sample from. An empirical mechanism of sampling has been proposed inthe experiments of [15] without the theoretical analysis. As noted in [15] a result in the theoretical direction willbe extremely useful for guiding practitioners, and we will discuss that in Section III after outlining the empiricalleverage scores next. D. Sampling Based on Empirical Leverage Scores
To start, let us first define the matrix Φ M,N (cid:44) √ N [ φ M ( x ) , . . . , φ M ( x N )] ∈ R M × N , (10)where φ M ( · ) is given in (5). Observe that Φ M,N can be related to kernel function k ( · , · ) as follows, K (cid:44) N k ( x , x ) · · · k ( x , x N ) ... ... ... k ( x N , x ) · · · k ( x N , x N ) = E dτ (cid:104) Φ (cid:62) M,N Φ M,N (cid:105) (11)Now, consider another kernel g : Ω × Ω → R defined as g ( ω , ω (cid:48) ) = (cid:82) X φ ( x , ω ) φ ( x , ω (cid:48) ) dp ( x ) , which measures thedissimilarity of random features. Then, the following relationship holds G (cid:44) M g ( ω , ω ) · · · g ( ω , ω M ) ... ... ... g ( ω M , ω ) · · · g ( ω M , ω M ) = E dp (cid:104) Φ M,N Φ (cid:62) M,N (cid:105) . (12)It is shown in [13] that sampling random features using leverage scores (7) corresponds to selecting (re-weighting)them according to the diagonal elements of the matrix ˜ G (cid:44) G ( G + λ I ) − . (13) March 21, 2019 DRAFT
Though the dependence to the operator Σ is relaxed, still the dimension of G grows with the number of randomfeatures, which suggests that the more features we evaluate (from the set Ω ), the more computational cost we incur.More importantly, another hurdle is that the data distribution is unknown and G cannot be calculated. Therefore,appealing to Φ M,N Φ (cid:62) M,N seems to be a natural solution in practice. The outline of such method is given inAlgorithm 1 . Algorithm 1
Empirical Leverage Score Sampling (ELSS)
Input:
A sub-sample { x n } N n =1 of inputs, the feature map φ ( · , · ) , an integer M , the sampling distribution dτ , theparameter λ . Draw M i.i.d. samples { ˜ ω m } M m =1 according to dτ . Construct the matrix Q = Φ M ,N Φ (cid:62) M ,N (cid:16) Φ M ,N Φ (cid:62) M ,N + λ I (cid:17) − , (14)where Φ M ,N is defined in (10). Let for i ∈ [ M ] (cid:98) q ( ˜ ω i ) = [ Q ] ii Tr [ Q ] . (15) Output:
The new weights (cid:98) q = [ (cid:98) q ( ˜ ω ) , . . . , (cid:98) q ( ˜ ω M )] (cid:62) .After running ELSS, we can use (cid:98) q as a discrete probability distribution to draw M ≤ M samples { ω m } Mm =1 and minimize the empirical risk over the function class (cid:98) F (cid:98) q (cid:44) (cid:40) f ( · ) = M (cid:88) m =1 θ m φ ( · , ω m ) (cid:112)(cid:98) q ( ω m ) : θ ∈ R M (cid:41) . (16)The function class (cid:98) F (cid:98) q is defined in consistent with the choice of approximated kernel given in (9). Note that(assuming that we can calculate the inverse in (14)) ELSS with λ = 0 precisely recovers the Random Kitchen Sinks(RKS) [5] and corresponds to uniform sampling. Figure 1 represents the histogram of weights with M = 2000 features for the Year Prediction dataset. As expected, for λ = 1 e − the measure is uniform on all samples (bottomleft) which translates to a delta plot for the histogram (top left). For λ = 1 e + 4 (right) the empirical leverage scorestransform the distribution of weights to a completely non-uniform measure.The algorithm requires O ( N M + M ) computations to form the matrix Q in (14) and calculate the empiricalleverage scores (assuming naive inversion of matrix). Parameters N and M can be selected using rule-of-thumb(without exhaustive tuning). We elaborate on this in the numerical experiments (Section V). Furthermore, the choiceof the initial distribution dτ and the feature map φ ( · , · ) depend on the kernel that we want to use for training. Forinstance, cosine feature maps and Gaussian distribution can be used for approximating a Gaussian kernel [4]. Remark 1. (Column Sampling) To improve efficiency, column sampling ideas have been previously used in theapproximation of large kernel matrices [22], [23] in order to deal with (the ridge-type) matrix ˜ G in (13) ; however, This algorithm was also suggested in the experiments of [15]
March 21, 2019 DRAFT
Fig. 1: The histogram of weights calculated for λ = 1 e − (top left) and λ = 1 e + 4 (top right) on the Year Prediction dataset,and the corresponding probability densities (bottom row) for randomly generated features. those approaches are useful when the closed-form of the kernel matrix G is readily available, which is not the casein our setup, as the data distribution is unknown, and the kernel function g ( · , · ) (in the domain of random features)must be approximated, i.e, we need to deal with (14) . Remark 2. (Block Diagonal Approximation) Following Remark 1, another technique to improve the time cost in (13) is block kernel approximation. It has been shown in [24] that for shift-invariant kernels, block kernel approximationcan improve the approximation error (depending on the kernel parameter). For our setup, we still have the sameproblem as in Remark 1 (unknown ˜ G ). III. T
HEORETICAL G UARANTEES
We now provide the generalization guarantees of ELSS. The following assumptions are used for the derivation ofour result.
Assumption 1.
The loss function y (cid:55)→ L ( y, · ) is uniformly G-Lipschitz-continuous in the first argument. A number of commonly used loss functions satisfy the assumption above. Examples include the logistic loss L ( y, y (cid:48) ) = log(1 + exp( − yy (cid:48) )) and hinge loss L ( y, y (cid:48) ) = max { , − yy (cid:48) } for classification, and the quadratic loss L ( y, y (cid:48) ) = ( y − y (cid:48) ) for regression. Assumption 2.
The feature map φ ( · , · ) satisfies sup x , ω | φ ( x , ω ) | ≤ . This also implies sup x , x (cid:48) | k ( x , x (cid:48) ) | ≤ dueto (3) . Boundedness assumption is also standard (see e.g. [5]). For example, cosine or sigmoidal feature maps (activationfunctions) satisfy the assumption. In general, when X and Ω are compact, the feature map can be normalized tosatisfy Assumption 2. March 21, 2019 DRAFT
Algorithm 1 aims to approximate a distribution on an infinite-dimensional set ( Ω ) depending on an infinite-dimensional dimensional operator ( Σ ). The main two challenges in analyzing ELSS is that we construct suchdistribution with finite data and finite random features. After the following definition, we state our main result, inwhich we use ˜ O ( · ) to hide poly-log factors. Definition 1. (Degrees of freedom [13]) For a positive-definite operator Σ , degrees of freedom is defined asdeg λ (Σ) (cid:44) Tr (cid:2) Σ(Σ + λI ) − (cid:3) . Theorem 1.
Let Assumptions 1-2 hold. For a fixed parameter ∆ ∈ (0 , . , let N ≥ ∆ − ˜ O ( M ) , M = o ( N ) ,and λ = N in Algorithm 1. Let M random features sampled from (cid:98) q (the output of Algorithm 1) define the class (cid:98) F (cid:98) q in (16) , and let (cid:98) f (cid:98) θ be the minimizer of the empirical risk over (cid:98) F (cid:98) q . If M ≥ deg λ ( (cid:98) Σ) log[16 N deg λ ( (cid:98) Σ)] , for g γ ∈ F k , we have E (cid:104) R ( (cid:98) f (cid:98) θ ) (cid:105) − inf (cid:107) γ (cid:107) ≤ FN R ( g γ ) ≤ Er ( N ) + Er ( k ) + Er ( g ) , where the expectation is over data and random features, (cid:98) Σ is the integral operator defined with respect to (cid:98) k M ( · , · ) in (4) , F is a constant factor, and Er ( N ) = O (cid:18) √ N (cid:19) Er ( k ) = O (cid:32)(cid:114) E dp,dp (cid:48) (cid:104) var dτ (cid:16)(cid:98) k M ( x , x (cid:48) ) (cid:17)(cid:105)(cid:33) Er ( g ) = O (cid:32)(cid:115) ∆ N E dτ (cid:20) σ M ( G ) (cid:21)(cid:33) . Interpretation:
The bound in Theorem 1 consists of three terms. Er ( N ) appears from calculating Rademachercomplexity (estimation due to finite sample size N ) and the choice of λ = 1 /N (which turns out to be optimal inview of (28)). Er ( k ) depends on the variance of the approximation of kernel k . Not only does it scale inverselywith M , but also it depends on the feature map. On the other hand, Er ( g ) captures the impact of the minimumeigenvalue of G (12). This quantity is a random number based on the choice of random features (but it is expectedout over dτ in the bound). In general, the trade-off between Er ( g ) and Er ( k ) is structure-dependent and non-trivial.Increasing M can improve Er ( k ) at the cost of making Er ( g ) looser.As defined in Section II-D both k ( · , · ) and g ( · , · ) depend on the feature map φ ( · , · ) , but the important insight isthat“ Er ( g ) (which depends on g ) is characterized by the data distribution dp , whereas Er ( k ) ((which depends on g )is characterized by the distribution of random features dτ .” Example 1.
For Gaussian kernel using Monte Carlo sampling (see Lemma 2 in [25]), the variance is M (cid:16) − e − z (cid:17) ≈ M z , for small z , where z = (cid:107) x − x (cid:48) (cid:107) (cid:112) var dτ ( ω ) . If z = O ( M /N ) (small variance for random features), then Er ( k ) = O (1 / √ N ) . If σ i ( G ) = Θ( e − i ) , by choosing M = ε log N for (cid:15) ∈ (0 , . , and letting ∆ = N − ε , we March 21, 2019 DRAFT can maintain the optimal bound as Er ( g ) = O (1 / √ N ) . Similarly, if σ i ( G ) = Θ( i − ) , ∆ = M − can guarantee Er ( g ) = O (1 / √ N ) . The detailed proof of the theorem is in the supplementary material. It combines several ideas with prior results inliterature. To analyze the difference between q ( ω ) and (cid:98) q ( ω ) , we use the spectral approximation result of [16] forridge leverage functions (based on recent works on matrix concentration inequalities [17]). We further employ theapproximation error by [13] for bounding the error caused by selecting M random features out of M possiblesamples { ˜ ω m } M m =1 . Remark 3.
Notice that the bound in Theorem 1 is for a Lipschitz continuous loss (similar to that of [13]), whereasthe results in [14] and [15] are focused on ridge regression and SVM, respectively. On the other hand, our boundis in expectation, whereas the results in [14], [15] are in high probability. The major difference (our contribution)is that all three prior works assumed (i) availability of q ( · ) (8) and (ii) the possibility of sampling from it. Ourwork relaxes these two by constructing and sampling from (cid:98) q ( · ) (Algorithm 1). Remark 4.
Given that Theorem 1 relates the generalization bound via Er ( k ) to the variance of the kernelapproximation, methods for variance reduction in sampling initial M random features may be more effective thanMonte Carlo. For example, Orthogonal Random Features (ORF) [25] is a potential technique for variance reductionin approximation of the Gaussian kernel. In general, assuming a structure on the kernel (more than the integralform (3) ) can result in more explicit error term Er ( k ) in the generalization bound, but pursuing this direction isoutside of the scope of this work. IV. R
ELATED L ITERATURE
Random features:
The idea of randomized features was proposed as an elegant technique for improvingcomputational efficiency of kernel methods [4]. As previously mentioned, a wide variety of kernels (of the form (3)),can be approximated using random features (e.g. shift-invariant kernels using Monte Carlo [4] or Quasi Monte Carlo[26] sampling, and dot product kernels [19]. To further increase the efficiency with respect to the input dimension,a number of methods have been developed based on the properties of dense Gaussian random matrices (see e.g.Fast-food [27] and Structured Orthogonal Random Features [25]). These methods can decrease the time complexityby a factor of O ((log d ) /d ) . To study supervised learning, [28] showed that using (cid:96) -regularization combined withrandomized coordinate descent, random features can be made more efficient. More specifically, to achieve (cid:15) error onthe risk, O (1 /(cid:15) ) random features is required in contrast to O (1 /(cid:15) ) in the early work of [5]. In the similar spirit andmore recently, [14] showed that to achieve O (1 / √ N ) learning error in ridge regression, only M = O ( √ N log N ) random features is required. Data-dependent random features:
A number of recent works have focused on kernel approximation techniquesbased on data-dependent sampling of random features. Examples include [29] on compact nonlinear feature maps,[30], [31] on approximation of shift-invariant/translation-invariant kernels, [32] on Stein effect in kernel approximation,and [33] on data-dependent approximation using greedy approaches (e.g. Frank-Wolfe).
March 21, 2019 DRAFT0
Another line of research has focused on generalization properties of data-dependent sampling. We discussed the unsupervised techniques based on leverage scores in the Introduction (e.g. [13]–[15]). On the other hand, there are supervised methods [10]–[12] with the goal of improvement of test error. [10] develop an optimization-based methodto re-weight random features and sample important ones for better generalization. This method outperforms [5] inthe experiments, but the theoretical bound still indicates the need for O ( N ) features to achieve O (1 / √ N ) learningerror. In a similar fashion, [11] propose a (supervised) score function for resampling of random features. Whileeffective in practice compared to its prior works, the method does not have a theoretical generalization guarantee.[12] study data-dependent approximation of translation-invariant/rotation-invariant kernels with a focus on SVM.Their technique works based on maximizing the kernel alignment in the Fourier domain. The theoretical boundis derived by applying no-regret learning to solve SVM dual. We finally remark that recently [34] have providedanalysis of ELSS in the case of Ridge regression. However, our results are valid for Lipschitz losses and thegeneralization bound is different. In particular, our bound depends on the eigenvalue decay of the (random) featuregram matrix, highlighting the role of data distribution. Taylor (explicit) features:
Beside random features, explicit feature maps have also been used in speeding upkernel methods. Cotter et al [35] discuss the Taylor approximation of Gaussian kernel in training SVM and provideempirical comparisons with random features. Low-dimensional Taylor approximation has also been addressed in[36], [37] for Gaussian kernel as well as in [38] for other practical kernels. Furthermore, the authors of [39] quantifythe approximation error of additive homogeneous kernels. Finally, greedy approximation using explicit featureshas been discussed in [40]. In general, the experiments of [35] for Gaussian kernel suggests that in comparison ofTaylor vs random features, none clearly dominates the other, as the structure of data indeed plays an important rolein having a better fit.
Nystr¨om method:
This work is also relevant to Nystr¨om method which offers a data-dependent sampling schemefor kernel approximation [41], [42]. In this approach, we use a subset of training data to approximate a surrogatekernel matrix, and then we transform the data points using the approximated kernel. Though being data-dependent,the main difference of this line of research with this work is that we are concerned with learning good features forgeneralization. V. E
MPIRICAL E VALUATIONS
In this section, we provide numerical experiments on four datasets from the UCI Machine Learning Repository.
Benchmark algorithms:
We use the following methods in randomized kernel approximation as baselines:
1) RKS [5] , with approximated Gaussian kernel: φ = cos( x (cid:62) ω m + b m ) in (4), { ω m } Mm =1 are sampled from aGaussian distribution, and { b m } Mm =1 are sampled from the uniform distribution on [0 , π ) .
2) LKRF [10] , with approximated Gaussian kernel: φ = cos( x (cid:62) ω m + b m ) . M random features ( M > M ) aresampled and re-weighted by solving a kernel alignment optimization. The top M features are used in the training.
3) EERF [11] , with approximated Gaussian kernel: φ = cos( x (cid:62) ω m + b m ) , and similar to LKRF, M > M numberof initial random features are sampled and then re-weighted according to a score function. The top M randomfeatures are used in the training. March 21, 2019 DRAFT1
Fig. 2: Comparison of the test error of Algorithm 1 and Algorithm 2 against randomized features baselines RKS, LKRF, andEERF.
The selection of the baselines above allows us to evaluate the generalization performance of one data-independent method ( [5]) and two supervised data-dependent methods ( [10], [11]) for sampling random features, and comparethem to ELSS, which is an unsupervised data-dependent method. It should be noted that LKRF and EERF learna new kernel based on input-outputs, but in view of (9), ELSS only performs importance sampling and does notchange the kernel. To change the kernel (still in an unsupervised manner) we modify ELSS (Algorithm 1) to choosethe top M features (out of M ) that have the most weight without actually sampling them. We present that asELSS (Algorithm 2) in the experiments. Practical considerations:
The Python code of our paper is available on Github . Grid search was performedto obtain the optimal hyper-parameter of each method for each dataset. For instance, to determine the width ofthe Gaussian kernel K ( x , x (cid:48) ) = exp( − (cid:107) x − x (cid:48) (cid:107) / σ ) , we obtain the value of σ for each dataset using gridsearch in [1 e − , e + 3] . Notice that for randomized approaches, this amounts to sampling random features from σ − N (0 , I d ) . Following the work of [11] and as a rule of thumb, we set M = 10 M for all algorithms. Theorem 1suggests that N > ∆ − M , and given that in our experiments M can go up to , even for ∆ ≈ . , N ≈ ,so we simply use N = N for each dataset. The hyper-parameters of the optimization step in LKRF [10] are tunedand the best results are reported. Datasets:
In this work we used four datasets from the UCI Machine Learning Repository, namely the Year https://github.com/.../ELSS (Suppressed for double-blind review) March 21, 2019 DRAFT2
Fig. 3: The change in train and test error rates with respect to λ for the Year Prediction dataset. Top and bottom rows showELSS Algorithm 1 and Algorithm 2, respectively. Prediction, Online News Popularity, Adult, and Epileptic Seizure Detection datasets, where the former two datasetsare regression tasks and the latter two are binary classification tasks. Table I tabulates the information for eachdataset. If the training and test samples are not provided separately for a dataset, we split it randomly. We standardizethe data by scaling the features to have zero mean and unit variance and the responses in regression to be inside [ − , . TABLE I: Input dimension, number of training samples, and number of test samples are denoted by d , N train , and N test , respectively. Dataset Task d N train N test Year prediction Regression 90 46371 5163Online news popularity Regression 58 26561 13083Adult Classification 122 32561 16281Epileptic seizure recognition Classification 178 8625 2875
Performance:
The results on datasets in Table I are reported in Figure 2. Each experiment was repeated timesand the average generalization performance (i.e., test accuracy/error) of the methods are reported. It can be seen thatELSS Algorithm 1 performs better than RKS, which is data-independent. ELSS Algorithm 2 boosts the performanceeven further and brings the performance closer to that of supervised data-dependent methods, i.e., EERF and LKRF,especially for M = 80 − . It is interesting to observe that in Year Prediction and Seizure Detection, ELSSAlgorithm 2, which changes the kernel unsupervised, outperforms LKRF. March 21, 2019 DRAFT3
Fig. 4: The results of few-shot learning with ELSS with M = 25 random features, logistic regression (Linear), and a perceptronwith M = 25 latent nodes for the Seizure Detection and Adult datasets. Sensitivity to λ : Naturally a question arises regarding the sensitivity of the generalization performance ofELSS with respect to λ . From a theoretical point of view and for shift-invariant kernels, one expects to see uniformlydistributed weights (i.e., equivalent to RFF) for too large and too small values of λ and a sweet spot in between. Toconfirm this, we calculated the train and test error of ELSS for the Year Prediction dataset and reported the resultsin 3. The top and bottom rows correspond to ELSS Algorithm 1 and Algorithm 2. Each experiment for each λ wasrepeated 50 times and the mean and standard deviations are reported. Learning with Less Labels (LwLL):
The existing state-of-the-art machine learning models, and specifically,deep learning architectures are data hungry and require a large number of labeled samples for training. Learningwith few labels has been a long standing goal in the ML community. Semi-supervised learning, active-learning, andmore recently zero-shot, one-shot, and few-shot learning paradigms study different aspects of this problem. Here,we show that the unsupervised nature of ELSS enables us to perform efficient LwLL.We consider the scenario in which we have lots of unlabeled data with few labeled samples as our training set. Tothat end, for the Seizure Detection and Adult datasets we use only K ∈ { , , ..., } labeled samples per classfor training. We then perform classification using Logistic Regression (LR), ELSS+LR, and a neural network (i.e. aperceptron). For ELSS we used M = 25 random features and for the perceptron we used M = 25 latent neurons.We repeated the experiments for each classifier times (with randomized sets of training samples) and measuredthe testing accuracy. The mean and standard deviation of the testing accuracy of these models for different numberof K ’s is depicted in Figure 4. Note that comparison of ELSS+LR and LR serves as an ablation study and showsthe benefit of our proposed approach. In addition, comparison of ELSS+LR with the perceptron shows the benefitof our proposed method compared to neural networks in the LwLL setting. Concluding remarks:
A main distinction between leverage scores and the existing literature on data-dependentrandom feature generation, is the unsupervised nature of ELSS. More interestingly, ELSS can provide generalizationperformance that is on par with supervised methods, which use input-output pairs for random feature generation,e.g., [10] and [11]. But why is it important to have an unsupervised data-dependent feature generator, specifically,when the final task is supervised learning? The answer lies in the realm of learning with less labels (LwLL). Insupervised LwLL, one cannot afford to train complex classifiers due to the lack of enough labeled data. While linear
March 21, 2019 DRAFT4 classifiers generally perform poorly on “complex” datasets. In such scenarios, ELSS could leverage large numberof unlabeled data and extract features that provide similar generalization performances to the ones extracted withfull supervision. A linear classifier then can be trained on the extracted features with few labels.VI. A
PPENDEIX
A. Estimation error
We start with the definition of Rademacher complexity, which is quite standard, but we provide it for completeness.
Definition 2. (Rademacher complexity [43]) For a finite-sample set { x n } Nn =1 , the empirical Rademacher complexityof a class F is defined as (cid:98) R ( F ) (cid:44) N E σ (cid:34) sup f ∈F N (cid:88) n =1 σ n f ( x n ) (cid:35) , where the expectation is taken over { σ i } Ni =1 that are independent samples uniformly distributed on the set {− , } .The Rademacher complexity is then R ( F ) (cid:44) E dp (cid:98) R ( F ) .B. The error of approximating ˜ G with finite data N Next, we have the notion of spectral approximation, which will be used in the proof of our main result.
Definition 3. ( ∆ -spectral approximation [16]) A matrix A is a ∆ -spectral approximation of another matrix B , ifthe following relationship holds (1 − ∆) B (cid:52) A (cid:52) (1 + ∆) B . We now provide the following theorem by [16] on spectral approximation. Note that to avoid confusion, were-write the theorem with the notation in this work. In particular, observe that in [16] the kernel matrix is definedfor data with respect to random features (similar to K in (11)), whereas we state the result for G which is definedfor random features with respect to data . For the sake of simplicity in presentation we use Φ instead of Φ M ,N . Theorem 2. [16] Let ∆ ∈ (0 , . and δ ∈ (0 , . Assume that (cid:107) G (cid:107) ≥ µ . If we use N ≥ ∆ − M µ log δ − random samples from dp , then ΦΦ (cid:62) + µ I is a ∆ -spectral approximation of G + µ I with probability of at least − δ . We dropped an o ( M ) term inside the logarithm argument above (which does not affect our result). We now useTheorem 2 to obtain the spectral approximation of the kernel matrix G in (12). Observe that G is normalized withits dimension (unlike [16]), which necessitates refinement of some parameters in the theorem above before applyingit. Lemma 3.
Let ∆ ∈ (0 , . and δ ∈ (0 , . Given a fixed scalar C , let λ = C N . Then, with probability at least − δ , we have that ( ΦΦ (cid:62) λ + I ) − is a -spectral approximation of ( G λ + I ) − , when N ≥ ∆ − M log δ − and M = o ( N ) . The role of random features and inputs are interchanged.
March 21, 2019 DRAFT5
Proof.
Observe that E (cid:34) ΦΦ (cid:62) λ (cid:35) = G λ , so we should apply Theorem 2 for µ = 1 . First, we should check the condition (cid:13)(cid:13) G λ (cid:13)(cid:13) ≥ . Since M G = g ( ω , ω ) · · · g ( ω , ω M ) ... ... ... g ( ω M , ω ) · · · g ( ω M , ω M ) , and the right-hand side is the gram matrix, which has a positive norm κ independent of N . Then, we should verify κ ≥ λM = C M N , which holds since M = o ( N ) . Now with µ = 1 , we need N ≥ ∆ − M log δ − samples to have (1 − ∆) (cid:18) G λ + I (cid:19) (cid:52) (cid:32) ΦΦ (cid:62) λ + I (cid:33) (cid:52) (1 + ∆) (cid:18) G λ + I (cid:19) , which by simple algebra implies (1 − (cid:18) G λ + I (cid:19) − (cid:52) (cid:32) ΦΦ (cid:62) λ + I (cid:33) − (cid:52) (1 + 2∆) (cid:18) G λ + I (cid:19) − , when ∆ ∈ (0 , . . Lemma 4.
Let ∆ ∈ (0 , . and δ ∈ (0 , . Given a fixed scalar C , let λ = C N . Recall the definition of ˜ G and Q in (13) and (14) , respectively. Then, with probability at least − δ (over N data points), we have that M (cid:88) i =1 (cid:12)(cid:12)(cid:12) [ Q ] ii − [ ˜ G ] ii (cid:12)(cid:12)(cid:12) ≤ M (cid:88) i =1 λλ + σ i ( G ) , as long as N ≥ ∆ − M log δ − and M = o ( N ) .Proof. Let us start with the fact that ˜ G = G ( G + λ I ) − = I − (cid:18) G λ + I (cid:19) − , and Q = ΦΦ (cid:62) (cid:16) ΦΦ (cid:62) + λ I (cid:17) − = I − (cid:32) ΦΦ (cid:62) λ + I (cid:33) − . Therefore, since ˜ G − Q = (cid:32) ΦΦ (cid:62) λ + I (cid:33) − − (cid:18) G λ + I (cid:19) − , due to Lemma 3, we derive − (cid:18) G λ + I (cid:19) − (cid:52) ˜ G − Q (cid:52) (cid:18) G λ + I (cid:19) − , (17) March 21, 2019 DRAFT6 entailing M (cid:88) i =1 (cid:12)(cid:12)(cid:12) [ Q ] ii − [ ˜ G ] ii (cid:12)(cid:12)(cid:12) ≤ Tr (cid:34)(cid:18) G λ + I (cid:19) − (cid:35) = 2∆ M (cid:88) i =1 λλ + σ i ( G ) , which finishes the proof.The above lemma is used to bound the total variation distance between (cid:98) q and q , as probability mass functionsover { ˜ ω m } M m =1 . Note that from Section 4.2 of [13], the leverage score s ( ω i ) = [ ˜ G ] ii for i ∈ [ M ] . As a result, theoptimized distribution q ( ω ) with respect to the uniform measure on { ˜ ω m } M m =1 is q ( ˜ ω i ) = [ ˜ G ] ii Tr (cid:104) ˜ G (cid:105) , for all i ∈ [ M ] , (18)and from Algorithm 1 recall that (cid:98) q ( ˜ ω i ) = [ Q ] ii Tr [ Q ] , for all i ∈ [ M ] , Corollary 5.
Let ∆ ∈ (0 , . and δ ∈ (0 , . Given a fixed scalar C , let λ = C N . Recall the definition of G in (12) . Then, with probability at least − δ (over N data points), we have that M (cid:88) m =1 | q ( ˜ ω m ) − (cid:98) q ( ˜ ω m ) | ≤ M (cid:80) m =1 λλ + σ m ( G ) M (cid:80) m =1 σ m ( G ) λ + σ m ( G ) , as long as N ≥ ∆ − M log δ − and M = o ( N ) .Proof. Let us start with M (cid:88) m =1 | q ( ˜ ω m ) − (cid:98) q ( ˜ ω m ) | = M (cid:88) i =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) [ Q ] ii Tr [ Q ] − [ ˜ G ] ii Tr (cid:104) ˜ G (cid:105) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ M (cid:88) i =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) [ ˜ G ] ii Tr (cid:104) ˜ G (cid:105) − [ Q ] ii Tr (cid:104) ˜ G (cid:105) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + M (cid:88) i =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) [ Q ] ii Tr [ Q ] − [ Q ] ii Tr (cid:104) ˜ G (cid:105) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Since [ Q ] ii ≥ , the second term in the bound above simplifies to M (cid:88) i =1 [ Q ] ii (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Tr [ Q ] − Tr (cid:104) ˜ G (cid:105) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Tr [ Q ] − Tr (cid:104) ˜ G (cid:105) Tr (cid:104) ˜ G (cid:105) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , March 21, 2019 DRAFT7 which is smaller than the first term. Thus, we get M (cid:88) m =1 | q ( ˜ ω m ) − (cid:98) q ( ˜ ω m ) | ≤ M (cid:88) i =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) [ ˜ G ] ii Tr (cid:104) ˜ G (cid:105) − [ Q ] ii Tr (cid:104) ˜ G (cid:105) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ Tr (cid:104) ˜ G (cid:105) M (cid:88) i =1 λλ + σ i ( G ) , where we applied Lemma 4. Observing thatTr (cid:104) ˜ G (cid:105) = M (cid:88) m =1 σ m ( G ) λ + σ m ( G ) , and plugging it in the bound completes the proof. C. The error of approximation using M random features out of M For bounding the error caused by selecting M random features out of M possible samples { ˜ ω m } M m =1 , we usethe approximation error by [13], which is adopted in our notation, following the subsequent definitions. For anyprobability mass function q , we can define the following class of functions: (cid:98) F q (cid:44) (cid:40) f ( · ) = M (cid:88) m =1 θ m φ ( · , ω m ) (cid:112) q ( ω m ) : θ ∈ R M (cid:41) . (19)Also, let (cid:98) k M ( x , x (cid:48) ) (cid:44) (cid:10) φ M ( x ) , φ M ( x (cid:48) ) (cid:11) be the kernel approximated using M random features sampled from dτ . Then, F (cid:98) k M (cid:44) (cid:40) f ( · ) = N (cid:88) n =1 α n (cid:98) k M ( x n , · ) : α ∈ R N (cid:41) . (20)The following result [13] characterizes the (minimum) distance between these two classes. Proposition 6. (Approximation of the unit ball of F (cid:98) k M for optimized distribution [13]) For λ > and thedistribution with density q ( ω ) defined in equation (18) with respect to d (cid:98) τ (the uniform measure on { ˜ ω m } M m =1 ). Let { ω m } Mm =1 be sampled i.i.d. from the density q ( ω ) , defining the kernel M (cid:80) Mm =1 q − ( ω m ) φ ( x , ω m ) φ ( x (cid:48) , ω m ) , andits associated RKHS (cid:98) F q in (19) . Then, for any δ ∈ (0 , , with probability − δ with respect to samples { ω m } Mm =1 ,we have, sup (cid:107) f (cid:107) F (cid:98) kM ≤ inf (cid:107) (cid:98) f (cid:107) (cid:98) F q ≤ (cid:13)(cid:13)(cid:13) f − (cid:98) f (cid:13)(cid:13)(cid:13) L ( dp, X ) ≤ λ, (21) if M ≥ deg λ ( (cid:98) Σ) log deg λ ( (cid:98) Σ) δ , where deg λ ( (cid:98) Σ) is defined in Definition 1, and (cid:98) Σ is the integral operator definedwith respect to (cid:98) k M ( x , x (cid:48) ) . We remark that in [13], the result above has been stated for comparing (cid:98) F q with F k under the assumption ofdenseness of F k in L ( dp, X ) to avoid zero eigenvalues of the operator. However, as mentioned in Section 2.1 of[13], this assumption can be relaxed . Since our base class is derived by the uniform measure on { ˜ ω m } M m =1 (ratherthan whole dτ ), we can only compare (cid:98) F q with F (cid:98) k M using [13]. In the next section, we compare F (cid:98) k M with F k . One can generate a sequence of nonzero positive numbers that sum to an infinitesimal number.
March 21, 2019 DRAFT8
D. The approximation error of sampling M random features from dτ Lemma 7.
Recall from (1) that F k (cid:44) (cid:40) f ( · ) = N (cid:88) n =1 α n k ( x n , · ) : α ∈ R N (cid:41) , and from (20) that F (cid:98) k M (cid:44) (cid:40) f ( · ) = N (cid:88) n =1 α n (cid:98) k M ( x n , · ) : α ∈ R N (cid:41) . Then, for any g α ∈ F k and (cid:98) g α ∈ F (cid:98) k M such that (cid:107) α (cid:107) ≤ FN , we have E (cid:107) g α − (cid:98) g α (cid:107) L ( dp, X ) ≤ (cid:114) F E dp,dp (cid:48) (cid:104) var dτ (cid:16)(cid:98) k M ( x , x (cid:48) ) (cid:17)(cid:105) , where the expectation on the left-hand side is over data and random features, and the variance on the right-handside is over random features.Proof. for any g α ∈ F k and (cid:98) g α ∈ F (cid:98) k M , we have (cid:107) g α − (cid:98) g α (cid:107) L ( dp, X ) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N (cid:88) n =1 α n k ( x n , x ) − N (cid:88) n =1 α n (cid:98) k M ( x n , x ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L ( dp, X ) . Let e ( x ) = [ e ( x ) , . . . , e N ( x )] (cid:62) , where for n ∈ [ N ] e n ( x ) (cid:44) k ( x n , x ) − (cid:98) k M ( x n , x ) . Then since α (cid:62) e ≤ (cid:107) α (cid:107) (cid:107) e (cid:107) , for any (cid:107) α (cid:107) ≤ FN , we get (cid:107) g α − (cid:98) g α (cid:107) L ( dp, X ) = (cid:13)(cid:13) α (cid:62) e ( x ) (cid:13)(cid:13) L ( dp, X ) ≤ (cid:113) E dp (cid:107) α (cid:107) (cid:107) e ( x ) (cid:107) ≤ (cid:114) FN E dp (cid:107) e ( x ) (cid:107) . Observe that E dτ [ e n ( x )] = 0 and so E dτ [ e n ( x )] = var dτ ( (cid:98) k M ( x n , x )) . Taking expectation with respect to dτ fromabove and using Jensen’s inequality, we get E dτ (cid:107) g α − (cid:98) g α (cid:107) L ( dp, X ) ≤ (cid:118)(cid:117)(cid:117)(cid:116) FN E dp N (cid:88) n =1 var dτ ( (cid:98) k M ( x n , x )) . Noting that { x n } Nn =1 are i.i.d. and taking another expectation from above with respect to the randomness of data,we have E (cid:107) g α − (cid:98) g α (cid:107) L ( dp, X ) ≤ (cid:118)(cid:117)(cid:117)(cid:116) FN E dp E dp (cid:48) N (cid:88) n =1 var dτ ( (cid:98) k M ( x (cid:48) , x ))= (cid:114) F E dp,dp (cid:48) (cid:104) var dτ (cid:16)(cid:98) k M ( x , x (cid:48) ) (cid:17)(cid:105) , which completes the proof. March 21, 2019 DRAFT9
E. Proof of Theorem 1
Recall the the definition of (cid:98) F (cid:98) q in (16) and (cid:98) F q in (19). Let us define (cid:98) f (cid:98) θ (cid:44) argmin f ∈ (cid:98) F (cid:98) q : (cid:107) θ (cid:107)≤ F √ M (cid:98) R ( f ) . For any (cid:107) β (cid:107) ≤ FM , let (cid:98) f β be another function in (cid:98) F (cid:98) q and f β ∈ (cid:98) F q . We now have for any g γ ∈ F k and (cid:98) g α ∈ F (cid:98) k M that R ( (cid:98) f (cid:98) θ ) = R ( (cid:98) f (cid:98) θ ) − (cid:98) R ( (cid:98) f (cid:98) θ ) + (cid:98) R ( (cid:98) f (cid:98) θ ) ≤ R ( (cid:98) f (cid:98) θ ) − (cid:98) R ( (cid:98) f (cid:98) θ ) + (cid:98) R ( (cid:98) f β )= R ( (cid:98) f (cid:98) θ ) − (cid:98) R ( (cid:98) f (cid:98) θ ) + sup (cid:107) β (cid:107) ≤ FM (cid:104) (cid:98) R ( (cid:98) f β ) − R ( (cid:98) f β ) (cid:105)(cid:124) (cid:123)(cid:122) (cid:125) e + sup (cid:107) β (cid:107) ≤ FM (cid:104) R ( (cid:98) f β ) − R ( f β ) (cid:105)(cid:124) (cid:123)(cid:122) (cid:125) e + sup α (cid:62) K α ≤ FN inf (cid:107) β (cid:107) ≤ FM [ R ( f β ) − R ( (cid:98) g α )] (cid:124) (cid:123)(cid:122) (cid:125) e + sup (cid:107) γ (cid:107) ≤ FN inf (cid:107) α (cid:107) ≤ FN [ R ( (cid:98) g α ) − R ( g γ )] (cid:124) (cid:123)(cid:122) (cid:125) e + inf (cid:107) γ (cid:107) ≤ FN R ( g γ ) . (22)The rest of the proof follows by bounding the terms above. Bounding e Standard arguments for Rademacher complexity of kernels (see e.g Lemma 22 in [43]) combined with Assumption1 implies, e ≤ G √ F √ M N (cid:118)(cid:117)(cid:117)(cid:116) N (cid:88) n =1 M (cid:88) m =1 φ ( x n , ω m ) (cid:98) q ( ω m ) . Since we sample M random features { ˜ ω m } M m =1 according to dτ , it is useful to define the following measure d (cid:98) τ ( ω ) (cid:44) M M (cid:88) m =1 δ ( ω − ˜ ω m ) , (23)where δ ( · ) is the Dirac delta function. Taking expectation with respect to the measure (cid:98) q ( ω ) d (cid:98) τ ( ω ) and using Jensen’sinequality, we get E (cid:98) qd (cid:98) τ [ e ] ≤ G √ FN (cid:118)(cid:117)(cid:117)(cid:116) M N (cid:88) n =1 M (cid:88) m =1 φ ( x n , ˜ ω m ) . March 21, 2019 DRAFT0
Let us define C (cid:44) E dp [ k ( x , x )] . Taking expectation from above with respect to both dτ (from which ˜ ω m ’s aresampled) and dp ( x ) , we have E [ e ] ≤ √ F GC √ N . (24)where we applied Jensen’s inequality again.
Bounding e To bound e , we start by Assumption 1 (G-Lipschitz loss) to get R ( (cid:98) f β ) − R ( f β ) ≤ G (cid:13)(cid:13)(cid:13) (cid:98) f β − f β (cid:13)(cid:13)(cid:13) L ( dp, X ) = G (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) M (cid:88) m =1 β m φ ( · , ω m ) (cid:32) (cid:112)(cid:98) q ( ω m ) − (cid:112) q ( ω m ) (cid:33)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L ( dp, X ) ≤ G (cid:118)(cid:117)(cid:117)(cid:116) FM M (cid:88) m =1 (cid:32) (cid:112)(cid:98) q ( ω m ) − (cid:112) q ( ω m ) (cid:33) , (25) where the last line follows by (cid:107) β (cid:107) ≤ FM and the fact that sup x , ω | φ ( x , ω ) | ≤ (Assumption 2). Taking expectationwith respect to (cid:98) q ( ω ) d (cid:98) τ ( ω ) , we have by Jensen’s inequality that E (cid:98) qd (cid:98) τ [ e ] ≤ G (cid:118)(cid:117)(cid:117)(cid:116) FM M (cid:88) m =1 (cid:32) − (cid:112)(cid:98) q ( ˜ ω m ) (cid:112) q ( ˜ ω m ) (cid:33) = G (cid:118)(cid:117)(cid:117)(cid:116) FM M (cid:88) m =1 q ( ˜ ω m ) (cid:16)(cid:112) q ( ˜ ω m ) − (cid:112)(cid:98) q ( ˜ ω m ) (cid:17) ≤ G (cid:118)(cid:117)(cid:117)(cid:116) FM M (cid:88) m =1 q ( ˜ ω m ) | q ( ˜ ω m ) − (cid:98) q ( ˜ ω m ) | , (26)where the last line follows by the simple inequality that (cid:12)(cid:12)(cid:12) √ a − √ b (cid:12)(cid:12)(cid:12) ≤ (cid:112) | a − b | for a, b ≥ .Now, let p i be the standard unit vector in R M . Note that from relationship (18), we can conclude that for any i ∈ [ M ] q ( ˜ ω i ) = [ ˜ G ] ii Tr (cid:104) ˜ G (cid:105) = p (cid:62) i ˜ Gp i Tr (cid:104) ˜ G (cid:105) ≥ σ M ( ˜ G ) Tr (cid:104) ˜ G (cid:105) , which allow us to simplify (26) to get E (cid:98) qd (cid:98) τ [ e ] ≤ G (cid:118)(cid:117)(cid:117)(cid:117)(cid:116) F Tr (cid:104) ˜ G (cid:105) M σ M ( ˜ G ) M (cid:88) m =1 | q ( ˜ ω m ) − (cid:98) q ( ˜ ω m ) | , March 21, 2019 DRAFT1 combined with Corollary 5 resulting in E (cid:98) qd (cid:98) τ [ e ] ≤ G (cid:118)(cid:117)(cid:117)(cid:117)(cid:117)(cid:117)(cid:117)(cid:116) F Tr (cid:104) ˜ G (cid:105) M σ M ( ˜ G ) M (cid:80) m =1 λλ + σ m ( G ) M (cid:80) m =1 σ m ( G ) λ + σ m ( G ) = G (cid:118)(cid:117)(cid:117)(cid:116) FM σ M ( ˜ G ) M (cid:88) m =1 λλ + σ m ( G ) ≤ G (cid:115) Fσ M ( ˜ G ) λλ + σ M ( G )= G (cid:115) F λσ M ( G ) , with probability − δ (over N data samples that are sampled out of N in Algorithm 1). Assuming F λσ M ( G ) < and letting δ = 4∆ F λσ M ( G ) , the in expectation bound over data will be easily obtained and we have E dp E (cid:98) qd (cid:98) τ [ e ] ≤ G (cid:115) ∆ F λσ M ( G ) . Finally, we take expectation with respect to dτ to get E [ e ] ≤ G (cid:115) ∆ F E dτ (cid:20) λσ M ( G ) (cid:21) , (27)as long as N ≥ ∆ − M log δ − and M = o ( N ) . Bounding e We can use G-Lipschitz continuity and apply the in-expectation version of Proposition 6 to get E [ e ] ≤ G √ λ. (28)We should note that Proposition 6 is with respect to the measure qd (cid:98) τ , while we generate the samples in thealgorithm by (cid:98) qd (cid:98) τ . This can cause an additional error in (28) which is in the order of the total variation distance (cid:80) M m =1 | q ( ˜ ω m ) − (cid:98) q ( ˜ ω m ) | . However, we can safely disregard this error term as we have already bounded a largererror (in orders) when bounding e in equation (26). Bounding e First, notice the change of feasible set for α from e to e . Since | k ( · , · ) | ≤ ∀ α : (cid:107) α (cid:107) ≤ FN ⇒ N (cid:88) i =1 N (cid:88) j =1 α i α j k ( x i , x j ) ≤ N (cid:107) α (cid:107) ≤ F, the feasible set involved in e is a subset of the one in e , and since we are taking infimum, this can only loosenthe bound. Since R ( (cid:98) g α ) − R ( g γ ) ≤ G (cid:107) g γ − (cid:98) g α (cid:107) L ( dp, X ) , March 21, 2019 DRAFT2 we have e ≤ G sup (cid:107) γ (cid:107) ≤ FN inf (cid:107) α (cid:107) ≤ FN (cid:104) (cid:107) g γ − (cid:98) g α (cid:107) L ( dp, X ) (cid:105) ≤ G sup (cid:107) γ (cid:107) ≤ FN (cid:104) (cid:107) g γ − (cid:98) g γ (cid:107) L ( dp, X ) (cid:105) . Taking expectation from above and applying Lemma 7, we obtain E [ e ] ≤ G (cid:114) F E dp,dp (cid:48) (cid:104) var dτ (cid:16)(cid:98) k M ( x , x (cid:48) ) (cid:17)(cid:105) . (29) Finishing the proof
Plugging (24), (27), (28), and (29) into (22) completes the proof.R
EFERENCES[1] J. Friedman, T. Hastie, and R. Tibshirani,
The elements of statistical learning . Springer series in statistics New York, NY, USA:, 2001,vol. 1.[2] A. J. Smola and B. Sch¨okopf, “Sparse greedy matrix approximation for machine learning,” in
Proceedings of the Seventeenth InternationalConference on Machine Learning , 2000, pp. 911–918.[3] S. Fine and K. Scheinberg, “Efficient SVM training using low-rank kernel representations,”
Journal of Machine Learning Research , vol. 2,no. Dec, pp. 243–264, 2001.[4] A. Rahimi and B. Recht, “Random features for large-scale kernel machines.” in
Advances in Neural Information Processing Systems , 2007.[5] ——, “Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning,” in
Advances in Neural InformationProcessing Systems , 2009, pp. 1313–1320.[6] S. Si, K.-Y. Chiang, C.-J. Hsieh, N. Rao, and I. S. Dhillon, “Goal-directed inductive matrix completion,” in
Proceedings of the 22nd ACMSIGKDD International Conference on Knowledge Discovery and Data Mining . ACM, 2016, pp. 1165–1174.[7] D. Lopez-Paz, P. Hennig, and B. Sch¨olkopf, “The randomized dependence coefficient,” in
Advances in neural information processingsystems , 2013, pp. 1–9.[8] L. Carratino, A. Rudi, and L. Rosasco, “Learning with sgd and random features,” in
Advances in Neural Information Processing Systems ,2018, pp. 10 212–10 223.[9] T. Yang, Y.-F. Li, M. Mahdavi, R. Jin, and Z.-H. Zhou, “Nystr¨om method vs random fourier features: A theoretical and empirical comparison,”in
Advances in Neural Information Processing Systems , 2012, pp. 476–484.[10] A. Sinha and J. C. Duchi, “Learning kernels with random features,” in
Advances In Neural Information Processing Systems , 2016, pp.1298–1306.[11] S. Shahrampour, A. Beirami, and V. Tarokh, “On data-dependent random features for improved generalization in supervised learning,” in
AAAI Conference on Artificial Intelligence , 2018.[12] B. Bullins, C. Zhang, and Y. Zhang, “Not-so-random features,”
International Conference on Learning Representations , 2018.[13] F. Bach, “On the equivalence between kernel quadrature rules and random feature expansions,”
Journal of Machine Learning Research ,vol. 18, no. 21, pp. 1–38, 2017.[14] A. Rudi and L. Rosasco, “Generalization properties of learning with random features,” in
Advances in Neural Information ProcessingSystems , 2017, pp. 3218–3228.[15] Y. Sun, A. Gilbert, and A. Tewari, “But how does it work in theory? linear svm with random features,” in
Advances in Neural InformationProcessing Systems , 2018, pp. 3383–3392.[16] H. Avron, M. Kapralov, C. Musco, C. Musco, A. Velingker, and A. Zandieh, “Random fourier features for kernel ridge regression:Approximation bounds and statistical guarantees,” in
Proceedings of the 34th International Conference on Machine Learning-Volume 70 ,2017, pp. 253–262.[17] J. A. Tropp et al. , “An introduction to matrix concentration inequalities,”
Foundations and Trends R (cid:13) in Machine Learning , vol. 8, no. 1-2,pp. 1–230, 2015. March 21, 2019 DRAFT3 [18] M. L. Braun et al. , “Spectral properties of the kernel matrix and their relation to kernel methods in machine learning,” Ph.D. dissertation,Universit¨ats-und Landesbibliothek Bonn, 2005.[19] P. Kar and H. Karnick, “Random feature maps for dot product kernels,” in
International conference on Artificial Intelligence and Statistics ,2012, pp. 583–591.[20] J. Yang, V. Sindhwani, Q. Fan, H. Avron, and M. W. Mahoney, “Random laplace feature maps for semigroup kernels on histograms,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2014, pp. 971–978.[21] Z. Liao and R. Couillet, “On the spectrum of random features maps of high dimensional data,” arXiv preprint arXiv:1805.11916 , 2018.[22] F. Bach, “Sharp analysis of low-rank kernel matrix approximations,” in
Conference on Learning Theory , 2013, pp. 185–209.[23] A. Rudi, D. Calandriello, L. Carratino, and L. Rosasco, “On fast leverage score sampling and optimal learning,” in
Advances in NeuralInformation Processing Systems , 2018, pp. 5677–5687.[24] S. Si, C.-J. Hsieh, and I. S. Dhillon, “Memory efficient kernel approximation,”
The Journal of Machine Learning Research , vol. 18, no. 1,pp. 682–713, 2017.[25] X. Y. Felix, A. T. Suresh, K. M. Choromanski, D. N. Holtmann-Rice, and S. Kumar, “Orthogonal random features,” in
Advances in NeuralInformation Processing Systems , 2016, pp. 1975–1983.[26] J. Yang, V. Sindhwani, H. Avron, and M. Mahoney, “Quasi-monte carlo feature maps for shift-invariant kernels,” in
International Conferenceon Machine Learning , 2014, pp. 485–493.[27] Q. Le, T. Sarl´os, and A. Smola, “Fastfood-approximating kernel expansions in loglinear time,” in
International Conference on MachineLearning , vol. 85, 2013.[28] I. E.-H. Yen, T.-W. Lin, S.-D. Lin, P. K. Ravikumar, and I. S. Dhillon, “Sparse random feature algorithm as coordinate descent in hilbertspace,” in
Advances in Neural Information Processing Systems , 2014, pp. 2456–2464.[29] F. X. Yu, S. Kumar, H. Rowley, and S.-F. Chang, “Compact nonlinear maps and circulant extensions,” arXiv preprint arXiv:1503.03893 ,2015.[30] Z. Yang, A. Wilson, A. Smola, and L. Song, “A la carte–learning fast kernels,” in
Artificial Intelligence and Statistics , 2015, pp. 1098–1106.[31] J. B. Oliva, A. Dubey, A. G. Wilson, B. P´oczos, J. Schneider, and E. P. Xing, “Bayesian nonparametric kernel-learning,” in
ArtificialIntelligence and Statistics , 2016, pp. 1078–1086.[32] W.-C. Chang, C.-L. Li, Y. Yang, and B. Poczos, “Data-driven random fourier features using stein effect,”
Proceedings of the Twenty-SixthInternational Joint Conference on Artificial Intelligence (IJCAI-17) , 2017.[33] R. Agrawal, T. Campbell, J. H. Huggins, and T. Broderick, “Data-dependent compression of random features for large-scale kernelapproximation,” arXiv preprint arXiv:1810.04249 , 2018.[34] Z. Li, J.-F. Ton, D. Oglic, and D. Sejdinovic, “A unified analysis of random fourier features,” arXiv preprint arXiv:1806.09178 , 2018.[35] A. Cotter, J. Keshet, and N. Srebro, “Explicit approximations of the gaussian kernel,” arXiv preprint arXiv:1109.4603 , 2011.[36] C. Yang, R. Duraiswami, and L. Davis, “Efficient kernel machines using the improved fast gauss transform,” in
Proceedings of the 17thInternational Conference on Neural Information Processing Systems , 2004, pp. 1561–1568.[37] J.-W. Xu, P. P. Pokharel, K.-H. Jeong, and J. C. Principe, “An explicit construction of a reproducing gaussian kernel Hilbert space,” in
IEEE International Conference on Acoustics, Speech and Signal Processing , vol. 5, 2006.[38] H. Q. Minh, P. Niyogi, and Y. Yao, “Mercer’s theorem, feature maps, and smoothing,” in
International Conference on ComputationalLearning Theory . Springer, 2006, pp. 154–168.[39] A. Vedaldi and A. Zisserman, “Efficient additive kernels via explicit feature maps,”
IEEE Transactions on Pattern Analysis and MachineIntelligence , vol. 34, no. 3, pp. 480–492, 2012.[40] S. Shahrampour and V. Tarokh, “Learning bounds for greedy approximation with explicit feature maps from multiple kernels,” in
Advancesin Neural Information Processing Systems , 2018, pp. 4695–4706.[41] C. Williams and M. Seeger, “Using the Nystr¨om method to speed up kernel machines,” in
Advances in Neural Information ProcessingSystems , 2001.[42] P. Drineas and M. W. Mahoney, “On the Nystr¨om method for approximating a gram matrix for improved kernel-based learning,”
Journal ofMachine Learning Research , vol. 6, no. Dec, pp. 2153–2175, 2005.[43] P. L. Bartlett and S. Mendelson, “Rademacher and gaussian complexities: Risk bounds and structural results,”
Journal of Machine LearningResearch , vol. 3, no. Nov, pp. 463–482, 2002., vol. 3, no. Nov, pp. 463–482, 2002.