Block Basis Factorization for Scalable Kernel Matrix Evaluation
BBLOCK BASIS FACTORIZATION FOR SCALABLE KERNELEVALUATION
RUOXI WANG ∗ , YINGZHOU LI † , MICHAEL W. MAHONEY ‡ , AND
ERIC DARVE § Abstract.
Kernel methods are widespread in machine learning; however, they are limited by the quadraticcomplexity of the construction, application, and storage of kernel matrices. Low-rank matrix approx-imation algorithms are widely used to address this problem and reduce the arithmetic and storagecost. However, we observed that for some datasets with wide intra-class variability, the optimalkernel parameter for smaller classes yields a matrix that is less well approximated by low-rank meth-ods. In this paper, we propose an efficient structured low-rank approximation method—the BlockBasis Factorization (BBF)—and its fast construction algorithm to approximate radial basis function(RBF) kernel matrices. Our approach has linear memory cost and floating point operations. BBFworks for a wide range of kernel bandwidth parameters and extends the domain of applicability oflow-rank approximation methods significantly. Our empirical results demonstrate the stability andsuperiority over the state-of-art kernel approximation algorithms.
1. Introduction.
Kernel methods are mathematically well-founded nonpara-metric methods for learning. The essential part of kernel methods is a kernel function K : X × X (cid:55)→ R . It is associated with a feature map Ψ from the original input space X ∈ R d to a higher-dimensional Hilbert space H , such that K ( x , y ) = (cid:104) Ψ( x ) , Ψ( y ) (cid:105) H . Presumably, the underlying function for data in the feature space is linear. Therefore,the kernel function enables us to build expressive nonlinear models based on themachinery of linear models. In this paper, we consider the radial basis function(RBF) kernel that is widely used in machine learning.The kernel matrix is an essential part in most kernel methods and is defined inwhat follows. Given n data points { x i } ni =1 , the ( i, j ) -th entry in a kernel matrix is K ij = K ( x i , x j ) . For example, the solution to a kernel ridge regression is the same asthe solution to the linear system ( K + δI ) α = y . Regrettably, any operations involving kernel matrices can be computationallyexpensive. Their construction, application and storage complexities are quadratic inthe number of data points n . Moreover, for solving linear systems involving thesematrices, the complexity is even higher. It is O ( n ) for direct solvers [19] and O ( n T ) for iterative solvers [28, 33], where T is the iteration number. This is prohibitive inlarge-scale applications. One popular solution to address this problem and reducethe arithmetic and storage cost is using matrix approximation. If we are able toapproximate the matrix such that the number of entries that need to be stored isreduced, then the timing for iterative solvers will be accelerated (assuming memoryis a close approximation of the running time for a matrix-vector multiplication).In machine learning, low-rank matrix approximations are widely used [19, 32,27, 31, 34, 15, 18, 14, 43] . When the kernel matrix has a large spectrum gap, a ∗ Institute for Computational and Mathematical Engineering, Stanford University,[email protected] † Department of Mathematics, Duke University, [email protected] ‡ International Computer Science Institute and Department of Statistics, University of California,Berkeley, [email protected] § Department of Mechanical Engineering, Stanford University, [email protected] a r X i v : . [ s t a t . M L ] A p r R. WANG, Y. LI, M. MAHONEY AND E. DARVE high approximation accuracy can be guaranteed by theoretical results. Even whenthe matrix does not have a large spectrum gap or fast spectrum decay, these low-rank algorithms are still popular practical choices to reduce the computational cost;however, the approximation would be less accurate.The motivation of our algorithm is that in many machine learning applications,the RBF kernel matrices cannot be well-approximated by low-rank matrices ; nonethe-less, they are not arbitrary high-rank matrices and are often of certain structure. Inthe rest of the introduction, we first discuss the importance of higher rank matricesand then introduce the main idea of our algorithm that takes advantage of thosestructures. The RBF kernel f ( (cid:107) x − y (cid:107) /h ) has a bandwidth parameter h that controlsthe size of the neighborhood, i.e. , how many nearby points to be considered for inter-actions. The numerical rank of a kernel matrix depends strongly on this parameter.As h decreases from large to small, the corresponding kernel matrix can be approxi-mated by a low-rank matrix whose rank increases from O (1) to O ( n ) . In the large- h regime, traditional low-rank methods are efficient; however, in the small- h regime,these methods fall back to quadratic complexity. The bandwidth parameter is oftenchosen to maximize the overall performance of regression/classification tasks, and itsvalue is closely related to the smoothness of the underlying function. For kernel re-gressions and kernelized classifiers, the hypothesis function classes are (cid:80) i α i K h ( x , x i ) and (cid:80) i α i y i K h ( x , x i ) , respectively. Both can be viewed as interpolations on the train-ing data points. Clearly, the optimal value of h should align with the smoothness ofthe underlying function.Although many real-world applications have found large h to lead to good overallperformances, in a lot of cases a large h will hurt the performance. For example,in kernel regression, when the underlying function is non-smooth such as those withsharp local changes, using a large bandwidth will smooth out the local structures ;in kernelized classifiers, when the true decision surfaces that separate two classes arehighly nonlinear, choosing a large bandwidth imposes smooth decision surfaces onthe model and ignores local information near the decision surfaces. In practice, theprevious situations where relatively small bandwidths are needed are very common.One example is that for classification dataasets with imbalanced classes, often theoptiomal h for smaller classes is relatively small. Hence, if we are particularly inter-ested in the properties of smaller classes, a small h is appropriate. As a consequence,matrices of higher ranks occur frequently in practice.Therefore, for certain machine learning problems, low-rank approximations ofdense kernel matrices are inefficient. This motivates the development of approxima-tion algorithms that extend the applicability of low-rank algorithms to matrices ofhigher ranks, i.e. , that work efficiently for a wider range of kernel bandwidth param-eters.In the field of scientific computing (which also considers kernel matrices, buttypically for very different ends), hierarchical algorithms [20, 21, 12, 13, 17, 41] ef-ficiently approximate the forward application of full-rank PDE kernel matrices inlow dimensions. These algorithms partition the data space recursively using a treestructure and separate the interactions into near- and far-field interactions, where thenear-field interactions are calculated hierarchically and the far-field interactions areapproximated using low-rank factorizations. Later, hierarchical matrices ( H -matrix, H -matrix, HSS matrix, HODLR matrix) [24, 26, 25, 8, 39, 4] were proposed as alge-braic variants of these hierarchical algorithms. Based on the algebraic representation,the application of the kernel matrix as well as its inverse, or its factorization can beprocessed in quasi-linear ( O ( n log k n ) ) operations. Due to the tree partitioning, the LOCK BASIS FACTORIZATION O ( n ) or O ( n log n ) complexity of those algorithms.In this paper, we adopt some ideas from hierarchical matrices and butterfly fac-torizations [29, 30], and propose a Block Basis Factorization (BBF) structure thatgeneralizes the traditional low-rank matrix , along with its efficient construction algo-rithm. We apply this scientific computing based method to a range of problems, withan emphasis on machine learning problems. We will show that the BBF structureachieves significantly higher accuracy than plain low-rank matrices, given the samememory budget, and the construction algorithm has a linear in n complexity for mostmachine learning kernels.The key of our structure is realizing that in most machine learning applications,the sub-matrices representing the interactions from one cluster to the entire data setare numerically low-rank. For example, Wang et al. [38] mathematically proved thatif the diameter of a cluster C is smaller than that of the entire dataset X , then therank of the sub-matrix K ( C , X ) is lower than the rank of the entire matrix K ( X , X ) . Ifwe partition the data such that each cluster has a small diameter, and the clusters areas far apart as possible from each other, then we can take advantage of the low-rankproperty of the sub-matrix K ( C , X ) to obtain a presentation that is more memory-efficient than low-rank representations.The application of our BBF structure is not limited to RBF kernels or machinelearning applications. There are many other types of structured matrices for whichthe conventional low-rank approximations may not be satisfactory. Examples include,but are not limited to, covariance matrices from spatial data [37], frontal matricesin the multi-frontal method for sparse matrix factorizations [3], and kernel method indynamic systems [7]. Our main contribution is three-fold. First, weshowed that for classification datasets whose decision surfaces have small radius ofcurvature, a small kernel bandwidth parameter is needed for high accuracy. Second,we proposed a novel matrix approximation structure that extends the applicabilityof low-rank methods to matrices whose ranks are higher. Third, we developed acorresponding construction algorithm that produces errors with small variance (thealgorithm uses randomized steps), and that has linear, i.e. , O ( n ) complexity for mostmachine learning kernels. Specifically, our contributions are as follows. • For several datasets with imbalanced classes, we observed an improvement inaccuracy for smaller classes when we set the kernel bandwidth parameter to besmaller than that selected from a cross-validation procedure. We attributethis to the nonlinear decision surfaces, which we quantify as the smallestradius of curvature of the decision boundary. • We proposed a novel matrix structure called the Block Basis Factorization(BBF) for machine learning applications. BBF approximates the kernel ma-trix with linear memory and is efficient for a wide range of bandwidth pa-rameters. • We proposed a construction algorithm for the BBF structure that is accurate,stable and linear for most machine learning kernels. This is in contrast tomost algorithms to calculate the singular value decomposition (SVD) whichare more accurate but lead to a cubic complexity, or naïve random samplingalgorithms ( e.g. , uniform sampling) which are linear but often inaccurate orunstable for incoherent matrices. We also provided a fast pre-computation
R. WANG, Y. LI, M. MAHONEY AND E. DARVE algorithm to search for suggested input parameters for BBF.Our algorithm involves three major steps. First, it divides the data into k distinctclusters, permutes the matrix according to these clusters. The permuted matrix has k blocks, each representing the interactions between two clusters. Second, it computesthe column basis for every row-submatrix (the interactions between one cluster and theentire dataset) by first selecting representative columns using a randomized samplingprocedure and then compressing the columns using a randomized SVD. Last, it usesthe corresponding column- and row- basis to compress each of the k blocks, alsousing a randomized sub-sampling algorithm. Consequently, our method computes anapproximation for the k blocks using a set of only k bases. The resulting frameworkyields a rank- R approximation and achieves a similar accuracy as the best rank- R approximation. The memory complexity for BBF is O ( nR/k + R ) , where k isupper bounded by R . This should be contrasted with a low-rank scheme that givesa rank- R approximation with O ( nR ) memory complexity. BBF achieves a similarapproximation accuracy with a factor of k saving on memory. There is a large body of research that aims to accel-erate kernel methods by low-rank approximations [19]. Given a matrix K ∈ R n × n , arank- r approximation of K is given by K ≈ U V (cid:62) where
U, V ∈ R n × r , and r is relatedto accuracy. The SVD provides the most accurate rank- r approximation of a matrixin terms of both 2-norm and Frobenius-norm; however, it has a cubic cost. Recentwork [34, 31, 27, 32] has reduced the cost to O ( n r ) using randomized projections.These methods require the construction of the entire matrix to proceed. Anotherline of the low-rank approximation research is the Nyström method [15, 18, 5], whichavoids constructing the entire matrix. A naïve Nyström algorithm uniformly samplescolumns and reconstructs the matrix with the sampled columns, which is computa-tionally inexpensive, but which works well only when the matrix has uniform leveragescores, i.e. , low coherence. Improved versions [16, 43, 14, 18, 1] of Nyström have beenproposed to provide more sophisticated ways of column sampling.There are several methods proposed to address the same problem as in this paper.The clustered low-rank approximation (CLRA) [35] performs a block-wise low-rankapproximation of the kernel matrix from social network data with quadratic construc-tion complexity. The memory efficient kernel approximation (MEKA) [36] successfullyavoids the quadratic complexity in CLRA. Importantly, these previous methods didnot consider the class size and parameter size issues as we did in detail. Also, in ourbenchmark, we found that under multiple trials, MEKA is not robust, i.e. , it oftenfailed to be accurate and produced large errors. This is due to its inaccurate structureand its simple construction algorithm. We briefly discuss some significant differencesbetween MEKA and our algorithm. In terms of the structure, the basis in MEKA iscomputed from a smaller column space and is inherently a less accurate representa-tion, making it more straightforward to achieve a linear complexity; in terms of thealgorithm, the uniform sampling method used in MEKA is less accurate and less sta-ble than the sophisticated sampling method used in BBF that is strongly supportedby theory.There is also a strong connection between our algorithm and the improved fastGauss transform (IFGT) [40], which is an improved version of the fast Gauss trans-form [22]. Both BBF and IFGT use a clustering approach for space partitioning.Differently, the IFGT approximates the kernel function by performing an analyticexpansion, while BBF uses an algebraic approach based on sampling the kernel ma-trix. This difference has made BBF more adaptive and achieve higher approximation LOCK BASIS FACTORIZATION
2. Motivation: kernel bandwidth and class size.
In this section, we discussthe motivations behind designing an algorithm that remains computationally efficientwhen the matrix rank increases. Three main motivations are as follows. First, thematrix rank depends strongly on the kernel bandwidth parameters (chosen based onthe particular problem), the smaller the parameter, the higher the matrix rank. Sec-ond, a small bandwidth parameter (higher-rank matrix) imposes high nonlinearityon the model, hence, it is useful for regression problems with non-smooth functionsurfaces and classification problems with complex decision boundaries. Third, whenthe properties of smaller classes are of particular interest, a smaller bandwidth pa-rameter would be appropriate and the resulting matrix would be of higher rank. Inthe following we focus on the first two motivations.
We consider firstthe bandwidth parameters, and we will show that the matrix rank depends stronglyon the parameter. Take the Gaussian kernel exp( −(cid:107) x − y (cid:107) /h ) as an example. Thebandwidth h controls the function smoothness. As h increases, the function becomesmore smooth, and consequently, the matrix numerical rank decreases. Figure 1 con-structs a matrix from a real dataset and shows the numerical rank versus h withvarying tolerances tol . As h increases from − to , the numerical rank decreasesfrom full (4177) to low (66 with tol = − , 28 with tol = − , 11 with tol = − ).Low-rank matrix approximations are efficient in the large- h regime, and in suchregime, the matrix rank is low. Unfortunately, in the small- h regime, they fall backto models with quadratic complexity. One natural question is whether the situationwhere a relatively small h is useful occur in machine learning, or whether low-rankmethods are sufficient. We answer this question in the following section, where westudy kernel classifiers on real datasets and investigate the influence of h on accuracy. We study the optimal bandwidth parame-ters used in practical situations, and in particular, we consider kernel classifiers. Inpractice, the parameter h is selected by a cross-validation procedure combined witha grid search, and we denote such parameter as h CV . For datasets with wide intra-variability, we observed that the optimal parameters of some small classes turned outto be smaller than h CV . By small classes, we refer to those with fewer points orsmaller diameters.Table 1 lists some classification datasets with wide intra-variability. This classimbalance has motivated us to study the individual performance of each class. Wefound that there can be a significant discrepancy between h CV which is optimal overallfor the entire dataset and the optimal h for a specific class. In Figure 2, we used kernelSVM classifier under a wide range of h and measure the performance by F score on R. WANG, Y. LI, M. MAHONEY AND E. DARVE h0 1 2 3 4 r an k (a) Numerical ranks Index of singular value0 1000 2000 3000 4000 5000 S i ngu l a r v a l ue -6 -4 -2 h = 2 -4 h = 2 -1 h = 2 (b) Singular value decay Figure 1: Numerical ranks of the kernel matrix versus h . The data used isAbalone, and is normalized in each dimension. The numerical rank is computedas argmin k ( σ k < tol · σ ) , where σ ≥ σ ≥ , . . . , ≥ σ n are the singular values, and tol is the tolerance. The plot on the right shows the decay patterns of singular values forvarying h .the test data. The F score is the harmonic mean of the precision and recall, i.e. , × precision × recallprecision + recall . The data was randomly divided into 80% training set and 20% testing set. Figure 2shows the test F score versus h for selected classes. We see that for some smallerclasses represented by darker colors, the F score peaks at a value for h that is smallerthan h CV . Specifically, for the smallest class (black curve) of each dataset, as h increases from their own optimal h to h CV , the test F scores drop by 21%, 100%,16%, and 5% for EMG, CTG, Otto and Gesture datasets, respectively. To interpretthe value of h in terms of matrix rank, we plotted the singular values for differentvalues of h for the CTG and Gesture dataset in Figure 3. We see that when using h CV , the numerical rank is much lower than using a smaller h which leads to a betterperformance on smaller classes.The above observation suggests that the value of h CV is mostly influenced bylarge classes and using h CV may degrade the performance of smaller classes. There-fore, to improve the prediction accuracy for smaller classes, one way is to reduce thebandwidth h . Unfortunately, a decrease in h increases the rank of the correspondingkernel matrix, making low-rank algorithms inefficient. Moreover, even if we createthe model using h CV , as discussed previously, the rank of the kernel matrix will notbe very low in most cases. These altogether stress the importance of developing analgorithm that extends the domain of applicability of low-rank algorithms. This section com-plements the previous section by investigating some data properties that influence theoptimal kernel bandwidth parameter h .We studied synthetic two-dimensional data and our experiments suggested thatthe optimal h depends strongly on the smallest radius of curvature of the decisionsurface (depicted in Figure 4). By optimal we mean the parameter that yields the LOCK BASIS FACTORIZATION selected classes, r i is the mediandistance to the center for class i , n i is the number of points in class i . Data n d Selected Classes (other classes not shown)1 2 3 4 5EMG 28,500 8 n i r i . × − . × − . × − . × − . × CTG 2,126 23 n i
71 241 318 r i n i r i . × − . × − . × − . × − . × − Otto 20,000 93 n i
870 1602 2736 r i h10 -3 -2 -1 f sc o r e ( t e s t ) (a) EMG, h CV = 0 . h10 f sc o r e ( t e s t )
51, 1.0e+00 71, 1.2e+00 241, 1.4e+00 318, 1.3e+00 555, 1.4e+00 (b) CTG, h CV = 1 . h10 f sc o r e ( t e s t ) (c) Otto, h CV = 1 h10 -1 f sc o r e ( t e s t ) (d) Gesture, h CV = 0 . Figure 2: Test F score of selected class for different datasets. Each curve representsone class, and the solid circle represents the maximum point along the curve. Thelegend represents a pair ( n i , r i ), where n i is the number of point in each class, r i is themedian distance to class center. h CV is the parameter obtained from cross-validation.We see that for smaller classes (represented by darker colors), the F score peaks atan h that is smaller than h CV . R. WANG, Y. LI, M. MAHONEY AND E. DARVE
Index of singular value0 500 1000 1500 2000 S i ngu l a r v a l ue -6 -4 -2 h = 0.5 h = 1.42 (a) CTG Index of singular value0 2000 4000 6000 8000 S i ngu l a r v a l ue -6 -4 -2 h = 0.088 h = 0.25 (b) Gesture Figure 3: Singular value decay for CTG dataset and Gesture dataset on selectedbandwidths. In subplot (a), h CV = 1 . and h = 0 . , respectively, is the optimalparaemter for the full dataset and the smallest dataset. In subplot (b), h CV = 0 . is the optimal parameter for the full dataset, and h = 0 . achieves a good F scorefor clusters of small radii.highest accuracy, and if multiple such parameters exist, we refer to the largest one tobe optimal and denote it as h ∗ . -1 -0.5 0 0.5 1-1-0.500.51 Figure 4: Decision boundary with varying smallest radius of curvatures. Dots rep-resent data points and different classes are color coded. The curves represent thedecision boundaries, of which the radii of curvature are large, median, and small forthe orange, blue and black (those surrounding the small clusters) curves, respectively.We first experimentally study the relation between h ∗ and the smallest radius ofcurvature of the decision boundary. Figure 5 shows Gaussian clusters with alternatinglabels that are color coded. We decrease the radius of curvature of the decisionboundary by decreasing the radius of each cluster while keeping the box size fixed.We quantify the smallest radius of curvature of the decision boundary approximatelyby the standard deviation σ of each cluster. Figure 5b shows a linear correlationbetween σ and h ∗ .We study a couple more examples. Figure 6 shows two smaller circles with dif-ferent radii surrounded by a large circle. For this example, the smallest radius ofcurvature of the decision boundary depends strongly on the cluster radius. Hence, LOCK BASIS FACTORIZATION (a) Data with 4 clusters sigma0 0.02 0.04 0.06 0.08 op t i m a l h -0.100.10.20.30.40.5 DataFitConfidence bounds (b) Relation of σ and optimal h Figure 5: Left: Data (4 clusters) with two alternating labels (20 and 100 clusters casesare not shown). Each cluster is generated from a Gaussian distribution with standarddeviation σ . The decision boundary (black curve) is associated with h ∗ = 0 . . Right:Linear relation of the standard deviation σ ( ≈ half of cluster radius) of each clusterand h ∗ .the optimal h for the smaller class (pink colored) should be smaller than that for thelarger class (orange colored), which was verified by the F score. Compared to thelarge cluster, the F score for the small cluster peaks at a smaller h and drops fasteras h increases. A similar observation was made in higher-dimensional data as well.We generated two clusters of different radii which are surrounded by a larger clusterin dimension 10. Figure 7 shows that the intuition in dimension 2 nicely extends todimension 10. Another cluster example is in Figure 4 , which shows multiple smallclusters overlapping with a larger cluster at the boundary. The 3 reference decisionboundaries correspond to h being 1.5 (orange), 0.2 (blue) and 0.02 (black), respec-tively. The highest accuracy was achieved at h = 0 . , which is close to the smallcluster radius 0.2 and is large enough to tolerate the noises in the overlapping region.The above examples, along with many that are not shown in this paper, haveexperimentally suggested that the optimal parameter h and the smallest radius ofcurvature of the decision surface are positively correlated. Hence, for datasets whosedecision surfaces are highly nonlinear, i.e. , of small radius of curvature, a relativelysmall h is very likely needed to achieve a high accuracy.In the following section, we will introduce our novel scheme to accelerate kernelevaluations, which remains efficient in cases where traditional low-rank methods areinefficient.
3. Block Basis Factorization.
In this section, we propose the Block BasisFactorization (BBF) that extends the availability of traditional low-rank structures.Section 3.1 describes the BBF structure. Section 3.2 proposes its fast constructionalgorithm.
This section defines and analyzes the
Block Basis Factor-ization (BBF). Given a symmetric matrix M ∈ R n × n partitioned into k by k blocks,let M i,j denote the ( i, j ) -th block for i, j = 1 , . . . , k . Then, the BBF of M is defined0 R. WANG, Y. LI, M. MAHONEY AND E. DARVE -1 -0.5 0 0.5 1-1-0.500.51 (a) Data and decision boundary h10 -2 -1 f sc o r e ( t e s t ) (b) F score for test data Figure 6: Left: classes surrounded by a larger one; the legend shows the number ofpoints in each class. The decision boundary (black curve) is for h = 0 . . Right: thetest F score for the test data versus h . h10 -1 nu m e r i c a l r an k tol = 10 -3 tol = 10 -4 tol = 10 -5 (a) Numerical ranks h10 f sc o r e ( t e s t ) (b) F score for test data Figure 7: Left: numerical ranks of the kernel matrix evaluated on the training dataversus h . Right: the F score for the test data versus h . The synthetic data is indimension 10 and has three clusters of different radii. The legend represents a pair ( n i , r i ) , where n i is the number of points in cluster i and r i is the cluster radius.as:(1) M = (cid:101) U (cid:101) C (cid:101) U (cid:62) , where (cid:101) U is a block diagonal matrix with the i -th diagonal block U i being the columnbasis of M i,j for all j ,and (cid:101) C is a k by k block matrix with the ( i, j ) -th block denoted by C i,j = U (cid:62) i M i,j U j . The BBF structure is depicted in Figure 8.We discuss the memory cost for the BBF structure. If the numerical ranks of allthe base U i are bounded by r , then the memory cost for the BBF is O ( nr + ( rk ) ) .Further, if k ≤ √ n and r is a constant independent of n , then the BBF gives a data-sparse representation of matrix M . In this case, the complexity for both storing the LOCK BASIS FACTORIZATION M = (cid:101) U (cid:101) C (cid:101) U (cid:62) Figure 8: M = (cid:101) U (cid:101) C (cid:101) U (cid:62) BBF structure and applying it to a vector will be linear in n .It is important to distinguish between our BBF and a block low-rank (BLR)structure [2]. There are two main differences: 1). The memory usage of the BBFis much less than BLR. BBF has one basis for all the blocks in the same row; whileBLR has a separate basis for each block. The memory for BBF is nr + ( rk ) , whereasfor BLR it is nkr . 2). It is more challenging to construct BBF in linear complexitywhile remaining accurate. A direct approach using SVD to construct the low-rankbase has a cubic cost; while a simple randomized approach would be inaccurate andunstable.In the next section, we will propose an efficient method to construct the BBFstructure, which uses randomized methods to reduce the cost while still providing arobust approach. Our method is linear in n for most kernels used in machine learningapplications. In this section, we first introducea theorem in Section 3.2.1 that reveals the motivation behind our BBF structureand addresses the applicable kernel functions. We then propose a fast constructionalgorithm for BBF in Section 3.2.2.
Consider a RBF kernel function K : R d × R d (cid:55)→ R . Thefollowing theorem in [38] provides an upper bound on the error for the low-rankrepresentation of kernel K . The error is expressed in terms of the function smoothnessand the diameters of the source domain and the target domain. Theorem
Consider a function f and kernel K ( x , y ) = f ( (cid:107) x − y (cid:107) ) with x =( x , . . . , x d ) and y = ( y , . . . , y d ) . We assume that x i ∈ [0 , D/ √ d ] , y i ∈ [0 , D/ √ d ] ,where D is a constant independent of d . This implies that (cid:107) x − y (cid:107) ≤ D . Weassume further that there are D x < D and D y < D , such that (cid:107) x i − x j (cid:107) ≤ D x and (cid:107) y i − y j (cid:107) ≤ D y .Let f p ( x ) = (cid:80) n T ◦ f ( x + 4 nD ) be a D -periodic extension of f ( x ) , where T ( · ) is 1 on [ − D , D ] and smoothly decays to 0 outside of this interval. We assume that f p and its derivatives through f ( q − p are continuous, and the q -th derivative is piecewisecontinuous with its total variation over one period bounded by V q .Then, ∀ M f , M t > with M f ≤ M t , the kernel K can be approximated in aseparable form whose rank is at most R = R ( M f , M t , d ) = 4 M f (cid:0) M t + dd (cid:1) K ( x , y ) = R (cid:88) i =1 g i ( x ) h i ( y ) + (cid:15) M f ,M t R. WANG, Y. LI, M. MAHONEY AND E. DARVE
The L ∞ error is bounded by | (cid:15) M f ,M t | ≤ (cid:107) f (cid:107) ∞ (cid:18) D x D y D (cid:19) M t +1 + V q πq (cid:18) D πM f (cid:19) q . In Theorem 3.1, the error is up bounded by the summation of two terms. We firststudy the second term, which is independent of D x or D y . The second term dependson the smoothness of the function, and decays exponentially as the smoothness ofthe function increases. Most kernel functions used in machine learning are sufficientlysmooth; hence, the second term is usually smaller than the first term. Regarding thefirst term, the domain diameter information influences the error through the factor (cid:16) D x D y D (cid:17) M t +1 , which suggests for a fixed rank (positively related to M t ), reducingeither D x or D y reduces the error bound. It also suggests that for a fixed error,reducing either D x or D y reduces the rank. This has motivated us to cluster pointsinto distinct clusters of small diameters, and by the theorem, the rank of the submatrixthat represents the local interactions from one cluster to the entire dataset would belower than the rank of the entire matrix.Hence, we seek linear-complexity clustering algorithms that are able to separatepoints into clusters of small diameters. k -means and k -centers algorithms are naturalchoices. Both algorithms partition n data points in dimension d into k clusters at acost of O ( nkd ) . Moreover, they are based on the Euclidean distance between points,which is consistent with the RBF kernels which are functions of the Euclidean dis-tance. In practice, the algorithms converge to slightly different clusters due to differentobjective functions, but neither is absolutely superior. Importantly, the clustering re-sults from these algorithms yield a more memory efficient BBF structure than randomclusters. A more task-specific clustering algorithm will possibly yield better result;however, the main focus of this paper is on factorizing the matrix efficiently, ratherthan proposing new approaches to identify good clusters.We experimentally verify our motivation on a real-world dataset. We clusteredthe pendigits dataset into 10 clusters ( C , C , . . . , C ) using the k -means algorithm,and reported the statistics of each cluster in Figure 9a. We see that the radius of eachcluster is smaller than that of the full dataset. We further plotted the normalizedsingular values of the entire matrix M and its sub-matrices M ( C i , :) in Figure 9b.Notably, the normalized singular value of the sub-matrices shows a significantly fasterdecay than that of the entire matrix. This suggests that the ranks of sub-matrices aremuch lower than that of the entire matrix. Hence, by clustering the data into clustersof smaller radius, we are able to capture the local interactions that are missed by theconventional low-rank algorithms which only consider global interactions. As a result,we achieve a similar level of accuracy with a much less memory cost. This section proposes a fast construc-tion algorithm for the BBF structure. For simplicity, we assume the data pointsare evenly partitioned into k clusters, C , . . . , C k , and the numerical rank for eachsubmatrix is r . We first permute the matrix according to the clusters:(2) M = P KP (cid:62) = C C · · · C k C M , M , · · · M ,k C M , M , · · · M ,k ... ... ... . . . ... C k M k, M k, · · · M k,k , LOCK BASIS FACTORIZATION Cluster Radius Size1 42.4 7522 41.3 6033 37.6 7294 24.3 7335 24.2 15716 23.0 5887 21.8 10068 21.6 4219 21.4 23610 21.3 855Full 62.8 7494(a)
Index of singular value0 200 400 600 800 1000 S i ngu l a r v a l ue -4 -3 -2 -1 (b) Figure 9: Left (a): Clustering result of the pendigits dataset. Right (b): Normalizedsingular value decay. In subplot (b), the solid curve represents the entire matrix M ,and the dash curves represent the row-submatrices M ( C i , :) . The kernel used was theGaussian kernel with bandwidth parameter h = 2 .where P is a permutation matrix, and M i,j = K ( C i , C j ) is the interaction matrixbetween cluster C i and cluster C j .Our fast construction algorithm consists of two components: basis constructionand inner matrix construction. In the following, we adopt Matlab’s notation forsubmatrices. We use the colon to represent , e.g. , M i, : = (cid:0) M i, · · · M i,k (cid:1) ,and use the index vectors I and J to represent sub-rows and sub-columns, e.g. , M ( I , J ) represents the intersection of rows and columns whose indices are I and J ,respectively.
1. Basis construction
We consider first the basis construction algorithm. The most accurate approachis to explicitly construct the submatrix M i, : and apply an SVD to obtain the columnbasis; regrettably, it has a cubic cost to compute all the bases. Randomized SVD [27]reduces the cost to quadratic while being accurate; however, a quadratic complexityis still expensive in practice. In the following, we describe a linear algorithm that isaccurate and stable. Since the proposed algorithm adopts randomness, by “stable”we mean the variance of the output is small under multiple runs. The key idea is torestrict us in a subspace by sampling columns of large volume.The algorithm is composed of two parts. In the first part, we select some columnsof M i, : that are representative of the column space. By representative, we mean the r sampled columns have volume approximating the maximum r -dimensional volumeamong all column sets of size r . In the second part, we apply the randomized SVDalgorithm to the representative columns to extract the column basis. Part 1: Randomized sampling algorithm
We seek a sampling method that samples columns with approximate maximumvolume. Strong rank revealing QR (RRQR) [23] returns columns whose volume is4
R. WANG, Y. LI, M. MAHONEY AND E. DARVE proportional to the maximum volume obtained by SVD. QR with column pivoting(pivoted QR) is a practical replacement for the strong RRQR due to its inexpensivecomputational cost. To ensure a linear complexity, we use the pivoted QR factoriza-tion with a randomized approach.We describe the randomized sampling method [16] used in our BBF algorithm;the algorithm detail is in Algorithm 1 with the procedure depicted in Figure 10. Thecomplexity of sampling r columns from an m × n matrix is O ( r ( m + n )) . The size ofthe output index sets Π r and Π c could grow as large as qr , but it can be controlled bysome practical algorithmic modifications. One is that given a tolerance, we truncatethe top columns based on the magnitudes of the diagonal entries of matrix R from thepivoted QR. Another is to apply an early stopping once the important column indexset do not change for two consecutive iterations. For the numerical results reportedin this paper, we used q = 2 . Note that any linear sampling algorithm can substituteAlgorithm 1, and in practice, Algorithm 1 returns columns whose volume is very closeto the largest. Algorithm 1:
Randomized sampling algorithm for each submatrix Function
Randomized_Sampling( M i, : , r i , q ) input : (1) Row-submatrix M i, : to sample from in its implicit form (givendata and kernel function); (2) Sample size r i ; (3) Iterationparameter q output: Important column index set Π c for M i, : Π r = ∅ for iter=1, . . . , q do Important columns . Uniformly sample r i rows, and denote the indexset as Γ r , update Π r = Π r ∪ Γ r . Apply a pivoted QR factorization on M i, : (Π r , :) to get the important columns index set, denoted as Π c . Important rows . Uniformly sample r i columns, and denote the indexset as Γ c . Update Π c = Γ c ∪ Π c . Apply a pivoted LQ factorization on M i, : (: , Π c ) to get the important row index set, denoted as Π r . end return Π c Note: The pivoted QR is the QR factorization with column pivoting based onthe largest column norm.
Algorithm 2:
BBF sampling algorithm Function
BBF_Sampling( { M i, : } ki =1 , { r i } ki =1 , q ) input : (1) Sub-matrices { M i, : } ki =1 to sample from in their implicit forms(given data and kernel function); (2) Sample sizes { r i } ki =1 foreach sub-matrix M i, : ; (3) Iteration parameter q output: Important column index set Π i for each row-submatrix for i = 1 , . . . , k do Π i = Randomized_Sampling( M i, : (: , Γ) , r i , q ) (using Algorithm 1) end return Π i for i = 1 , . . . , k Applying Algorithm 1 to k submatrices { M i, : } ki =1 will return the desired k sets LOCK BASIS FACTORIZATION k (see Section 3.2.3 for details), and we can remove this de-pendence by applying Algorithm 1 on a pre-selected and refined set of columns insteadof all the columns. This leads to a more efficient procedure to sample columns forthe k submatrices as described in Algorithm 3. Our final BBF construction algorithmwill use Algorithm 2 for column sampling. Algorithm 3:
More efficient BBF sampling algorithm Function
BBF_Sampling_I( { M i, : } ki =1 , { r i } ki =1 , q ) input : (1) Sub-matrices { M i, : } ki =1 to sample from in their implicit forms(given data and kernel function); (2) Sample sizes { r i } ki =1 foreach sub-matrix M i, : ; (3) Iteration parameter q output: Important column index set Π i for each row-submatrix for i = 1 , . . . , k do Randomly sample r i columns from M i and denote the index set as Π i Apply a pivot LQ on M i, : (: , Π i ) to obtain r important rows, and wedenote the index set as Γ i end Stack all the sampled rows
Γ = [Γ , . . . , Γ k ] for i = 1 , . . . , k do Π i = Randomized_Sampling( M i, : (: , Γ) , r i , q ) (using Algorithm 1) end return Π i for i = 1 , . . . , k Figure 10: A pictorial description of the sampling algorithm. We start with samplingrandom columns, and iterate between important rows (by pivoted LQ) and importantcolumns (by pivoted QR) to obtain our refined important columns. This procedureis usually repeated for a few times to ensure the stability of the important indices.
Part 2: Orthogonalization algorithm
Having sampled the representative columns M i, : (: , Π i ) , the next step is to obtainthe column basis. This can be achieved through any orthogonalization methods, e.g. ,pivoted QR, SVD, randomized SVD [27], etc. According to Algorithm 1, the sizeof the sampled index set Π c can be as large as qr . In practice, we found that therandomized SVD works efficiently. The randomized SVD algorithm was proposed toreduce the cost of computing a rank- r approximation of an m × n matrix to O ( mnr ) The algorithm is described in Algorithm 4. The practical implementation of Algo-rithm 4 involves an oversampling parameter (cid:96) to reduce the iteration parameter q .For simplicity, we eliminate (cid:96) from the pseudo code.
2. Inner matrix construction R. WANG, Y. LI, M. MAHONEY AND E. DARVE
Algorithm 4:
Randomized SVD Function
Randomized_SVD( M , r , q ) input : (1) Matrix M ∈ R m × n ; (2) desired rank r ; (3) iterationparameter q output: U , Σ , and V such that M ≈ U Σ V (cid:62) Randomly generate a Gaussian matrix Ω ∈ R n × r M Ω = QR for i =1, . . . , q do M (cid:62) Q = ˆ Q ˆ R M ˆ Q = QR end ˆ U Σ V (cid:62) = Q (cid:62) M U = Q ˆ U return U , Σ , V We then consider the inner matrix construction. Given column base U i and U j ,we seek a matrix C i,j such that it minimizes (cid:107) M i,j − U i C i,j U (cid:62) j (cid:107) . The minimizer is given by C i,j = U † i M i,j ( U (cid:62) j ) † . Computing C i,j exactly has aquadratic cost. Again, we restrict ourselves in a subspace and propose a sampling-based approach that is efficient yet accurate. The following proposition provides akey theoretical insight behind our algorithm. Proposition
If a matrix M ∈ R m × n can be written as M = U CV (cid:62) , where U ∈ R m × r and V ∈ R n × r . Further, if for some index set I and J , U ( I , :) and V ( J , :) are full rank, then, the inner matrix C is given by (3) C = ( U ( I , :)) † M ( I , J ) ( V ( J , :) (cid:62) ) † , where † denotes the pseudo-inverse of the matrix.Proof. To simplify the notations, we denote (cid:98) U = U ( I , :) , (cid:98) V = V ( J , :) , and (cid:99) M = M ( I , J ) , where I is the sampled row index set for U and J is the sampled row indexset for V . We apply the sampling matrices P I and P J (matrices of 0’s and 1’s) toboth sides of equation M = U CV (cid:62) , and obtain P I M P (cid:62)J = P I U CV (cid:62) P (cid:62)J , i.e. , (cid:99) M = (cid:98) U C (cid:98) V (cid:62) . The assumption that (cid:98) U and (cid:98) V are tall and skinny matrices with full column ranksimplies that (cid:98) U † ˆ U = I and (cid:98) V (cid:62) ( (cid:98) V (cid:62) ) † = I . We then multiply (cid:98) U † and ( (cid:98) V (cid:62) ) † on bothsides and obtain the desired result: (cid:98) U † (cid:99) M ( (cid:98) V (cid:62) ) † = (cid:98) U † (cid:98) U C (cid:98) V (cid:62) ( (cid:98) V (cid:62) ) † = C. Prop. 3.2 provides insights into an efficient, stable and accurate construction ofthe inner matrix. In practice, the equality in Prop. 3.2 often holds with an errorterm and we seek index sets I and J such that the computation for C is accurate LOCK BASIS FACTORIZATION M ( I , J ) with a large volume. However, finding such a set can be computationally expensiveand a heuristic is required for efficiency. We used a simplified approach where wesample I (resp. J ) such that U ( I , :) (resp. V ( J , :) ) has a large volume. This leadsto good numerical stability, because having a large volume is equivalent to beingnearly orthogonal, which implies a good condition number. In principle, a pivotedQR strategy could be used but, fortunately, we are able to skip it by using the resultsfrom the basis construction. Recall that in the basis construction, the importantrows were sampled using a pivoted LQ factorization, hence, they already have largevolumes.Therefore, the inner matrix construction is described in what follows. We firstuniformly sample r column indices Γ j and r row indices Γ i , respectively, from C j and C i . Then, the index sets are constructed as I = Π i ∪ Γ i and J = Π j ∪ Γ j , where Π i and Π j are the important row index sets from the basis construction. Finally, C i,j isgiven by ( U i ( I , :)) † M i,j ( I , J ) ( U j ( J , :) (cid:62) ) † . We also observed small entries in some off-diagonal blocks of the inner matrix.Those blocks normally represent far-range interactions. We can set the blocks forwhich the norm is below a preset threshold to 0. In this way, the dense inner matrixbecomes a block-wise sparse matrix, further reducing the memory.Having discussed the details for the construction algorithm, we summarize theprocedure in Algorithm 5, which is the algorithm used for all the numerical results.In this section, for simplicity, we only present BBF for symmetric kernel matrices.However, the extension to general non-symmetric cases is straightforward by applyingsimilar ideas, and the computational cost will be roughly doubled. Asymmetric BBFcan be useful in compressing the kernel matrix in the testing phase.
3. Pre-computation: Parameter Selection
We present a heuristic algorithm to identify input parameters for BBF. The algo-rithm takes n input points { x i } ni =1 and a requested error (tolerance) (cid:15) , and outputsthe suggested parameters for the BBF construction algorithm, specifically, the num-ber of clusters k , the index set for each cluster I , and the estimated rank r i for thesubmatrix corresponding to the cluster I . We seek a set of parameters that minimizesthe memory cost while keeping the approximation error below (cid:15) . Choice of column ranks.
Given the tolerance (cid:15) and the number of clusters k ,we describe our method of identifying the column ranks. To maintain a low cost, thekey idea is to consider only the diagonal blocks instead of the entire row-submatrices.For each row-submatrix in the RBF kernel matrices (after permutation), the diagonalblock, which represents the interactions within a cluster, usually has a slower spec-tral decay than that of off-diagonal blocks which represent the interactions betweenclusters. Hence, we minimize the input rank for the diagonal block and use this asthe rank for those off-diagonal blocks in the same row.Specifically, we denote σ ,i ≥ σ ,i ≥ . . . ≥ σ n i ,i as the singular values for M i,i .Then for block M i,i ∈ R n i × n i , the rank r i is chosen as r i = min (cid:110) m | n i (cid:88) p = m +1 σ p,i < n i n (cid:107) M i,i (cid:107) F (cid:15) (cid:111) . Choice of number of clusters k . Given the tolerance (cid:15) , we consider thenumber of clusters k . For k clusters, the upper bound on the memory usage of BBF is8 R. WANG, Y. LI, M. MAHONEY AND E. DARVE
Algorithm 5: Main Algorithm — Fast construction algorithm for BBF Function
BBF_Construction ( k , {C i } ki =1 , { r i } ki =1 , M , q Samp , q SVD ) Input : (1) Number of clusters k ; (2) Clustering assignments {C i } ki =1 ; (3)Rank { r i } ki =1 for each column basis; (4) Matrix M in its implicitform; (5) Iteration parameter q Samp for randomized sampling;(6) Iteration parameter q SVD for randomized SVD.
Output:
Block diagonal matrix (cid:101) U and block-wise sparse matrix (cid:101) C s.t. M ≈ (cid:101) U (cid:101) C (cid:101) U (cid:62) [Π , · · · , Π k ] = BBF_Sampling( { M i, : } ki =1 , { r i } ki =1 , q Samp ) (Algorithm 2) for i = 1 , . . . , k do U i = Randomized_SVD( M i : (: , Π i ) , r i , q SVD ) (Algorithm 4) end for i = 1 , . . . , k do for j = 1 , . . . , i do if cutoff criterion is not satisfied then Uniformly sample Γ i and Γ j from C i and C j , respectively I = Π i ∪ Γ i and J = Π j ∪ Γ j C i,j = ( U i ( I , :)) † M i,j ( I , J )( U j ( J , :) (cid:62) ) † C j,i = C (cid:62) i,j else C i,j = 0 C j,i = 0 end end end return ˜ U , ˜ C (cid:80) ki =1 n i r i + (cid:16)(cid:80) ki =1 r i (cid:17) , where r i is computed as described above. Hence, the optimal k is the solution to the following optimization problemminimize k g ( k ) = k (cid:88) i =1 n i r i + (cid:32) k (cid:88) i =1 r i (cid:33) subj to r i = min (cid:110) m | n i (cid:88) p = m +1 σ p < n i n (cid:107) M i,i (cid:107) F (cid:15) (cid:111) , ∀ i We observed empirically that in general, g ( k ) is close to convex in the interval [1 , O ( √ n )] , which enables us to perform a dichotomy search algorithm with com-plexity O (log n ) for the minimal point. In this section, we analyze the algorithm com-plexity. We will provide detailed analysis on the factorization step including thebasis construction and the inner matrix construction, and skip the analysis for thepre-computation step. We first introduce some notations.
Notations.
Let k denote the number of clusters, { n i } ki =1 denote the number ofpoint in each cluster, { r i } ki =1 denote the requested rank for the blocks in the i -thsubmatrix, and l denote the oversampling parameter. Basis construction.
The cost comes from two parts, the column sampling and
LOCK BASIS FACTORIZATION i -th row-submatrix. Forthe column sampling, the cost is n i ( r i + l ) + n ( r i + l ) , where the first term comesfrom the pivoted LQ factorization, and the second term comes from the pivoted QR factorization. For the randomized SVD, the cost is n i ( r i + l ) . Summing up the costsfrom all the submatrices, we obtain the overall complexity O (cid:32) k (cid:88) i =1 n i ( r i + l ) + n ( r i + l ) + n i ( r i + l ) (cid:33) . We simplify the result by assuming that the clusters are of equal size and the numericalrank for each block is r . Then, the above complexity is simplified to O ( nkr ) . Inner matrix construction.
The cost for computing inner matrix C i,j withsampled M i,j , U i and U j is r i r j + r i r j . Summing over all the k blocks, the overallcomplexity is given by k (cid:88) i =1 k (cid:88) j =1 r i r j + r i r j . With the same assumptions as above, the simplified complexity is O ( k r ) . Note that k can reach up to O ( √ n ) while still maintaining a linear complexity for this step.Finally, we summarize the complexity of our algorithm in Table 2.Table 2: Complexity table. n is the total number of points, k is the number of clusters, r is the numerical rank for column basis.Pre-computation Factorization ApplicationEach block O ( n i r ) Basis O ( nkr ) O ( nr + ( rk ) ) Compute g ( k ) O ( nr ) Inner matrix O ( k r ) Total O ( r n log n ) Total O ( nkr + k r ) From Table 2, we note that the factorization and application cost (storage) dependquadratically on the number of clusters k . This suggests that a large k will spoilthe linearity of the algorithm. However, this may not be the case for most machinelearning kernels, and we will discuss the influence of k on three types of kernel matrices:1) well-approximated by a low-rank matrix; 2) full-rank but approximately sparse; 3)full-rank and dense.1. Well-approximated by a low-rank matrix . When the kernel matrix can be well-approximated by a low-rank matrix, kr is up bounded by a constant (up to theapproximation accuracy). In this case, both the factorization and application costsare linear.2. Full-rank but approximately sparse . When the kernel matrix is full-rank ( kr = O ( n ) ) but approximately sparse, the application cost (storage) remains linear dueto the sparsity. By “sparsity”, we mean that as h decreases, the entries in theoff-diagonal blocks of the inner matrices become small enough that setting themto 0 does not cause much accuracy loss. The factorization cost, however, becomesquadratic when using Algorithm 2. One solution is to use Algorithm 3 for columnsampling, which removes the dependence on k , assuming k < O ( √ n ) .0 R. WANG, Y. LI, M. MAHONEY AND E. DARVE Full-rank and dense . In this case, BBF would be sub-optimal. However, we ex-perimentally observed that most kernel matrices generated by RBF functions withhigh dimensional data are in case 1 or 2.In the end, we empirically verify the linear complexity of our method. Figure 11shows the factorization time (in second) versus the number of data points on somereal datasets. The trend is linear, confirming the linear complexity of our algorithm. number of sampled data points10 t i m e ( s ) -1 O(n)BBF (a) Census housing, h = 2 number of sampled data points10 t i m e ( s ) O(n) (b) Forest covertype, h = 2 Figure 11: Factorization time (loglog scale) for kernel matrices from real datasets. Toillustrate the linear growth of the complexity, we generated datasets with a varyingnumber of points with the following strategy. We first clustered the data into 15groups, and sampled a portion p % from each group, then increased p . To avoid theinfluence from other factors on the timing, we fixed the input rank for each block. Aswe can see, the timing grows linearly with the data size (matrix size).
4. Experimental Results.
In this section, we experimentally verify the advan-tages of the BBF structure in Section 4.1 and the BBF algorithm in Section 4.2. ByBBF algorithm we refer to the BBF structure and the proposed fast constructionalgorithm.The datasets are listed in Table 3 and Table 1, and they can be downloadedfrom the UCI repository [6], the libsvm website [9] and Kaggle. All the data werenormalized such that each dimension has mean 0 and standard deviation 1. All theexperiments were performed on a computer with 2.4 GHz CPU and 8 GB memory.Table 3: Real datasets used in the experiments.
Dataset
Abalone Mushroom Cpusmall
Dataset
Pendigits Census house Forest covertype
11 16 54
LOCK BASIS FACTORIZATION In this section, we will experimentally analyze the keyfactors in our BBF structure that contribute to its advantages over competing meth-ods. Many factors contribute and we will focus our discussions on the following two:1) The BBF structure has its column base constructed from the entire row-submatrix,which is an inherently more accurate representation than from diagonal blocks only(see MEKA); 2) The BBF structure considers local interactions instead of only globalinteractions used by a low-rank scheme.
We verifythat computing the column basis from the entire row-submatrix M i, : is generally moreaccurate than from the diagonal blocks M i,i only. Column basis computed from thediagonal blocks only preserves the column space information in the diagonal blocks,and will be less accurate in approximating the off-diagonal blocks. Figure 12 showsthat computing the basis from the entire row-submatrix is more accurate. We compare the BBFstructure and the low-rank structure. The BBF structure refers to Figure 8, and thelow-rank structure means K ≈ U U (cid:62) , where U is a tall and skinny matrix. For afair comparison, we fixed all the factors to be the same except for the structure. Forexample, for both the BBF and the low-rank schemes, we used the same samplingmethod for the column selection, and computed the inner matrices exactly to avoidrandomness introduced in that step. The columns for BBF and low-rank scheme,respectively, were sampled from each row-submatrix M i, : ∈ R n i × n and the entirematrix M ∈ R n × n . For BBF with leverage-score sampling, we sampled columns of M i, : based on its column leverage scores computed from the algorithm in [14].Figure 13 shows the relative error versus the memory cost for different samplingmethods. The relative error is computed by (cid:107) ˆ K − K (cid:107) F (cid:107) K (cid:107) F , where ˆ K is the approximatedkernel matrix, K is the exact kernel matrix, and (cid:107)·(cid:107) F denotes the Frobenius norm. Ascan be seen, the BBF structure is strictly a generalization of the low-rank scheme, andachieves lower approximation error regardless of the sampling method used. More-over, for most sampling methods, the BBF structure outperforms the best low rankapproximation computed by an SVD, which strongly implies that the BBF structureis favorable. In this section, we experimentally evaluate the perfor-mance of our BBF algorithm with other state-of-art kernel approximation methods.Section 4.2.1 and Section 4.2.2 examine the matrix reconstruction error under varyingmemory budget and kernel bandwidth parameters. Section 4.2.3 applies the approx-imations to the kernel ridge regression problem. Finally, Section 4.2.4 compares thelinear complexity of BBF with the IFGT [40]. Throughout the experiments, we useBBF to denote our algorithm, whose input parameters are computed from our pre-computation algorithm.In what follows, we briefly introduce some implementation and input parameterdetails for the methods we are comparing to. • The naïve Nyström (Nys) . We uniformly sampled k columns without re-placement for a rank k approximation. • k -means Nyström (kNys) . It uses k -means clustering and sets the centroidsto be the landmark points. We used the code provided by the author. • Leverage score Nyström (lsNys) . It samples columns with probabilities pro-portional to the statistical leverage scores. We calculated the approximatedleverage scores [14] and sampled k columns with replacement for a rank- k R. WANG, Y. LI, M. MAHONEY AND E. DARVE (a1) M i, : , SVD (a2) M i,i , SVD(b1) M i, : , randomized (b2) M i,i , randomized Figure 12: Errors for each block in the approximated matrix from the Abalone dataset.Warmer color represents larger error. Subplot (a1) and (a2) shares the same colorbar;and (b1) and (b2) shares the same colorbar. The error for block ( i, j ) is computed as (cid:107) M i,j − (cid:99) M i,j (cid:107) F / (cid:107) M (cid:107) F , where (cid:99) M i,j is the approximation of M i,j . The basis in subplot(a1) and (a2) are computed by an SVD; and in (b1) and (b2) are computed by therandomized sampling algorithm. As we can see, computing the column basis from thediagonal blocks leads to lower error in the diagonal blocks; however, the errors in theoff-diagonal blocks are much larger. The relative error in subplot (a1), (a2), (b1) and(b2) are . × − , . × − , . × − , . × − , respectively.approximation. • Memory Efficient Kernel Approximation (MEKA) . We used the code pro-vided by the author. • Random Kitchen Sinks (RKS) . We used our own MATLAB implementationbased on their algorithm. • Improved Fast Gauss Transform (IFGT) . We used the
C++ code provided bythe author.
We consider the re-construction errors from different methods when the memory cost varies. The memorycost (storage) is also a close approximation of the running time for a matrix-vectormultiplication. In addition, computing memory is more accurate than running time,
LOCK BASIS FACTORIZATION memory × r e l a t i v e e rr o r -4 -3 -2 -1 BBF w/ rand LR w/ rand BBF w/ unif LR w/ unif BBF w/ ls LR w/ ls BBF w/ svd LR w/ svd (a) Abalone, h =2 memory × r e l a t i v e e rr o r -3 -2 -1 (b) Pendigits, h = 2 Figure 13: Kernel approximation error versus memory cost for BBF and low-rankstructure with different sampling methods. Gaussian kernel is used. The results areaveraged over 5 runs. BBF (solid lines) uses the structure described in Figure 8,and LR (dash lines) uses a low-rank structure. “rand”: randomized sampling; “unif”:uniform sampling; “ls”: leverage score sampling; “svd”: an SVD is used for computingthe basis.which is sensitive to the implementation and algorithmic details. In our experiments,we indirectly increased the memory cost by requesting a lower tolerance in BBF. Thememories for all the methods were fixed to be roughly the same in the following way.For low rank methods, the input rank was set to be the memory of BBF divided bythe matrix size. For MEKA, the input number of clusters was set to be the same asBBF; the “eta” parameter (the percentage of blocks to be set to zeros) was also set tobe similar as BBF.Figure 14 and Figure 15 show the reconstruction error versus memory cost on realdatasets and 2D synthetic datasets, respectively. We see that BBF achieves compara-ble and often significantly lower error than the competing methods regardless of thememory cost. There are two observations worth noting. First, the BBF outperformsthe exact SVD which is the best rank- r approximation, and it outperforms with afactorization complexity that is only linear rather than cubic. This has demonstratedthe superiority of the BBF structure over the low-rank structure. Second, even whencompared to a similar structure as MEKA, BBF achieves a lower error whose vari-ance is also smaller, and it achieves so with a similar factorization complexity. Thesehave verified that the representation of BBF is more accurate and the constructingalgorithm is more stable. Weconsider the reconstruction errors with varying decay patterns of singular values,which we achieve by choosing a wide range of kernel bandwidth parameters. Thememory for all methods are fixed to be roughly the same.The plots on the left of Figure 16 show the average matrix reconstruction errorversus /h . We see that for all the low-rank methods, the error increases when h decreases. When h becomes smaller, the kernel function becomes less smooth, andconsequently the matrix rank increases. This relation between h and the matrix rankare revealed in some statistics listed in Table 4. The results in the table are consistent4 R. WANG, Y. LI, M. MAHONEY AND E. DARVE memory10 r e l a t i v e e rr o r -3 -2 -1 BBF MEKA Nys kNys lsNys RKS SVD (a) Abalone, h = 1 memory10 r e l a t i v e e rr o r -3 -2 -1 (b) Pendigits, h = 2 memory × r e l a t i v e e rr o r -4 -3 -2 -1 (c) CTG, h = 0 . memory10 r e l a t i v e e rr o r -3 -2 -1 (d) EMG, h = 0 . Index of singular value0 500 1000 1500 2000 S i ngu l a r v a l ue -6 -4 -2 (e) Singular values (CTG) Index of singular value0 2000 4000 6000 8000 S i ngu l a r v a l ue -6 -4 -2 (f) Singular values (EMG) Figure 14: Comparisons of BBF (our algorithm) with competing methods. Top fourplots (loglog scale) share the same legend. For each method, we show the improvementin error as more memory is available. For each memory footprint, we report the errorof 5 runs of each algorithm. Each run is shown with a marker, while the lines representthe average error. For CTG and EMG datasets, the parameter h was chosen to achievehigher F score on smaller classes, which leads to matrices with higher ranks as shownby the plateau or slow decay of singular values in the bottom plots subplots (e) and(f). LOCK BASIS FACTORIZATION h × -3 nu m e r i c a l r an k -2 tol = 10 -3 tol = 10 -4 (a) Numerical ranks of the kernel matrix memory10 r e l a t i v e e rr o r -4 -2 BBF MEKA Nys kNys lsNys RKS SVD (b) Relative error versus memory cost h0.05 0.1 0.15 0.2 0.25 nu m e r i c a l r an k -2 tol = 10 -3 tol = 10 -4 (c) Numerical ranks of the kernel matrix memory10 r e l a t i v e e rr o r -4 -3 -2 -1 (d) Relative error versus memory cost Figure 15: The left plots report the numerical ranks of matrices versus h and the rightplots report the relative error versus memory. The data for top plots has alternatinglabels (see Figure 5 with 100 clusters) while the data for the bottom plots has smallerclusters surrounded by larger ones (see Figure 6). The values of h reported in subplot ( a ) and ( c ) yield test accuracy greater than 0.99. The h in ( b ) and ( d ) are the largestoptimal h with value 0.0127 and 0.25 respectively. For each memory cost, we reportthe relative error of 5 runs of each algorithm. The number of clusters for BBF wasfixed at 20 for subplot ( b ) and selected automatically for subplot ( d ) .with the results shown in [18] for varying kernel bandwidth parameters.In the large- h regime, the gap in error between BBF and other methods is small.In such regime, the matrix is low rank, and the low-rank algorithms work effectively.Hence, the difference in error is not significant. In the small- h regime, the gap starts toincrease. In this regime, the matrix becomes close to diagonal dominant, and the low-rank structure, as a global structure, cannot efficiently capture the information alongthe diagonal; while for BBF, the pre-computation procedure will increase the numberof clusters to better approximate the diagonal part, and the off-diagonal blocks can beset to 0 due to their small entries. By efficiently using the memory, BBF is favorablein all cases, from low-rank to nearly diagonal.6 R. WANG, Y. LI, M. MAHONEY AND E. DARVE
Table 4: Summary statistics for abalone and pendigits datasets with the Gaussiankernel, where r is the rank, M is the exact matrix, and M r is the best rank- r approx-imation for M . (cid:108) (cid:107) M (cid:107) F (cid:107) M (cid:107) (cid:109) is referred to as the stable rank and is an underestimate ofthe rank. l r represents the r -th largest leverage score scaled by nr . Abalone ( r = 100) Pendigits ( r = 252) h (cid:108) (cid:107) M (cid:107) F (cid:107) M (cid:107) (cid:109) (cid:107) M r (cid:107) F (cid:107) M (cid:107) F l r h (cid:108) (cid:107) M (cid:107) F (cid:107) M (cid:107) (cid:109) (cid:107) M r (cid:107) F (cid:107) M (cid:107) F l r -2 r e l a t i v e e rr o r -4 -3 -2 -1 BBF Nys kNys lsNys svd (a) Abalone -2 R M SE (b) Abalone -4 -2 r e l a t i v e e rr o r -3 -2 -1 (c) Pendigits -4 -2 R M SE (d) Pendigits Figure 16: Plots for relative error vers /h for different kernel approximation meth-ods. The memory costs for all methods and all kernel parameters are fixed to beroughly the same. The kernel function used is the Gaussian kernel. As the rank ofthe matrix increases ( h decreases), the gap in error between low-rank approximationmethods and BBF increases. LOCK BASIS FACTORIZATION We consider the kernel ridge regression. Thestandard optimization problem for the kernel ridge regression is(4) min α (cid:107) K α − y (cid:107) + λ (cid:107) α (cid:107) , where K is a kernel matrix, y is the target, and λ > is the regularization parameter.The minimizer is given by the solution of the following linear system(5) ( K + λI ) α = y . The linear system can be solved by an iterative solver, e.g.
MINRES [33], and thecomplexity is O ( n T ) where n is from matrix-vector multiplications and T denotesthe iteration number. If we can approximate K by (cid:98) K which can be represented inlower memory, then the solving time can be accelerated. This is because the mem-ory is a close approximation for the running time of a matrix-vector multiplication.We could also solve the approximated system directly when the matrix can be well-approximately by a low-rank matrix, that is, we compute the inversion of (cid:98) K first bythe Woodbury formula and then apply the inversion to y .In the experiments, we approximated K by (cid:98) K , and solved the following approxi-mated system with MINRES.(6) ( (cid:98) K + λI ) α = y . The dataset was randomly divided into training set (80%) and testing set (20%). Wereport the test root-mean-square error (RMSE) which is defined as(7) (cid:114) n test (cid:107) K test ˆ α − y test (cid:107) F , where K test is the interaction matrix between the test data and training data, ˆ α is thesolution from solving Equation 6, and y test is the true test target. Figure 17 showsthe test RMSE with varying memory cost of the approximation. We see that withthe same memory footprint, the BBF achieves lower test error. Webenchmarked the linear complexity of the Improved Fast Gauss Transform (IFGT) [40]and BBF. IFGT was proposed to alleviate the dimensionality issue for the Gaussiankernel. For a fixed dimension d , the IFGT has a linear complexity in terms of applica-tion time and memory cost; regrettably, when d increases ( e.g. , d ≥ ) the algorithmrequires a large number of data points n to make this linear behavior visible. BBF,on the other hand, does not require a large n to observe a linear growth.We verify the influence of dimension d on the complexity of BBF and the IFGTon synthetic datasets. We fixed the tolerance to be − throughout the experiments.Figure 18 shows the time versus the number of points. We focus only on the trend oftime instead of the absolute value, because the IFGT was implemented in C++ whileBBF was in MATLAB. We see that the growth rate of IFGT is linear when d = 5 but falls back to quadratic when d = 40 ; the growth rate of BBF, however, remainslinear.
5. Conclusions and future work.
In this paper, we observed that for clas-sification datasets whose decision boundaries are complex, i.e. , of small radius ofcurvature, a small bandwidth parameter is needed for a high prediction accuracy.In practical datasets, this complex decision boundary occurs frequently when thereexist a large variability in class sizes or radii. These small bandwidths result in ker-nel matrices whose ranks are not low and hence traditional low-rank methods are https://en.wikipedia.org/wiki/Woodbury_matrix_identity R. WANG, Y. LI, M. MAHONEY AND E. DARVE memory R M SE BBFNyskNyslsNysMEKAExact (a) Abalone, (4, − ) memory R M SE (b) Pendigits, (22.6, − ) memory R M SE (c) Cpusmall, (4.94, − ) memory R M SE -3 -2 -1 (d) Mushroom, (4.63, − ) Figure 17: Test RMSE versus memory for kernel ridge regression. For each memorycost, we report the results averaged over 5 runs. The black line on the bottom of eachplot represents the test RMSE when using the exact matrix. The kernel parameter h and the regularization parameter λ were selected by a 5-fold cross-validation with agrid search, and the selected ( h, λ ) pairs are listed in the subcaption.no longer efficient. Moreover, for many machine-learning applications, low-rank ap-proximations of dense kernel matrices are inefficient. Hence, we are interested inextending the domain of availability of low-rank methods and retain computationalefficiency. Specifically, we proposed a structured low-rank based algorithm that is oflinear memory cost and floating point operations, that remains accurate even whenthe kernel bandwidth parameter is small, i.e. , when the matrix rank is not low. Weexperimentally demonstrated that the algorithm works in fact for a wide range ofkernel parameters. Our algorithm achieves comparable and often orders of magnitudehigher accuracy than other state-of-art kernel approximation methods, with the samememory cost. It also produces errors with smaller variance, thanks to the sophis-ticated randomized algorithm. This is in contrast with other randomized methodswhose error fluctuates much more significantly. Applying our algorithm to the ker-nel ridge regression also demonstrates that our method competes favorably with thestate-of-art approximation methods.There are a couple of future directions. One direction is on the efficiency and LOCK BASIS FACTORIZATION N10 t i m e ( s ) -6 -4 -2 O(n)O(n)BBFIFGT (a) Application time (d = 5)
N10 t i m e ( s ) -4 -2 O(n)O(n ) (b) Application time (d = 40) N10 t i m e ( s ) -2 -1 O(n)O(n) (c) Total time (d = 5)
N10 t i m e ( s ) -1 O(n)O(n ) (d) Total time (d = 40) Figure 18: Timing (loglog scale) for IFGT and BBF on a synthetic dataset withdimension d = 5 and . We generated 10 centers uniformly at random in a unitcube, and around each center we randomly generated data with standard deviation0.1 along each dimension. The tolerance and the kernel parameter h were set to − , and 0.5, respectively. All the plots share the same legends. The top plots showthe application time (matrix-vector product), and bottom plots show the total time(factorization and application). The timing for BBF is linear for all dimensions, whilethe timing for the IFGT falls back to quadratic when d increases.performance of the downstream inference tasks. The focus of this paper is on theapproximation of the kernel matrix itself. Although the experimental results havedemonstrated good performances in real-world regression tasks, we could further im-prove the downstream tasks by relaxing the orthogonality constrains of the U i matri-ces. That is, we can generate U i from interpolative decomposition [11], which allowsus to share the same advantages as algorithms using Nys and RKS. Another direc-tion is on the evaluation metric for the kernel matrix approximation. This paperused the conventional Frobenius norm to measure the approximation performance.Zhang et al. [42] proposed a new metric that better measures the downstream perfor-mance. The new metric suggests that to achieve a good generalization performance,it is important to have a high-rank approximation. This suggestion aligns well withthe design of BBF and further emphasizes its advantage. Evaluating BBF under thenew metric will be explored in the future. Last but not least, the BBF construction0 R. WANG, Y. LI, M. MAHONEY AND E. DARVE did not consider the regularization parameter used in many learning algorithms. Webelieve that the regularization parameter could facilitate the low-rank compression ofthe kernel matrix in our BBF, while the strategy requires further exploration.
REFERENCES[1]
A. E. Alaoui and M. W. Mahoney , Fast randomized kernel ridge regression with statisticalguarantees , in Proceedings of the 28th International Conference on Neural InformationProcessing Systems - Volume 1, NIPS’15, Cambridge, MA, USA, 2015, MIT Press, pp. 775–783.[2]
P. Amestoy, C. Ashcraft, O. Boiteau, A. Buttari, J. L’Excellent, and C. Weis-becker , Improving multifrontal methods by means of block low-rank representations , SIAMJournal on Scientific Computing, 37 (2015), pp. A1451–A1474.[3]
P. Amestoy, C. Ashcraft, O. Boiteau, A. Buttari, J.-Y. L’Excellent, and C. Weis-becker , Improving multifrontal methods by means of block low-rank representations , SIAMJournal on Scientific Computing, 37 (2015), pp. A1451–A1474.[4]
A. Aminfar, S. Ambikasaran, and E. Darve , A fast block low-rank dense solver withapplications to finite-element matrices , Journal of Computational Physics, 304 (2016),pp. 170–188.[5]
F. Bach , Sharp analysis of low-rank kernel matrix approximations , in Proceedings of the 26thAnnual Conference on Learning Theory, 2013, pp. 185–209.[6]
K. Bache and M. Lichman , UCI machine learning repository , 2013.[7]
J. Bouvrie and B. Hamzi , Kernel methods for the approximation of nonlinear systems , SIAMJournal on Control and Optimization, 55 (2017), pp. 2460–2492.[8]
S. Chandrasekaran, M. Gu, and T. Pals , A fast ULV decomposition solver for hierarchi-cally semiseparable representations , SIAM Journal on Matrix Analysis and Applications,28 (2006), pp. 603–622.[9]
C.-C. Chang and C.-J. Lin , LIBSVM: A library for support vector machines
J. Chen, H. Avron, and V. Sindhwani , Hierarchically compositional kernels for scalablenonparametric learning , J. Mach. Learn. Res., 18 (2017), p. 2214–2255.[11]
H. Cheng, Z. Gimbutas, P.-G. Martinsson, and V. Rokhlin , On the compression of lowrank matrices , SIAM J. Sci. Comput., 26 (2005), pp. 1389–1404.[12]
E. Darve , The fast multipole method I: error analysis and asymptotic complexity , SIAMJournal on Numerical Analysis, 38 (2000), pp. 98–128.[13] ,
The fast multipole method: numerical implementation , Journal of ComputationalPhysics, 160 (2000), pp. 195–240.[14]
P. Drineas, M. Magdon-Ismail, M. W. Mahoney, and D. P. Woodruff , Fast approx-imation of matrix coherence and statistical leverage , The Journal of Machine LearningResearch, 13 (2012), pp. 3475–3506.[15]
P. Drineas and M. W. Mahoney , On the Nyström method for approximating a gram matrixfor improved kernel-based learning , The Journal of Machine Learning Research, 6 (2005),pp. 2153–2175.[16]
B. Engquist, L. Ying, et al. , A fast directional algorithm for high frequency acoustic scatter-ing in two dimensions , Communications in Mathematical Sciences, 7 (2009), pp. 327–345.[17]
W. Fong and E. Darve , The black-box fast multipole method , Journal of ComputationalPhysics, 228 (2009), pp. 8712–8725.[18]
A. Gittens and M. W. Mahoney , Revisiting the Nyström method for improved large-scalemachine learning , The Journal of Machine Learning Research, 17 (2016), pp. 3977–4041.[19]
G. H. Golub and C. F. Van Loan , Matrix computations , vol. 3, JHU Press, 2012.[20]
L. Greengard and V. Rokhlin , A fast algorithm for particle simulations , Journal of com-putational physics, 73 (1987), pp. 325–348.[21] ,
A new version of the fast multipole method for the Laplace equation in three dimensions ,Acta numerica, 6 (1997), pp. 229–269.[22]
L. Greengard and J. Strain , The fast gauss transform , SIAM Journal on Scientific andStatistical Computing, 12 (1991), pp. 79–94.[23]
M. Gu and S. C. Eisenstat , Efficient algorithms for computing a strong rank-revealing QRfactorization , SIAM Journal on Scientific Computing, 17 (1996), pp. 848–869.[24]
W. Hackbusch , A sparse matrix arithmetic based on H -matrices. Part I: Introduction to H -matrices , Computing, 62 (1999), pp. 89–108.LOCK BASIS FACTORIZATION [25] W. Hackbusch and S. Börm , Data-sparse approximation by adaptive H matrices , Comput-ing, 69 (2002), pp. 1–35.[26] W. Hackbusch and B. N. Khoromskij , A sparse H -matrix arithmetic. , Computing, 64(2000), pp. 21–47.[27] N. Halko, P.-G. Martinsson, and J. A. Tropp , Finding structure with randomness: Prob-abilistic algorithms for constructing approximate matrix decompositions , SIAM review, 53(2011), pp. 217–288.[28]
M. R. Hestenes and E. Stiefel , Methods of conjugate gradients for solving linear systems ,vol. 49, NBS Washington, DC, 1952.[29]
Y. Li, H. Yang, E. R. Martin, K. L. Ho, and L. Ying , Butterfly factorization , MultiscaleModeling & Simulation, 13 (2015), pp. 714–732.[30]
Y. Li, H. Yang, and L. Ying , Multidimensional butterfly factorization , Applied and Com-putational Harmonic Analysis, 44 (2018), pp. 737 – 758.[31]
E. Liberty, F. Woolfe, P.-G. Martinsson, V. Rokhlin, and M. Tygert , Random-ized algorithms for the low-rank approximation of matrices , Proceedings of the NationalAcademy of Sciences, 104 (2007), pp. 20167–20172.[32]
M. W. Mahoney , Randomized algorithms for matrices and data , Foundations and Trends inMachine Learning, 3 (2011), pp. 123–224.[33]
C. C. Paige and M. A. Saunders , Solution of sparse indefinite systems of linear equations ,SIAM Journal on Numerical Analysis, 12 (1975), pp. 617–629.[34]
T. Sarlos , Improved approximation algorithms for large matrices via random projections ,in Foundations of Computer Science, 2006. FOCS’06. 47th Annual IEEE Symposium on,IEEE, 2006, pp. 143–152.[35]
B. Savas, I. S. Dhillon, et al. , Clustered low rank approximation of graphs in informationscience applications. , in SDM, SIAM, 2011, pp. 164–175.[36]
S. Si, C.-J. Hsieh, and I. Dhillon , Memory efficient kernel approximation , in Proceedingsof The 31st International Conference on Machine Learning, 2014, pp. 701–709.[37]
M. L. Stein , Limitations on low rank approximations for covariance matrices of spatial data ,Spatial Statistics, 8 (2014), pp. 1–19.[38]
R. Wang, Y. Li, and E. Darve , On the numerical rank of radial basis function kernels inhigh dimensions , SIAM J. Matrix Anal. Appl., 39 (2018), pp. 1810–1835.[39]
J. Xia, S. Chandrasekaran, M. Gu, and X. S. Li , Fast algorithms for hierarchicallysemiseparable matrices , Numerical Linear Algebra with Applications, 17 (2010), pp. 953–976.[40]
C. Yang, R. Duraiswami, N. A. Gumerov, and L. Davis , Improved fast Gauss transformand efficient kernel density estimation , in Computer Vision, 2003. Proceedings. NinthIEEE International Conference on, IEEE, 2003, pp. 664–671.[41]
L. Ying, G. Biros, and D. Zorin , A kernel-independent adaptive fast multipole algorithmin two and three dimensions , Journal of Computational Physics, 196 (2004), pp. 591–626.[42]
J. Zhang, A. May, T. Dao, and C. Ré , Low-precision random fourier features for memory-constrained kernel approximation , mar 2019.[43]