Correlated random features for fast semi-supervised learning
aa r X i v : . [ s t a t . M L ] N ov Correlated random features forfast semi-supervised learning
Brian McWilliams
ETH Z¨urich, Switzerland [email protected]
David Balduzzi
ETH Z¨urich, Switzerland [email protected]
Joachim M. Buhmann
ETH Z¨urich, Switzerland [email protected]
Abstract
This paper presents Correlated Nystr¨om Views (
XNV ), a fast semi-supervised al-gorithm for regression and classification. The algorithm draws on two main ideas.First, it generates two views consisting of computationally inexpensive randomfeatures. Second, multiview regression, using Canonical Correlation Analysis(CCA) on unlabeled data, biases the regression towards useful features. It hasbeen shown that CCA regression can substantially reduce variance with a mini-mal increase in bias if the views contains accurate estimators. Recent theoreticaland empirical work shows that regression with random features closely approxi-mates kernel regression, implying that the accuracy requirement holds for randomviews. We show that
XNV consistently outperforms a state-of-the-art algorithmfor semi-supervised learning: substantially improving predictive performance andreducing the variability of performance on a wide variety of real-world datasets,whilst also reducing runtime by orders of magnitude.
As the volume of data collected in the social and natural sciences increases, the computational costof learning from large datasets has become an important consideration. For learning non-linearrelationships, kernel methods achieve excellent performance but na¨ıvely require operations cubic inthe number of training points.Randomization has recently been considered as an alternative to optimization that, surprisingly, canyield comparable generalization performance at a fraction of the computational cost [1, 2]. Ran-dom features have been introduced to approximate kernel machines when the number of trainingexamples is very large, rendering exact kernel computation intractable. Among several differentapproaches, the Nystr¨om method for low-rank kernel approximation [1] exhibits good theoreticalproperties and empirical performance [3–5].A second problem arising with large datasets concerns obtaining labels , which often requires a do-main expert to manually assign a label to each instance which can be very expensive – requiring sig-nificant investments of both time and money – as the size of the dataset increases. Semi-supervisedlearning aims to improve prediction by extracting useful structure from the unlabeled data pointsand using this in conjunction with a function learned on a small number of labeled points.
Contribution.
This paper proposes a new semi-supervised algorithm for regression and classifi-cation, Correlated Nystr¨om Views (
XNV ), that addresses both problems simultaneously. The method1onsists in essentially two steps. First, we construct two “views” using random features. We in-vestigate two ways of doing so: one based on the Nystr¨om method and another based on randomFourier features (so-called kitchen sinks) [2, 6]. It turns out that the Nystr¨om method almost alwaysoutperforms Fourier features by a quite large margin, so we only report these results in the maintext.The second step, following [7], uses Canonical Correlation Analysis (CCA, [8, 9]) to bias the opti-mization procedure towards features that are correlated across the views. Intuitively, if both viewscontain accurate estimators, then penalizing uncorrelated features reduces variance without increas-ing the bias by much. Recent theoretical work by Bach [5] shows that Nystr¨om views can be ex-pected to contain accurate estimators.We perform an extensive evaluation of
XNV on 18 real-world datasets, comparing against a modifiedversion of the
SSSL (simple semi-supervised learning) algorithm introduced in [10]. We find that
XNV outperforms
SSSL by around 10-15% on average, depending on the number of labeled pointsavailable, see §
3. We also find that the performance of
XNV exhibits dramatically less variabilitythan
SSSL , with a typical reduction of 30%.We chose
SSSL since it was shown in [10] to outperform a state of the art algorithm, LaplacianRegularized Least Squares [11]. However, since
SSSL does not scale up to large sets of unlabeleddata, we modify
SSSL by introducing a Nystr¨om approximation to improve runtime performance.This reduces runtime by a factor of × on N = 10 , points, with further improvements as N increases. Our approximate version of SSSL outperforms kernel ridge regression (KRR) by > on the 18 datasets on average, in line with the results reported in [10], suggesting that we lose littleby replacing the exact SSSL with our approximate implementation.
Related work.
Multiple view learning was first introduced in the co-training method of [12] andhas also recently been extended to unsupervised settings [13,14]. Our algorithm builds on an elegantproposal for multi-view regression introduced in [7]. Surprisingly, despite guaranteeing improvedprediction performance under a relatively weak assumption on the views, CCA regression has notbeen widely used since its proposal – to the best of our knowledge this is first empirical evaluationof multi-view regression’s performance. A possible reason for this is the difficulty in obtainingnaturally occurring data equipped with multiple views that can be shown to satisfy the multi-viewassumption. We overcome this problem by constructing random views that satisfy the assumptionby design.
This section introduces
XNV , our semi-supervised learning method. The method builds on twomain ideas. First, given two equally useful but sufficiently different views on a dataset, penalizingregression using the canonical norm (computed via CCA), can substantially improve performance[7]. The second is the Nystr¨om method for constructing random features [1], which we use toconstruct the views.
Suppose we have data T = (cid:0) ( x , y ) , . . . , ( x n , y n ) (cid:1) for x i ∈ R D and y i ∈ R , sampled according tojoint distribution P ( x , y ) . Further suppose we have two views on the data z ( ν ) : R D −→ H ( ν ) = R M : x z ( ν ) ( x ) =: z ( ν ) for ν ∈ { , } . We make the following assumption about linear regressors which can be learned on these views.
Assumption 1 (Multi-view assumption [7]) . Define mean-squared error loss function ℓ ( g, x , y ) =( g ( x ) − y ) and let loss( g ) := E P ℓ ( g ( x ) , y ) . Further let L ( Z ) denote the space of linear mapsfrom a linear space Z to the reals, and define: f ( ν ) := argmin g ∈ L ( H ( ν ) ) loss( g ) for ν ∈ { , } a nd f := argmin g ∈ L ( H (1) ⊕H (2) ) loss( g ) . The multi-view assumption is that loss (cid:16) f ( ν ) (cid:17) − loss( f ) ≤ ǫ for ν ∈ { , } . (1)2n short, the best predictor in each view is within ǫ of the best overall predictor. Canonical correlation analysis.
Canonical correlation analysis [8, 9] extends principal compo-nent analysis (PCA) from one to two sets of variables. CCA finds bases for the two sets of variablessuch that the correlation between projections onto the bases are maximized.The first pair of canonical basis vectors, (cid:16) b (1)1 , b (2)1 (cid:17) is found by solving: argmax b (1) , b (2) ∈ R M corr (cid:16) b (1) ⊤ z (1) , b (2) ⊤ z (2) (cid:17) . (2)Subsequent pairs are found by maximizing correlations subject to being orthogonal to previouslyfound pairs. The result of performing CCA is two sets of bases, B ( ν ) = h b ( ν )1 , . . . , b ( ν ) M i for ν ∈ { , } , such that the projection of z ( ν ) onto B ( ν ) which we denote ¯ z ( ν ) satisfies1. Orthogonality: E T (cid:2) ¯ z ( ν ) ⊤ j ¯ z ( ν ) k ] = δ jk , where δ jk is the Kronecker delta, and2. Correlation: E T (cid:2) ¯ z (1) ⊤ j ¯ z (2) k (cid:3) = λ j · δ jk where w.l.o.g. we assume ≥ λ ≥ λ ≥ · · · ≥ . λ j is referred to as the j th canonical correlation coefficient . Definition 1 (canonical norm) . Given vector ¯ z ( ν ) in the canonical basis, define its canonical norm as k ¯ z ( ν ) k CCA := vuut D X j =1 − λ j λ j (cid:16) ¯ z ( ν ) j (cid:17) . Canonical ridge regression.
Assume we observe n pairs of views coupled with real valued labels n z (1) i , z (2) i , y i o ni =1 , canonical ridge regression finds coefficients b β ( ν ) = h b β ( ν )1 , . . . , b β ( ν ) M i ⊤ such that b β ( ν ) := argmin β n n X i =1 (cid:16) y i − β ( ν ) ⊤ ¯ z ( ν ) i (cid:17) + k β ( ν ) k CCA . (3)The resulting estimator, referred to as the canonical shrinkage estimator , is b β ( ν ) j = λ j n n X i =1 ¯ z ( ν ) i,j y i . (4)Penalizing with the canonical norm biases the optimization towards features that are highly cor-related across the views. Good regressors exist in both views by Assumption 1. Thus, intuitively,penalizing uncorrelated features significantly reduces variance, without increasing the bias by much.More formally: Theorem 1 (canonical ridge regression, [7]) . Assume E [ y | x ] ≤ and that Assumption 1 holds. Let f ( ν ) b β denote the estimator constructed with the canonical shrinkage estimator, Eq. (4) , on trainingset T , and let f denote the best linear predictor across both views. For ν ∈ { , } we have E T [loss( f ( ν ) b β )] − loss( f ) ≤ ǫ + P Mj =1 λ j n where the expectation is with respect to training sets T sampled from P ( x , y ) . The first term, ǫ , bounds the bias of the canonical estimator, whereas the second, n P λ j boundsthe variance. The P λ j can be thought of as a measure of the “intrinsic dimensionality” of theunlabeled data, which controls the rate of convergence. If the canonical correlation coefficientsdecay sufficiently rapidly, then the increase in bias is more than made up for by the decrease invariance. 3 .2 Constructing random views We construct two views satisfying Assumption 1 in expectation, see Theorem 3 below. To ensure ourmethod scales to large sets of unlabeled data, we use random features generated using the Nystr¨ommethod [1].Suppose we have data { x i } Ni =1 . When N is very large, constructing and manipulating the N × N Gram matrix [ K ] ii ′ = h φ ( x i ) , φ ( x i ′ ) i = κ ( x i , x i ′ ) is computationally expensive. Where here, φ ( x ) defines a mapping from R D to a high dimensional feature space and κ ( · , · ) is a positive semi-definitekernel function.The idea behind random features is to instead define a lower-dimensional mapping, z ( x i ) : R D → R M through a random sampling scheme such that [ K ] ii ′ ≈ z ( x i ) ⊤ z ( x i ′ ) [6, 15]. Thus, usingrandom features, non-linear functions in x can be learned as linear functions in z ( x ) leading tosignificant computational speed-ups. Here we give a brief overview of the Nystr¨om method, whichuses random subsampling to approximate the Gram matrix. The Nystr¨om method.
Fix an M ≪ N and randomly (uniformly) sample a subset M = { ˆ x i } Mi =1 of M points from the data { x i } Ni =1 . Let b K denote the Gram matrix [ b K ] ii ′ where i, i ′ ∈ M . TheNystr¨om method [1, 3] constructs a low-rank approximation to the Gram matrix as K ≈ ˜ K := N X i =1 N X i ′ =1 [ κ ( x i , ˆ x ) , . . . , κ ( x i , ˆ x M )] b K † [ κ ( x i ′ , ˆ x ) , . . . , κ ( x i ′ , ˆ x M )] ⊤ , (5)where b K † ∈ R M × M is the pseudo-inverse of b K . Vectors of random features can be constructed as z ( x i ) = b D − / b V ⊤ [ κ ( x i , ˆ x ) , . . . , κ ( x i , ˆ x M )] ⊤ , where the columns of b V are the eigenvectors of b K with b D the diagonal matrix whose entries arethe corresponding eigenvalues. Constructing features in this way reduces the time complexity oflearning a non-linear prediction function from O ( N ) to O ( N ) [15].An alternative perspective on the Nystr¨om approximation, that will be useful below, is as follows.Consider integral operators L N [ f ]( · ) := 1 N N X i =1 κ ( x i , · ) f ( x i ) and L M [ f ]( · ) := 1 M M X i =1 κ ( x i , · ) f ( x i ) , (6)and introduce Hilbert space ˆ H = span { ˆ ϕ , . . . , ˆ ϕ r } where r is the rank of b K and the ˆ ϕ i are the first r eigenfunctions of L M . Then the following proposition shows that using the Nystr¨om approxima-tion is equivalent to performing linear regression in the feature space (“view”) z : X → ˆ H spannedby the eigenfunctions of linear operator L M in Eq. (6): Proposition 2 (random Nystr¨om view, [3]) . Solving min w ∈ R r N N X i =1 ℓ ( w ⊤ z ( x i ) , y i ) + γ k w k (7) is equivalent to solving min f ∈ ˆ H N N X i =1 ℓ ( f ( x i ) , y i ) + γ k f k H κ . (8) XNV ) Algorithm 1 details our approach to semi-supervised learning based on generating two views consist-ing of Nystr¨om random features and penalizing features which are weakly correlated across views.The setting is that we have labeled data { x i , y i } ni =1 and a large amount of unlabeled data { x i } Ni = n +1 .Step 1 generates a set of random features. The next two steps implement multi-view regression usingthe randomly generated views z (1) ( x ) and z (2) ( x ) . Eq. (9) yields a solution for which unimportant4 lgorithm 1 Correlated Nystr¨om Views (XNV) . Input:
Labeled data: { x i , y i } ni =1 and unlabeled data: { x i } Ni = n +1 Generate features.
Sample ˆ x , . . . , ˆ x M uniformly from the dataset, compute the eigendecom-positions of the sub-sampled kernel matrices ˆ K (1) and ˆ K (2) which are constructed from thesamples , . . . , M and M + 1 , . . . , M respectively, and featurize the input: z ( ν ) ( x i ) ← ˆ D ( ν ) , − / ˆ V ( ν ) ⊤ [ κ ( x i , ˆ x ) , . . . , κ ( x i , ˆ x M )] ⊤ for ν ∈ { , } . Unlabeled data.
Compute CCA bases B (1) , B (2) and canonical correlations λ , . . . , λ M for thetwo views and set ¯ z i ← B (1) z (1) ( x i ) . Labeled data.
Solve b β = argmin β n n X i =1 ℓ (cid:16) β ⊤ ¯ z i , y i (cid:17) + k β k CCA + γ k β k . (9) Output: b β features are heavily downweighted in the CCA basis without introducing an additional tuning pa-rameter. The further penalty on the ℓ norm (in the CCA basis) is introduced as a practical measureto control the variance of the estimator b β which can become large if there are many highly correlatedfeatures (i.e. the ratio − λ j λ j ≈ for large j ). In practice most of the shrinkage is due to the CCAnorm: cross-validation obtains optimal values of γ in the range [0 . , . . Computational complexity.
XNV is extremely fast. Nystr¨om sampling, step 1, reduces the O ( N ) operations required for kernel learning to O ( N ) . Computing the CCA basis, step 2, using standardalgorithms is in O ( N M ) . However, we reduce the runtime to O ( N M ) by applying a recentlyproposed randomized CCA algorithm of [16]. Finally, step 3 is a computationally cheap linearprogram on n samples and M features. Performance guarantees.
The quality of the kernel approximation in (5) has been the subject ofdetailed study in recent years leading to a number of strong empirical and theoretical results [3–5,15]. Recent work of Bach [5] provides theoretical guarantees on the quality of Nystr¨om estimates inthe fixed design setting that are relevant to our approach. Theorem 3 (Nystr¨om generalization bound, [5]) . Let ξ ∈ R N be a random vector with finitevariance and zero mean, y = [ y , . . . , y N ] ⊤ , and define smoothed estimate ˆ y kernel := ( K + N γ I ) − K ( y + ξ ) and smoothed Nystr¨om estimate ˆ y Nystr¨om := ( ˜ K + N γ I ) − ˜ K ( y + ξ ) , bothcomputed by minimizing the MSE with ridge penalty γ . Let η ∈ (0 , . For sufficiently large M (depending on η , see [5]), we have E M E ξ (cid:2) k y − ˆ y Nystr¨om k (cid:3) ≤ (1 + 4 η ) · E ξ (cid:2) k y − ˆ y kernel k (cid:3) where E M refers to the expectation over subsampled columns used to construct ˜ K . In short, the best smoothed estimators in the Nystr¨om views are close to the optimal smoothedestimator. Since the kernel estimate is consistent, loss( f ) → as n → ∞ . Thus, Assumption 1holds in expectation and the generalization performance of XNV is controlled by Theorem 1.
Random Fourier Features.
An alternative approach to constructing random views is to useFourier features instead of Nystr¨om features in Step 1. We refer to this approach as CorrelatedKitchen Sinks (
XKS ) after [2]. It turns out that the performance of
XKS is consistently worse than
XNV , in line with the detailed comparison presented in [3]. We therefore do not discuss Fourierfeatures in the main text, see § SI.3 for details on implementation and experimental results. Extending to a random design requires techniques from [17].
Set Name Task N D Set Name Task N D1 abalone C ,
089 6 elevators R ,
752 18 adult C ,
561 14 HIVa C ,
339 1 , ailerons R ,
154 40 house R ,
392 16 bank8 C ,
096 8 ibn Sina C ,
361 92 bank32 C ,
096 32 orange C ,
000 230 cal housing R ,
320 8 sarcos 1 R ,
484 21 census R ,
186 119 sarcos 5 R ,
484 21 CPU R ,
554 21 sarcos 7 R ,
484 21 CT R ,
000 385 sylva C ,
626 216
SSSL
The
SSSL (simple semi-supervised learning) algorithm proposed in [10] finds the first s eigenfunc-tions φ i of the integral operator L N in Eq. (6) and then solves argmin w ∈ R s n X i =1 s X j =1 w j φ k ( x i ) − y i , (10)where s is set by the user. SSSL outperforms Laplacian Regularized Least Squares [11], a state ofthe art semi-supervised learning method, see [10]. It also has good generalization guarantees underreasonable assumptions on the distribution of eigenvalues of L N . However, since SSSL requirescomputing the full N × N Gram matrix, it is extremely computationally intensive for large N .Moreover, tuning s is difficult since it is discrete.We therefore propose SSSL M , an approximation to SSSL . First, instead of constructing the fullGram matrix, we construct a Nystr¨om approximation by sampling M points from the labeled andunlabeled training set. Second, instead of thresholding eigenfunctions, we use the easier to tuneridge penalty which penalizes directions proportional to the inverse square of their eigenvalues [18].As justification, note that Proposition 2 states that the Nystr¨om approximation to kernel regressionactually solves a ridge regression problem in the span of the eigenfunctions of ˆ L M . As M increases,the span of ˆ L M tends towards that of L N [15]. We will also refer to the Nystr¨om approximation to SSSL using M features as SSSL M . See experiments below for further discussion of the qualityof the approximation. Setup.
We evaluate the performance of
XNV on 18 real-world datasets, see Table 1. The datasetscover a variety of regression (denoted by R) and two-class classification (C) problems. The sarcos dataset involves predicting the joint position of a robot arm; following convention we report resultson the 1st, 5th and 7th joint positions.The
SSSL algorithm was shown to exhibit state-of-the-art performance over fully and semi-supervised methods in scenarios where few labeled training examples are available [10]. How-ever, as discussed in § XNV to theNystr¨om approximations
SSSL M and SSSL M .We used a Gaussian kernel for all datasets. We set the kernel width, σ and the ℓ regularisationstrength, γ , for each method using 5-fold cross validation with labeled training examples. Wetrained all methods using a squared error loss function, ℓ ( f ( x i ) , y i ) = ( f ( x i ) − y i ) , with M = 200 random features, and n = 100 , , , . . . , randomly selected training examples. Taken from the UCI repository http://archive.ics.uci.edu/ml/datasets.html Taken from Taken from Taken from untime performance. The
SSSL algorithm of [10] is not computationally feasible on largedatasets, since it has time complexity O ( N ) . For illustrative purposes, we report run times inseconds of the SSSL algorithm against
SSSL M and XNV on three datasets of different sizes. runtimes bank8 cal housing sylvaSSSL s s - SSSL M . s . s s XNV . s . s s For the cal housing dataset,
XNV exhibits an almost × speed up over SSSL . For the largestdataset, sylva , exact
SSSL is computationally intractable. Importantly, the computational over-head of
XNV over
SSSL M is small. Generalization performance.
We report on the prediction performance averaged over 100 experi-ments. For regression tasks we report on the mean squared error (MSE) on the testing set normalizedby the variance of the test output. For classification tasks we report the percentage of the test set thatwas misclassified.The table below shows the improvement in performance of
XNV over
SSSL M and SSSL M (takingwhichever performs better out of M or M on each dataset), averaged over all 18 datasets. Observethat XNV is considerably more accurate and more robust than
SSSL M . XNV vs SSSL M/ M n = 100 n = 200 n = 300 n = 400 n = 500 Avg reduction in error 11% 16% 15% 12% 9%Avg reduction in std err 15% 30% 31% 33% 30%
The reduced variability is to be expected from Theorem 1.
100 200 300 400 500 600 700 800 900 10000.150.160.170.180.190.20.210.220.230.24 p r ed i c t i on e rr o r number of labeled training points SSSL M SSSL XNV (a) adult
100 200 300 400 500 600 700 800 900 10000.40.50.60.70.80.91 p r ed i c t i on e rr o r number of labeled training points SSSL M SSSL XNV (b) cal housing
100 200 300 400 500 600 700 800 900 100000.010.020.030.040.050.06 p r ed i c t i on e rr o r number of labeled training points SSSL M SSSL XNV (c) census
100 200 300 400 500 600 700 800 900 10000.10.20.30.40.50.60.70.8 p r ed i c t i on e rr o r number of labeled training points SSSL M SSSL XNV (d) elevators
100 200 300 400 500 600 700 800 900 10000.040.060.080.10.120.140.160.180.2 p r ed i c t i on e rr o r number of labeled training points SSSL M SSSL XNV (e) ibn Sina
100 200 300 400 500 600 700 800 900 10000.050.10.150.20.250.30.350.4 p r ed i c t i on e rr o r number of labeled training points SSSL M SSSL XNV (f) sarcos 5
Figure 1: Comparison of mean prediction error and standard deviation on a selection of datasets.Table 2 presents more detailed comparison of performance for individual datasets when n =200 , . The plots in Figure 1 shows a representative comparison of mean prediction errors forseveral datasets when n = 100 , . . . , . Error bars represent one standard deviation. Observe that XNV almost always improves prediction accuracy and reduces variance compared with
SSSL M and SSSL M when the labeled training set contains between 100 and 500 labeled points. A completeset of results is provided in § SI.1.
Discussion of
SSSL M . Our experiments show that going from M to M does not improve gener-alization performance in practice. This suggests that when there are few labeled points, obtaining a Computed in Matlab 7.14 on a Core i5 with 4GB memory.
SSSL M to SSSL suffices.Finally, § SI.2 compares the performance of
SSSL M and XNV to fully supervised kernel ridge reg-ression (KRR). We observe dramatic improvements, between 48% and 63%, consistent with theresults observed in [10] for the exact
SSSL algorithm.
Random Fourier features.
Nystr¨om features significantly outperform Fourier features, in linewith observations in [3]. The table below shows the relative improvement of
XNV over
XKS : XNV vs XKS n = 100 n = 200 n = 300 n = 400 n = 500 Avg reduction in error 30% 28% 26% 25% 24%Avg reduction in std err 36% 44% 34% 37% 36%
Further results and discussion for
XKS are included in the supplementary material.Table 2: Performance (normalized MSE/classification error rate). Standard errors in parentheses. set
SSSL M SSSL M XNV set
SSSL M SSSL M XNV n = 200 .
054 (0 . .
055 (0 . . ( . ) .
309 (0 . .
358 (0 . . ( . ) .
198 (0 . .
184 (0 . . ( . ) .
146 (0 . .
072 (0 . . ( . ) .
218 (0 . .
231 (0 . . ( . ) . ( . ) 0 .
787 (0 . .
792 (0 . . ( . ) 0 .
567 (0 . .
561 (0 . .
109 (0 . .
109 (0 . . ( . ) .
058 (0 . .
060 (0 . . ( . ) .
019 (0 . .
019 (0 . . ( . ) .
567 (0 . .
634 (0 . . ( . ) .
076 (0 . .
078 (0 . . ( . ) .
020 (0 . .
022 (0 . . ( . ) .
172 (0 . .
192 (0 . . ( . ) .
395 (0 . .
463 (0 . . ( . ) .
041 (0 . .
043 (0 . . ( . ) .
437 (0 . .
367 (0 . . ( . ) .
036 (0 . .
039 (0 . . ( . ) n = 400 .
051 (0 . .
052 (0 . . ( . ) .
218 (0 . .
233 (0 . . ( . ) .
177 (0 . .
172 (0 . . ( . ) .
051 (0 . .
122 (0 . . ( . ) .
199 (0 . .
209 (0 . . ( . ) . ( . ) 0 .
701 (0 . .
709 (0 . .
517 (0 . .
527 (0 . . ( . ) .
070 (0 . .
072 (0 . . ( . ) .
050 (0 . .
051 (0 . . ( . ) .
019 (0 . .
019 (0 . . ( . ) .
513 (0 . .
555 (0 . . ( . ) .
059 (0 . .
060 (0 . . ( . ) .
019 (0 . .
021 (0 . . ( . ) .
105 (0 . .
106 (0 . . ( . ) .
209 (0 . .
286 (0 . . ( . ) . ( . ) 0 .
033 (0 . . ( . ) .
249 (0 . .
304 (0 . . ( . ) .
029 (0 . .
032 (0 . . ( . ) We have introduced the
XNV algorithm for semi-supervised learning. By combining two randomlygenerated views of Nystr¨om features via an efficient implementation of CCA,
XNV outperforms theprior state-of-the-art,
SSSL , by 10-15% (depending on the number of labeled points) on averageover 18 datasets. Furthermore,
XNV is over 3 orders of magnitude faster than
SSSL on mediumsized datasets ( N = 10 , ) with further gains as N increases. An interesting research directionis to investigate using the recently developed deep CCA algorithm, which extracts higher ordercorrelations between views [19], as a preprocessing step.In this work we use a uniform sampling scheme for the Nystr¨om method for computational reasonssince it has been shown to perform well empirically relative to more expensive schemes [20]. SinceCCA gives us a criterion by which to measure the important of random features, in the future weaim to investigate active sampling schemes based on canonical correlations which may yield betterperformance by selecting the most informative indices to sample. Acknowledgements.
We thank Haim Avron for help with implementing randomized CCA andPatrick Pletscher for drawing our attention to the Nystr¨om method.8 eferences [1] Williams C, Seeger M:
Using the Nystr¨om method to speed up kernel machines . In
NIPS
Weighted sums of random kitchen sinks: Replacing minimization with random-ization in learning . In
Adv in Neural Information Processing Systems (NIPS)
Nystr¨om Method vs Random Fourier Features: A Theo-retical and Empirical Comparison . In
NIPS
Revisiting the Nystr¨om method for improved large-scale machine learning .In
ICML
Sharp analysis of low-rank kernel approximations . In
COLT
Random Features for Large-Scale Kernel Machines . In
Adv in Neural InformationProcessing Systems
Multi-view Regression Via Canonical Correlation Analysis . In
ComputationalLearning Theory (COLT)
Relations between two sets of variates . Biometrika :312–377.[9] Hardoon DR, Szedmak S, Shawe-Taylor J: Canonical Correlation Analysis: An Overview with Appli-cation to Learning Methods . Neural Comp (12):2639–2664.[10] Ji M, Yang T, Lin B, Jin R, Han J: A Simple Algorithm for Semi-supervised Learning with ImprovedGeneralization Error Bound . In
ICML
Manifold regularization: A geometric framework for learningfrom labeled and unlabeled examples . JMLR :2399–2434.[12] Blum A, Mitchell T: Combining labeled and unlabeled data with co-training . In
COLT
Multiview clustering via Canonical CorrelationAnalysis . In
ICML
Multi-view predictive partitioning in high dimensions . Statistical Analysisand Data Mining :304–321.[15] Drineas P, Mahoney MW: On the Nystr¨om Method for Approximating a Gram Matrix for ImprovedKernel-Based Learning . JMLR :2153–2175.[16] Avron H, Boutsidis C, Toledo S, Zouzias A: Efficient Dimensionality Reduction for Canonical Corre-lation Analysis . In
ICML
An Analysis of Random Design Linear Regression . In
COLT
A Risk Comparison of Ordinary Least Squares vsRidge Regression . Journal of Machine Learning Research :1505–1511.[19] Andrew G, Arora R, Bilmes J, Livescu K: Deep Canonical Correlation Analysis . In
ICML
Sampling methods for the Nystr¨om method . JMLR :981–1006. upplementary Information SI.1 Complete
XNV results
Table 3: Performance (normalized MSE/classification error rate). Standard errors in parentheses. set
SSSL M SSSL M XNV set
SSSL M SSSL M XNV n = 100 . ( . ) 0 .
060 (0 . .
059 (0 . .
439 (0 . .
545 (0 . . ( . ) .
220 (0 . .
200 (0 . . ( . ) .
064 (0 . .
054 (0 . . ( . ) . ( . ) 0 .
263 (0 . .
255 (0 . . ( . ) 0 .
864 (0 . .
895 (0 . . ( . ) 0 .
666 (0 . .
691 (0 . .
160 (0 . .
167 (0 . . ( . ) .
068 (0 . .
076 (0 . . ( . ) .
020 (0 . .
020 (0 . . ( . ) .
628 (0 . .
718 (0 . . ( . ) .
104 (0 . .
104 (0 . . ( . ) . ( . ) 0 .
031 (0 . .
036 (0 . .
231 (0 . .
261 (0 . . ( . ) .
691 (0 . .
751 (0 . . ( . ) .
058 (0 . .
061 (0 . . ( . ) .
488 (0 . .
367 (0 . . ( . ) .
042 (0 . .
043 (0 . . ( . ) n = 200 .
054 (0 . .
055 (0 . . ( . ) .
309 (0 . .
358 (0 . . ( . ) .
198 (0 . .
184 (0 . . ( . ) .
146 (0 . .
072 (0 . . ( . ) .
218 (0 . .
231 (0 . . ( . ) . ( . ) 0 .
787 (0 . .
792 (0 . . ( . ) 0 .
567 (0 . .
561 (0 . .
109 (0 . .
109 (0 . . ( . ) .
058 (0 . .
060 (0 . . ( . ) .
019 (0 . .
019 (0 . . ( . ) .
567 (0 . .
634 (0 . . ( . ) .
076 (0 . .
078 (0 . . ( . ) .
020 (0 . .
022 (0 . . ( . ) .
172 (0 . .
192 (0 . . ( . ) .
395 (0 . .
463 (0 . . ( . ) .
041 (0 . .
043 (0 . . ( . ) .
437 (0 . .
367 (0 . . ( . ) .
036 (0 . .
039 (0 . . ( . ) n = 300 .
052 (0 . .
053 (0 . . ( . ) .
250 (0 . .
275 (0 . . ( . ) .
185 (0 . .
177 (0 . . ( . ) .
074 (0 . .
105 (0 . . ( . ) .
206 (0 . .
217 (0 . . ( . ) . ( . ) 0 .
736 (0 . .
744 (0 . .
531 (0 . .
540 (0 . . ( . ) .
083 (0 . .
084 (0 . . ( . ) .
053 (0 . .
055 (0 . . ( . ) .
019 (0 . .
019 (0 . . ( . ) .
535 (0 . .
585 (0 . . ( . ) .
066 (0 . .
067 (0 . . ( . ) .
020 (0 . .
022 (0 . . ( . ) .
126 (0 . .
133 (0 . . ( . ) .
270 (0 . .
370 (0 . . ( . ) .
035 (0 . .
037 (0 . . ( . ) .
304 (0 . .
352 (0 . . ( . ) .
032 (0 . .
035 (0 . . ( . ) n = 400 .
051 (0 . .
052 (0 . . ( . ) .
218 (0 . .
233 (0 . . ( . ) .
177 (0 . .
172 (0 . . ( . ) .
051 (0 . .
122 (0 . . ( . ) .
199 (0 . .
209 (0 . . ( . ) . ( . ) 0 .
701 (0 . .
709 (0 . .
517 (0 . .
527 (0 . . ( . ) .
070 (0 . .
072 (0 . . ( . ) .
050 (0 . .
051 (0 . . ( . ) .
019 (0 . .
019 (0 . . ( . ) .
513 (0 . .
555 (0 . . ( . ) .
059 (0 . .
060 (0 . . ( . ) .
019 (0 . .
021 (0 . . ( . ) .
105 (0 . .
106 (0 . . ( . ) .
209 (0 . .
286 (0 . . ( . ) . ( . ) 0 .
033 (0 . . ( . ) .
249 (0 . .
304 (0 . . ( . ) .
029 (0 . .
032 (0 . . ( . ) n = 500 .
051 (0 . .
051 (0 . . ( . ) .
202 (0 . .
214 (0 . . ( . ) .
172 (0 . .
169 (0 . . ( . ) .
043 (0 . .
092 (0 . . ( . ) .
194 (0 . .
202 (0 . . ( . ) . ( . ) 0 .
680 (0 . .
686 (0 . .
508 (0 . .
517 (0 . . ( . ) .
061 (0 . .
063 (0 . . ( . ) .
048 (0 . .
049 (0 . . ( . ) . ( . ) . ( . ) . ( . ) .
503 (0 . .
541 (0 . . ( . ) .
055 (0 . .
055 (0 . . ( . ) .
017 (0 . .
018 (0 . . ( . ) .
089 (0 . .
088 (0 . . ( . ) .
167 (0 . .
241 (0 . . ( . ) . ( . ) 0 .
030 (0 . .
031 (0 . .
222 (0 . .
259 (0 . . ( . ) .
027 (0 . .
029 (0 . . ( . )
00 200 300 400 500 600 700 800 900 10000.0450.050.0550.060.0650.070.075 p r ed i c t i on e rr o r number of labeled training points SSSL M SSSL XNV (a) abalone
100 200 300 400 500 600 700 800 900 10000.150.160.170.180.190.20.210.220.230.24 p r ed i c t i on e rr o r number of labeled training points SSSL M SSSL XNV (b) adult
100 200 300 400 500 600 700 800 900 10000.160.180.20.220.240.260.280.3 p r ed i c t i on e rr o r number of labeled training points SSSL M SSSL XNV (c) ailerons
100 200 300 400 500 600 700 800 900 10000.040.0450.050.0550.060.0650.070.0750.080.0850.09 p r ed i c t i on e rr o r number of labeled training points SSSL M SSSL XNV (d) bank8
100 200 300 400 500 600 700 800 900 10000.450.50.550.60.650.70.750.8 p r ed i c t i on e rr o r number of labeled training points SSSL M SSSL XNV (e) bank32
100 200 300 400 500 600 700 800 900 10000.40.50.60.70.80.91 p r ed i c t i on e rr o r number of labeled training points SSSL M SSSL XNV (f) cal housing
100 200 300 400 500 600 700 800 900 100000.010.020.030.040.050.06 p r ed i c t i on e rr o r number of labeled training points SSSL M SSSL XNV (g) census
100 200 300 400 500 600 700 800 900 1000−0.4−0.200.20.40.60.811.21.41.6 p r ed i c t i on e rr o r number of labeled training points SSSL M SSSL XNV (h)
CPU
100 200 300 400 500 600 700 800 900 10000.20.250.30.350.40.450.50.550.60.65 p r ed i c t i on e rr o r number of labeled training points SSSL M SSSL XNV (i) CT
100 200 300 400 500 600 700 800 900 10000.10.20.30.40.50.60.70.8 p r ed i c t i on e rr o r number of labeled training points SSSL M SSSL XNV (j) elevators
100 200 300 400 500 600 700 800 900 10000.020.040.060.080.10.120.140.160.180.2 p r ed i c t i on e rr o r number of labeled training points SSSL M SSSL XNV (k)
HIVa
100 200 300 400 500 600 700 800 900 10000.50.60.70.80.911.11.2 p r ed i c t i on e rr o r number of labeled training points SSSL M SSSL XNV (l) house
100 200 300 400 500 600 700 800 900 10000.040.060.080.10.120.140.160.180.2 p r ed i c t i on e rr o r number of labeled training points SSSL M SSSL XNV (m) ibn Sina
100 200 300 400 500 600 700 800 900 10000.0160.0170.0180.0190.020.0210.0220.0230.024 p r ed i c t i on e rr o r number of labeled training points SSSL M SSSL XNV (n) orange
100 200 300 400 500 600 700 800 900 10000.040.050.060.070.080.090.10.110.120.13 p r ed i c t i on e rr o r number of labeled training points SSSL M SSSL XNV (o) sarcos 1
100 200 300 400 500 600 700 800 900 10000.050.10.150.20.250.30.350.4 p r ed i c t i on e rr o r number of labeled training points SSSL M SSSL XNV (p) sarcos 5
100 200 300 400 500 600 700 800 900 10000.020.030.040.050.060.070.08 p r ed i c t i on e rr o r number of labeled training points SSSL M SSSL XNV (q) sarcos 7
100 200 300 400 500 600 700 800 900 10000.0150.020.0250.030.0350.040.0450.050.055 p r ed i c t i on e rr o r number of labeled training points SSSL M SSSL XNV (r) sylva
Figure 2: Comparison of mean prediction error and standard deviation on all 18 datasets.11
I.2 Comparison with Kernel Ridge Regression
We compare
SSSL M and XNV to kernel ridge regression (KRR). The table below reports the per-centage improvement in mean error of both of these methods against KRR, averaged over the 18datasets according to the experimental procedure detailed in §
3. Parameters σ (kernel width) and γ (ridge penalty) for KRR were chosen by 5-fold cross validation. We observe that both SSSL M and XNV far outperform KRR, by − . Importantly, this shows our approximation to SSSL faroutperforms the fully supervised baseline.
SSSL M and XNV vs KRR n = 100 n = 200 n = 300 n = 400 n = 500 Avg reduction in error for
SSSL M
48% 52% 56% 58% 60%Avg reduction in error for
XNV
56% 62% 63% 63% 63%
SI.3 Random Fourier features
Random Fourier features are a method for approximating shift invariant kernels [6], i.e. where κ ( x i , x i ′ ) = κ ( x i − x i ′ ) . Such a kernel function can be represented in terms of its inverse Fouriertransform as κ ( x i − x i ′ ) = R R D P ( ω ) e j ω ⊤ ( x i − x i ′ ) . P ( ω ) is the Fourier transform of κ whichis guaranteed to be a proper probability distribution and so for real-valued features κ ( x i , x i ′ ) canbe equivalently interpreted as E ω (cid:2) z ( x i ) ⊤ z ( x i ′ ) (cid:3) where z ( x i ) = √ cos( ω ⊤ x i + b ) . Replacingthe expectation by the sample average leads to a scheme for constructing random features. In par-ticular, a Gaussian kernel of width σ has a Fourier transform which is also Gaussian. Sampling ω m ∼ N (0 , σ I D ) and b m ∼ Unif [ − π, π ] , we can then construct features whose inner productapproximates this kernel as z i = √ M (cid:2) cos( ω ⊤ x i + b ) , . . . , cos( ω ⊤ M x i + b M ) (cid:3) .It was recently shown how both random Fourier features the Nystr¨om approximation could be castin the same framework [3]. A major difference between the methods lies in the sampling schemeemployed. Random Fourier features are constructed in a data independent fashion which makesthem extremely cheap to compute. Nystr¨om features are constructed in a data dependent way whichleads to improved performance but, in the case of semi-supervised learning, more expensive sincewe need to evaluate the approximate kernel for all unlabeled points we wish to use.Algorithm 2 details Correlated Kitchen Sinks ( XKS ). This algorithm generates randomviews using the random Fourier features procedure in step 1. Steps 2 and 3 proceed exactly as inAlgorithm 1.
Algorithm 2
Correlated Kitchen Sinks (XKS).
Input:
Labeled data: { x i , y i } ni =1 and unlabeled data: { x i } Ni = n +1 Generate features.
Draw ω , . . . ω K i.i.d. from P and featurize the input: z (1) i ← [ φ ( x i ; ω ) , . . . , φ ( x i ; ω M )] , z (2) i ← [ φ ( x i ; ω M +1 ) , . . . , φ ( x i ; ω M )] . Unlabeled data.
Compute CCA bases B (1) , B (2) and canonical correlations λ , . . . , λ M for thetwo views and set ¯ z i ← B (1) z (1) i . Labeled data.
Solve b β = min β n n X i =1 ℓ (cid:16) β ⊤ ¯ z i , y i (cid:17) + k β k CCA + γ k β k . (11) Output: b β It can be shown that, with sufficiently many features, views constructed via random Fourier featurescontain good approximations to a large class of functions with high probability, see main theoremof [2]. We do not provide details, since
XKS is consistently outperformed by
XNV in practice.12
I.4 Complete
XKS results
For completeness we report on the performance of the
XKS algorithm. We use the same experimentalsetup as in Section 3. We compare the performance of
XKS against a linear machine learned using M and M random Fourier features respectively.Table 4: Average performance of XKS against
RFF M/ M on 18 datasets. XKS vs RFF M/ M n = 100 n = 200 n = 300 n = 400 n = 500 Avg reduction in error 15% 30% 34% 31% 28%Avg reduction in std err -1% 35% 47% 43% 44%
Table 4 shows the performance improvement of
XKS over
RFF M/ M , averaged across the 18datasets. Table 6 compares the prediction error and standard deviation for each of the datasetsindividually. Figure 3 shows the performance across the full range of values of n for all datasets.The relative performance of XKS against
RFF M and RFF M follows the same trend seen in Section3, suggesting that CCA-based regression consistently improves on regression across single and jointviews. Table 5: Number of datasets (out of 18) on which XNV outperforms
XKS . n = 100 n = 200 n = 300 n = 400 n = 500
16 16 15 16 16
Finally, Table 5 compares the performance of correlated Nystr¨om features against correlated kitchensinks.
XNV typically outperforms
XKS on 16 out of 18 datasets; with
XKS only ever outperforming
XNV on bank8 , house and orange . Since XNV almost always outperforms
XKS , we only discussNystr¨om features in the main text. 13able 6: Performance of
XKS (normalized MSE/classification error rate). Standard errors in paren-theses. set
RFF M RFF M XKS set
RFF M RFF M XKS n = 100 . ( . ) 0 .
060 (0 . .
059 (0 . .
829 (0 . .
913 (0 . . ( . ) .
349 (0 . .
325 (0 . . ( . ) .
106 (0 . .
060 (0 . . ( . ) .
956 (0 . .
963 (0 . . ( . ) .
085 (0 . .
240 (0 . . ( . ) .
778 (0 . .
793 (0 . . ( . ) .
183 (0 . .
183 (0 . . ( . ) . ( . ) 0 .
108 (0 . .
116 (0 . .
067 (0 . .
047 (0 . . ( . ) .
091 (4 . .
320 (6 . . ( . ) .
112 (0 . .
125 (0 . . ( . ) .
053 (0 . .
048 (0 . . ( . ) .
373 (0 . .
376 (0 . . ( . ) .
813 (2 . .
062 (3 . . ( . ) .
090 (0 . .
095 (0 . . ( . ) .
556 (0 . . ( . ) 0 .
528 (0 . .
059 (0 . .
056 (0 . . ( . ) n = 200 . ( . ) 0 .
056 (0 . .
056 (0 . .
026 (0 . .
094 (0 . . ( . ) .
403 (0 . .
338 (0 . . ( . ) .
346 (0 . .
087 (0 . . ( . ) .
316 (0 . .
359 (0 . . ( . ) .
935 (0 . .
059 (0 . . ( . ) .
674 (0 . .
724 (0 . . ( . ) .
159 (0 . .
157 (0 . . ( . ) . ( . ) 0 .
073 (0 . .
073 (0 . .
109 (0 . .
070 (0 . . ( . ) .
731 (3 . .
037 (5 . . ( . ) .
082 (0 . .
090 (0 . . ( . ) .
051 (0 . .
049 (0 . . ( . ) .
239 (0 . .
266 (0 . . ( . ) .
922 (1 . .
938 (0 . . ( . ) .
059 (0 . .
064 (0 . . ( . ) .
999 (0 . .
464 (0 . . ( . ) .
053 (0 . .
053 (0 . . ( . ) n = 300 . ( . ) 0 .
054 (0 . .
054 (0 . .
197 (0 . .
354 (1 . . ( . ) .
315 (0 . .
374 (0 . . ( . ) .
146 (0 . .
139 (0 . . ( . ) .
513 (0 . .
646 (0 . . ( . ) .
869 (0 . .
964 (0 . . ( . ) .
636 (0 . .
705 (0 . . ( . ) .
145 (0 . .
145 (0 . . ( . ) . ( . ) 0 .
062 (0 . .
060 (0 . .
048 (0 . .
105 (0 . . ( . ) .
769 (2 . .
871 (4 . . ( . ) .
069 (0 . .
073 (0 . . ( . ) .
050 (0 . .
043 (0 . . ( . ) .
165 (0 . .
181 (0 . . ( . ) .
699 (0 . .
789 (0 . . ( . ) .
046 (0 . .
049 (0 . . ( . ) .
673 (0 . .
611 (0 . . ( . ) .
046 (0 . .
045 (0 . . ( . ) n = 400 . ( . ) 0 .
053 (0 . .
052 (0 . .
311 (0 . .
466 (1 . . ( . ) .
264 (0 . .
401 (0 . . ( . ) .
099 (0 . .
313 (0 . . ( . ) .
596 (0 . .
752 (0 . . ( . ) .
815 (0 . .
894 (0 . . ( . ) .
605 (0 . .
675 (0 . . ( . ) .
133 (0 . .
139 (0 . . ( . ) .
056 (0 . .
058 (0 . . ( . ) .
029 (0 . .
111 (0 . . ( . ) .
214 (2 . .
632 (2 . . ( . ) . ( . ) 0 .
065 (0 . .
063 (0 . .
042 (0 . .
041 (0 . . ( . ) .
129 (0 . .
139 (0 . . ( . ) .
605 (0 . .
695 (0 . . ( . ) .
040 (0 . .
041 (0 . . ( . ) .
480 (0 . .
812 (0 . . ( . ) .
040 (0 . .
039 (0 . . ( . ) n = 500 .
052 (0 . .
052 (0 . . ( . ) .
514 (1 . .
650 (1 . . ( . ) .
237 (0 . .
362 (0 . . ( . ) .
080 (0 . .
188 (0 . . ( . ) .
747 (0 . .
923 (0 . . ( . ) .
782 (0 . .
847 (0 . . ( . ) .
583 (0 . .
653 (0 . . ( . ) .
124 (0 . .
133 (0 . . ( . ) .
053 (0 . .
055 (0 . . ( . ) .
023 (0 . .
079 (0 . . ( . ) .
515 (1 . .
977 (2 . . ( . ) . ( . ) 0 .
059 (0 . .
060 (0 . .
037 (0 . .
041 (0 . . ( . ) .
108 (0 . .
114 (0 . . ( . ) .
533 (0 . .
536 (0 . . ( . ) .
036 (0 . . ( . ) 0 .
037 (0 . .
403 (0 . .
726 (0 . . ( . ) .
036 (0 . .
035 (0 . . ( . )
00 200 300 400 500 600 700 800 900 10000.0450.050.0550.060.0650.070.075 p r ed i c t i on e rr o r number of labeled training points RFF M RFF XKS (a) abalone
100 200 300 400 500 600 700 800 900 10000.20.250.30.350.40.450.5 p r ed i c t i on e rr o r number of labeled training points RFF M RFF XKS (b) adult
100 200 300 400 500 600 700 800 900 100000.511.522.533.54 p r ed i c t i on e rr o r number of labeled training points RFF M RFF XKS (c) ailerons
100 200 300 400 500 600 700 800 900 10000.040.060.080.10.120.140.16 p r ed i c t i on e rr o r number of labeled training points RFF M RFF XKS (d) bank8
100 200 300 400 500 600 700 800 900 10000.450.50.550.60.650.70.750.80.850.9 p r ed i c t i on e rr o r number of labeled training points RFF M RFF XKS (e) bank32
100 200 300 400 500 600 700 800 900 1000−15−10−5051015202530 p r ed i c t i on e rr o r number of labeled training points RFF M RFF XKS (f) cal housing
100 200 300 400 500 600 700 800 900 1000−0.0200.020.040.060.080.10.12 p r ed i c t i on e rr o r number of labeled training points RFF M RFF XKS (g) census
100 200 300 400 500 600 700 800 900 1000−2−10123456 p r ed i c t i on e rr o r number of labeled training points RFF M RFF XKS (h)
CPU
100 200 300 400 500 600 700 800 900 10000.20.30.40.50.60.70.80.911.11.2 p r ed i c t i on e rr o r number of labeled training points RFF M RFF XKS (i) CT
100 200 300 400 500 600 700 800 900 1000−0.500.511.522.533.544.5 p r ed i c t i on e rr o r number of labeled training points RFF M RFF XKS (j) elevators
100 200 300 400 500 600 700 800 900 100000.050.10.150.20.250.30.350.4 p r ed i c t i on e rr o r number of labeled training points RFF M RFF XKS (k)
HIVa
100 200 300 400 500 600 700 800 900 10000.811.21.41.61.82 p r ed i c t i on e rr o r number of labeled training points RFF M RFF XKS (l) house
100 200 300 400 500 600 700 800 900 10000.060.080.10.120.140.160.180.20.22 p r ed i c t i on e rr o r number of labeled training points RFF M RFF XKS (m) ibn Sina
100 200 300 400 500 600 700 800 900 100000.020.040.060.080.10.120.140.160.18 p r ed i c t i on e rr o r number of labeled training points RFF M RFF XKS (n) orange
100 200 300 400 500 600 700 800 900 10000.040.060.080.10.120.140.16 p r ed i c t i on e rr o r number of labeled training points RFF M RFF XKS (o) sarcos 1
100 200 300 400 500 600 700 800 900 10000.050.10.150.20.250.30.350.40.450.5 p r ed i c t i on e rr o r number of labeled training points RFF M RFF XKS (p) sarcos 5
100 200 300 400 500 600 700 800 900 10000.020.030.040.050.060.070.080.090.10.110.12 p r ed i c t i on e rr o r number of labeled training points RFF M RFF XKS (q) sarcos 7
100 200 300 400 500 600 700 800 900 10000.020.0250.030.0350.040.0450.050.0550.060.0650.07 p r ed i c t i on e rr o r number of labeled training points RFF M RFF XKS (r) sylvasylva