[PDF] Correlated random features for fast semi-supervised learning

Abstract

This paper presents Correlated Nystrom Views (XNV), a fast semi-supervised algorithm for regression and classification. The algorithm draws on two main ideas. First, it generates two views consisting of computationally inexpensive random features. Second, XNV applies multiview regression using Canonical Correlation Analysis (CCA) on unlabeled data to bias the regression towards useful features. It has been shown that, if the views contains accurate estimators, CCA regression can substantially reduce variance with a minimal increase in bias. Random views are justified by recent theoretical and empirical work showing that regression with random features closely approximates kernel regression, implying that random views can be expected to contain accurate estimators. We show that XNV consistently outperforms a state-of-the-art algorithm for semi-supervised learning: substantially improving predictive performance and reducing the variability of performance on a wide variety of real-world datasets, whilst also reducing runtime by orders of magnitude.

Full PDF

aa r X i v : . [ s t a t . M L ] N ov Correlated random features forfast semi-supervised learning

Brian McWilliams

ETH Z¨urich, Switzerland [email protected]

David Balduzzi

ETH Z¨urich, Switzerland [email protected]

Joachim M. Buhmann

ETH Z¨urich, Switzerland [email protected]

Abstract

This paper presents Correlated Nystr¨om Views (

XNV ), a fast semi-supervised al-gorithm for regression and classiﬁcation. The algorithm draws on two main ideas.First, it generates two views consisting of computationally inexpensive randomfeatures. Second, multiview regression, using Canonical Correlation Analysis(CCA) on unlabeled data, biases the regression towards useful features. It hasbeen shown that CCA regression can substantially reduce variance with a mini-mal increase in bias if the views contains accurate estimators. Recent theoreticaland empirical work shows that regression with random features closely approxi-mates kernel regression, implying that the accuracy requirement holds for randomviews. We show that

XNV consistently outperforms a state-of-the-art algorithmfor semi-supervised learning: substantially improving predictive performance andreducing the variability of performance on a wide variety of real-world datasets,whilst also reducing runtime by orders of magnitude.

As the volume of data collected in the social and natural sciences increases, the computational costof learning from large datasets has become an important consideration. For learning non-linearrelationships, kernel methods achieve excellent performance but na¨ıvely require operations cubic inthe number of training points.Randomization has recently been considered as an alternative to optimization that, surprisingly, canyield comparable generalization performance at a fraction of the computational cost [1, 2]. Ran-dom features have been introduced to approximate kernel machines when the number of trainingexamples is very large, rendering exact kernel computation intractable. Among several differentapproaches, the Nystr¨om method for low-rank kernel approximation [1] exhibits good theoreticalproperties and empirical performance [3–5].A second problem arising with large datasets concerns obtaining labels , which often requires a do-main expert to manually assign a label to each instance which can be very expensive – requiring sig-niﬁcant investments of both time and money – as the size of the dataset increases. Semi-supervisedlearning aims to improve prediction by extracting useful structure from the unlabeled data pointsand using this in conjunction with a function learned on a small number of labeled points.

Contribution.

This paper proposes a new semi-supervised algorithm for regression and classiﬁ-cation, Correlated Nystr¨om Views (

XNV ), that addresses both problems simultaneously. The method1onsists in essentially two steps. First, we construct two “views” using random features. We in-vestigate two ways of doing so: one based on the Nystr¨om method and another based on randomFourier features (so-called kitchen sinks) [2, 6]. It turns out that the Nystr¨om method almost alwaysoutperforms Fourier features by a quite large margin, so we only report these results in the maintext.The second step, following [7], uses Canonical Correlation Analysis (CCA, [8, 9]) to bias the opti-mization procedure towards features that are correlated across the views. Intuitively, if both viewscontain accurate estimators, then penalizing uncorrelated features reduces variance without increas-ing the bias by much. Recent theoretical work by Bach [5] shows that Nystr¨om views can be ex-pected to contain accurate estimators.We perform an extensive evaluation of

XNV on 18 real-world datasets, comparing against a modiﬁedversion of the

SSSL (simple semi-supervised learning) algorithm introduced in [10]. We ﬁnd that

XNV outperforms

SSSL by around 10-15% on average, depending on the number of labeled pointsavailable, see §

3. We also ﬁnd that the performance of

XNV exhibits dramatically less variabilitythan

SSSL , with a typical reduction of 30%.We chose

SSSL since it was shown in [10] to outperform a state of the art algorithm, LaplacianRegularized Least Squares [11]. However, since

SSSL does not scale up to large sets of unlabeleddata, we modify

SSSL by introducing a Nystr¨om approximation to improve runtime performance.This reduces runtime by a factor of × on N = 10 , points, with further improvements as N increases. Our approximate version of SSSL outperforms kernel ridge regression (KRR) by > on the 18 datasets on average, in line with the results reported in [10], suggesting that we lose littleby replacing the exact SSSL with our approximate implementation.

Related work.

Multiple view learning was ﬁrst introduced in the co-training method of [12] andhas also recently been extended to unsupervised settings [13,14]. Our algorithm builds on an elegantproposal for multi-view regression introduced in [7]. Surprisingly, despite guaranteeing improvedprediction performance under a relatively weak assumption on the views, CCA regression has notbeen widely used since its proposal – to the best of our knowledge this is ﬁrst empirical evaluationof multi-view regression’s performance. A possible reason for this is the difﬁculty in obtainingnaturally occurring data equipped with multiple views that can be shown to satisfy the multi-viewassumption. We overcome this problem by constructing random views that satisfy the assumptionby design.

This section introduces

XNV , our semi-supervised learning method. The method builds on twomain ideas. First, given two equally useful but sufﬁciently different views on a dataset, penalizingregression using the canonical norm (computed via CCA), can substantially improve performance[7]. The second is the Nystr¨om method for constructing random features [1], which we use toconstruct the views.

Suppose we have data T = (cid:0) ( x , y ) , . . . , ( x n , y n ) (cid:1) for x i ∈ R D and y i ∈ R , sampled according tojoint distribution P ( x , y ) . Further suppose we have two views on the data z ( ν ) : R D −→ H ( ν ) = R M : x z ( ν ) ( x ) =: z ( ν ) for ν ∈ { , } . We make the following assumption about linear regressors which can be learned on these views.

Assumption 1 (Multi-view assumption [7]) . Deﬁne mean-squared error loss function ℓ ( g, x , y ) =( g ( x ) − y ) and let loss( g ) := E P ℓ ( g ( x ) , y ) . Further let L ( Z ) denote the space of linear mapsfrom a linear space Z to the reals, and deﬁne: f ( ν ) := argmin g ∈ L ( H ( ν ) ) loss( g ) for ν ∈ { , } a nd f := argmin g ∈ L ( H (1) ⊕H (2) ) loss( g ) . The multi-view assumption is that loss (cid:16) f ( ν ) (cid:17) − loss( f ) ≤ ǫ for ν ∈ { , } . (1)2n short, the best predictor in each view is within ǫ of the best overall predictor. Canonical correlation analysis.

Canonical correlation analysis [8, 9] extends principal compo-nent analysis (PCA) from one to two sets of variables. CCA ﬁnds bases for the two sets of variablessuch that the correlation between projections onto the bases are maximized.The ﬁrst pair of canonical basis vectors, (cid:16) b (1)1 , b (2)1 (cid:17) is found by solving: argmax b (1) , b (2) ∈ R M corr (cid:16) b (1) ⊤ z (1) , b (2) ⊤ z (2) (cid:17) . (2)Subsequent pairs are found by maximizing correlations subject to being orthogonal to previouslyfound pairs. The result of performing CCA is two sets of bases, B ( ν ) = h b ( ν )1 , . . . , b ( ν ) M i for ν ∈ { , } , such that the projection of z ( ν ) onto B ( ν ) which we denote ¯ z ( ν ) satisﬁes1. Orthogonality: E T (cid:2) ¯ z ( ν ) ⊤ j ¯ z ( ν ) k ] = δ jk , where δ jk is the Kronecker delta, and2. Correlation: E T (cid:2) ¯ z (1) ⊤ j ¯ z (2) k (cid:3) = λ j · δ jk where w.l.o.g. we assume ≥ λ ≥ λ ≥ · · · ≥ . λ j is referred to as the j th canonical correlation coefﬁcient . Deﬁnition 1 (canonical norm) . Given vector ¯ z ( ν ) in the canonical basis, deﬁne its canonical norm as k ¯ z ( ν ) k CCA := vuut D X j =1 − λ j λ j (cid:16) ¯ z ( ν ) j (cid:17) . Canonical ridge regression.

Assume we observe n pairs of views coupled with real valued labels n z (1) i , z (2) i , y i o ni =1 , canonical ridge regression ﬁnds coefﬁcients b β ( ν ) = h b β ( ν )1 , . . . , b β ( ν ) M i ⊤ such that b β ( ν ) := argmin β n n X i =1 (cid:16) y i − β ( ν ) ⊤ ¯ z ( ν ) i (cid:17) + k β ( ν ) k CCA . (3)The resulting estimator, referred to as the canonical shrinkage estimator , is b β ( ν ) j = λ j n n X i =1 ¯ z ( ν ) i,j y i . (4)Penalizing with the canonical norm biases the optimization towards features that are highly cor-related across the views. Good regressors exist in both views by Assumption 1. Thus, intuitively,penalizing uncorrelated features signiﬁcantly reduces variance, without increasing the bias by much.More formally: Theorem 1 (canonical ridge regression, [7]) . Assume E [ y | x ] ≤ and that Assumption 1 holds. Let f ( ν ) b β denote the estimator constructed with the canonical shrinkage estimator, Eq. (4) , on trainingset T , and let f denote the best linear predictor across both views. For ν ∈ { , } we have E T [loss( f ( ν ) b β )] − loss( f ) ≤ ǫ + P Mj =1 λ j n where the expectation is with respect to training sets T sampled from P ( x , y ) . The ﬁrst term, ǫ , bounds the bias of the canonical estimator, whereas the second, n P λ j boundsthe variance. The P λ j can be thought of as a measure of the “intrinsic dimensionality” of theunlabeled data, which controls the rate of convergence. If the canonical correlation coefﬁcientsdecay sufﬁciently rapidly, then the increase in bias is more than made up for by the decrease invariance. 3 .2 Constructing random views We construct two views satisfying Assumption 1 in expectation, see Theorem 3 below. To ensure ourmethod scales to large sets of unlabeled data, we use random features generated using the Nystr¨ommethod [1].Suppose we have data { x i } Ni =1 . When N is very large, constructing and manipulating the N × N Gram matrix [ K ] ii ′ = h φ ( x i ) , φ ( x i ′ ) i = κ ( x i , x i ′ ) is computationally expensive. Where here, φ ( x ) deﬁnes a mapping from R D to a high dimensional feature space and κ ( · , · ) is a positive semi-deﬁnitekernel function.The idea behind random features is to instead deﬁne a lower-dimensional mapping, z ( x i ) : R D → R M through a random sampling scheme such that [ K ] ii ′ ≈ z ( x i ) ⊤ z ( x i ′ ) [6, 15]. Thus, usingrandom features, non-linear functions in x can be learned as linear functions in z ( x ) leading tosigniﬁcant computational speed-ups. Here we give a brief overview of the Nystr¨om method, whichuses random subsampling to approximate the Gram matrix. The Nystr¨om method.

Fix an M ≪ N and randomly (uniformly) sample a subset M = { ˆ x i } Mi =1 of M points from the data { x i } Ni =1 . Let b K denote the Gram matrix [ b K ] ii ′ where i, i ′ ∈ M . TheNystr¨om method [1, 3] constructs a low-rank approximation to the Gram matrix as K ≈ ˜ K := N X i =1 N X i ′ =1 [ κ ( x i , ˆ x ) , . . . , κ ( x i , ˆ x M )] b K † [ κ ( x i ′ , ˆ x ) , . . . , κ ( x i ′ , ˆ x M )] ⊤ , (5)where b K † ∈ R M × M is the pseudo-inverse of b K . Vectors of random features can be constructed as z ( x i ) = b D − / b V ⊤ [ κ ( x i , ˆ x ) , . . . , κ ( x i , ˆ x M )] ⊤ , where the columns of b V are the eigenvectors of b K with b D the diagonal matrix whose entries arethe corresponding eigenvalues. Constructing features in this way reduces the time complexity oflearning a non-linear prediction function from O ( N ) to O ( N ) [15].An alternative perspective on the Nystr¨om approximation, that will be useful below, is as follows.Consider integral operators L N [ f ]( · ) := 1 N N X i =1 κ ( x i , · ) f ( x i ) and L M [ f ]( · ) := 1 M M X i =1 κ ( x i , · ) f ( x i ) , (6)and introduce Hilbert space ˆ H = span { ˆ ϕ , . . . , ˆ ϕ r } where r is the rank of b K and the ˆ ϕ i are the ﬁrst r eigenfunctions of L M . Then the following proposition shows that using the Nystr¨om approxima-tion is equivalent to performing linear regression in the feature space (“view”) z : X → ˆ H spannedby the eigenfunctions of linear operator L M in Eq. (6): Proposition 2 (random Nystr¨om view, [3]) . Solving min w ∈ R r N N X i =1 ℓ ( w ⊤ z ( x i ) , y i ) + γ k w k (7) is equivalent to solving min f ∈ ˆ H N N X i =1 ℓ ( f ( x i ) , y i ) + γ k f k H κ . (8) XNV ) Algorithm 1 details our approach to semi-supervised learning based on generating two views consist-ing of Nystr¨om random features and penalizing features which are weakly correlated across views.The setting is that we have labeled data { x i , y i } ni =1 and a large amount of unlabeled data { x i } Ni = n +1 .Step 1 generates a set of random features. The next two steps implement multi-view regression usingthe randomly generated views z (1) ( x ) and z (2) ( x ) . Eq. (9) yields a solution for which unimportant4 lgorithm 1 Correlated Nystr¨om Views (XNV) . Input:

Labeled data: { x i , y i } ni =1 and unlabeled data: { x i } Ni = n +1 Generate features.

Sample ˆ x , . . . , ˆ x M uniformly from the dataset, compute the eigendecom-positions of the sub-sampled kernel matrices ˆ K (1) and ˆ K (2) which are constructed from thesamples , . . . , M and M + 1 , . . . , M respectively, and featurize the input: z ( ν ) ( x i ) ← ˆ D ( ν ) , − / ˆ V ( ν ) ⊤ [ κ ( x i , ˆ x ) , . . . , κ ( x i , ˆ x M )] ⊤ for ν ∈ { , } . Unlabeled data.

Compute CCA bases B (1) , B (2) and canonical correlations λ , . . . , λ M for thetwo views and set ¯ z i ← B (1) z (1) ( x i ) . Labeled data.

Solve b β = argmin β n n X i =1 ℓ (cid:16) β ⊤ ¯ z i , y i (cid:17) + k β k CCA + γ k β k . (9) Output: b β features are heavily downweighted in the CCA basis without introducing an additional tuning pa-rameter. The further penalty on the ℓ norm (in the CCA basis) is introduced as a practical measureto control the variance of the estimator b β which can become large if there are many highly correlatedfeatures (i.e. the ratio − λ j λ j ≈ for large j ). In practice most of the shrinkage is due to the CCAnorm: cross-validation obtains optimal values of γ in the range [0 . , . . Computational complexity.

XNV is extremely fast. Nystr¨om sampling, step 1, reduces the O ( N ) operations required for kernel learning to O ( N ) . Computing the CCA basis, step 2, using standardalgorithms is in O ( N M ) . However, we reduce the runtime to O ( N M ) by applying a recentlyproposed randomized CCA algorithm of [16]. Finally, step 3 is a computationally cheap linearprogram on n samples and M features. Performance guarantees.

The quality of the kernel approximation in (5) has been the subject ofdetailed study in recent years leading to a number of strong empirical and theoretical results [3–5,15]. Recent work of Bach [5] provides theoretical guarantees on the quality of Nystr¨om estimates inthe ﬁxed design setting that are relevant to our approach. Theorem 3 (Nystr¨om generalization bound, [5]) . Let ξ ∈ R N be a random vector with ﬁnitevariance and zero mean, y = [ y , . . . , y N ] ⊤ , and deﬁne smoothed estimate ˆ y kernel := ( K + N γ I ) − K ( y + ξ ) and smoothed Nystr¨om estimate ˆ y Nystr¨om := ( ˜ K + N γ I ) − ˜ K ( y + ξ ) , bothcomputed by minimizing the MSE with ridge penalty γ . Let η ∈ (0 , . For sufﬁciently large M (depending on η , see [5]), we have E M E ξ (cid:2) k y − ˆ y Nystr¨om k (cid:3) ≤ (1 + 4 η ) · E ξ (cid:2) k y − ˆ y kernel k (cid:3) where E M refers to the expectation over subsampled columns used to construct ˜ K . In short, the best smoothed estimators in the Nystr¨om views are close to the optimal smoothedestimator. Since the kernel estimate is consistent, loss( f ) → as n → ∞ . Thus, Assumption 1holds in expectation and the generalization performance of XNV is controlled by Theorem 1.

Random Fourier Features.

An alternative approach to constructing random views is to useFourier features instead of Nystr¨om features in Step 1. We refer to this approach as CorrelatedKitchen Sinks (

XKS ) after [2]. It turns out that the performance of

XKS is consistently worse than

XNV , in line with the detailed comparison presented in [3]. We therefore do not discuss Fourierfeatures in the main text, see § SI.3 for details on implementation and experimental results. Extending to a random design requires techniques from [17].

Set Name Task N D Set Name Task N D1 abalone C ,

089 6 elevators R ,

752 18 adult C ,

561 14 HIVa C ,

339 1 , ailerons R ,

154 40 house R ,

392 16 bank8 C ,

096 8 ibn Sina C ,

361 92 bank32 C ,

096 32 orange C ,

000 230 cal housing R ,

320 8 sarcos 1 R ,

484 21 census R ,

186 119 sarcos 5 R ,

484 21 CPU R ,

554 21 sarcos 7 R ,

484 21 CT R ,

000 385 sylva C ,

626 216

SSSL

The

SSSL (simple semi-supervised learning) algorithm proposed in [10] ﬁnds the ﬁrst s eigenfunc-tions φ i of the integral operator L N in Eq. (6) and then solves argmin w ∈ R s n X i =1  s X j =1 w j φ k ( x i ) − y i  , (10)where s is set by the user. SSSL outperforms Laplacian Regularized Least Squares [11], a state ofthe art semi-supervised learning method, see [10]. It also has good generalization guarantees underreasonable assumptions on the distribution of eigenvalues of L N . However, since SSSL requirescomputing the full N × N Gram matrix, it is extremely computationally intensive for large N .Moreover, tuning s is difﬁcult since it is discrete.We therefore propose SSSL M , an approximation to SSSL . First, instead of constructing the fullGram matrix, we construct a Nystr¨om approximation by sampling M points from the labeled andunlabeled training set. Second, instead of thresholding eigenfunctions, we use the easier to tuneridge penalty which penalizes directions proportional to the inverse square of their eigenvalues [18].As justiﬁcation, note that Proposition 2 states that the Nystr¨om approximation to kernel regressionactually solves a ridge regression problem in the span of the eigenfunctions of ˆ L M . As M increases,the span of ˆ L M tends towards that of L N [15]. We will also refer to the Nystr¨om approximation to SSSL using M features as SSSL M . See experiments below for further discussion of the qualityof the approximation. Setup.

We evaluate the performance of

XNV on 18 real-world datasets, see Table 1. The datasetscover a variety of regression (denoted by R) and two-class classiﬁcation (C) problems. The sarcos dataset involves predicting the joint position of a robot arm; following convention we report resultson the 1st, 5th and 7th joint positions.The

SSSL algorithm was shown to exhibit state-of-the-art performance over fully and semi-supervised methods in scenarios where few labeled training examples are available [10]. How-ever, as discussed in § XNV to theNystr¨om approximations

SSSL M and SSSL M .We used a Gaussian kernel for all datasets. We set the kernel width, σ and the ℓ regularisationstrength, γ , for each method using 5-fold cross validation with labeled training examples. Wetrained all methods using a squared error loss function, ℓ ( f ( x i ) , y i ) = ( f ( x i ) − y i ) , with M = 200 random features, and n = 100 , , , . . . , randomly selected training examples. Taken from the UCI repository http://archive.ics.uci.edu/ml/datasets.html Taken from Taken from Taken from untime performance. The

SSSL algorithm of [10] is not computationally feasible on largedatasets, since it has time complexity O ( N ) . For illustrative purposes, we report run times inseconds of the SSSL algorithm against

SSSL M and XNV on three datasets of different sizes. runtimes bank8 cal housing sylvaSSSL s s - SSSL M . s . s s XNV . s . s s For the cal housing dataset,

XNV exhibits an almost × speed up over SSSL . For the largestdataset, sylva , exact

SSSL is computationally intractable. Importantly, the computational over-head of

XNV over

SSSL M is small. Generalization performance.

We report on the prediction performance averaged over 100 experi-ments. For regression tasks we report on the mean squared error (MSE) on the testing set normalizedby the variance of the test output. For classiﬁcation tasks we report the percentage of the test set thatwas misclassiﬁed.The table below shows the improvement in performance of

XNV over

SSSL M and SSSL M (takingwhichever performs better out of M or M on each dataset), averaged over all 18 datasets. Observethat XNV is considerably more accurate and more robust than

SSSL M . XNV vs SSSL M/ M n = 100 n = 200 n = 300 n = 400 n = 500 Avg reduction in error 11% 16% 15% 12% 9%Avg reduction in std err 15% 30% 31% 33% 30%

The reduced variability is to be expected from Theorem 1.

100 200 300 400 500 600 700 800 900 10000.150.160.170.180.190.20.210.220.230.24 p r ed i c t i on e rr o r number of labeled training points SSSL M SSSL XNV (a) adult

100 200 300 400 500 600 700 800 900 10000.40.50.60.70.80.91 p r ed i c t i on e rr o r number of labeled training points SSSL M SSSL XNV (b) cal housing

100 200 300 400 500 600 700 800 900 100000.010.020.030.040.050.06 p r ed i c t i on e rr o r number of labeled training points SSSL M SSSL XNV (c) census

100 200 300 400 500 600 700 800 900 10000.10.20.30.40.50.60.70.8 p r ed i c t i on e rr o r number of labeled training points SSSL M SSSL XNV (d) elevators

100 200 300 400 500 600 700 800 900 10000.040.060.080.10.120.140.160.180.2 p r ed i c t i on e rr o r number of labeled training points SSSL M SSSL XNV (e) ibn Sina

100 200 300 400 500 600 700 800 900 10000.050.10.150.20.250.30.350.4 p r ed i c t i on e rr o r number of labeled training points SSSL M SSSL XNV (f) sarcos 5

Figure 1: Comparison of mean prediction error and standard deviation on a selection of datasets.Table 2 presents more detailed comparison of performance for individual datasets when n =200 , . The plots in Figure 1 shows a representative comparison of mean prediction errors forseveral datasets when n = 100 , . . . , . Error bars represent one standard deviation. Observe that XNV almost always improves prediction accuracy and reduces variance compared with

SSSL M and SSSL M when the labeled training set contains between 100 and 500 labeled points. A completeset of results is provided in § SI.1.

Discussion of

SSSL M . Our experiments show that going from M to M does not improve gener-alization performance in practice. This suggests that when there are few labeled points, obtaining a Computed in Matlab 7.14 on a Core i5 with 4GB memory.

SSSL M to SSSL sufﬁces.Finally, § SI.2 compares the performance of

SSSL M and XNV to fully supervised kernel ridge reg-ression (KRR). We observe dramatic improvements, between 48% and 63%, consistent with theresults observed in [10] for the exact

SSSL algorithm.

Random Fourier features.

Nystr¨om features signiﬁcantly outperform Fourier features, in linewith observations in [3]. The table below shows the relative improvement of

XNV over

XKS : XNV vs XKS n = 100 n = 200 n = 300 n = 400 n = 500 Avg reduction in error 30% 28% 26% 25% 24%Avg reduction in std err 36% 44% 34% 37% 36%

Further results and discussion for

XKS are included in the supplementary material.Table 2: Performance (normalized MSE/classiﬁcation error rate). Standard errors in parentheses. set

SSSL M SSSL M XNV set

SSSL M SSSL M XNV n = 200 .

054 (0 . .

055 (0 . . ( . ) .

309 (0 . .

358 (0 . . ( . ) .

198 (0 . .

184 (0 . . ( . ) .

146 (0 . .

072 (0 . . ( . ) .

218 (0 . .

231 (0 . . ( . ) . ( . ) 0 .

787 (0 . .

792 (0 . . ( . ) 0 .

567 (0 . .

561 (0 . .

109 (0 . .

109 (0 . . ( . ) .

058 (0 . .

060 (0 . . ( . ) .

019 (0 . .

019 (0 . . ( . ) .

567 (0 . .

634 (0 . . ( . ) .

076 (0 . .

078 (0 . . ( . ) .

020 (0 . .

022 (0 . . ( . ) .

172 (0 . .

192 (0 . . ( . ) .

395 (0 . .

463 (0 . . ( . ) .

041 (0 . .

043 (0 . . ( . ) .

437 (0 . .

367 (0 . . ( . ) .

036 (0 . .

039 (0 . . ( . ) n = 400 .

051 (0 . .

052 (0 . . ( . ) .

218 (0 . .

233 (0 . . ( . ) .

177 (0 . .

172 (0 . . ( . ) .

051 (0 . .

122 (0 . . ( . ) .

199 (0 . .

209 (0 . . ( . ) . ( . ) 0 .

701 (0 . .

709 (0 . .

517 (0 . .

527 (0 . . ( . ) .

070 (0 . .

072 (0 . . ( . ) .

050 (0 . .

051 (0 . . ( . ) .

019 (0 . .

019 (0 . . ( . ) .

513 (0 . .

555 (0 . . ( . ) .

059 (0 . .

060 (0 . . ( . ) .

019 (0 . .

021 (0 . . ( . ) .

105 (0 . .

106 (0 . . ( . ) .

209 (0 . .

286 (0 . . ( . ) . ( . ) 0 .

033 (0 . . ( . ) .

249 (0 . .

304 (0 . . ( . ) .

029 (0 . .

032 (0 . . ( . ) We have introduced the

XNV algorithm for semi-supervised learning. By combining two randomlygenerated views of Nystr¨om features via an efﬁcient implementation of CCA,

XNV outperforms theprior state-of-the-art,

SSSL , by 10-15% (depending on the number of labeled points) on averageover 18 datasets. Furthermore,

XNV is over 3 orders of magnitude faster than

SSSL on mediumsized datasets ( N = 10 , ) with further gains as N increases. An interesting research directionis to investigate using the recently developed deep CCA algorithm, which extracts higher ordercorrelations between views [19], as a preprocessing step.In this work we use a uniform sampling scheme for the Nystr¨om method for computational reasonssince it has been shown to perform well empirically relative to more expensive schemes [20]. SinceCCA gives us a criterion by which to measure the important of random features, in the future weaim to investigate active sampling schemes based on canonical correlations which may yield betterperformance by selecting the most informative indices to sample. Acknowledgements.

We thank Haim Avron for help with implementing randomized CCA andPatrick Pletscher for drawing our attention to the Nystr¨om method.8 eferences [1] Williams C, Seeger M:

Using the Nystr¨om method to speed up kernel machines . In

NIPS

Weighted sums of random kitchen sinks: Replacing minimization with random-ization in learning . In

Adv in Neural Information Processing Systems (NIPS)

Nystr¨om Method vs Random Fourier Features: A Theo-retical and Empirical Comparison . In

NIPS

Revisiting the Nystr¨om method for improved large-scale machine learning .In

ICML

Sharp analysis of low-rank kernel approximations . In

COLT

Random Features for Large-Scale Kernel Machines . In

Adv in Neural InformationProcessing Systems

Multi-view Regression Via Canonical Correlation Analysis . In

ComputationalLearning Theory (COLT)

Relations between two sets of variates . Biometrika :312–377.[9] Hardoon DR, Szedmak S, Shawe-Taylor J: Canonical Correlation Analysis: An Overview with Appli-cation to Learning Methods . Neural Comp (12):2639–2664.[10] Ji M, Yang T, Lin B, Jin R, Han J: A Simple Algorithm for Semi-supervised Learning with ImprovedGeneralization Error Bound . In

ICML

Manifold regularization: A geometric framework for learningfrom labeled and unlabeled examples . JMLR :2399–2434.[12] Blum A, Mitchell T: Combining labeled and unlabeled data with co-training . In

COLT

Multiview clustering via Canonical CorrelationAnalysis . In

ICML

Multi-view predictive partitioning in high dimensions . Statistical Analysisand Data Mining :304–321.[15] Drineas P, Mahoney MW: On the Nystr¨om Method for Approximating a Gram Matrix for ImprovedKernel-Based Learning . JMLR :2153–2175.[16] Avron H, Boutsidis C, Toledo S, Zouzias A: Efﬁcient Dimensionality Reduction for Canonical Corre-lation Analysis . In

ICML

An Analysis of Random Design Linear Regression . In

COLT

A Risk Comparison of Ordinary Least Squares vsRidge Regression . Journal of Machine Learning Research :1505–1511.[19] Andrew G, Arora R, Bilmes J, Livescu K: Deep Canonical Correlation Analysis . In

ICML

Sampling methods for the Nystr¨om method . JMLR :981–1006. upplementary Information SI.1 Complete

XNV results

Table 3: Performance (normalized MSE/classiﬁcation error rate). Standard errors in parentheses. set

SSSL M SSSL M XNV set

SSSL M SSSL M XNV n = 100 . ( . ) 0 .

060 (0 . .

059 (0 . .

439 (0 . .

545 (0 . . ( . ) .

220 (0 . .

200 (0 . . ( . ) .

064 (0 . .

054 (0 . . ( . ) . ( . ) 0 .

263 (0 . .

255 (0 . . ( . ) 0 .

864 (0 . .

895 (0 . . ( . ) 0 .

666 (0 . .

691 (0 . .

160 (0 . .

167 (0 . . ( . ) .

068 (0 . .

076 (0 . . ( . ) .

020 (0 . .

020 (0 . . ( . ) .

628 (0 . .

718 (0 . . ( . ) .

104 (0 . .

104 (0 . . ( . ) . ( . ) 0 .

031 (0 . .

036 (0 . .

231 (0 . .

261 (0 . . ( . ) .

691 (0 . .

751 (0 . . ( . ) .

058 (0 . .

061 (0 . . ( . ) .

488 (0 . .

367 (0 . . ( . ) .

042 (0 . .

043 (0 . . ( . ) n = 200 .

054 (0 . .

055 (0 . . ( . ) .

309 (0 . .

358 (0 . . ( . ) .

198 (0 . .

184 (0 . . ( . ) .

146 (0 . .

072 (0 . . ( . ) .

218 (0 . .

231 (0 . . ( . ) . ( . ) 0 .

787 (0 . .

792 (0 . . ( . ) 0 .

567 (0 . .

561 (0 . .

109 (0 . .

109 (0 . . ( . ) .

058 (0 . .

060 (0 . . ( . ) .

019 (0 . .

019 (0 . . ( . ) .

567 (0 . .

634 (0 . . ( . ) .

076 (0 . .

078 (0 . . ( . ) .

020 (0 . .

022 (0 . . ( . ) .

172 (0 . .

192 (0 . . ( . ) .

395 (0 . .

463 (0 . . ( . ) .

041 (0 . .

043 (0 . . ( . ) .

437 (0 . .

367 (0 . . ( . ) .

036 (0 . .

039 (0 . . ( . ) n = 300 .

052 (0 . .

053 (0 . . ( . ) .

250 (0 . .

275 (0 . . ( . ) .

185 (0 . .

177 (0 . . ( . ) .

074 (0 . .

105 (0 . . ( . ) .

206 (0 . .

217 (0 . . ( . ) . ( . ) 0 .

736 (0 . .

744 (0 . .

531 (0 . .

540 (0 . . ( . ) .

083 (0 . .

084 (0 . . ( . ) .

053 (0 . .

055 (0 . . ( . ) .

019 (0 . .

019 (0 . . ( . ) .

535 (0 . .

585 (0 . . ( . ) .

066 (0 . .

067 (0 . . ( . ) .

020 (0 . .

022 (0 . . ( . ) .

126 (0 . .

133 (0 . . ( . ) .

270 (0 . .

370 (0 . . ( . ) .

035 (0 . .

037 (0 . . ( . ) .

304 (0 . .

352 (0 . . ( . ) .

032 (0 . .

035 (0 . . ( . ) n = 400 .

051 (0 . .

052 (0 . . ( . ) .

218 (0 . .

233 (0 . . ( . ) .

177 (0 . .

172 (0 . . ( . ) .

051 (0 . .

122 (0 . . ( . ) .

199 (0 . .

209 (0 . . ( . ) . ( . ) 0 .

701 (0 . .

709 (0 . .

517 (0 . .

527 (0 . . ( . ) .

070 (0 . .

072 (0 . . ( . ) .

050 (0 . .

051 (0 . . ( . ) .

019 (0 . .

019 (0 . . ( . ) .

513 (0 . .

555 (0 . . ( . ) .

059 (0 . .

060 (0 . . ( . ) .

019 (0 . .

021 (0 . . ( . ) .

105 (0 . .

106 (0 . . ( . ) .

209 (0 . .

286 (0 . . ( . ) . ( . ) 0 .

033 (0 . . ( . ) .

249 (0 . .

304 (0 . . ( . ) .

029 (0 . .

032 (0 . . ( . ) n = 500 .

051 (0 . .

051 (0 . . ( . ) .

202 (0 . .

214 (0 . . ( . ) .

172 (0 . .

169 (0 . . ( . ) .

043 (0 . .

092 (0 . . ( . ) .

194 (0 . .

202 (0 . . ( . ) . ( . ) 0 .

680 (0 . .

686 (0 . .

508 (0 . .

517 (0 . . ( . ) .

061 (0 . .

063 (0 . . ( . ) .

048 (0 . .

049 (0 . . ( . ) . ( . ) . ( . ) . ( . ) .

503 (0 . .

541 (0 . . ( . ) .

055 (0 . .

055 (0 . . ( . ) .

017 (0 . .

018 (0 . . ( . ) .

089 (0 . .

088 (0 . . ( . ) .

167 (0 . .

241 (0 . . ( . ) . ( . ) 0 .

030 (0 . .

031 (0 . .

222 (0 . .

259 (0 . . ( . ) .

027 (0 . .

029 (0 . . ( . )

00 200 300 400 500 600 700 800 900 10000.0450.050.0550.060.0650.070.075 p r ed i c t i on e rr o r number of labeled training points SSSL M SSSL XNV (a) abalone

100 200 300 400 500 600 700 800 900 10000.150.160.170.180.190.20.210.220.230.24 p r ed i c t i on e rr o r number of labeled training points SSSL M SSSL XNV (b) adult

100 200 300 400 500 600 700 800 900 10000.160.180.20.220.240.260.280.3 p r ed i c t i on e rr o r number of labeled training points SSSL M SSSL XNV (c) ailerons

100 200 300 400 500 600 700 800 900 10000.040.0450.050.0550.060.0650.070.0750.080.0850.09 p r ed i c t i on e rr o r number of labeled training points SSSL M SSSL XNV (d) bank8

100 200 300 400 500 600 700 800 900 10000.450.50.550.60.650.70.750.8 p r ed i c t i on e rr o r number of labeled training points SSSL M SSSL XNV (e) bank32

100 200 300 400 500 600 700 800 900 10000.40.50.60.70.80.91 p r ed i c t i on e rr o r number of labeled training points SSSL M SSSL XNV (f) cal housing

100 200 300 400 500 600 700 800 900 100000.010.020.030.040.050.06 p r ed i c t i on e rr o r number of labeled training points SSSL M SSSL XNV (g) census

100 200 300 400 500 600 700 800 900 1000−0.4−0.200.20.40.60.811.21.41.6 p r ed i c t i on e rr o r number of labeled training points SSSL M SSSL XNV (h)

CPU

100 200 300 400 500 600 700 800 900 10000.20.250.30.350.40.450.50.550.60.65 p r ed i c t i on e rr o r number of labeled training points SSSL M SSSL XNV (i) CT

100 200 300 400 500 600 700 800 900 10000.10.20.30.40.50.60.70.8 p r ed i c t i on e rr o r number of labeled training points SSSL M SSSL XNV (j) elevators

100 200 300 400 500 600 700 800 900 10000.020.040.060.080.10.120.140.160.180.2 p r ed i c t i on e rr o r number of labeled training points SSSL M SSSL XNV (k)

HIVa

100 200 300 400 500 600 700 800 900 10000.50.60.70.80.911.11.2 p r ed i c t i on e rr o r number of labeled training points SSSL M SSSL XNV (l) house

100 200 300 400 500 600 700 800 900 10000.040.060.080.10.120.140.160.180.2 p r ed i c t i on e rr o r number of labeled training points SSSL M SSSL XNV (m) ibn Sina

100 200 300 400 500 600 700 800 900 10000.0160.0170.0180.0190.020.0210.0220.0230.024 p r ed i c t i on e rr o r number of labeled training points SSSL M SSSL XNV (n) orange

100 200 300 400 500 600 700 800 900 10000.040.050.060.070.080.090.10.110.120.13 p r ed i c t i on e rr o r number of labeled training points SSSL M SSSL XNV (o) sarcos 1

100 200 300 400 500 600 700 800 900 10000.050.10.150.20.250.30.350.4 p r ed i c t i on e rr o r number of labeled training points SSSL M SSSL XNV (p) sarcos 5

100 200 300 400 500 600 700 800 900 10000.020.030.040.050.060.070.08 p r ed i c t i on e rr o r number of labeled training points SSSL M SSSL XNV (q) sarcos 7

100 200 300 400 500 600 700 800 900 10000.0150.020.0250.030.0350.040.0450.050.055 p r ed i c t i on e rr o r number of labeled training points SSSL M SSSL XNV (r) sylva

Figure 2: Comparison of mean prediction error and standard deviation on all 18 datasets.11

I.2 Comparison with Kernel Ridge Regression

We compare

SSSL M and XNV to kernel ridge regression (KRR). The table below reports the per-centage improvement in mean error of both of these methods against KRR, averaged over the 18datasets according to the experimental procedure detailed in §

3. Parameters σ (kernel width) and γ (ridge penalty) for KRR were chosen by 5-fold cross validation. We observe that both SSSL M and XNV far outperform KRR, by − . Importantly, this shows our approximation to SSSL faroutperforms the fully supervised baseline.

SSSL M and XNV vs KRR n = 100 n = 200 n = 300 n = 400 n = 500 Avg reduction in error for

SSSL M

48% 52% 56% 58% 60%Avg reduction in error for

XNV

56% 62% 63% 63% 63%

SI.3 Random Fourier features

Random Fourier features are a method for approximating shift invariant kernels [6], i.e. where κ ( x i , x i ′ ) = κ ( x i − x i ′ ) . Such a kernel function can be represented in terms of its inverse Fouriertransform as κ ( x i − x i ′ ) = R R D P ( ω ) e j ω ⊤ ( x i − x i ′ ) . P ( ω ) is the Fourier transform of κ whichis guaranteed to be a proper probability distribution and so for real-valued features κ ( x i , x i ′ ) canbe equivalently interpreted as E ω (cid:2) z ( x i ) ⊤ z ( x i ′ ) (cid:3) where z ( x i ) = √ cos( ω ⊤ x i + b ) . Replacingthe expectation by the sample average leads to a scheme for constructing random features. In par-ticular, a Gaussian kernel of width σ has a Fourier transform which is also Gaussian. Sampling ω m ∼ N (0 , σ I D ) and b m ∼ Unif [ − π, π ] , we can then construct features whose inner productapproximates this kernel as z i = √ M (cid:2) cos( ω ⊤ x i + b ) , . . . , cos( ω ⊤ M x i + b M ) (cid:3) .It was recently shown how both random Fourier features the Nystr¨om approximation could be castin the same framework [3]. A major difference between the methods lies in the sampling schemeemployed. Random Fourier features are constructed in a data independent fashion which makesthem extremely cheap to compute. Nystr¨om features are constructed in a data dependent way whichleads to improved performance but, in the case of semi-supervised learning, more expensive sincewe need to evaluate the approximate kernel for all unlabeled points we wish to use.Algorithm 2 details Correlated Kitchen Sinks ( XKS ). This algorithm generates randomviews using the random Fourier features procedure in step 1. Steps 2 and 3 proceed exactly as inAlgorithm 1.

Algorithm 2

Correlated Kitchen Sinks (XKS).

Input:

Labeled data: { x i , y i } ni =1 and unlabeled data: { x i } Ni = n +1 Generate features.

Draw ω , . . . ω K i.i.d. from P and featurize the input: z (1) i ← [ φ ( x i ; ω ) , . . . , φ ( x i ; ω M )] , z (2) i ← [ φ ( x i ; ω M +1 ) , . . . , φ ( x i ; ω M )] . Unlabeled data.

Compute CCA bases B (1) , B (2) and canonical correlations λ , . . . , λ M for thetwo views and set ¯ z i ← B (1) z (1) i . Labeled data.

Solve b β = min β n n X i =1 ℓ (cid:16) β ⊤ ¯ z i , y i (cid:17) + k β k CCA + γ k β k . (11) Output: b β It can be shown that, with sufﬁciently many features, views constructed via random Fourier featurescontain good approximations to a large class of functions with high probability, see main theoremof [2]. We do not provide details, since

XKS is consistently outperformed by

XNV in practice.12

I.4 Complete

XKS results

For completeness we report on the performance of the

XKS algorithm. We use the same experimentalsetup as in Section 3. We compare the performance of

XKS against a linear machine learned using M and M random Fourier features respectively.Table 4: Average performance of XKS against

RFF M/ M on 18 datasets. XKS vs RFF M/ M n = 100 n = 200 n = 300 n = 400 n = 500 Avg reduction in error 15% 30% 34% 31% 28%Avg reduction in std err -1% 35% 47% 43% 44%

Table 4 shows the performance improvement of

XKS over

RFF M/ M , averaged across the 18datasets. Table 6 compares the prediction error and standard deviation for each of the datasetsindividually. Figure 3 shows the performance across the full range of values of n for all datasets.The relative performance of XKS against

RFF M and RFF M follows the same trend seen in Section3, suggesting that CCA-based regression consistently improves on regression across single and jointviews. Table 5: Number of datasets (out of 18) on which XNV outperforms

XKS . n = 100 n = 200 n = 300 n = 400 n = 500

16 16 15 16 16

Finally, Table 5 compares the performance of correlated Nystr¨om features against correlated kitchensinks.

XNV typically outperforms

XKS on 16 out of 18 datasets; with

XKS only ever outperforming

XNV on bank8 , house and orange . Since XNV almost always outperforms

XKS , we only discussNystr¨om features in the main text. 13able 6: Performance of

XKS (normalized MSE/classiﬁcation error rate). Standard errors in paren-theses. set

RFF M RFF M XKS set

RFF M RFF M XKS n = 100 . ( . ) 0 .

060 (0 . .

059 (0 . .

829 (0 . .

913 (0 . . ( . ) .

349 (0 . .

325 (0 . . ( . ) .

106 (0 . .

060 (0 . . ( . ) .

956 (0 . .

963 (0 . . ( . ) .

085 (0 . .

240 (0 . . ( . ) .

778 (0 . .

793 (0 . . ( . ) .

183 (0 . .

183 (0 . . ( . ) . ( . ) 0 .

108 (0 . .

116 (0 . .

067 (0 . .

047 (0 . . ( . ) .

091 (4 . .

320 (6 . . ( . ) .

112 (0 . .

125 (0 . . ( . ) .

053 (0 . .

048 (0 . . ( . ) .

373 (0 . .

376 (0 . . ( . ) .

813 (2 . .

062 (3 . . ( . ) .

090 (0 . .

095 (0 . . ( . ) .

556 (0 . . ( . ) 0 .

528 (0 . .

059 (0 . .

056 (0 . . ( . ) n = 200 . ( . ) 0 .

056 (0 . .

026 (0 . .

094 (0 . . ( . ) .

403 (0 . .

338 (0 . . ( . ) .

346 (0 . .

087 (0 . . ( . ) .

316 (0 . .

359 (0 . . ( . ) .

935 (0 . .

059 (0 . . ( . ) .

674 (0 . .

724 (0 . . ( . ) .

159 (0 . .

157 (0 . . ( . ) . ( . ) 0 .

073 (0 . .

109 (0 . .

070 (0 . . ( . ) .

731 (3 . .

037 (5 . . ( . ) .

082 (0 . .

090 (0 . . ( . ) .

051 (0 . .

049 (0 . . ( . ) .

239 (0 . .

266 (0 . . ( . ) .

922 (1 . .

938 (0 . . ( . ) .

059 (0 . .

064 (0 . . ( . ) .

999 (0 . .

464 (0 . . ( . ) .

053 (0 . .

053 (0 . . ( . ) n = 300 . ( . ) 0 .

054 (0 . .

197 (0 . .

354 (1 . . ( . ) .

315 (0 . .

374 (0 . . ( . ) .

146 (0 . .

139 (0 . . ( . ) .

513 (0 . .

646 (0 . . ( . ) .

869 (0 . .

964 (0 . . ( . ) .

636 (0 . .

705 (0 . . ( . ) .

145 (0 . .

145 (0 . . ( . ) . ( . ) 0 .

062 (0 . .

060 (0 . .

048 (0 . .

105 (0 . . ( . ) .

769 (2 . .

871 (4 . . ( . ) .

069 (0 . .

073 (0 . . ( . ) .

050 (0 . .

043 (0 . . ( . ) .

165 (0 . .

181 (0 . . ( . ) .

699 (0 . .

789 (0 . . ( . ) .

046 (0 . .

049 (0 . . ( . ) .

673 (0 . .

611 (0 . . ( . ) .

046 (0 . .

045 (0 . . ( . ) n = 400 . ( . ) 0 .

053 (0 . .

052 (0 . .

311 (0 . .

466 (1 . . ( . ) .

264 (0 . .

401 (0 . . ( . ) .

099 (0 . .

313 (0 . . ( . ) .

596 (0 . .

752 (0 . . ( . ) .

815 (0 . .

894 (0 . . ( . ) .

605 (0 . .

675 (0 . . ( . ) .

133 (0 . .

139 (0 . . ( . ) .

056 (0 . .

058 (0 . . ( . ) .

029 (0 . .

111 (0 . . ( . ) .

214 (2 . .

632 (2 . . ( . ) . ( . ) 0 .

065 (0 . .

063 (0 . .

042 (0 . .

041 (0 . . ( . ) .

129 (0 . .

139 (0 . . ( . ) .

605 (0 . .

695 (0 . . ( . ) .

040 (0 . .

041 (0 . . ( . ) .

480 (0 . .

812 (0 . . ( . ) .

040 (0 . .

039 (0 . . ( . ) n = 500 .

052 (0 . .

052 (0 . . ( . ) .

514 (1 . .

650 (1 . . ( . ) .

237 (0 . .

362 (0 . . ( . ) .

080 (0 . .

188 (0 . . ( . ) .

747 (0 . .

923 (0 . . ( . ) .

782 (0 . .

847 (0 . . ( . ) .

583 (0 . .

653 (0 . . ( . ) .

124 (0 . .

133 (0 . . ( . ) .

053 (0 . .

055 (0 . . ( . ) .

023 (0 . .

079 (0 . . ( . ) .

515 (1 . .

977 (2 . . ( . ) . ( . ) 0 .

059 (0 . .

060 (0 . .

037 (0 . .

041 (0 . . ( . ) .

108 (0 . .

114 (0 . . ( . ) .

533 (0 . .

536 (0 . . ( . ) .

036 (0 . . ( . ) 0 .

037 (0 . .

403 (0 . .

726 (0 . . ( . ) .

036 (0 . .

035 (0 . . ( . )

00 200 300 400 500 600 700 800 900 10000.0450.050.0550.060.0650.070.075 p r ed i c t i on e rr o r number of labeled training points RFF M RFF XKS (a) abalone

100 200 300 400 500 600 700 800 900 10000.20.250.30.350.40.450.5 p r ed i c t i on e rr o r number of labeled training points RFF M RFF XKS (b) adult

100 200 300 400 500 600 700 800 900 100000.511.522.533.54 p r ed i c t i on e rr o r number of labeled training points RFF M RFF XKS (c) ailerons

100 200 300 400 500 600 700 800 900 10000.040.060.080.10.120.140.16 p r ed i c t i on e rr o r number of labeled training points RFF M RFF XKS (d) bank8

100 200 300 400 500 600 700 800 900 10000.450.50.550.60.650.70.750.80.850.9 p r ed i c t i on e rr o r number of labeled training points RFF M RFF XKS (e) bank32

100 200 300 400 500 600 700 800 900 1000−15−10−5051015202530 p r ed i c t i on e rr o r number of labeled training points RFF M RFF XKS (f) cal housing

100 200 300 400 500 600 700 800 900 1000−0.0200.020.040.060.080.10.12 p r ed i c t i on e rr o r number of labeled training points RFF M RFF XKS (g) census

100 200 300 400 500 600 700 800 900 1000−2−10123456 p r ed i c t i on e rr o r number of labeled training points RFF M RFF XKS (h)

CPU

100 200 300 400 500 600 700 800 900 10000.20.30.40.50.60.70.80.911.11.2 p r ed i c t i on e rr o r number of labeled training points RFF M RFF XKS (i) CT

100 200 300 400 500 600 700 800 900 1000−0.500.511.522.533.544.5 p r ed i c t i on e rr o r number of labeled training points RFF M RFF XKS (j) elevators

100 200 300 400 500 600 700 800 900 100000.050.10.150.20.250.30.350.4 p r ed i c t i on e rr o r number of labeled training points RFF M RFF XKS (k)

HIVa

100 200 300 400 500 600 700 800 900 10000.811.21.41.61.82 p r ed i c t i on e rr o r number of labeled training points RFF M RFF XKS (l) house

100 200 300 400 500 600 700 800 900 10000.060.080.10.120.140.160.180.20.22 p r ed i c t i on e rr o r number of labeled training points RFF M RFF XKS (m) ibn Sina

100 200 300 400 500 600 700 800 900 100000.020.040.060.080.10.120.140.160.18 p r ed i c t i on e rr o r number of labeled training points RFF M RFF XKS (n) orange

100 200 300 400 500 600 700 800 900 10000.040.060.080.10.120.140.16 p r ed i c t i on e rr o r number of labeled training points RFF M RFF XKS (o) sarcos 1

100 200 300 400 500 600 700 800 900 10000.050.10.150.20.250.30.350.40.450.5 p r ed i c t i on e rr o r number of labeled training points RFF M RFF XKS (p) sarcos 5

100 200 300 400 500 600 700 800 900 10000.020.030.040.050.060.070.080.090.10.110.12 p r ed i c t i on e rr o r number of labeled training points RFF M RFF XKS (q) sarcos 7

100 200 300 400 500 600 700 800 900 10000.020.0250.030.0350.040.0450.050.0550.060.0650.07 p r ed i c t i on e rr o r number of labeled training points RFF M RFF XKS (r) sylvasylva