[PDF] Hypothesis Transfer Learning via Transformation Functions

Abstract

We consider the Hypothesis Transfer Learning (HTL) problem where one incorporates a hypothesis trained on the source domain into the learning procedure of the target domain. Existing theoretical analysis either only studies specific algorithms or only presents upper bounds on the generalization error but not on the excess risk. In this paper, we propose a unified algorithm-dependent framework for HTL through a novel notion of transformation function, which characterizes the relation between the source and the target domains. We conduct a general risk analysis of this framework and in particular, we show for the first time, if two domains are related, HTL enjoys faster convergence rates of excess risks for Kernel Smoothing and Kernel Ridge Regression than those of the classical non-transfer learning settings. Experiments on real world data demonstrate the effectiveness of our framework.

Full PDF

HHypothesis Transfer Learning viaTransformation Functions

Simon S. Du

Carnegie Mellon University [email protected]

Jayanth Koushik

Carnegie Mellon University [email protected]

Aarti Singh

Carnegie Mellon University [email protected]

Barnabás Póczos

Carnegie Mellon University [email protected]

Abstract

We consider the Hypothesis Transfer Learning (HTL) problem where one incor-porates a hypothesis trained on the source domain into the learning procedureof the target domain. Existing theoretical analysis either only studies speciﬁcalgorithms or only presents upper bounds on the generalization error but not on theexcess risk. In this paper, we propose a uniﬁed algorithm-dependent frameworkfor HTL through a novel notion of transformation function , which characterizesthe relation between the source and the target domains. We conduct a general riskanalysis of this framework and in particular, we show for the ﬁrst time, if twodomains are related, HTL enjoys faster convergence rates of excess risks for KernelSmoothing and Kernel Ridge Regression than those of the classical non-transferlearning settings. Experiments on real world data demonstrate the effectiveness ofour framework.

In a classical transfer learning setting, we have a large amount of data from a source domain anda relatively small amount of data from a target domain. These two domains are related but notnecessarily identical, and the usual assumption is that the hypothesis learned from the source domainis useful in the learning task of the target domain.In this paper, we focus on the regression problem where the functions we want to estimate of the sourceand the target domains are different but related. Figure 1a shows a 1D toy example of this setting,where the source function is f so ( x ) = sin(4 πx ) and the target function is f ta ( x ) = sin(4 πx ) + 4 πx .Many real world problems can be formulated as transfer learning problems. For example, in the taskof predicting the reaction time of an individual from his/her fMRI images, we have about subjectsbut each subject has only about data points. To learn the mapping from neural images to thereaction time of a speciﬁc subject, we can treat all but this subject as the source domain, and thissubject as the target domain. In Section 6, we show how our proposed method helps us learn thismapping more accurately.This paradigm, hypothesis transfer learning (HTL) has been explored empirically with success inmany applications [Fei-Fei et al., 2006, Yang et al., 2007, Orabona et al., 2009, Tommasi et al.,2010, Kuzborskij et al., 2013, Wang and Schneider, 2014]. Kuzborskij and Orabona [2013, 2016]pioneered the theoretical analysis of HTL for linear regression and recently Wang and Schneider[2015] analyzed Kernel Ridge Regression. However, most existing works only provide generalizationbounds, i.e. the difference between the true risk and the training error or the leave-one-out error.These analyses are not complete because minimizing the generalization error does not necessarily a r X i v : . [ s t a t . M L ] N ov Y f so Source dataf ta Target Data (a) A toy example of transferlearning. We have many moresamples from the source domainthan the target domain. Y Offset KSOnly Target KSf ta (b) Transfer learning with OffsetTransformation. Y Scale KSOnly Target KSf ta (c) Transfer learning with ScaleTransformation. Figure 1: Experimental results on synthetic data.reduce the true risk. Further, these works often rely on a particular form of transformation from thesource domain to the target domain. For example, Wang and Schneider [2015] studied the offsettransformation that instead of estimating the target domain function directly, they learn the residualbetween the target domain function and the source domain function. It is natural to ask what if weuse other transfer functions and how it affects the risk on the target domain.In this paper, we propose a general framework of HTL. Instead of analyzing a speciﬁc form oftransfer, we treat it as an input of our learning algorithm. We call this input transformation function since intuitively, it captures the relevance between these two domains. This framework uniﬁes manyprevious works Wang and Schneider [2014], Kuzborskij and Orabona [2013], Wang et al. [2016] andnaturally induces a class of new learning procedures.Theoretically, we develop excess risk analysis for this framework. The performance depends on thestability [Bousquet and Elisseeff, 2002] of the algorithm used as a subroutine that if the algorithmis stable then the estimation error in the source domain will not affect the estimation in the targetdomain much. To our knowledge, this connection was ﬁrst established by Kuzborskij et al. [2013] inthe linear regression setting but here we generalize it to a broader context. In particular, we provideexplicit risk bounds for two widely used nonlinear estimators, Kernel Smoothing (KS) estimatorsand Kernel Ridge Regression (KRR) as subroutines. To the best of our knowledge, these are the ﬁrstresults showing when two domains are related, transfer learning techniques have faster statisticalconvergence rate of excess risk than that of non-transfer learning of kernel based methods. Further,we accompany this framework with a theoretical analysis showing a small amount of data for cross-validation enables us (1) avoid using HTL when it is not useful and (2) choose the best transformationfunction as input from a large pool.The rest of the paper is organized as follows. In Section 2 we introduce HTL and provide necessarybackgrounds for KS and KRR. We formalize our transformation function based framework inSection 3. Our main theoretical results are in Section 4 and speciﬁcally in Section 4.1 and Section 4.2we provide explicit risk bounds for KS and KRR, respectively. In Section 5 we analyze cross-validation in HTL setting and in Section 6 we conduct experiments on real world data data. Weconclude with a brief discussion of avenues for future work.

In this paper, we assume both X ∈ R d and Y ∈ R lie in compact subsets: || X || ≤ (cid:52) X , | Y | ≤ (cid:52) Y for some (cid:52) X , (cid:52) Y ∈ R + . Throughout the paper, we use T = { ( X i , Y i ) } ni =1 to denote a set ofsamples. Let ( X so , Y so ) be the sample from the source domain, and ( X ta , Y ta ) the sample from thetarget domain. In our setting, there are n so samples drawn i.i.d from the source distribution: T so = { ( X soi , Y soi ) } n so i =1 , and n ta samples drawn i.i.d from the target distribution: T ta = { ( X tai , Y tai ) } n ta i =1 .In addition, we also use n val samples drawn i.i.d from the target domain for cross-validation. Wemodel the joint relation between X and Y by: Y so = f so ( X so ) + (cid:15) so and Y ta = f ta ( X ta ) + (cid:15) ta where f so and f ta are regression functions and we assume the noise E [ (cid:15) so ] = E [ (cid:15) ta ] = 0 , i.i.d, We formally deﬁne the transformation functions in Section 3. A : T → ˆ f to denote an algorithm that takes a set of samples and producean estimator. Given an estimator ˆ f , we deﬁne the integrated L risk as R ( ˆ f ) = E (cid:20)(cid:16) ˆ f ( X ) − Y (cid:17) (cid:21) where the expectation is taken over the distribution of ( X, Y ) . Similarly, the empirical L risk ona set of sample T is deﬁned as ˆ R ( ˆ f ) = n (cid:80) ni =1 (cid:16) Y i − ˆ f ( X i ) (cid:17) . In HTL setting, we use ˆ f so anestimator from the source domain to facilitate the learning procedure for f ta . We say a function f is in the ( λ, α ) Hölder class [Wasserman, 2006], if for any x, x (cid:48) ∈ R d , f satisﬁes | f ( x ) − f ( x (cid:48) ) | ≤ λ || x − x (cid:48) || α , for some α ∈ (0 , . The kernel smoothing method uses a positivekernel K on [0 , , highest at , decreasing on [0 , , outside [0 , , and (cid:82) R d u K ( u ) < ∞ . Using T = { ( X i , Y i ) } ni =1 , the kernel smoothing estimator is deﬁned as follows: ˆ f ( x ) = (cid:80) ni =1 w i ( x ) Y i , where w i ( x ) = K ( || x − X i || /h ) (cid:80) nj =1 K ( || x − X j || /h ) ∈ [0 , . Another popular non-linear estimator is the kernel ridge regression (KRR) which uses the theoryof reproducing kernel Hilbert space (RKHS) for regression [Vovk, 2013]. Any symmetric positivesemideﬁnite kernel function K : R d × R d → R deﬁnes a RKHS H . For each x ∈ R d , the function z → K ( z, x ) is contained in the Hilbert space H ; moreover, the Hilbert space is endowed withan inner product (cid:104)· , ·(cid:105) H such that K ( · , x ) acts as the kernel of the evaluation functional, meaning (cid:104) f, K ( x, · ) (cid:105) H = f ( x ) for f ∈ H . In this paper we assume K is bounded: sup x ∈ R d K ( x, x ) = k < ∞ . Given the inner product, the H norm of a function g ∈ H is deﬁned as || g || H (cid:44) (cid:112) (cid:104) g, g (cid:105) H and similarly the L norm, || g || (cid:44) (cid:0)(cid:82) R d g ( x ) dP X (cid:1) / for a given P X . Also, the kernel inducesan integral operator T K : L ( P X ) → L ( P X ) : T K [ f ] ( x ) = (cid:82) R d K ( x (cid:48) , x ) f ( x (cid:48) ) dP x ( x (cid:48) ) withcountably many non-zero eigenvalues: { µ i } i ≥ . For a given function f , the approximation error isdeﬁned as: A f ( λ ) (cid:44) inf h ∈H (cid:16) || h − f || L ( P X ) + λ || h || H (cid:17) for λ ≥ . Finally the estimated functionevaluated at point x can be written as ˆ f ( x ) = K ( X , x ) ( K ( X , X ) + nλI ) − Y where X ∈ R n × d are the inputs of training samples and Y ∈ R n × are the training labels Vovk [2013]. Before we present our framework, it is helpful to give a brief overview of existing literature ontheoretical analysis of transfer learning. Many previous works focused on the settings when only unlabeled data from the target domain are available [Huang et al., 2006, Sugiyama et al., 2008, Yuand Szepesvári, 2012]. In particular, a line of research has been established based on distributiondiscrepancy, a loss induced metric for the source and target distributions [Mansour et al., 2009,Ben-David et al., 2007, Blitzer et al., 2008, Cortes and Mohri, 2011, Mohri and Medina, 2012]. Forexample, recently Cortes and Mohri [2014] gave generalization bounds for kernel based methodsunder convex loss in terms of discrepancy.In many real world applications such as yield prediction from pictures [Nuske et al., 2014], orprediction of response time from fMRI [Verstynen, 2014], some labeled data from the target domainis also available. Cortes et al. [2015] used these data to improve their discrepancy minimizationalgorithm. Zhang et al. [2013] focused on modeling target shift ( P ( Y ) changes), conditional shift( P ( X | Y ) changes), and a combination of both. Recently, Wang and Schneider [2014] proposed akernel mean embedding method to match the conditional probability in the kernel space and laterderived generalization bound for this problem Wang and Schneider [2015]. Kuzborskij and Orabona[2013, 2016], Kuzborskij et al. [2016] gave excess risk bounds for target domain estimator in the formof a linear combination of estimators from multiple source domains and an additional linear function.Ben-David and Urner [2013] showed a similar bound of the same setting with different quantitiescapturing the relatedness. Wang et al. [2016] showed that if the features of source and target domainare [0 , d , using orthonormal basis function estimator, transfer learning achieves better excess risk3uarantee if f ta − f so can be approximated by the basis functions easier than f ta . Their work can beviewed as a special case of our framework using the transformation function G ( a, b ) = a + b . In this section, we ﬁrst deﬁne our class of models and give a meta-algorithm to learn the targetregression function. Our models are based on the idea that transfer learning is helpful when onetransforms the target domain regression problem into a simpler regression problem using sourcedomain knowledge. Consider the following example.

Example: Offset Transfer.

Let f so ( x ) = (cid:112) x (1 − x ) sin (cid:16) . πx +0 . (cid:17) and f ta ( x ) = f so ( x ) + x. f so is the so called Doppler function. It requires a large number of samples to estimate well because of itslack of smoothness Wasserman [2006]. For the same reason, f ta is also difﬁcult to estimate directly.However, if we have enough data from the source domain, we can have a fairly good estimate of f so .Further, notice that the offset function w ( x ) = f ta ( x ) − f so ( x ) = x , is just a linear function. Thus,instead of directly using T ta to estimate f ta , we can use the target domain samples to ﬁnd an estimateof w ( x ) , denoted by ˆ w ( x ) , and our estimator for the target domain is just: ˆ f ta ( x ) = ˆ f so ( x ) + ˆ w ( x ) .Figure 1b shows this technique gives improved ﬁtting for f ta .The previous example exploits the fact that function w ( x ) = f ta ( x ) − f so ( x ) is a simpler functionthan f ta . Now we generalize this idea further. Formally, we deﬁne the transformation function as G ( a, b ) : R → R where we assume that given a ∈ R , G ( a, · ) is invertible. Here a will be theregression function of the source domain evaluated at some point and the output of G will be theregression function of the target domain evaluated at the same point. Let G − a ( · ) denote the inverseof G ( a, · ) such that G (cid:0) a, G − a ( c ) (cid:1) = c. For example if G ( a, b ) = a + b and G − a ( c ) = c − a . For agiven G and a pair ( f so , f ta ) , they together induce a function w G ( x ) = G − f so ( x ) ( f ta ( x )) . In the offsettransfer example, w G ( x ) = x . By this deﬁnition, for any x , we have G ( f so ( x ) , w G ( x )) = f ta ( x ) . We call w G the auxiliary function of the transformation function G . In the HTL setting, G is a user-deﬁned transformation that represents users’ prior knowledge on the relation between the source andtarget domains. Now we list some other examples: Example: Scale-Transfer.

Consider G ( a, b ) = ab . This transformation function is useful when f so and f ta satisfy a smooth scale transfer. For example, if f ta = cf so , for some constant c , then w G ( x ) = c because f ta ( x ) = G ( f so ( x ) , w G ( x )) = f so ( x ) w G ( x ) = f so ( x ) c . See Figure 1c. Example: Non-Transfer.

Consider G ( a, b ) = b . Notice that f ta ( x ) = w G ( x ) and so f so isirrelevant. Thus this model is equivalent to traditional regression on the target domain since data fromthe source domain does not help. Given the transformation G and data, we provide a general procedure to estimate f ta . The spirit ofthe algorithm is turning learning a complex function f ta into an easier function w G . First we use analgorithm A so that takes T so to obtain ˆ f so . Since we have sufﬁcient data from the source domain, ˆ f so should be close to the true regression function f so . Second, we construct a new data set usingthe n ta data points from the target domain: T w G = (cid:110)(cid:16) X tai , H G (cid:16) ˆ f so ( X tai ) , Y tai (cid:17)(cid:17)(cid:111) n ta i =1 where H G : R → R and satisﬁes E (cid:2) H G (cid:0) f so (cid:0) X tai (cid:1) , Y tai (cid:1)(cid:3) = G − f so ( X tai ) (cid:0) f ta (cid:0) X tai (cid:1)(cid:1) = w G (cid:0) X tai (cid:1) where and the expectation is taken over (cid:15) ta . Thus, we can use these newly constructed data tolearn w G with algorithm A W G : ˆ w G = A W G (cid:0) T W G (cid:1) . Finally, we plug trained ˆ f so and ˆ w G intotransformation G to obtain an estimation for f ta : ˆ f ta ( X ) = G ( ˆ f so ( X ) , ˆ w G ( X )) . Pseudocode isshown in Algorithm 1.

Unbiased Estimator H G ( f so ( X ta ) , Y ta ) : In Algorithm 1, we require an unbiased estimator for w G ( X ta ) . Note that if G ( a, b ) is linear b or (cid:15) ta = 0 , we can simply set H G ( f so ( X ) , Y ) = G − f so ( X ) ( Y ) . For other scenarios, G − f so ( X tai ) ( Y tai ) is biased: E (cid:20) G − f so ( X tai ) ( Y tai ) (cid:21) (cid:54) = G − f so ( x ) ( f ta ( x )) and we need to design estimator using the structure of G .4 lgorithm 1 Transformation Function based Transfer Learning

Inputs:

Source domain data: T so = { ( X soi , Y soi ) } n so i =1 , target domain data: T ta = { ( X tai , Y tai ) } n ta i =1 , transformation function: G , algorithm to train f so : A so , algorithm to train w G : A w G and H G an unbiased estimator for estimating w G . Outputs:

Regression function for the target domain: ˆ f ta . Train the source domain regression function ˆ f so = A so ( T so ) . Construct new data using ˆ f so and T ta : T w G = { ( X tai , W i ) } n ta i =1 , where W i = H G (cid:16) ˆ f so ( X tai ) , Y tai (cid:17) . Train the auxiliary function: ˆ w G = A W G ( T w G ) . Output the estimated regression for the target domain: ˆ f ta ( X ) = G (cid:16) ˆ f so ( X ) , ˆ w G ( X ) (cid:17) . Remark 1:

Many transformation functions are equivalent to a transformation function G (cid:48) ( a, b ) where G (cid:48) ( a, b ) is linear in b . For example, for G ( a, b ) = ab , i.e., f ta ( x ) = f so ( x ) w G ( x ) ,consider G (cid:48) ( a, b ) = ab where b in G (cid:48) stands for b in G , i.e., f ta ( x ) = f so ( x ) w (cid:48) G ( x ) . Therefore w (cid:48) G = w G and we only need to estimate w (cid:48) G well instead of estimating w G . More generally, if G ( a, b ) can be factorized as G ( a, b ) = g ( a ) g ( b ) , i.e., f ta ( x ) = g ( f so ( x )) g ( w G ( x )) , weonly need to estimate g ( w G ( x )) and the convergence rate depends on the structure of g ( w G ( x )) . Remark 2:

When G is not linear in b and (cid:15) ta (cid:54) = 0 , observe that in Algorithm 1, we treat Y tai s asnoisy covariates to estimate w G ( X i ) s. This problem is called error-in-variable or measurement errorand has been widely studied in statistics literature. For details, we refer the reader to the seminalwork by Carroll et al. [2006]. There is no universal estimator for the measurement error problem. InSection B, we provide a common technique, regression calibration to deal with measurement errorproblem. In this section, we present theoretical analyses for the proposed class of models and estimators. First,we need to impose some conditions on G . The ﬁrst assures that if the estimations of f so and w G areclose to the source regression and auxiliary function, then our estimator for f ta is close to the truetarget regression function. The second assures that we are estimating a regular function. Assumption 1 G ( a, b ) is L -Lipschitz: | G ( a, b ) − G ( a (cid:48) , b (cid:48) ) | ≤ L || ( a, b ) − ( a (cid:48) , b (cid:48) ) || and is invert-ible with respect to b given a , i.e. if G ( x, y ) = z then G − x ( z ) = y . Assumption 2

Given G , the induced auxiliary function w G is bounded: for x : || x || ≤ (cid:52) X , w G ( x ) ≤ B for some B > . Offset Transfer and Non-Transfer satisfy these conditions with L = 1 and B = (cid:52) Y . Scale Transfersatisﬁes these assumptions when f so is lower bounded from away . Lastly, we assume our unbiasedestimator is also regular. Assumption 3

For x : || x || ≤ (cid:52) X and y : | y | ≤ (cid:52) Y , H G ( x, y ) ≤ B for some B > and H G isLipschitz continuous in the ﬁrst argument: | H G ( x, y ) − H G ( x (cid:48) , y ) | ≤ L | x − x (cid:48) | for some L > . We begin with a general result which only requires the stability of A W G : Theorem 1

Suppose for any two sets of samples that have same features but different labels: T = { ( X tai , W i ) } n ta i =1 and (cid:101) T = (cid:110)(cid:16) X tai , (cid:102) W i (cid:17)(cid:111) n ta i =1 , the algorithm A w G for training w G satisﬁes: (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) A w G ( T ) − A w G (cid:16) (cid:101) T (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ ≤ n ta (cid:88) i =1 c i (cid:0) X tai (cid:1) (cid:12)(cid:12)(cid:12) W i − (cid:102) W i (cid:12)(cid:12)(cid:12) , (1)5 here c i only depends on X tai . Then for any x , (cid:12)(cid:12)(cid:12) ˆ f ta ( x ) − f ta ( x ) (cid:12)(cid:12)(cid:12) = O (cid:18)(cid:12)(cid:12)(cid:12) ˆ f so ( x ) − f so ( x ) (cid:12)(cid:12)(cid:12) + | (cid:101) w G ( x ) − w G ( x ) | + (cid:32) n ta (cid:88) i =1 c i (cid:0) X tai (cid:1) (cid:12)(cid:12)(cid:12) ˆ f so (cid:0) X tai (cid:1) − f so (cid:0) X tai (cid:1)(cid:12)(cid:12)(cid:12)(cid:33)  where ˜ w G = A w G (cid:0) { ( X tai , H G ( f so ( X tai ) , Y tai )) } n ta i =1 (cid:1) , the estimated auxiliary function trainedbased on true source domain regression function. Theorem 1 shows how the estimation error in the source domain function propagates to our estimationof the target domain function. Notice that if we happen to know f so , then the error is bounded by O (cid:16) | ˜ w G ( x ) − w G ( x ) | (cid:17) , the estimation error of w G . However, since we are using estimated f so toconstruct training samples for w G , the error might accumulate as n ta increases. Though the thirdterm in Theorem 1 might increase with n ta , it also depends on the estimation error of f so which isrelatively small because of the large amount of source domain data.The stability condition (1) we used is related to the uniform stability introduced by Bousquet andElisseeff Bousquet and Elisseeff [2002] where they consider how much will the output change if oneof the training instance is removed or replaced by another whereas ours depends on two differenttraining data sets. The connection between transfer learning and stability has been discoveredby Kuzborskij and Orabona [2013], Liu et al. [2016] and Zhang [2015] in different settings, but theyonly showed bounds for generalization, not for excess risk. We ﬁrst analyze kernel smoothing method.

Theorem 2

Suppose the support of X ta is a subset of the support of X so and the probability densityof P X so and P X ta are uniformly bounded away from below on their supports. Further assume f so is ( λ so , α so ) Hölder and w G is ( λ w G , α w G ) Hölder . If we use kernel smoothing estimation for f so and w G with bandwidth h so (cid:16) n − / (2 α so + d ) so and h w G (cid:16) n − / (2 α wG + d ) ta , with probability at least − δ the risk satisﬁes: E (cid:104) R (cid:16) ˆ f ta (cid:17)(cid:105) − R (cid:0) f ta (cid:1) = O (cid:32) n − αso αso + d so + n − αwG αwG + d ta (cid:33) log (cid:18) δ (cid:19) where the expectation is taken over T so and T ta . Theorem 2 suggests that the risk depends on two sources, one from estimation of f so and one fromestimation of w G . For the ﬁrst term, since in the typical transfer learning scenarios n so >> n ta , it isrelatively small in the setting we focus on. The second terms shows the power of transfer learning ontransforming a possibly complex target regression function into a simpler auxiliary function. It iswell known that learning f ta only using target domain has risk of the order Ω (cid:18) n − α fta / ( α fta + d ) ta (cid:19) .Thus, if the auxiliary function is smoother than the target regression function, i.e. α w G > α f ta , weobtain better statistical rate. Next, we give an upper bound for the excess risk using KRR:

Theorem 3

Suppose P X so = P X ta and the eigenvalues of the integral operator T K satisfy µ i ≤ ai − /p for i ≥ a ≥ (cid:52) Y , p ∈ (0 , and there exists a constant C ≥ such that for f ∈ H , || f || ∞ ≤ C || f || p H · || f || − pL ( P X ) . Furthur assume that A f so ( λ ) ≤ cλ β so and A w G ( λ ) ≤ cλ β wG .If we use KRR for estimating f so and w G with regularization parameters λ so (cid:16) n − / ( β so + p ) so and w G (cid:16) n − / ( β wG + p ) ta , then with probability at least − δ the excess risk satisﬁes: E (cid:104) R (cid:16) ˆ f ta (cid:17)(cid:105) − R (cid:0) f ta (cid:1) = O (cid:32)(cid:32) n βwG + p ta log ( n ta ) · n − βsoβso + p so + n − βwGβwG + p ta (cid:33) log (cid:18) δ (cid:19)(cid:33) where the expectation is taken over T so and T ta . Similar to Theorem 2, Theorem 3 suggests that the estimation error comes from two sources. Forestimating the auxiliary function w G , the statistical rate depends on properties of the kernel inducedRKHS, and how far the auxiliary function is from this space. For the ease of presentation, we assume P X so = P X ta , so the approximation errors A f so and A f ta are deﬁned on the same domain. Theerror of estimating f so is ampliﬁed by O (cid:0) λ − w G log ( n ta ) (cid:1) , which is worse than that of nonparametrickernel smoothing. We believe this λ − w G is nearly tight because Bousquet and Elisseeff have shownthe uniform algorithmic stability parameter for KRR is O (cid:0) λ − w G (cid:1) Bousquet and Elisseeff [2002].Steinwart et al. Steinwart et al. [2009] showed that for non-transfer learning, the optimal statisticalrate for excess risk is Ω (cid:18) n − βtaβta + p ta (cid:19) , so if β wg ≥ β ta and n so is sufﬁciently large then we achieveimproved convergence rate through transfer learning. Remark:

Theorem 2 and 3 are not directly comparable because our assumptions on the functionspaces of these two theorems are different. In general, Hölder space is only a Banach space but not aHilbert space. We refer readers to Theorem 1 in Zhou [2008] for details.

In the previous section we showed for a speciﬁc transformation function G , if auxiliary functionis smoother than the target regression function then we have smaller excess risk. In practice,we would like to try out a class of transformation functions G , which is possibly uncountable.We can construct a subset of G ⊂ G , which is ﬁnite and satisﬁes that each G in G there is a G in G that is close to G . Here we give an example. Consider the transformation functionsthat have the form: G = { G ( a, b ) = αa + b where | α | ≤ L α , | a | ≤ L a } . We can quantize this setof transformation functions by considering a subset of G : G = { G ( a, b ) = k(cid:15)a + b } where (cid:15) = L α K , k = − K, · · · , , · · · , K and | a | ≤ L a . Here (cid:15) is the quantization unit.The next theorem shows we only need to search the transformation function G in G whose corre-sponding estimator ˆ f taG has the lowest empirical risk on the validation dataset. Theorem 4

Let G be a class of transformation functions and G be its ||·|| ∞ norm (cid:15) -cover.Suppose w G satisﬁes the same assumption in Theorem 1 and for any two G , G ∈ G , || w G − w G || ∞ ≤ L || G − G || ∞ for some constant L . Denote G (cid:63) = argmin G ∈G R (cid:16) ˆ f taG (cid:17) and G (cid:63) = argmin G ∈G ˆ R (cid:16) ˆ f taG (cid:17) . If we choose (cid:15) = O (cid:18) R ( ˆ f taG(cid:63) ) (cid:80) ntai =1 ci (cid:19) and n val = Ω (cid:0) log (cid:0)(cid:12)(cid:12) G (cid:12)(cid:12) /δ (cid:1)(cid:1) , thewith probability at least − δ , E (cid:104) R (cid:16) ˆ f taG (cid:63) (cid:17)(cid:105) − R ( f ta ) = O (cid:16) E (cid:104) R (cid:16) ˆ f taG (cid:63) (cid:17)(cid:105) − R ( f ta ) (cid:17) where theexpectation is taken over T so and T ta . Remark 1:

This theorem implies that if no-transfer function ( G ( a, b ) = b ) is in G then we willend up choosing a transformation function that has the same order of excess risk as using no-transferlearning algorithm, thus avoiding negative transfer. Remark 2:

Note number of validation set is only logarithmically depending on the size of set oftransformation functions. Therefore, we only need to use a very small amount of data from the targetdomain to do cross-validation.

In this section we use robotics and neural imaging data to demonstrate the effectiveness of theproposed framework. We conduct experiments on real-world data sets with the following procedures.7 ta = 10 n ta = 20 n ta = 40 n ta = 80 n ta = 160 n ta = 320 Only Target KS . ± .

022 0 . ± .

010 0 . ± .

008 0 . ± .

007 0 . ± .

006 0 . ± . Only Target KRR . ± .

017 0 . ± .

022 0 . ± .

013 0 . ± .

007 0 . ± . ± Only Source KRR . ± .

017 0 . ± .

017 0 . ± . Combined KS . ± .

011 0 . ± .

008 0 . ± .

009 0 . ± .

006 0 . ± .

006 0 . ± . Combined KRR . ± .

025 0 . ± .

015 0 . ± .

009 0 . ± .

005 0 . ± .

003 0 . ± . CDM . ± .

023 0 . ± .

020 0 . ± .

008 0 . ± .

007 0 . ± .

009 0 . ± . Offset KS . ± .

026 0 . ± . ± . ± .

006 0 . ± .

003 0 . ± . Offset KRR . ± .

112 0 . ± .

017 0 . ± . ± ± . ± . Scale KS ± ± . ± .

009 0 . ± .

005 0 . ± .

008 0 . ± . Scale KRR . ± .

033 0 . ± .

100 0 . ± .

014 0 . ± .

010 0 . ± .

004 0 . ± . Table 1: standard deviation intervals for the mean squared errors of various algorithms whentransferring from kin-8fm to kin-8nh. The values in bold are the smallest errors for each n ta . OnlySource KS has much worse performance than other algorithms so we do not show its result here. • Directly training on the target data T ta (Only Target KS, Only Target KRR). • Only training on the source data T so (Only Source KS, Only Source KRR). • Training on the combined source and target data (Combined KS, Combined KRR). • The CDM algorithm proposed by Wang and Schneider [2014] with KRR (CDM). • The algorithm described in this paper with G ( a, b ) = ( a + α ) b where α is a hyper-parameter(Scale KS, Scale KRR). • The algorithm described in this paper with G ( a, b ) = αa + b where α is a hyper-parameter(Offset KS, Offset KRR). ∠ For the ﬁrst experiment, we vary the size of the target domain to study the effect of n ta relativeto n so . We use two datasets from the ‘kin’ family in Delve [Rasmussen et al., 1996]. The twodatasets we use are ‘kin-8fm’ and ‘kin-8nh’, both with 8 dimensional inputs. kin-8fm has fairly linearoutput, and low noise. kin-8nh on the other hand has non-linear output, and high noise. We considerthe task of transfer learning from kin-8fm to kin-8nh. In this experiment, We set n so to 320, andvary n ta in { , , , , , } . Hyper-parameters were picked using grid search with 10-foldcross-validation on the target data (or source domain data when not using the target domain data).Table 1 shows the mean squared errors on the target data. To better understand the results, we show abox plot of the mean squared errors for n ta = 40 onwards in Figure 2(a). The results for n ta = 10 and n ta = 20 have high variance, so we do not show them in the plot. We also omit the results ofOnly Source KRR because of its poor performance. We note that our proposed algorithm outperformsother methods across nearly all values of n ta especially when n ta is small. Only when there are asmany points in the target as in the source, does simply training on the target give the best performance.This is to be expected since the primary purpose in doing transfer learning is to alleviate the problemof lack of data in the target domain. Though quite comparable, the performance of the scale methodswas worse than the offset methods in this experiment. In general, we would use cross-validation tochoose between the two.We now consider another real-world dataset where the covariates are fMRI images taken whilesubjects perform a Stroop task [Stroop, 1935]. We use the dataset collected by Verstynen [2014]which contains fMRI data of 28 subjects. A total of 120 trials were presented to each participant andfMRI data was collected throughout the trials, and went through a standard post-processing scheme.The result of this is a feature vector corresponding to each trial that describes the activity of brainregions (voxels), and the goal is to use this to predict the response time.To frame the problem in the transfer learning setting, we consider as source the data of all but onesubject. The goal is to predict on the remaining subject. We performed ﬁve repetitions for eachalgorithm by drawing n so = 300 data points randomly from the points in the source domain.We used n ta = 80 points from the target domain for training and cross-validation; evaluation wasdone on the remaining points in the target domain. Figure 2 (b) shows a box plot of the coeffecientof determination values (R-squared) for the best performing algorithms. R-squared is deﬁned as − SS res /SS tot where SS res is the sum of squared residuals, and SS tot is the total sum of squares.Note that R-squared can be negative when predicting on unseen samples – which were not used to ﬁtthe model – as in our case. When positive, it indicates the proportion of explained variance in thedependent variable (higher the better). From the plot, it is clear that Offset KRR and Only TargetKRR have the best performances on average and Offset KRR has smaller variance.8 Number of training points from target domain M ea n s qu a r e d e rr o r (a) Results of transfering from kin8fm to kin8nh Only Target KSOnly Target KRR Combined KSCombined KRR CDMOffset KS Offset KRRScale KS Scale KRR0.10.00.10.2 R s qu a r e d (b) Results of fMRI experiment Figure 2: Box plots of experimental results on real datasets. Each box extends from the ﬁrst to thirdquartile, and the horizontal lines in the middle are medians. For the robotics data, we report meansquared error (lower the better) and for the fMRI data, we report R-squared (the higher the better).For the ease of presentation, we only show results of algorithms with good performances.Mean Median Standard DeviationOnly Target KS -0.0096 0.0444 0.1041Only Target KRR 0.1041 0.1186 0.2361Only Source KS -0.4932 -0.5366 0.4555Only Source KRR -0.8763 -0.9363 0.6265Combined KS -0.7540 -0.2023 1.5109Combined KRR -0.5868 -0.0691 1.3223CDM -3.1183 -3.4510 2.6473Offset KS

Offset KRR 0.1080

In this paper, we proposed a general transfer learning framework for the HTL regression problemwhen there is some data available from the target domain. Theoretical analysis shows it is possible toachieve better statistical rate using transfer learning than standard supervised learning.Now we list two future directions and how our results could be further improved. First, in many realworld applications, there is also a large amount of unlabeled data from the target domain available.Combining our proposed framework with previous works for this scenario [Cortes and Mohri, 2014,Huang et al., 2006] is a promising direction to pursue. Second, we only present upper bounds inthis paper. It is an interesting direction to obtain lower bounds for HTL and other transfer learningscenarios.

S.S.D. and B.P. were supported by NSF grant IIS1563887 and ARPA-E Terra program. A.S. wassupported by AFRL grant FA8750-17-2-0212. 9 eferences

Shai Ben-David and Ruth Urner. Domain adaptation as learning with auxiliary information. In

NewDirections in Transfer and Multi-Task-Workshop@ NIPS , 2013.Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations fordomain adaptation.

Advances in neural information processing systems , 19:137, 2007.John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman. Learningbounds for domain adaptation. In

Advances in neural information processing systems , pages129–136, 2008.Olivier Bousquet and André Elisseeff. Stability and generalization.

Journal of Machine LearningResearch , 2(Mar):499–526, 2002.Raymond J Carroll, David Ruppert, Leonard A Stefanski, and Ciprian M Crainiceanu.

Measurementerror in nonlinear models: a modern perspective . CRC press, 2006.Corinna Cortes and Mehryar Mohri. Domain adaptation in regression. In

Algorithmic LearningTheory , pages 308–323. Springer, 2011.Corinna Cortes and Mehryar Mohri. Domain adaptation and sample bias correction theory andalgorithm for regression.

Theoretical Computer Science , 519:103–126, 2014.Corinna Cortes, Mehryar Mohri, and Andrés Muñoz Medina. Adaptation algorithm and theory basedon generalized discrepancy. In

Proceedings of the 21th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining , pages 169–178. ACM, 2015.Cecil C Craig. On the tchebychef inequality of bernstein.

The Annals of Mathematical Statistics , 4(2):94–102, 1933.Li Fei-Fei, Rob Fergus, and Pietro Perona. One-shot learning of object categories.

IEEE transactionson pattern analysis and machine intelligence , 28(4):594–611, 2006.Jiayuan Huang, Arthur Gretton, Karsten M Borgwardt, Bernhard Schölkopf, and Alex J Smola.Correcting sample selection bias by unlabeled data. In

Advances in neural information processingsystems , pages 601–608, 2006.Samory Kpotufe and Vikas Garg. Adaptivity to local smoothness and dimension in kernel regression.In

Advances in Neural Information Processing Systems , pages 3075–3083, 2013.Ilja Kuzborskij and Francesco Orabona. Stability and hypothesis transfer learning. In

ICML (3) ,pages 942–950, 2013.Ilja Kuzborskij and Francesco Orabona. Fast rates by transferring from auxiliary hypotheses.

MachineLearning , pages 1–25, 2016.Ilja Kuzborskij, Francesco Orabona, and Barbara Caputo. From n to n+ 1: Multiclass transferincremental learning. In

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , pages 3358–3365, 2013.Ilja Kuzborskij, Francesco Orabona, and Barbara Caputo. Scalable greedy algorithms for transferlearning.

Computer Vision and Image Understanding , 2016.Tongliang Liu, Dacheng Tao, Mingli Song, and Stephen Maybank. Algorithm-dependent gener-alization bounds for multi-task learning.

IEEE transactions on pattern analysis and machineintelligence , 2016.Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning boundsand algorithms. arXiv preprint arXiv:0902.3430 , 2009.Mehryar Mohri and Andres Munoz Medina. New analysis and algorithm for learning with driftingdistributions. In

Algorithmic Learning Theory , pages 124–138. Springer, 2012.10tephen Nuske, Kamal Gupta, Srinivasa Narasimhan, and Sanjiv Singh. Modeling and calibratingvisual yield estimates in vineyards. In

Field and Service Robotics , pages 343–356. Springer, 2014.Francesco Orabona, Claudio Castellini, Barbara Caputo, Angelo Emanuele Fiorilla, and GiulioSandini. Model adaptation with least-squares svm for adaptive hand prosthetics. In

Robotics andAutomation, 2009. ICRA’09. IEEE International Conference on , pages 2897–2903. IEEE, 2009.Carl Edward Rasmussen, Radford M Neal, Georey Hinton, Drew van Camp, Michael Revow, ZoubinGhahramani, Rafal Kustra, and Rob Tibshirani. Delve data for evaluating learning in validexperiments. , 1996.Ingo Steinwart, Don R Hush, and Clint Scovel. Optimal rates for regularized least squares regression.In

COLT , 2009.J Ridley Stroop. Studies of interference in serial verbal reactions.

Journal of experimental psychology ,18(6):643, 1935.Masashi Sugiyama, Shinichi Nakajima, Hisashi Kashima, Paul V Buenau, and Motoaki Kawanabe.Direct importance estimation with model selection and its application to covariate shift adaptation.In

Advances in neural information processing systems , pages 1433–1440, 2008.Tatiana Tommasi, Francesco Orabona, and Barbara Caputo. Safety in numbers: Learning categoriesfrom few examples with multi model knowledge transfer. In

Computer Vision and PatternRecognition (CVPR), 2010 IEEE Conference on , pages 3081–3088. IEEE, 2010.Timothy D Verstynen. The organization and dynamics of corticostriatal pathways link the medialorbitofrontal cortex to future behavioral responses.

Journal of neurophysiology , 112(10):2457–2469, 2014.Vladimir Vovk. Kernel ridge regression. In

Empirical Inference , pages 105–116. Springer, 2013.Xuezhi Wang and Jeff Schneider. Flexible transfer learning under support and model shift. In

Advances in Neural Information Processing Systems , pages 1898–1906, 2014.Xuezhi Wang and Jeff Schneider. Generalization bounds for transfer learning under model shift.2015.Xuezhi Wang, Junier B Oliva, Jeff Schneider, and Barnabás Póczos. Nonparametric risk and stabilityanalysis for multi-task learning problems. In , volume 1, page 2, 2016.Larry Wasserman.

All of nonparametric statistics . Springer Science & Business Media, 2006.Jun Yang, Rong Yan, and Alexander G Hauptmann. Cross-domain video concept detection usingadaptive svms. In

Proceedings of the 15th ACM international conference on Multimedia , pages188–197. ACM, 2007.Yaoliang Yu and Csaba Szepesvári. Analysis of kernel mean matching under covariate shift. arXivpreprint arXiv:1206.4650 , 2012.Kun Zhang, Krikamol Muandet, and Zhikun Wang. Domain adaptation under target and conditionalshift. In

Proceedings of the 30th International Conference on Machine Learning (ICML-13) , pages819–827, 2013.Yu Zhang. Multi-task learning and algorithmic stability. In

AAAI , volume 2, pages 6–2, 2015.Ding-Xuan Zhou. Derivative reproducing properties for kernel methods in learning theory.

Journalof computational and Applied Mathematics , 220(1):456–463, 2008.11

Proofs

A.1 Proof of Theorem 1

Proof of Theorem 1 . The proof just uses assumptions on the transformation function and stability ofthe training algorithm. (cid:12)(cid:12)(cid:12) ˆ f ta ( x ) − f ta ( x ) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) G (cid:16) ˆ f so ( x ) , ˆ w G ( x ) (cid:17) − G ( f so ( x ) , w G ( x )) (cid:12)(cid:12)(cid:12) (2) ≤ L (cid:12)(cid:12)(cid:12) ˆ f so ( x ) − f so ( x ) (cid:12)(cid:12)(cid:12) + L | ˆ w G ( x ) − w G ( x ) | (3) ≤ L (cid:12)(cid:12)(cid:12) ˆ f so ( x ) − f so ( x ) (cid:12)(cid:12)(cid:12) + 2 L | ˆ w G ( x ) − ˜ w G ( x ) | + 2 L | ˜ w G ( x ) − w G ( x ) | (4) ≤ L (cid:12)(cid:12)(cid:12) ˆ f so ( x ) − f so ( x ) (cid:12)(cid:12)(cid:12) + 2 L (cid:32) n ta (cid:88) i =1 c i (cid:0) X tai (cid:1) (cid:12)(cid:12)(cid:12) W i − (cid:102) W i (cid:12)(cid:12)(cid:12)(cid:33) + 2 L | ˜ w G ( x ) − w G ( x ) | (5)where (2) is by the requirement of G , (3) is by the Lipschitz condition of G , (4) is because ( a − b ) ≤ a − c ) + 2( c − b ) and (5) is by our stability assumption of A w G . Now, we are left bounding (cid:16)(cid:80) n ta i =1 c i (cid:12)(cid:12)(cid:12) W i − (cid:102) W i (cid:12)(cid:12)(cid:12)(cid:17) . Notice that by the assumption of H G , (cid:12)(cid:12)(cid:12) W i − (cid:102) W i (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) H G (cid:16) ˆ f so (cid:0) X tai (cid:1) , (cid:0) Y tai (cid:1)(cid:17) − H G (cid:0) f so (cid:0) X tai (cid:1) , Y tai (cid:1)(cid:12)(cid:12)(cid:12) ≤ L (cid:12)(cid:12)(cid:12) ˆ f so (cid:0) X tai (cid:1) − f so (cid:0) X tai (cid:1)(cid:12)(cid:12)(cid:12) (6)Plugging (6) into (5), we obtain our desired result. (cid:3) A.2 Proof of Theorem 2

For simplicity, let K h ( · ) = K ( · /h ) and deﬁne the expected regression estimate ˜ f = (cid:80) ni =1 w i f ( X i ) .To prove Theorem 2, we ﬁrst give some standard supporting lemmas for kernel smoothing. Lemma 1 (Lemma 1 of [Kpotufe and Garg, 2013])

Under the same assumptions in Theorem 2,for all x with || x || ≤ (cid:52) X , if f is ( λ, α ) Hölder , then, for any h > , we have | ˜ f ( x ) − f ( x ) | ≤ λ h α . Lemma 2 (Corollary of Lemma 3 and Lemma 7 of [Kpotufe and Garg, 2013])

Under the sameassumptions in Theorem 2, let < δ < / , for all x : || x || ≤ (cid:52) X and h > , with probability atleast − δ , we have | ˆ f ( x ) − ˜ f ( x ) | = O (cid:18) log (1 /δ ) nh d (cid:19) . Proof of Theorem 2 . we prove Theorem 2 by bounding each corresponding term in Theorem 1. First,by Lemma 1 and Lemma 2, we have for all x , with probability at least − δ (cid:12)(cid:12)(cid:12) ˆ f so ( x ) − f so ( x ) (cid:12)(cid:12)(cid:12) = O (cid:18) h α so so + log (1 /δ ) n so h dso (cid:19) . Speciﬁcally, for X ta , . . . , X tan ta , we have max i =1 , ··· ,n ta (cid:12)(cid:12)(cid:12) ˆ f so ( X tai ) − f so ( X tai ) (cid:12)(cid:12)(cid:12) = O (cid:18) h α so so + log (1 /δ ) n so h dso (cid:19) . (7)Next, according to Assumption 1 and 2, H G is bounded and unbiased and w G is bounded, we can view (cid:110)(cid:16) X tai , (cid:102) W i (cid:17)(cid:111) n ta i =1 a training set for function w G that (cid:102) W i = w G ( X tai ) + (cid:15) w G where E [ (cid:15) w G ] = 0 and | (cid:15) w G | ≤ B . Based on this observation, using Lemma 1 and Lemma 2 again, for all x : || x || ≤ (cid:52) X ,we have with probability at least − δ | ˜ w G ( x ) − w G ( x ) | = O (cid:18) h α wG w G + log (1 /δ ) n ta h dw G (cid:19) . (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) A w G ( T ) − A w G (cid:16) (cid:101) T (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ . Notice that for T , ˜ T in Theorem 1, and forall x : || x || ≤ (cid:52) X : (cid:12)(cid:12)(cid:12) A w G ( T ) ( x ) − A w G (cid:16) (cid:101) T (cid:17) ( x ) (cid:12)(cid:12)(cid:12) = (cid:80) n ta i =1 K h (cid:0) || x − X tai || (cid:1) (cid:16) W i − (cid:102) W i (cid:17)(cid:80) n ta i =1 K h (cid:0) || x − X tai || (cid:1) = (cid:80) n ta i =1 K h (cid:0) || x − X tai || (cid:1) (cid:12)(cid:12)(cid:12) W i − (cid:102) W i (cid:12)(cid:12)(cid:12)(cid:80) n ta i =1 K h (cid:0) || x − X tai || (cid:1) (cid:44) n ta (cid:88) i =1 c i (cid:12)(cid:12)(cid:12) W i − (cid:102) W i (cid:12)(cid:12)(cid:12) for c i = K h ( || x − X tai || ) (cid:80) ntai =1 K h ( || x − X tai || ) . Now according to Theorem 1, we only need to bound (cid:16)(cid:80) n ta i =1 c i (cid:12)(cid:12)(cid:12) ˆ f so ( X tai ) − f so ( X tai ) (cid:12)(cid:12)(cid:12)(cid:17) . With probability at least − δ , we have: (cid:32) n ta (cid:88) i =1 c i (cid:12)(cid:12)(cid:12) ˆ f so (cid:0) X tai (cid:1) − f so (cid:0) X tai (cid:1)(cid:12)(cid:12)(cid:12)(cid:33) ≤ (cid:32) n ta (cid:88) i =1 c i (cid:33) (cid:18) max i =1 , ··· ,n ta (cid:12)(cid:12)(cid:12) ˆ f so (cid:0) X tai (cid:1) − f so (cid:0) X tai (cid:1)(cid:12)(cid:12)(cid:12) (cid:19) (8) = max i =1 , ··· ,n ta (cid:12)(cid:12)(cid:12) ˆ f so (cid:0) X tai (cid:1) − f so (cid:0) X tai (cid:1)(cid:12)(cid:12)(cid:12) (9) = O (cid:18) h α so so + log ( n ta /δ ) n so h dso (cid:19) , (10)where (8) is because maximum is bigger than other terms, (9) is because (cid:80) n ta i =1 c i = 1 by deﬁnition,and (10) is by (7). Putting these all together, using Theorem 1 and choosing the bandwidth accordingto Theorem 2, we can show for all x : || x || ≤ (cid:52) X (cid:12)(cid:12)(cid:12) f ta ( x ) − ˆ f ta ( x ) (cid:12)(cid:12)(cid:12) = O (cid:32) n − αso αso + d so + n − αwG αwG + d ta (cid:33) log (cid:18) δ (cid:19) . Now integrate with respect to P X ta we obtain our desired result. (cid:3) A.3 Proof of Theorem 3

The proof strategy is similar to that of Theorem 2. Using Theorem 1 we have E (cid:20)(cid:12)(cid:12)(cid:12) ˆ f ta ( X ) − f ta ( X ) (cid:12)(cid:12)(cid:12) (cid:21) = O (cid:18) E (cid:20)(cid:12)(cid:12)(cid:12) ˆ f so ( X ) − f so ( X ) (cid:12)(cid:12)(cid:12) + | (cid:101) w G ( X ) − w G ( X ) | + (cid:32) n ta (cid:88) i =1 c i (cid:0) X tai (cid:1) (cid:12)(cid:12)(cid:12) ˆ f so (cid:0) X tai (cid:1) − f so (cid:0) X tai (cid:1)(cid:12)(cid:12)(cid:12)(cid:33)  . where the expectation is taken over P x ta and T ta . Now we bound three terms on the right hand sideseparately. By Corollary 3 of Steinwart et al. [2009], we have with probability at least − δ E (cid:20)(cid:12)(cid:12)(cid:12) ˆ f so ( X ) − f so ( X ) (cid:12)(cid:12)(cid:12) (cid:21) = O (cid:18) λ β so so + log (1 /δ ) λ pso n so (cid:19) , (11)where expectation is taken over P tax . Taking union bound over X ta , . . . , X tan ta , we have max i =1 , ··· ,n ta E (cid:20)(cid:12)(cid:12)(cid:12) ˆ f so ( X tai ) − f so ( X tai ) (cid:12)(cid:12)(cid:12) (cid:21) = O (cid:18) λ β so so + log ( n ta /δ ) λ pso n so (cid:19) . (12)where the expectation is taken over T ta . Next, using the exactly same argument as in the Theorem 2,we can view (cid:110)(cid:16) X tai , (cid:102) W i (cid:17)(cid:111) n ta i =1 a training set for function w G that (cid:102) W i = w G ( X tai ) + (cid:15) w G as13 W i = w G ( X tai ) + (cid:15) w G where E [ (cid:15) w G ] = 0 and | (cid:15) w G | ≤ B . Thus applying Corollary 3 of Steinwartet al. [2009] again, we have with probability at least − δ E (cid:104) | ˜ w G ( X ) − w G ( X ) | (cid:105) = O (cid:18) λ β wG w G + log (1 /δ ) λ pw G n ta (cid:19) . where expectation is taken over P x ta . Now we analyze the stability of KRR. We use Φ ( x ) todenotes the feature map corresponding with the given kernel K so K ( x, y ) = Φ ( x ) (cid:62) Φ ( y ) . Alsofor simplicity, we denote Φ ta = (cid:0) Φ ( x ta ) | · · · | Φ (cid:0) x tan ta (cid:1)(cid:1) the feature matrix of target domain data. With these notations, we can write (cid:12)(cid:12)(cid:12) A w G ( T w G ) ( x ) − A w G (cid:16) (cid:101) T w G (cid:17) ( x ) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) W − (cid:102) W · · · W n ta − (cid:102) W n ta  (cid:62) (cid:0) Φ (cid:62) ta Φ + n ta λ w G I (cid:1) − Φ (cid:62) ta Φ ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Φ ta  W − (cid:102) W · · · W n ta − (cid:102) W n ta  (cid:62) (cid:0) Φ ta Φ (cid:62) ta + n ta λ w G I (cid:1) − Φ ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k / (cid:12)(cid:12)(cid:12) W − (cid:102) W (cid:12)(cid:12)(cid:12) · · · k / (cid:12)(cid:12)(cid:12) W n ta − (cid:102) W n ta (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:0) Φ ta Φ (cid:62) ta + n ta λ w G I (cid:1) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) op k / ≤ n ta (cid:88) i =1 kn ta λ w G (cid:12)(cid:12)(cid:12) W i − (cid:102) W i (cid:12)(cid:12)(cid:12) (cid:44) n ta (cid:88) i =1 c i (cid:12)(cid:12)(cid:12) W i − (cid:102) W i (cid:12)(cid:12)(cid:12) . The second equality we used the identity that (cid:0) Φ (cid:62) Φ + λ I (cid:1) − Φ (cid:62) = Φ (cid:62) (cid:0) ΦΦ (cid:62) + λ I (cid:1) − forany Φ and λ . The ﬁrst inequality we used sub-multiplicity of operator norm and the assumption || Φ ( x ) || H ≤ k / . The second inequality we used the fact the lower bound of least eigenvalue of (cid:0) Φ ta Φ (cid:62) ta + n ta λ w G I (cid:1) is n ta λ w G . Therefore, applying Cauchy-Schwartz inequality and using thebound in (12), we have with probability at least − δ , E (cid:32) n ta (cid:88) i =1 c i (cid:12)(cid:12)(cid:12) ˆ f so (cid:0) X tai (cid:1) − f so (cid:0) X tai (cid:1)(cid:12)(cid:12)(cid:12)(cid:33)  ≤ (cid:32) n ta (cid:88) i =1 c i (cid:33) · (cid:32) n ta (cid:88) i =1 (cid:12)(cid:12)(cid:12) ˆ f so (cid:0) X tai (cid:1) − f so (cid:0) X tai (cid:1)(cid:12)(cid:12)(cid:12) (cid:33) = n ta (cid:88) i =1 k n ta λ w G · E (cid:34) n ta (cid:88) i =1 (cid:12)(cid:12)(cid:12) ˆ f so (cid:0) X tai (cid:1) − f so (cid:0) X tai (cid:1)(cid:12)(cid:12)(cid:12) (cid:35) ≤ k λ w G · max i =1 ,...,n ta E (cid:20)(cid:12)(cid:12)(cid:12) ˆ f so (cid:0) X tai (cid:1) − f so (cid:0) X tai (cid:1)(cid:12)(cid:12)(cid:12) (cid:21) = O (cid:18) k λ w G (cid:18) λ β so so + log ( n ta /δ ) λ pso n so (cid:19)(cid:19) . Now putting these all together and choosing λ so and λ w G according to Theorem 3, we obtain thedesired result. (cid:3) A.4 Proof of Theorem 4

We ﬁrst prove a general theorem for cross-validation. This is a standard result for cross-validationand we include the proof for completeness. 14 heorem 5

Let Θ be the set of all hypotheses and ˆ θ = argmin θ ∈ Θ (cid:80) n val i =1 (cid:16) ˆ f taθ (cid:0) X vali (cid:1) − Y vali (cid:17) the estimator that minimizes error on the cross-validation set. Then with probability at least − δ : E (cid:104) R (cid:16) ˆ f ta ˆ θ (cid:17)(cid:105) − R (cid:0) f ta (cid:1) = O (cid:32) E (cid:104) R (cid:16) ˆ f taθ (cid:63) (cid:17)(cid:105) − R (cid:0) f ta (cid:1) + log | Θ | δ n val (cid:33) , where θ ∗ = argmin θ ∈ Θ R (cid:16) ˆ f θ (cid:17) and the expectation is taken over T so and T ta . To prove of Theorem 5, we use the following type of Bernstein’s inequality [Craig, 1933]:

Lemma 3

Let X , . . . , X n be random variables and suppose that for k ≥ : E [ | X i − E [ X i ] | k ] ≤ Var [ X i ]2 k ! r k − , for some r > . Then with probability > − δ : n n (cid:88) i =1 ( X i − E [ X i ]) ≤ log(1 /δ ) nt + t Var [ X i ]2(1 − c ) , for ≤ tr ≤ c < .Proof of Theorem 5: For a given θ ∈ Θ , we obtain a corresponding estimated regression function ˆ f θ .Deﬁne U θi (cid:44) − (cid:16) Y vali − ˆ f taθ ( X vali ) (cid:17) + ( Y i − f ta ( X vali )) . Compute the expectation: E (cid:2) U θi (cid:3) = − E (cid:104) − Y vali ˆ f taθ (cid:0) X vali (cid:1) + ˆ f taθ (cid:0) X vali (cid:1) + 2 Y vali f ta (cid:0) X vali (cid:1) − f ta (cid:0) X vali (cid:1) (cid:105) = − E (cid:20)(cid:16) ˆ f taθ (cid:0) X vali (cid:1) − f ta (cid:0) X vali (cid:1)(cid:17) (cid:21) = R (cid:0) f ta (cid:1) − R (cid:16) ˆ f taθ (cid:17) . Also, by deﬁnition, it is easy to see n val n val (cid:88) i =1 U θi = ˆ R ( f ) − ˆ R (cid:16) ˆ f taθ (cid:17) . In order to apply Bernstein’s inequality, we must ﬁrst bound the variance of U θi : var (cid:2) U θi (cid:3) ≤ E (cid:2) ( U θi ) (cid:3) = E (cid:34)(cid:18) − (cid:16) Y vali − ˆ f valθ (cid:0) X vali (cid:1)(cid:17) + (cid:0) Y vali − f ta (cid:0) X vali (cid:1)(cid:1) (cid:19) (cid:35) = E (cid:20)(cid:16) f ta ( X i ) − ˆ f taθ (cid:17) + 4 (cid:15) i (cid:16) f ta (cid:0) X vaki (cid:1) − ˆ f taθ (cid:0) X vali (cid:1)(cid:17) + 4 (cid:15) i (cid:16) f ta (cid:0) X vali (cid:1) − ˆ f taθ (cid:0) X vali (cid:1)(cid:17) (cid:21) ≤ − (cid:52) Y E [ U i ] where in the last inequality we used the domain of Y is bounded. Since U i is a sum of boundedrandom variables, the moment condition is satisﬁed with r = 4 (cid:52) Y . Now apply Craig-Bernsteininequality to U θi s, with probability at least − δ : n val n val (cid:88) i =1 U θi − E (cid:2) U θi (cid:3) ≤ log(1 /δ ) n val t + − t (cid:52) Y E (cid:2) U θi (cid:3) − c .

15e need to ensure that c < . To do this, let c = tr = 4 t (cid:52) Y and let t < (cid:52) Y , then it is easy to seethat c < . For simplicity, deﬁne a = t (cid:52) Y − c < . Now grouping terms we get: (1 − a ) ( − E (cid:2) U θi (cid:3) ) + 1 n val n val (cid:88) i =1 U θi ≤ log (1 /δ ) n val t (1 − a ) (cid:16) R (cid:16) ˆ f ta (cid:17) − R ( f ) (cid:17) − (cid:16) ˆ R (cid:0) f taθ (cid:1) − ˆ R ( f ) (cid:17) ≤ log (1 /δ ) n val tR (cid:16) ˆ f taθ (cid:17) − R (cid:0) f ta (cid:1) ≤ − a (cid:18) ˆ R (cid:16) ˆ f θ (cid:17) − ˆ R (cid:0) f ta (cid:1) + log (1 /δ ) n val t (cid:19) . Take union bound over Θ , and consider ˆ f ˆ θ : R (cid:16) ˆ f ta ˆ θ (cid:17) − R (cid:0) f ta (cid:1) ≤ − a (cid:18) ˆ R (cid:16) ˆ f ta ˆ θ (cid:17) − ˆ R (cid:0) f ta (cid:1) + log ( | Θ | /δ ) n val t (cid:19) . Now, recall that ˆ f ta ˆ θ is the minimizer for ˆ R among all estimators induced by Θ , we have R (cid:16) ˆ f ta ˆ θ (cid:17) − R (cid:0) f ta (cid:1) ≤ − a (cid:18) ˆ R (cid:16) ˆ f taθ (cid:63) (cid:17) − ˆ R (cid:0) f ta (cid:1) + log ( | Θ | /δ ) n val t (cid:19) . Now taking expectation over T val then over T so and T ta we obtain the desired result. (cid:3) Now we are ready to prove Theorem 4. Since G is an (cid:15) -cover of G , there exists G (cid:48) ∈ G such that || G (cid:48) − G (cid:63) || ∞ ≤ (cid:15) . For any x , (cid:12)(cid:12)(cid:12) f ta ( x ) − ˆ f taG (cid:48) ( x ) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) G (cid:63) ( f so ( x ) , w G (cid:63) ( x )) − G (cid:48) (cid:16) ˆ f so ( x ) , ˆ w G (cid:48) ( x ) (cid:17)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) G (cid:63) ( f so ( x ) , w G (cid:63) ( x )) − G (cid:63) (cid:16) ˆ f so ( x ) , ˆ w G (cid:63) ( x ) (cid:17)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) G (cid:63) (cid:16) ˆ f so ( x ) , ˆ w G (cid:63) ( x ) (cid:17) − G (cid:48) (cid:16) ˆ f so ( x ) , ˆ w G (cid:63) ( x ) (cid:17)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) G (cid:48) (cid:16) ˆ f so ( x ) , ˆ w G (cid:63) ( x ) (cid:17) − G (cid:48) (cid:16) ˆ f so ( x ) , ˆ w G (cid:48) ( x ) (cid:17)(cid:12)(cid:12)(cid:12) (13)where ˆ w G (cid:63) = A w G ( { X tai , W (cid:63)i } ) and W (cid:63)i = H G (cid:48) (cid:16) ˆ f so ( X (cid:63)i ) , Y (cid:63)i (cid:17) + w G (cid:63) ( X tai ) − w G (cid:48) ( X tai ) , i.e.an un-biased estimated of w G (cid:63) ( X (cid:63)i ) . We can bound three terms in (13) separately. The ﬁrst termis just the difference between estimator based on G (cid:63) and the true f ta , so after taking expectation itbecomes the excess risk of ˆ f taG (cid:63) . By our construction of (cid:15) -cover of G , the second term is smaller than (cid:15) . For the third term, notice that by Lipschitz assumption on G s and our assumptions on G s in G inthe theorem 4, we have: (cid:12)(cid:12)(cid:12) G (cid:48) (cid:16) ˆ f so ( x ) , ˆ w G (cid:63) ( x ) (cid:17) − G (cid:48) (cid:16) ˆ f so ( x ) , ˆ w G (cid:48) ( x ) (cid:17)(cid:12)(cid:12)(cid:12) ≤ L ( | ˆ w G (cid:63) ( x ) − ˆ w G (cid:48) ( x ) | ) ≤ L n ta (cid:88) i =1 c i || G (cid:63) − G (cid:48) || ∞ = O (cid:32) n ta (cid:88) i =1 c i (cid:15) (cid:33) . Now we have shown R (cid:16) ˆ f taG (cid:48) (cid:17) − R ( f ta ) = O (cid:16) R (cid:16) ˆ f taG (cid:63) (cid:17) − R ( f ta ) (cid:17) . Let G (cid:63) =argmin G ∈G R (cid:16) ˆ f G (cid:17) , the best transformation function in G . By the optimality of G (cid:63) , we have R (cid:16) ˆ f taG (cid:63) (cid:17) − R ( f ta ) = O (cid:16) R (cid:16) ˆ f taG (cid:63) (cid:17) − R ( f ta ) (cid:17) . Applying Theorem 5 with our assumptions on (cid:15) and n val we know R (cid:16) ˆ f taG (cid:63) (cid:17) − R ( f ta ) = O (cid:16) R (cid:16) ˆ f taG (cid:63) (cid:17) − R ( f ta ) (cid:17) . Combing these facts we have R (cid:16) ˆ f taG (cid:63) (cid:17) − R ( f ta ) = O (cid:16) R (cid:16) ˆ f taG (cid:63) (cid:17) − R ( f ta ) (cid:17) .16 Regression Calibration for Measurement Error Problem

Given, f so , in this section we provide a standard technique to obtain an unbiased estimate of w G ( X tai ) s. Since we assume Y ta = f ta (cid:0) X ta (cid:1) + (cid:15) ta , the measurement error model corresponds to classical error model in Carroll et al. [2006]. Regressioncalibration is a widely used and reasonably well investigated method for measurement error problem.The algorithm is as follows (we have adapted the general algorithm to our HTL problem): • Compute an estimate of f ta ( X tai ) : ˜ f ta ( X tai ) . Note that directly using Y tai is one of theoption for ˜ f ta ( X tai ) . • Compute G − f so ( X tai ) (cid:16) ˜ f ta ( X tai ) (cid:17) . • Calibrate our previous computed value by applying some function F : (cid:102) W i = F (cid:18) G − f so ( X tai ) (cid:16) ˜ f ta (cid:0) X tai (cid:1)(cid:17)(cid:19) where F depends on G and the speciﬁc distribution on noise.Now we consider the loglinear mean model as a concrete example. Suppose G ( f so ( x ) , w G ( x )) = βf so ( x ) log ( w G ( x )) where β is some constant. Further, we assume (cid:15) ta ∼ N (cid:0) , σ (cid:1) Now we apply the regressioncalibration algorithm. • First we choose Y tai as our estimate for ˜ f ta ( X tai ) . • Second, by our choice of G : G − f so ( X tai ) (cid:0) Y tai (cid:1) = exp (cid:18) Y tai βf so ( X tai ) (cid:19) • Last, for our choice of G and assumption of (cid:15) ta , the corresponding F and ﬁnal estimate of w G ( X tai ) is (cid:102) W i = F (cid:18) G − f so ( X tai ) (cid:16) ˜ f ta (cid:0) X tai (cid:1)(cid:17)(cid:19) = exp (cid:18) log (cid:18) G − f so ( X tai ) (cid:16) ˜ f ta (cid:0) X tai (cid:1)(cid:17)(cid:19) + σ (cid:0) f so (cid:0) X tai (cid:1)(cid:1) (cid:19) = exp (cid:18) Y tai βf so ( X tai ) + σ (cid:0) f so (cid:0) X tai (cid:1)(cid:1) (cid:19) . The estimator for w G ( X tai ) depends on some distribution speciﬁc parameters which may be unknown,like σ in the previous example. In such cases, we may replace these parameters by our estimates. Forexample, in the previous Gaussian noise case, suppose for each X tai , we have multiple observations { Y ij } n i j =1 . Then we can estimate σ by ˆ σ = (cid:80) n ta i =1 (cid:80) n i j =1 (cid:0) Y taji − ¯ Y tai (cid:1) (cid:80) n ta i =1 ( n i − where ¯ Y tai = (cid:80) nij =1 Y ij n i .Here we only provide one method for measurement error problem. There are other techniques suchsimulation extrapolation and likelihood method which may be also applicable in many situations.The choice of method depends on speciﬁc transformation G and assumptions on the distribution ofthe noise. Again, interested readers are referred to Carroll et al. [2006] for details.17 ta = 10 n ta = 20 n ta = 40 n ta = 80 n ta = 160 n ta = 320 Only Target KS . ± .

001 0 . ± .

000 0 . ± .

000 0 . ± . Only Target KRR ± ± ± ± ± ± Only Source KS . ± .

012 0 . ± .

012 0 . ± . Only Source KRR . ± .

013 0 . ± .

013 0 . ± . Combined KS . ± .

017 0 . ± .

011 0 . ± .

013 0 . ± .

007 0 . ± .

000 0 . ± . Combined KRR . ± .

008 0 . ± .

010 0 . ± .

002 0 . ± .

000 0 . ± .

000 0 . ± . CDM . ± .

002 0 . ± .

001 0 . ± .

002 0 . ± .

000 0 . ± .

000 0 . ± . Offset KS . ± .

001 0 . ± .

000 0 . ± .

000 0 . ± . Offset KRR . ± . ± ± ± ± ± Scale KS . ± .

002 0 . ± .

001 0 . ± .

000 0 . ± .

000 0 . ± . Scale KRR ± ± ± ± ± ± Table 3: 1 standard deviation intervals for the mean squared errors of various algorithms whentransferring from kin-8nh to kin-8fm. The values in bold are the best errors for each n ta . C Additional Experimental Results

C.1 Synthetic data

This section gives details of the synthetic data. For both experiments, we use n so = 10000 samplesfrom the source domain, and n ta = 100 samples from the target domain. We put Gaussian noiseon the labels: (cid:15) so ∼ N (0 , . , (cid:15) ta ∼ N (0 , . ; and we use KS with a gaussian kernel forestimating f so and w G .Figure 1b shows the offset example in Section 3, where we consider f so ( x ) = (cid:112) x (1 − x ) sin (cid:18) . πx + 0 . (cid:19) , f ta ( x ) = f so ( x ) + x. We used the transformation function G ( a, b ) = a + b . The bandwidths of the kernels were chosenby cross validation. For estimating f so , the chosen bandwidth is h so = 10 − , and for estimating w G , the chosen value is h w G = 10 − . Figure 1c shows the scale example in Section 3, where weconsider the same source regression function and f ta ( x ) = 5 f so ( x ) . We tested the transformationfunction G ( a, b ) = ab . Bandwidth parameters were again chosen by cross validation: h so = 10 − forestimating f so , and h w G = 5 × − for estimating w G . The plots show that by using our proposedtransfer learning framework with an appropriate transformation function, we can estimate the targetregression function better, especially in regions where f ta is not smooth. C.2 Transferring from kin-8nh to kin-8fm

Now we brieﬂy discuss the results of the second transfer task with the robotic arm data described inSection 6. The source domain is kin-8nh and the target domain is kin-8fm. The results are shownin Table 3. Here we see the effects of trying to transfer to an “easy” domain. We do not gainany advantage by using the transfer algorithm, except for the smallest value of n tata