[PDF] Transfer Learning for Linear Regression: a Statistical Test of Gain

Abstract

Transfer learning, also referred as knowledge transfer, aims at reusing knowledge from a source dataset to a similar target one. While many empirical studies illustrate the benefits of transfer learning, few theoretical results are established especially for regression problems. In this paper a theoretical framework for the problem of parameter transfer for the linear model is proposed. It is shown that the quality of transfer for a new input vector x depends on its representation in an eigenbasis involving the parameters of the problem. Furthermore a statistical test is constructed to predict whether a fine-tuned model has a lower prediction quadratic risk than the base target model for an unobserved sample. Efficiency of the test is illustrated on synthetic data as well as real electricity consumption data.

Full PDF

TT RANSFER L EARNING FOR L INEAR R EGRESSION : A S TATISTICAL T EST OF G AIN

David Obst

EDF R&DAix-Marseille Université [email protected]

Badih Ghattas

Aix-Marseille Université [email protected]

Jairo Cugliari

Université Lumière Lyon 2

[email protected]

Georges Oppenheim

Université Paris-Est Marne-la-Vallée [email protected]

Sandra Claudel

EDF R&D [email protected]

Yannig Goude

EDF R&D [email protected]

February 19, 2021 A BSTRACT

Transfer learning, also referred as knowledge transfer, aims at reusing knowledge from a sourcedataset to a similar target one. While many empirical studies illustrate the beneﬁts of transfer learning,few theoretical results are established especially for regression problems. In this paper a theoreticalframework for the problem of parameter transfer for the linear model is proposed. It is shown that thequality of transfer for a new input vector x depends on its representation in an eigenbasis involvingthe parameters of the problem. Furthermore a statistical test is constructed to predict whether aﬁne-tuned model has a lower prediction quadratic risk than the base target model for an unobservedsample. Efﬁciency of the test is illustrated on synthetic data as well as real electricity consumptiondata. K eywords Linear regression · Transfer learning · Statistical test · Fine-tuning · Transfer theory

We consider the situation where we want to perform predictions for a target task T but for which training data islimited, either because it is difﬁcult to label or only newly available. We have at our disposal another source task S related (in some sense) to T for which we have plenty of data. Our goal, which is the one of transfer learning, is toleverage information from S to improve the task results on T [Weiss et al., 2016]. For instance we could be interestedin performing forecasts for a group of newly arrived customers T while we have a group of long time customers S . We expect that the relatedness between S and T and the information at our disposal for S helps to improve ourforecasting model for the newly arrived customers. Transfer can be performed on multiple levels. [Pan and Yang, 2009]deﬁne four categories of transfer: parameter, instances, features and relationship transfer. In this paper we focus onthe ﬁrst, more speciﬁcally in the framework of linear regression. Transfer for the linear model has been studied inthe literature on multiple occasions, but the essence of results vary. In [Lounici et al., 2009, Lounici et al., 2011] theauthors consider the setting of sparse multi-task learning in high dimension with a common sparsity pattern withinthe regression vectors. They obtain oracle inequalities on the prediction error, albeit for the same data on which theparameters were learned. [Maurer, 2006] establishes bounds on the average prediction error over m tasks in the linearclassiﬁcation setting, but do not investigate in which situation learning on multiple sources could be beneﬁcial for aspeciﬁc target. Furthermore in both aforementioned papers the results remain mainly theoretical, and the setting ofmulti-task learning is slightly different than the one of transfer. Bayesian approaches, where a prior is constructedwith the help of the source data and then the posterior is obtained with the target one are also popular for parametertransfer. For instance in [Launay et al., 2015] it is applied for the problem of electricity load forecasting, with goodresults. [Bouveyron and Jacques, 2010] propose another method to transfer the estimated coefﬁcients of a linear model a r X i v : . [ m a t h . S T ] F e b hen the number of target samples is highly limited. After estimating the regression coefﬁcients on the source set, thetarget vector is obtained by a linear transformation of the previous one with constraints on the transformation. Theydemonstrated the efﬁciency of their approach on house pricing data among others, showing signiﬁcant improvement overlearning from scratch on the target set. [Chen et al., 2015] suggest another approach of transfer, where the estimator ofthe target vector is built either by directly minimizing a combination of the quadratic losses of both sets (referred as datapooling in their paper) or by constructing a convex combination of the individual estimators. They proceed to investigatethe properties of the estimators, with one of their main results being the optimality of the convex combination undercertain conditions. [Dar and Baraniuk, 2020] recently studied transfer of a set of parameters for linear regression in therestricted setting of gaussian and independent features. They showed the existence of a phenomenon of "double doubledescent", with transfer being beneﬁcial under certain conditions of under or over-parametrization of the tasks. Howeverin the aforementioned papers one common practice in transfer learning is missing, namely ﬁne-tuning. It has shown alot of success in recent years, notably for neural networks and allows for more ﬂexibility than just combining estimators.It consists in reusing a part of the learned parameters on the source (for instance neural network layers) and adjustingthem on the target with a few gradient iterations [Shin et al., 2016]. Furthermore the problem of negative transfer, i.e.when the transfer procedure may be detrimental, is not fully adressed in the aforementioned papers especially for newobservations. [Fawaz et al., 2018] approach the problem empirically: after deﬁning a distance between datasets basedon the dynamic time warping (DTW), they show that in general negative transfer will happen when the deﬁned distancebetween source and target is large. [Ben-David et al., 2010] indirectly address the issue of negative transfer for theproblem of binary classiﬁcation. Considering the transfer problem as a special case of a multi-task objective, not onlydo they obtain an upper bound on the transfer prediction error, but they also prove the existence of phases dependingon the number of samples N S and N T available for source and target respectively. Finally a domination inequality isestablished in [Chen et al., 2015] for their optimal estimator, i.e. meaning that in certain cases transfer is bound to bebeneﬁcial. However it is valid only under speciﬁc assumptions on the observations that are usually not met in practice.Therefore we deﬁne the problem of negative transfer as following. We have at our disposal two parametric models, M T of estimated parameter ˆ β T trained on the target samples available and another one M T |S of estimated parameter ˆ β T | S trained on the source samples and then enriched on the same target ones as M T . We would like to know whether M T |S will have a better prediction error than M T on a new sample ( x, y ) drawn from the target distribution, i.e. if thetransfer is positive or not. Let f ˆ β T ( x ) be the corresponding prediction from M T and f ˆ β T | S ( x ) be the one from M T |S .Following [Bosq and Blanke, 2008], we use the quadratic prediction error (QPE) as metric of the quality of a prediction R ( f ˆ β ( x )) = E [( y − f ˆ β ( x )) ] where the expectation is taken with respect to the noise of y and the distribution of ˆ β .Therefore the transfer will be said negative for a given x when E [( y − f ˆ β T ( x )) ] < E [( y − f ˆ β T | S ( x )) ] . Hence wewould like to know in advance whether this inequality stands or not, i.e. when negative transfer will happen. In ourwork we derive a new quantity referred as gain quantifying the beneﬁts of transfer, without any assumption outside ofthe one of the linear model. While the hypothesis of a linear model may seem restrictive, it includes many variants suchas generalized linear models (GAM) [Wood, 2017] that make it possible to capture highly nonlinear effects through theuse of spline bases. We will also show that it is possible to derive a hypothesis test to predict in practice whether thetransfer is positive or not. The contributions of the paper are the following:1. We formalize the problem of negative transfer for the ﬁne-tuning of a linear regression model. However ourframework is valid for a broad class of transfer procedures for the linear model found in the literature.2. We show that the transfer gain for a new feature vector x depends on its representation on an eigenbasisdepending on the parameters of the linear model.3. We establish a link between transfer by data pooling and ﬁne-tuning and show that in the framework of linearregression they both yield estimators of the same form.4. We suggest a statistical test to choose for a new observation x between the target model M T or a ﬁne-tunedone M T |S .The rest of the paper is organized as follows. Section 2 introduces the ﬁne-tuning transfer procedure considered for thelinear model and establishes the equation of the transfer gain. Section 3 presents the statistical test to predict positiveand negative transfer. In Section 4 we apply the test on synthetic data as well as a real-world electricity consumptiondataset. Section 5 illustrates how the gain relates to the source and target sample sizes. Finally Section 6 concludes ourwork and suggests further research possibilities. The appendices contain mathematical proofs and additional ﬁgures.

Let D S = { ( x S,i , y

S,i ) , i = 1 ..N S } the source training dataset of size N S be, where the input vectors x S,i ∈ R D aresupposed to be deterministic and the y S,i are supposed to be independent. We make the assumption of the gaussian2inear model, i.e. that there exists β S ∈ R D such that y S,i = x (cid:62) S,i β S + ε S,i with ε S,i ∼ N (0 , σ S ) . Similarly we deﬁne D T = { ( x T,i , y

T,i ) , i = 1 ..N T } the target training dataset with x T,i ∈ R D and independent y T,i = x (cid:62) T,i β T + ε T,i where β T ∈ R D and ε T,i ∼ N (0 , σ T ) . The source and target data are supposed to be independent. The data can berewritten more conveniently under matrix form Y ν = X ν β ν + ε ν where ν ∈ { S, T } denotes either the source or thetarget. The matrices X S and X T are the design matrices respectively of size N S × D and N T × D and are assumed tobe full rank, so that X (cid:62) ν X ν is invertible. Hence it implies that D < N ν ,which corresponds to low-dimensional setting.Many aspects of our work can be generalized to the high-dimensional setting, but is out of the scope of this paper. Thestandard procedure to estimate the coefﬁcients is by minimization of the least-squares error J ν ( β ) := (cid:107) Y ν − X ν β (cid:107) on each training set. It yields the well-known solution ˆ β ν = Σ − ν X (cid:62) ν Y ν where Σ ν = X (cid:62) ν X ν . In our framework wesuppose that the number of samples N T to have a decent estimator of β T is too low, and that leveraging informationfrom S may improve performance. ˆ β S Therefore we start from the estimator ˆ β S and ﬁne-tune it by batch gradient descent (GD) of stepsize α on D T . Thefollowing result gives the expression of the ﬁne-tuned estimator. Proposition 1.

At iteration k ∈ N the ﬁne-tuned estimator of β T is: ˆ β k = A k ˆ β S + ( I − A k ) ˆ β T (1) where A = I D − α Σ T and I D is the identity matrix of size D . We will refer to this resulting model as M T |S . Therefore the ﬁne-tuned estimator is a matrix combination of sourceand target estimators. In fact this observation can be taken further in the right vector basis to give more insight onthis expression. Since Σ T is symmetric and real-valued, let P be an orthogonal diagonalization basis matrix such that Σ T = P Λ P (cid:62) with Λ = diag ( λ i , i = 1 ..D ) the diagonal matrix of eigenvalues of Σ T . Let ˜ β ν denote the coordinate of ˆ β ν in Σ T ’s eigenbasis. Hence ˆ β ν = P ˜ β ν . Reusing equation (1) yields: ˜ β k = ( I D − α Λ) k ˜ β S + ( I D − ( I D − α Λ) k ) ˜ β T (2)which means that for every coordinate i in this basis we have: ˜ β ( i ) k = (1 − αλ i ) k ˜ β ( i ) S + (cid:0) − (1 − αλ i ) k (cid:1) ˜ β ( i ) T . (3)Hence when α is small enough and in the right basis, each coordinate of the ﬁne-tuned coefﬁcient is a convexcombination of the source and target coefﬁcients, albeit with different weights depending on the eigenvalues λ i . Forsmall eigenvalues of Σ T the ﬁne-tuning procedure will give a larger weight to the source whereas it is the opposite forlarger ones.Note that these expressions relate this transfer strategy to the ones introduced in [Chen et al., 2015], where they considertwo types of transfer for the linear model. The ﬁrst one is the pooling of source and target data, leading to estimatorsof the form ˆ β λ = W λ ˆ β S + ( I D − W λ ) ˆ β T where W λ is a matrix depending on the penalty parameter λ > . Thesecond one is a simple convex combination ˆ β ( ω ) = ω ˆ β S + (1 − ω ) ˆ β T for a constant weight ω ∈ [0 , . Hencetransfer by ﬁne-tuning is between those two approaches: in the right basis and for α small enough each coefﬁcientis a convex combination of the source and target ones, albeit with different weights depending on the eigenvalue λ i .This allows for more adaptability than for a constant ω as will be shown in simulations. It is interesting to note thatin the end two popular transfer approaches, namely data pooling and ﬁne-tuning yield estimators of the same class ˆ β ( W ) = W ˆ β S + ( I D − W ) ˆ β T with speciﬁc forms of W ∈ R d × d . In the case of data pooling the expresion of W ismore complex and it is generally not symmetric (see [Chen et al., 2015]). To our knowledge such a strong relationshipbetween the approaches has never been highlighted in literature before. The quality of a model will be evaluated for a new independent sample ( x, y ) drawn from the underlying distributionof T . We want to know if for this given x the estimator ˆ β k learned on S but ﬁne-tuned on T is better than the basicestimator ˆ β T . Following what was discussed in the introduction we introduce the algebraic gain ∆ R k ( x ) for sample ( x, y ) deﬁned by: ∆ R k ( x ) = E [( y − ˆ y T ) ] − E [( y − ˆ y k ) ] where ˆ y T = x (cid:62) ˆ β T and ˆ y k = x (cid:62) ˆ β k . We have the followingresult in the case of ﬁne-tuning. 3 roposition 2. For transfer by ﬁne-tuning as presented by equation (1), at iteration k the gain is: ∆ R k ( x ) = x (cid:62) H k x where H k = σ T (Σ − T − α Ω k Σ T Ω k ) − σ S A k Σ − S A k − A k BA k (4) with Ω k = α Σ − T ( I D − A k ) , B = ( β T − β S )( β T − β S ) (cid:62) . When it is positive, the transfer is beneﬁcial for the sample ( x, y ) , and negative otherwise. Therefore it can be seen that the matrix H k plays a signiﬁcant role for the transfer problem. The gain will be positivefor vectors in the span of the eigenvectors of H k associated to positive eigenvalues. The role of the noise in the dataas well as the distance between the regression parameters also becomes clear with this formula and seems intuitive.When (cid:107) β S − β T (cid:107) is large, i.e. the means of y ν ’s will differ signiﬁcantly, transfer is likely to be detrimental. When σ T is large (the target data is noisy), the gain will increase since learning from the target data may be difﬁcult. Note thatthis expression of the gain does not require any hypothesis on x , which is a major difference with previous works. Wealso see that an uniformly positive transfer may be impossible, and that the beneﬁts of transfer are a local property:therefore for some x it may be beneﬁcial to use a ﬁne-tuned model, whereas for others not. From (4) bounds on theprediction error can easily be derived: E [( y − ˆ y k ) ] ≤ E [( y − ˆ y T ) ] − λ min ( H k ) (cid:107) x (cid:107) E [( y − ˆ y k ) ] ≥ E [( y − ˆ y T ) ] − λ max ( H k ) (cid:107) x (cid:107) (5)where λ min and λ max respectively denote the minimum and maximum eigenvalues of Σ T . Again those bounds donot require any assumptions and hold for any x ∈ R D and only require H k to be symmetric, which is the case whenperforming transfer by ﬁne-tuning. As one can see the transfer is always positive when λ min ( H k ) > . More generally,a similar expression to (4) is possible for any estimator of the form ˆ β ( W ) = W ˆ β S + ( I D − W ) ˆ β T . However when W is not symmetric, interpretability of transfer in terms of eigenvector direction is lost and the equations of (5) cannotbe established in the same way. Consequently if H k was accessible, one would know which model to use exactly.However the issue is that many quantities in the matrix are unknown, namely the true regression parameters β ν and thetrue variances of the noise σ ν . A naive approach would consist to consider the "plug-in" estimate ˆ H k by replacing theparameters by their estimates, but experiments have shown that this is a rather poor choice in most situations. Anotherstrategy is therefore proposed in the next section. Finally we emphasize again that x is potentially a novel observationon which we require no hypothesis. In the aforementioned papers the bounds hold only under speciﬁc conditions thatdid now allow for any x ∈ R D , making our result broader. We simplify our problem to knowing in advance whether the transfer will be beneﬁcial or not, i.e. if ∆ R k ( x ) > .Therefore an alternative is to deﬁne the problem as hypothesis testing. Considering that M T |S is likely to be biased,we choose the null hypothesis H : { ∆ R k ( x ) ≤ } against the alternative H : { ∆ R k ( x ) > } . This boils down tochoosing between two models, the pure target one and the ﬁne-tuned one for a given target sample. The idea of achievingbest performance in transfer learning by taking advantage of multiple models could be related to [Gao et al., 2008]where they weighted classiﬁers according to the local properties of target observations. The main result of the paper is the following:

Theorem 1.

Let x ∈ R D be any observation. Let ˆ σ S and ˆ σ T be the estimations of the noise variances deﬁned by ˆ σ ν = (cid:13)(cid:13)(cid:13) Y ν − X ν ˆ β ν (cid:13)(cid:13)(cid:13) / ( N ν − D ) . Let ρ be such that ρ ≥ (cid:107) β T − β S (cid:107) /σ T . Then the following test is of approximatelevel a to test H against H : (cid:16) ˆ σ T ˆ σ S x (cid:62) (Σ − T − α Ω k Σ T Ω k ) x − ρ (cid:13)(cid:13) A k x (cid:13)(cid:13) x (cid:62) A k Σ − S A k x (cid:124) (cid:123)(cid:122) (cid:125) := ψ k ( x ) > q − a (cid:17) (6) where q − a is the quantile of order − a of the F ( N T − D, N S − D ) Fisher-Snedecor distribution of degrees offreedom N T − D and N S − D . The p-value for the observed data is: p k ( x ) = P F ∼F ( N T − D,N S − D ) (cid:16) F ≥ ψ k ( x ) (cid:17) (7)4he parameter ρ can be seen as a prior on the distance between the source and target distributions. Indeed, in thegaussian case one can easily prove that D KL (cid:16) N ( x (cid:62) β S , σ S ) ||N ( x (cid:62) β T , σ T ) (cid:17) ≤ g (cid:0) σ S σ T (cid:1) + ρ (cid:107) x (cid:107) where D KL denotes the Kullback-Leiber (KL) divergence and g ( u ) = u − log( u ) − . The larger ρ is, the more signiﬁcant thedifference between source and target distributions is allowed to be and thus the less likely the transfer will be beneﬁcial.When ρ = 0 (i.e. β S = β T ) only the variances differ. Note that the Cauchy-Schwarz approximation lowers the powerof the test (see supplementary materials for more details). An issue is that p k ( x ) −→ when k → + ∞ . Hence whenthe number of gradient iterations goes to inﬁnity, the test will almost systematically reject the null hypothesis, despitethe gain converging to 0. Therefore a choice of a reasonable k is of crucial importance. Finally the test can only beobtained when using a symmetric weight matrix W . α , k and ρ lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll l −2 −1 0 1 2 − − x y l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l P T M T M T|S (k=k^)Chen w orcl Chen W l ^ Data SData T (a) Overlap of P T and its different estimates. −2 −1 0 1 2 − . − . . . . . . x D R k ( x ) k=50k=k^Chen w orcl Chen W l (b) Theoretical gain for different k Figure 1: Comparison of the estimates of P T and the theoretical gain values for the different models.Three quantities must be tuned before usage of the test: the gradient step size α , the number of iterations k and theapproximation parameter ρ . Equation (3) suggests that < α < /λ max (Σ T ) so that the coordinates remain a convexcombination of the source and target ones. Additionally according to [Bertsekas and Scientiﬁc, 2015], a step size α ∗ = 2 / (cid:0) λ max (Σ T ) + λ min (Σ T ) (cid:1) allows to converge at optimal speed. However in our case convergence to ˆ β T is notdesirable since it would erase beneﬁts from ˆ β S . Taking α = α ∗ / or α ∗ / has proven to be a good choice in practicesince it ensures the ﬁrst condition while remaining close to α ∗ . Moreover experimentally we observed that a low valueof α could be compensated by a larger k .Ideally, one would choose the smallest k such that λ min ( H k ) ≥ (which would ensure an exclusively positivegain). However it depends on unknown parameters, and again a plug-in estimate yields poor results. The followingapproach yields satisfactory results. Let us denote by N T = N ( x (cid:62) β T ; σ T x (cid:62) Σ − T x ) and N k = N ( x (cid:62) β k ; x (cid:62) V k x ) thedistributions of the predictions, where β k = E [ ˆ β k ] and V k = σ S A k Σ − S A k + σ T α Ω k Σ T Ω k . It can be proved that: ∆ R k ( x ) = − σ T x (cid:62) Σ − T x D KL ( N k ||N T ) − σ T x (cid:62) Σ − T x ln (cid:16) x (cid:62) V k xσ T x (cid:62) Σ − T x (cid:17) (8)Let U k ( x ) denote the second term of the right hand side. Since the KL divergence is positive, it is needed that U k ( x ) islarge to ensure a positive gain. Therefore maximizing U k ( x ) is more likely to maximize ∆ R k ( x ) . Finally since theamount of target data is limited, we cannot afford to perform this procedure on a hold-out set. Therefore k is selectedby maximizing U k := N S + N T (cid:80) N T + N S i =1 U k ( x i ) where the true variances have been replaced by their empiricalcounterparts. In case of absence of a local maximum, the elbow rule is applied instead. Finally, the choice of ρ isperformed by considering a range of possible values (typically between − and ) and checking the precision and5ecall of the test when used on the joint training data D S ∪ D T . We refer by ˆ k and ˆ ρ the choices of k and ρ made withthis procedure. . . . . . . lllllllllllllllllllllllll l l ll TT|ST|S*Oraclek^ Gradient Iteration k R M SE (a) Test RMSE in function of k Jan Mar May Jul Sep Nov Jan . . . . . . Date p − v a l ue p−value (b) P-value over time on the test set ( k = ˆ k ). Figure 2: Results on the test data (ﬁrst scenario).In this section the theoretical results are illustrated by numerical experiments. The ﬁrst one is performed on syntheticdata where all the parameters are known. In the second example real data of electricity consumption is used, provingthe use of the test for real-life applications.

We consider the problem of the estimation of the coefﬁcients of a target polynomial P T ( x ) = β T, + β T, x + β T, x + β T, x where β T = ( − , − . , . , (cid:62) . The advantage of this example lies in how it can be visualized, as one willsee afterwards. We have N T = 60 independent target observations y T,i = P T ( x T,i ) + ε T,i with x T,i ∈ [ − , and ε T,i ∼ N (0 , σ T ) . Additionally we have N S = 600 independent source observations y S,i = P S ( x S,i ) + ε S,i with x S,i ∈ [0 , and ε S,i ∼ N (0 , σ S ) . The coefﬁcients of P S are the ones of P T plus a gaussian noise of mean andstandard-deviation . , and set σ T = σ S = 1 . Considering the locations of the samples for the source and target, thetransfer is expected to be beneﬁcial for x ≥ . As suggested earlier the step size is set to α = α ∗ / , whereas thenumber of gradient iterations k is decided by the strategy proposed in Section 3.2, leading to ˆ k = 405 . In order toillustrate the beneﬁts of our tuning procedure, the results for ˆ k are compared with the ones of a sub-optimal k = 50 , butalso the estimators using W = ω orcl I D or W ˆ λ from [Chen et al., 2015]. The true polynomial P T as well as its estimatesare represented ﬁg. 1 (a). The ﬁne-tuned model with ˆ k is the closest to the real curve, showing the advantages it takesfrom both the data of S and T . Ergo those estimates conﬁrm the aspect of the theoretical gain represented Figure 1 (b).Furthermore ﬁne-tuning proves to be superior to both of Chen’s estimator, even with a theoretically optimal coefﬁcientinaccessible in practice. This means that the ﬁne-tuning procedure allows more ﬂexibility in the use of both source andtarget data than simple data pooling or a convex combination of estimators. The beneﬁts of the tuning procedure for k is illustrated, as for k = 50 the gain is still signiﬁcantly negative between − and − whereas it is positive almosteverywhere for k = ˆ k . The p-values from (1) are represented Figure 3 in function of x for two different values of ρ .Additionally we represented the indicator (cid:16) ∆ R k ( x ) ≤ (cid:17) , i.e. when the theoretical gain is negative . As one can see,the ranges of large p-values (typically greater than . ) and the indicator usually coincide. The choice of ρ = ˆ ρ yieldsthe best results, since for ρ = 4ˆ ρ the p-value suddenly increases between 0 and 1 despite the gain being positive.6 . . . . . . x p − v a l ue r = r ^ r = r ^1( D R k (x)<0) Figure 3: P-value in function of x ( k = ˆ k ). . . . . . . lllllllllllllllllllllllll l l ll TT|ST|S*Oraclek^ Gradient Iteration k R M SE (a) Test RMSE in function of k . Jan Mar May Jul Sep Nov Jan . . . . . . Date p − v a l ue p−valueS training (2004) (b) P-value over time on the test set ( k = ˆ k ). Figure 4: Results on the test data (second scenario).

This dataset was used during the GEFCOM2012 electricity consumption forecasting competition [Hong et al., 2014]. Itconsists of the electricity consumption of 21 areas (zones) located in the United-States available from the 1 st of January2004 to the 31 st of December 2007 with a 1 hour temporal resolution that we normalized. Input variables includecalendar ones such as the day of the week, time of the year, but also the temperature measurements of 10 meteorologicalstations over the same period. The nature of our transfer is twofold: across time (period on which a model has beentrained) and space (from one area to another). We will use the measured load at 8a.m. of zone 13 as source S and zone2 as target T . To focus on the beneﬁts of our test and to avoid time series stationarity issues, both load time series havebeen detrended. The corresponding time series are represented in the supplementary material. In our work we focus onthe use of our hypothesis test and not achieving pure predictive performance. Therefore the model we consider for bothsource and target is very simple: y ν,t = β ν, + β ν, | sin( ωt ) | + β ν, W E t + (cid:88) j =3 β ν,j θ t (cid:0) θ t ∈ I j (cid:1) + ε ν,t , ν ∈ { S, T } (9)where y ν,t is the load demand at 8a.m. for day t , ω = π . The sine term is used for the annual periodicity, W E t is a binary variable whose value is on on weekends. θ t is the temperature and its effect has been cut into threeintervals to translate the impact of heating and cooling on the electricity demand [Pierrot and Goude, 2011]. Whether it7s the source or the target data, the training data will be included within the year 2004, whereas the test data on whichperformance is ﬁnally evaluated will be the whole 2005 year of zone 2.In order to evaluate the performance brought by our test, we consider the oracle prediction which knows in advancewhether to use M T or M T |S . Thus the closer M ∗T |S is to the oracle, the better. The metric of evaluation will be theroot mean squared error (RMSE) deﬁned as: RM SE = (cid:113) T (cid:80) Tt =1 ( y t − ˆ y t ) where T the number of test samples. In this scenario we suppose that the data for the source S is available for the whole year 2004. The target training datawill only be available from October the 1 st to the end of the year. Hence N S = 366 and N T = 92 . The RMSE fordifferent values of k is represented ﬁg. 2 (a), with a vertical line corresponding to our chosen ˆ k . Here the improvementbrought by the test is only marginal, for k below a threshold. This is due to the discussed phenomenon at the end ofSection 3 where when k → ∞ the test tends to systematically reject H . Note that the RMSE is minimal for ˆ k . Theerrors for ˆ ω plug and W ˆ λ from Chen et al. have been calculated but not represented because both always fared poorerthan M T . However most importantly the model M ∗T |S is always as good as M T |S : the test can thus be used safely inpractice. The p-value over time on the test set is also represented ﬁg. 2 (b). One sees that it’s almost always close to 0,except locally for cold months. Since M T was trained on a similar period the year before, such a behavior is logical. We consider the case where the training data from S is available between April the 1 st and September the 30 th T is available between the 1 st of September and the 31 st of December 2004, and thus N S = 182 and N T = 122 . In practice it could correspond to the case where a customer breaks his contract, and a new one arrives.Results on the test data for the different approaches are given ﬁg. 4 (a). We see that this time the test signiﬁcantlyimproves upon the individual forecasts, lowering the RMSE by more than 0.02 compared to M T for the chosen ˆ k . Thetest efﬁciently detects the situations of positive and negative transfer, thus taking advantage of each model’s speciﬁcities.Furthermore the prediction for ˆ k is very close to the oracle. Again both of Chen et al.’s predictors yielded poor results,being only marginally better than M T . The p-value over time on the test set is also plotted ﬁg. 4 (b). It is close to 0 ona period similar to the one the source model was trained the year before, and large during the cold months where themodel T is expected to be better. A natural question to ask is how the gain evolves with the sample sizes N S and N T . One would for instance expectthe gain to increase when the number of source samples is order of magnitudes higher than the target one. In orderto analyze these dependencies, we consider the following experimental framework. We suppose that the sourceand target data are i.i.d. x ν,i ∼ N (0 , I D ) (thus Σ ν ∼ W D ( I D , N ν ) where W D ( I D , Ψ) ). For ( N S , N T ) in a grid I S × I T , we calculate and average the gain ∆ R k ( x ) over B = 50 simulations for x ∼ N (0 , I D ) as well. Algorithm 1summarizes the procedure. This experiment is conducted for a dimension size D = 15 , k ∈ { , , } , α = α ∗ / and (cid:107) β S − β T (cid:107) = 0 . (both coefﬁcients have been randomly sampled). In order to improve the readability, the gain hasbeen thresholded to the range [ − . , . . The results are represented in Fig. 5. −0.4−0.20.00.20.4200 400 600 800 1000100200300400500 N S N T || b S - b T ||=0.25, k=0 (a) k = 0 −0.4−0.20.00.20.4200 400 600 800 1000100200300400500 N S N T || b S - b T ||=0.25, k=10 (b) k = 10 −0.4−0.20.00.20.4200 400 600 800 1000100200300400500 N S N T || b S - b T ||=0.25, k=50 (c) k = 50 Figure 5: Transfer phases in function of N S , N T and k .8hases are observed depending on the values of N S and N T , and follow the intuition. When no ﬁne-tuning is performed(i.e. k = 0 ) the gain will be positive only when the number of target samples N T is small and the number of sourceones N S is large enough. For N T above a certain threshold, negative transfer will systematically happen. When k increases, the blue areas corresponding to negative gain fade away thus meaning that at worst the transfer procedurewill have a neutral impact, even for large values of N T . The beneﬁts of transfer through ﬁne-tuning are particularlyvisible for k = 10 with an increase of the size of the positive transfer areas: the ﬁne-tuning procedure allows to takeadvantage of both source and target samples. However as emphasized before, an excessive number of gradient iterationsmay erase the beneﬁts of transfer as seen in Figure 5 (c) obtained for k = 50 where only for extremely small values of N T transfer can be beneﬁcial. This is because the ﬁne-tuned estimator ˆ β k has come too close to the pure target one ˆ β T .Note that these ﬁgures remind of Fig. 1 from [Ben-David et al., 2010]. Algorithm 1:

Gain simulation for varying N S & N T Initialisation : D, β ν , σ ν , k . I S = { , , . . . , } and I T = { , , . . . , } . Recursion : For ( N S , N T ) in I S × I T :1. ∆ R k ( N S , N T ) ← Recursion For b = 1 , . . . , B : (a) Generate X ν ∼ N (0 , I D ) . Deduce Σ ν .(b) Generate x ∼ N (0 , I D ) (c) Calculate ∆ R k ( N S , N T ) ← ∆ R k ( N S , N T ) + (1 /B ) x (cid:62) H k x . In this paper a novel framework for the problem of transfer learning for the linear model is proposed. By deﬁningthe gain of transfer by a difference of quadratic prediction errors, we obtain a quantity that measures how beneﬁcialor detrimental transfer by gradient descent is for a new (potentially unobserved) x . However the framework of thegain is applicable for any estimator of the form ˆ β ( W ) = W ˆ β S + ( I D − W ) ˆ β T , which encompasses many found inthe literature. Since this gain depends on unknown parameters in practice, we derived a statistical test relying of theFisher-Snedecor F distribution to predict negative transfer. The test was applied on synthetic as well as real-worldelectricity demand data, where it proved its ability to predict negative transfer for new observations.However despite its success, some points remain to investigate. How to choose the right number of gradient iterations k remains problematic, although an empirical approach has been suggested. Furthermore in order to obtain a tractablecalculation and satisfying empirical results, we had to rely on an approximation. Another possibility would be totransfer only a subset of parameters. This is often the case for neural networks where only certain layers are transferred[Laptev et al., 2018], but could be adapted for linear models. [Dar and Baraniuk, 2020] investigate the beneﬁts oftransfer depending on the number of parameters transferred, but do not indicate how to choose the subset to transfer.Moreover they do it in the static setting, i.e. no ﬁne-tuning. We have also supposed that the matrices Σ ν are invertible.However deﬁning the gain without this hypothesis is still possible although its form is slightly more complex, whichmakes it difﬁcult to adapt the test directly. Finally in this paper we made the hypothesis of linearity, which could seemrestrictive. However nonlinearity can be achieved through generalized additive models (GAM) for instance. Since theyboil down to a linear model, the formula of the gain is valid for it as well. However as such, the test we introducedcannot be used with GAM yet, and how to extrapolate it is currently under investigation. Appendices

The ﬁrst section of the appendix presents proofs of the results of the main article (sections 1 through 3). The secondpart consists in ﬁgures relating to the experimental Section 4.

A Proofs details

This appendix presents proofs of the results in the paper. 9 .1 Proposition 1

Proof.

We proceed by mathematical induction.• For k = 0 the property is trivial. ˆ β = ˆ β S = A ˆ β S + ( I D − A ) ˆ β T .• Let k ∈ N be. We suppose the property true at rank k . We have ˆ β k +1 = ˆ β k − α ∇ J T ( ˆ β k ) = ˆ β k − α Σ T ˆ β k + αX (cid:62) T Y T By deﬁnition of A = I D − α Σ T and because X (cid:62) T Y T = Σ T ˆ β T we obtain: ˆ β k +1 = A ˆ β k + α Σ T ˆ β T Finally by induction hypothesis: ˆ β k +1 = A [ A k ˆ β S + ( I D − A k ) ˆ β T ] + α Σ T ˆ β T = A k +1 ˆ β S + ( I D − A k +1 ) ˆ β T which concludes the induction. A.2 Equations (2) and (3)

Proof.

Let P be the orthogonal matrix of eigenvectors of Σ T be, i.e. such that Σ T = P Λ P (cid:62) with Λ = diag ( λ i , i =1 ..D ) and P P (cid:62) = P (cid:62) P = I D . Thus ˆ β ν = P ˜ β ν ⇔ ˜ β ν = P (cid:62) ˆ β ν . One can also write that A = P ( I D − α Λ) P (cid:62) .Hence reinjecting in (1) gives: ˆ β k = P ( I D − α Λ) k P (cid:62) ˆ β S + P ( I D − ( I D − α Λ) k ) P (cid:62) ˆ β T Applying P (cid:62) on the left of this equation yields: ˜ β k = ( I D − α Λ) k ˜ β S + ( I D − ( I D − α Λ) k ) ˜ β T Finally the matrices involved are diagonal with respective terms (1 − αλ i ) k and − (1 − αλ i ) k , thus resulting inequation (3). A.3 Proof of Proposition 2

Proof.

We remind that ˆ β S ∼ N ( β S , σ S Σ − S ) and ˆ β T ∼ N ( β T , σ T Σ − T ) . By independence of ˆ β S and ˆ β T we thus have ˆ β k ∼ N (cid:16) A k β S + ( I D − A k ) β T , σ S A k Σ − S A k + σ T ( I D − A k )Σ − T ( I D − A k ) (cid:17) It is easy to see that σ T ( I D − A k )Σ − T ( I D − A k ) = σ T α Ω k Σ T Ω k . We will note β k = E [ ˆ β k ] and V k = Var ( ˆ β k ) . Foran independent y = x (cid:62) β T + ε with ε ∼ N (0 , σ T ) we obtain that: y − ˆ y T = x (cid:62) ( β T − ˆ β T ) + ε ∼ N (cid:16) , σ T (1 + x (cid:62) Σ − T x ) (cid:17) y − ˆ y k = x (cid:62) ( β T − ˆ β k ) + ε ∼ N (cid:16) x (cid:62) ( β T − β k ) , σ T + x (cid:62) V k x (cid:17) Thus R ( M T ) = E [( y − ˆ y T ) ] = Var ( y − ˆ y T ) + E [ y − ˆ y T ] = σ T (1 + x (cid:62) Σ − T x ) and R ( M T | S ) = E [( y − ˆ y k ) ] = σ T + x (cid:62) V k x + (cid:16) x (cid:62) ( β T − β k ) (cid:17) . Therefore: ∆ R k ( x ) = σ T x (cid:62) Σ − T x − x (cid:62) V k x − x (cid:62) B k x β T − β k = A k ( β T − β S ) , we obtain that (cid:16) x (cid:62) ( β T − β k ) (cid:17) = x (cid:62) A k BA k x where B =( β T − β S )( β T − β S ) (cid:62) thus yielding the expected result. A.4 Proof of the equations of (5)

Proof. H k is symmetric. Hence we can introduce { u i } i =1 ..D an orthonormal basis of eigenvectors of it with λ i ( H k ) the associated eigenvalues. Let x ∈ R D be with coordinates x i in this basis. Thus x can be rewritten x = (cid:80) Di =1 x i u i .Since { u i } is orthonormal (i.e. u (cid:62) i u j = 1 if i = j and else) it follows that: x (cid:62) H k x = D (cid:88) i,j =1 λ i ( H k ) x i x j u (cid:62) i u j = D (cid:88) i =1 λ i ( H k ) x i . Since λ min ( H k ) ≤ λ i ( H k ) ≤ λ max ( H k ) we get that λ min ( H k ) (cid:107) x (cid:107) ≤ x (cid:62) H k x ≤ λ max ( H k ) (cid:107) x (cid:107) . Finally remem-bering that x (cid:62) H k x = E [( y − ˆ y T ) ] − E [( y − ˆ y k ) ] yields (5). A.5 Proof of Theorem 1

Proof.

It would be natural to reject H if an estimator ˆ δ ( x ) of the gain is above a certain threshold. Hence a naturalform of such a decision rule is (ˆ δ ( x ) > K a ) , where K a is a constant depending on the desired level a of the test. Weconsider the estimator of ∆ R k ( x ) : ˆ δ ( x ) = ˆ σ T x (cid:62) (cid:0) Σ − T − α Ω k Σ T Ω k (cid:1) x − ˆ σ S x (cid:62) A k Σ − S A k x − x (cid:62) A k BA k x While the matrix B is not accessible in practice, we start from this estimator for the sake of the simplicity of thecalculations. We will address this issue later. It can be proved (see hereafter) that the type I error, the probability ofwrongly rejecting the null hypothesis, is the largest at the boundary ∆ R k ( x ) = 0 . Thus ˆ δ ( x ) > K a is equivalent to: ˆ σ T /σ T ˆ σ S /σ S + ˆ σ T ˆ σ S x (cid:62) A k BA k xσ T x (cid:62) A k Σ − S A k x > K a + x (cid:62) A k BA k x + ˆ σ S x (cid:62) A k Σ − S A k x ˆ σ S x (cid:62) A k Σ − S A k x Since ˆ σ T /σ T ˆ σ S /σ S ∼ F ( N T − D, N S − D ) , taking K a = q − a ˆ σ S x (cid:62) A k Σ − S A k x − x (cid:62) A k BA k x − ˆ σ S x (cid:62) A k Σ − S A k x + ˆ σ T ˆ σ S x (cid:62) A k BA k x (where q − a is the quantile of order − a of the F ( N T − D, N S − D ) distribution) yields the test oflevel a : (cid:16) φ k ( x ) := ˆ σ T x (cid:62) (cid:0) Σ − T − α Ω k Σ T Ω k (cid:1) x − (ˆ σ T /σ T ) x (cid:62) A k BA k x ˆ σ S x (cid:62) A k Σ − S A k x > q − a (cid:17) However B and σ T are unknown in practice, we will thus have to rely on a lower bound of φ k ( x ) for the test. Byhypothesis, we have (cid:107) β T − β S (cid:107) /σ T ≤ ρ . Since B is symmetric x (cid:62) A k BA k x ≤ λ max ( B ) (cid:13)(cid:13) A k x (cid:13)(cid:13) . Moreover B is arank 1 matrix and thus its sole nonzero eigenvalue is λ max ( B ) = (cid:107) β T − β S (cid:107) . The aforementioned hypothesis leadsto σ T x (cid:62) A k BA k x ≤ ρ (cid:13)(cid:13) A k x (cid:13)(cid:13) . Therefore we have the following lower bound ψ k ( x ) of φ k ( x ) that can be used inpractice: ψ k ( x ) = ˆ σ T ˆ σ S x (cid:62) (Σ − T − α Ω k Σ T Ω k ) x − ρ (cid:13)(cid:13) A k x (cid:13)(cid:13) x (cid:62) A k Σ S A k x What remains to prove is that the type I error is maximum at the frontier, i.e. where ∆ R k ( x ) = 0 . If ∆ R k ( x ) ≤ then: φ k ( x ) ≤ ˆ σ T σ S x (cid:62) A k Σ − S A k x + x (cid:62) A k BA k xσ T − (ˆ σ T /σ T ) x (cid:62) A k BA k x ˆ σ S x (cid:62) A k Σ − S A k x F = ˆ σ T /σ T ˆ σ S /σ S . Thus ﬁnally: P ∆ R k ( x ) ≤ ( φ k ( x ) ≥ q − a ) ≤ P F ∼F ( N T − D,N S − D ) (cid:16) F ≥ q − a (cid:17) = P ∆ R k ( x )=0 ( φ k ( x ) ≥ q − a (cid:1) = a which proves that the type I error is maximum at ∆ R k ( x ) = 0 and that the level of the test is a .Thus the p-value of the test relying on φ k ( x ) can thus be upper bounded by P F ∼F ( N T − D,N S − D ) (cid:0) F ≥ ψ k ( x ) (cid:1) , provingall the results of the theorem. A.6 Proof of equation (8)

Proof.

The KL divergence between two univariate gaussians directly yields: D KL ( N k ||N T ) = x (cid:62) V k xσ T x (cid:62) Σ − T x + ( x (cid:62) (cid:0) β T − β k ) (cid:1) − σ T x (cid:62) Σ − T xσ T x (cid:62) Σ − T x − ln (cid:16) x (cid:62) V k xσ T x (cid:62) Σ − T x (cid:17) = − ∆ R k ( x ) σ T x (cid:62) Σ − T x − ln (cid:16) x (cid:62) V k xσ T x (cid:62) Σ − T x (cid:17) hence the result. B Additional experimental elements

The ﬁrst subsection is dedicated to the calibration of the hyperparameters k and ρ that was discussed theoretically insection III of the paper. The second one gives more insight on the GEFCOM2012 data. B.1 Synthetic data - choice of the hyperparameters

The procedure to choose k and ρ is detailed further with ﬁgure 6. The plot of U k used to choose ˆ k is represented in (a).Initially U k starts from a very high value, but the number of iterations is too low. Thus the local maximum at ˆ k = 4053 is chosen instead. The tuning of ρ is performed as following. The training samples ( x i , y i ) ∈ D S ∪ D T such that ( y i − ˆ y T,i ) > ( y i − ˆ y k,i ) are labeled , and the others . This can be seen as using an empirical counterpart to thegain. The test is then applied on this data for different values of ρ , on which we check the recall and precision for thelabel . Finally ˆ ρ is taken to maximize the precision with a good recall or right before the latter drops, as represented in(b). k^ Gradient Iteration k U k (a) Choice of ˆ k llllllllllllllllllllllllllllllllllllllllllllllllll . . . . . . r l RecallPrecision r ^ (b) Choice of ˆ ρ Figure 6: Tuning of the test quantities k and ρ .2 GEFCOM2012 data representation The electricity demand data of the GEFCOM2012 dataset is represented ﬁgure 7. Fig. (a) depicts the annual trend forthe demand of the two zones at 8a.m, whereas (b) represents the average demand over a day during the week or onweekends (WE). The behavior of the two series is similar: the annual trend is the same (higher consumption in winterand lower during the summer), and the daily peaks happen around the same hour. . . . . . time Load (a) Load over 2004-2005 . . . . Hour of the day D a il y c u r v e Zone 13Zone 13−WEZone 2Zone 2−WE (b) Daily Load

Figure 7: Comparison of the load demand for zones 13 ( S ) and 2 ( T ). References [Ben-David et al., 2010] Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., and Vaughan, J. W. (2010).A theory of learning from different domains.

Machine learning , 79(1-2):151–175.[Bertsekas and Scientiﬁc, 2015] Bertsekas, D. P. and Scientiﬁc, A. (2015).

Convex optimization algorithms . AthenaScientiﬁc Belmont.[Bosq and Blanke, 2008] Bosq, D. and Blanke, D. (2008).

Inference and prediction in large dimensions , volume 754.John Wiley & Sons.[Bouveyron and Jacques, 2010] Bouveyron, C. and Jacques, J. (2010). Adaptive linear models for regression: improv-ing prediction when population has changed.

Pattern Recognition Letters , 31(14):2237–2247.[Chen et al., 2015] Chen, A., Owen, A. B., Shi, M., et al. (2015). Data enriched linear regression.

Electronic journalof statistics , 9(1):1078–1112.[Dar and Baraniuk, 2020] Dar, Y. and Baraniuk, R. G. (2020). Double double descent: On generalization errors intransfer learning between linear regression tasks. arXiv preprint arXiv:2006.07002 .[Fawaz et al., 2018] Fawaz, H. I., Forestier, G., Weber, J., Idoumghar, L., and Muller, P.-A. (2018). Transfer learningfor time series classiﬁcation. In , pages 1367–1376.IEEE.[Gao et al., 2008] Gao, J., Fan, W., Jiang, J., and Han, J. (2008). Knowledge transfer via multiple model local structuremapping. In

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and datamining , pages 283–291.[Hong et al., 2014] Hong, T., Pinson, P., and Fan, S. (2014). Global energy forecasting competition 2012.[Laptev et al., 2018] Laptev, N., Yu, J., and Rajagopal, R. (2018). Reconstruction and regression loss for time-seriestransfer learning. In

Proc. SIGKDD MiLeTS .[Launay et al., 2015] Launay, T., Philippe, A., and Lamarche, S. (2015). Construction of an informative hierarchicalprior for a small sample with the help of historical data and application to electricity load forecasting.

Test ,24(2):361–385.[Lounici et al., 2009] Lounici, K., Pontil, M., Tsybakov, A. B., and Van De Geer, S. (2009). Taking advantage ofsparsity in multi-task learning. arXiv preprint arXiv:0903.1468 .13Lounici et al., 2011] Lounici, K., Pontil, M., Van De Geer, S., Tsybakov, A. B., et al. (2011). Oracle inequalities andoptimal inference under group sparsity.

The annals of statistics , 39(4):2164–2204.[Maurer, 2006] Maurer, A. (2006). Bounds for linear multi-task learning.

Journal of Machine Learning Research ,7(Jan):117–139.[Pan and Yang, 2009] Pan, S. J. and Yang, Q. (2009). A survey on transfer learning.

IEEE Transactions on knowledgeand data engineering , 22(10):1345–1359.[Pierrot and Goude, 2011] Pierrot, A. and Goude, Y. (2011). Short-term electricity load forecasting with generalizedadditive models.

Proceedings of ISAP power , 2011.[Shin et al., 2016] Shin, H.-C., Roth, H. R., Gao, M., Lu, L., Xu, Z., Nogues, I., Yao, J., Mollura, D., and Summers,R. M. (2016). Deep convolutional neural networks for computer-aided detection: Cnn architectures, datasetcharacteristics and transfer learning.

IEEE transactions on medical imaging , 35(5):1285–1298.[Weiss et al., 2016] Weiss, K., Khoshgoftaar, T. M., and Wang, D. (2016). A survey of transfer learning.

Journal ofBig data , 3(1):9.[Wood, 2017] Wood, S. N. (2017).