[PDF] A General Approach to Domain Adaptation with Applications in Astronomy

Abstract

The ability to build a model on a source task and subsequently adapt such model on a new target task is a pervasive need in many astronomical applications. The problem is generally known as transfer learning in machine learning, where domain adaptation is a popular scenario. An example is to build a predictive model on spectroscopic data to identify Supernovae IA, while subsequently trying to adapt such model on photometric data. In this paper we propose a new general approach to domain adaptation that does not rely on the proximity of source and target distributions. Instead we simply assume a strong similarity in model complexity across domains, and use active learning to mitigate the dependency on source examples. Our work leads to a new formulation for the likelihood as a function of empirical error using a theoretical learning bound; the result is a novel mapping from generalization error to a likelihood estimation. Results using two real astronomical problems, Supernova Ia classification and identification of Mars landforms, show two main advantages with our approach: increased accuracy performance and substantial savings in computational cost.

Full PDF

AA General Approach to DomainAdaptation with Applications in Astronomy

Ricardo Vilalta ∗ , Kinjal Dhar Gupta, Dainis Boumber Department of Computer ScienceUniversity of Houston

Houston TX, 77204-3010, USA ∗ corresponding author email: [email protected] Mikhail M. Meskhi

Department of Computer ScienceNorth American University

Stafford TX, 77477, USA

Abstract —The ability to build a model on a source task andsubsequently adapt such model on a new target task is a pervasiveneed in many astronomical applications. The problem is generallyknown as transfer learning in machine learning, where domainadaptation is a popular scenario. An example is to build apredictive model on spectroscopic data to identify Supernovae IA,while subsequently trying to adapt such model on photometricdata. In this paper we propose a new general approach to domainadaptation that does not rely on the proximity of source andtarget distributions. Instead we simply assume a strong similarityin model complexity across domains, and use active learning tomitigate the dependency on source examples. Our work leads to anew formulation for the likelihood as a function of empirical errorusing a theoretical learning bound; the result is a novel mappingfrom generalization error to a likelihood estimation. Results usingtwo real astronomical problems, Supernova Ia classiﬁcation andidentiﬁcation of Mars landforms, show two main advantages withour approach: increased accuracy performance and substantialsavings in computational cost.

Index Terms —Supervised Learning, Domain Adaptation, Max-imum A Posteriori, Model Complexity.

I. I

NTRODUCTION

In this paper we propose a new approach to domain adap-tation that is particularly well suited for astronomical applica-tions. Our general setting assumes the learner is embedded ina domain adaptation framework [1]–[8], where the goal is toobtain a predictive model on a target domain, where examplesabound, but labeled data is scarce. We assume the existence ofa source domain with abundant labeled data, but with differentdistribution, such that the naive approach of directly applyingthe source model on the target becomes inadequate. Instead wefollow a Maximum A Posteriori (MAP) approach to estimatemodel complexity by extracting the prior distribution fromprevious experience (i.e., from a previous task), and by takingthe (scarce) target data as evidence to compute the likelihood.The result is a new approach to domain adaptation thatis exempt from the common restriction that demands closeproximity between source and target distributions [2].We show how using a prior distribution from a previoustask to estimate a posterior of model complexity on a newtask, not only yields an increase in accuracy performance,but in addition has an enormous impact on computationalcost. Our focus is on astronomical problems where we arewitnessing a rapid growth of data volumes corresponding to a variety of astronomical surveys; data repositories havegone from gigabytes into terabytes, and we expect thoserepositories to reach the petabytes in the coming years [9],[10]. Our proposed methodology assumes an exhaustive searchfor the right model complexity on a source domain, wherewe generate a prior distribution on model complexity. Thearrival of a new target task dispels with such exhaustive search;instead it generates a posterior distribution that directly leadsto ﬁnding a near-optimal ﬁgure of model complexity. This isparticularly important for big-data applications where lengthycomputational tasks are unavoidable, even under the availabil-ity of an efﬁcient high-performance-computing infrastructure.We report on experiments using two real-world astronomicaldomains: classiﬁcation of Supernovae Ia using photometricdata, and characterization of landforms on Mars using DigitalElevation Maps (DEMs). Both domains can produce massiveamounts of data with a strong need for efﬁcient computationalsolutions. Results show how the use of a source prior toguide the search for an optimal value of model complexitycan signiﬁcantly improve on generalization performance.The rationale for assuming similar model-complexity valuesacross tasks is based on the nature of distributional discrep-ancies in many physical domains. The idea is useful not onlyto astronomical data analysis, but to many other real-worldproblems where the shift in distribution originates from moresophisticated equipment (e.g., modern telescopes), differentinstrumentation, or different coverage on the feature space,while the complexity of the classiﬁcation problem experienceslittle change. For example, while spectroscopic and photomet-ric observations capture data at different levels of resolution,the identiﬁcation itself of speciﬁc astronomical objects sharesa similar degree of difﬁculty. In short, we assume that thechange in distribution from source to target does not affectmodel complexity signiﬁcantly.This paper is organized as follows. We begin by providingbasic concepts in classiﬁcation and domain adaptation, fol-lowed by a detailed description of our proposed approach thatshows how to extract a prior distribution from a source domain.We then show our experiments and empirical results. The lastsection provides summary and conclusions. a r X i v : . [ c s . L G ] D ec I. P

RELIMINARY C ONCEPTS

A. Basic Notation

In supervised learning or classiﬁcation, we assume theexistence of a training set of examples, T = { ( x i , y i ) } pi =1 ,where vector x = ( x , x , · · · , x n ) is an instance of theinput space X , and y is an instance of the output space Y .It is often assumed that sample T contains independentlyand identically distributed (i.i.d.) examples that come froma ﬁxed but unknown joint probability distribution, P ( x , y ) ,in the input-output space X × Y . The output of the learningalgorithm is a function f θ ( x ) (parameterized by θ ) mappingthe input space to the output space, f θ : X → Y . Function f θ comes from a space of functions H . The idea is to searchfor the hypothesis that minimizes the expectation of a lossfunction L ( y, f ( x | θ )) , a.k.a. the risk: R ( θ, P ( x , y )) = E ∼ P [ L ( y, f ( x | θ ))] (1)where we usually employ the zero-one loss function: L ( y, f ( x | θ )) = 1 { x | y ( x ) (cid:54) = f ( x | θ ) } ( x ) (2)such that · ) is an indicator function, and y ( x ) is the trueclass of x . Domain Adaptation . In domain adaptation, we assume theexistence of a source domain, corresponding to a previous taskfrom which experience can be leveraged, and a target domain,corresponding to the present task. Each domain enables usto draw a dataset: T s = { ( x i , y i ) } pi =1 for the source, and T t = { x i } qi =1 for the target. T s is an instantiation of a jointprobability distribution, P s ( x , y ) , while T t is an instantiationof the marginal distribution P t ( x ) (from the joint distribution P t ( x , y ) , such that P t ( x ) = (cid:82) y P t ( x , y ) d y ). The emphasisis always placed on the target domain, corresponding to thetask at hand. The main objective is to induce a model fromthe target dataset; when building the model, one can exploitknowledge from the source dataset. A major difﬁculty indomain adaptation stems from the lack of labels on T t . We willassume the possibility of querying some of those examples toattain a few labeled examples as part of an active learningsetting [11].Most domain adaptation methods assume similar class pos-teriors across source and target domains, i.e., P s ( y | x ) = P t ( y | x ) , but different marginals P s ( x ) (cid:54) = P t ( x ) ; this is knownas the covariate-shift assumption. Different from previouswork, we will consider the case where both source and targetdiffer in the marginal distributions and class posteriors. Parameter Estimation . We also consider the problem ofparameter estimation, which can play a major role in classiﬁ-cation as a means to estimate an optimal ﬁgure of model com-plexity. Examples include ﬁnding the number of hidden nodesin a neural network, or ﬁnding the degree of a polynomialkernel in support vector machines. In Maximum a Posteriori(MAP), the goal is to obtain a point estimate that maximizesthe posterior distribution of the parameter given the data or evidence. The posterior probability is essentially a function oftwo main factors: the prior probability (i.e., degree of belief ofmodel complexity before data analysis) and the likelihood (i.e.,probability of data sample conditioned on model complexity).When data abounds, the likelihood bears a stronger inﬂuenceon the posterior, while the opposite takes place when data isscarce; here the prior bears more inﬂuence on the posterior. Animportant question is how to obtain a reliable prior when datais scarce (i.e., when the prior plays a strong role in estimatingthe posterior).

B. Related Work

Domain adaptation induces a model by exploiting expe-rience gathered from previous tasks [2]. It is considered asubﬁeld of transfer learning [12], and has become increasinglypopular in recent years due to the pervasive nature of taskdomains exhibiting differences in sample distribution [13],[14]. The central question is if a previously constructed(source) model can be adapted to a new task, or if it is betterto build a new (target) model from scratch.Domain adaptation papers can be classiﬁed into twotypes: instance-based and feature-based methods. The idea ininstance-based methods is to assign high weights to source ex-amples occupying regions of high density in the target domain.A popular approach is known as covariate shift [15]–[19]. Thecovariance-shift assumption is that one can build a model onthe newly-weighted source sample and apply it directly to thetarget domain [20], [21]. A stringent requirement is that sourceand target distributions must be close to each other.Feature-based domain adaptation methods attempt to projectsource and target datasets into a latent feature space, wherethe covariate-shift assumption holds. A model is then built onthe transformed space, and used as the classiﬁer on the target.Examples are structural corresponding learning [3], subspacealignment methods [22], among others [8], [23], [24].From a theoretical view, previous work has tried to estimatethe distance between source and target distributions [1], [2],[4]; and employ regularization terms to ﬁnd models with goodgeneralization performance on both source and target domains[25]. III. M

ETHODOLOGY

We begin by providing a general description of the proposedmethodology. The main idea is to assist in ﬁnding the rightconﬁguration (model complexity) for a learning algorithmby leveraging information from a previous similar task. Forexample, when trying to ﬁnd a predictive model to classifySupernovae, or to predict the class of a transient star, searchingfor a model with the right degree of complexity by varying aconﬁguration parameter(s) may turn frustratingly cumbersome.As an example, setting the architecture of a (shallow) neuralnetwork by varying the number of hidden nodes would leadto a huge number of experiments to assess model qualityfor each architecture. To alleviate this situation, we learn arange of values of model complexity from a previous taskusing a Maximum a Posteriori approach, where there is a highikelihood of ﬁnding a good value of model complexity (e.g.,number of hidden nodes) on the new task. Moreover, differentfrom previous work, our method disregards many assumptionsmade by previous work: we do not follow the covariate-shiftassumption; no data projection is required to transform thefeature space (thus incurring no loss of information); and thedependence on the source is not based on transferring sourceexamples to build the target model. To summarize, the mainidea is to learn about the model-building process employed in aprevious task (source domain), and to transfer that experienceto the new task (target domain). Experience is here understoodas a distribution of optimal values of model complexity.

A. Active Learning

Our basic strategy is to step aside from the commonapproach followed by many domain-adaptation techniques thatselectively gather source examples to enlarge the set of targetexamples. When source and target distributions differ signiﬁ-cantly, such approach can lead to a biased model. Under highdistribution discrepancy, an optimal strategy would simplyrely on target instances. But the classical setting in domainadaptation provides none (or very few) class labels on thetarget dataset. A solution to this conundrum is to provide targetclass labels using active learning [11], [26], [27], where aselective mechanism queries an expert for (target) class labelsunder a limited budget (i.e., a limited number of queries).The use of active learning in domain adaptation obviatesusing source examples while building the target model, open-ing new research avenues in the ﬁeld of transfer learning.Here we investigate a mechanism that generates a distributionof model complexity on the source domain, and re-utilizessuch distribution as a prior in a Bayesian setting over thetarget domain. The resulting point-estimate over the posteriordistribution of model complexity depends on the prior (sourcedomain) and the likelihood, or evidence (target domain).

B. Model Complexity as a Transferable Item

We assume that optimal predictive models for both sourceand target domains share a similar degree of model complexity.For example, assuming both domains are best modeled usingSupport Vector Machines with a polynomial kernel parame-terized by θ (corresponding to the degree of a polynomial),we can then further assume that P s ( θ ∗ ) ∼ P t ( θ ∗ ) , where θ ∗ is the polynomial degree that minimizes a loss function. Suchassumption focuses on the similarity of complexity-parameterdistributions, and not on the similarity of joint input-outputdistributions.Speciﬁcally, iteratively sampling and building predictivemodels on the source domain leads to a distribution of modelparameters, P s ( θ ) . Our goal is then to estimate an optimalparameter value θ ∗ that maximizes the posterior distributionon the target domain P t ( θ | D ) , where D is the data or evidence(i.e., target sample T t ). By using the distribution gathered fromthe source domain as a reliable prior, we can formulate theposterior using Bayes formula: P t ( θ | D ) = P t ( D | θ ) P s ( θ ) (cid:80) i P t ( D | θ i ) P s ( θ i ) (3)where P t ( D | θ ) is the likelihood, and P s ( θ ) the prior. This isprecisely how we propose to adapt a model across domains;assuming the complexity of the model built on the sourcedomain is similar to that on the target domain, we look atthe source prior P s ( θ ) as the transferable item to be used inthe new target domain.Since the denominator in Eq. (3) is constant, we cansimplify the equation as follows: P t ( θ | D ) = ZP t ( D | θ ) P s ( θ ) ∝ P t ( D | θ ) P s ( θ ) (4)where Z is a normalization factor. To optimize P t ( θ | D ) , weoptimize the product of P t ( D | θ ) and P s ( θ ) , and disregard thevalue of Z , as it is not a function of parameter θ . Hence, ourgoal is not to obtain a distribution for the posterior P t ( θ | D ) ,but only to estimate the value of θ that maximizes the productof the likelihood and prior , a technique known as Maximuma Posteriori, a.k.a. MAP. We now explain how to compute theprior P s ( θ ) and likelihood P t ( D | θ ) to obtain a point estimateof model complexity on the target domain. C. Estimating the Likelihood

Our approach to estimate the likelihood is as follows. Weﬁrst use active learning to obtain a (small) labeled samplefrom the target domain. We then introduce a novel mechanismto compute P t ( D | θ ) by mapping generalization error to alikelihood probability. We explain both steps next.

1) Active Learning:

To lessen the dependence on the sourcedomain, we resort to active learning to produce an informativesample of labeled instances from the target domain. We usepool-based active learning with margin sampling [31] as theuncertainty sampling technique [32], [33]. Speciﬁcally, thealgorithm randomly selects an initial set of instances from theunlabeled target dataset T t , and queries their class labels; itthen iteratively builds a model f t ( x | θ ) on the labeled targetinstances as follows. At every iteration, the algorithm identiﬁesthe instance x i from the remaining unlabeled target instanceswith the minimum margin (i.e., minimum distance to thedecision boundary), queries x i to obtain class label y i , andadds ( x i , y i ) to the set of labeled target instances. The processrepeats until a budget (i.e., maximum number of allowedqueries) is exhausted. The result is a labeled sample that willbe used to compute the likelihood P ( D | θ ) .

2) Mapping Error to a Likelihood:

In general, the likeli-hood P ( D | θ ) estimates the probability of seeing data D givenparameter θ . This estimation is particularly complex in ourstudy because θ is normally understood as a parameter of aprobabilistic or generative model. P ( D | θ ) indicates how likelyit is to obtain D from a probabilistic model parameterized by θ . In our case θ is unconventionally deﬁned as a (complexity) This is different from Bayesian estimation where the output is a fullposterior probability distribution over model parameters [28]–[30]. arameter of a predictive model f ( x | θ ) (e.g., degree of apolynomial kernel).We contend that P ( D | θ ) can be re-interpreted as the probabil-ity of D given the empirical error of f ( x | θ ) on D . In general,the lower the empirical error, the higher the likelihood thatmodel f ( · ) can reproduce the class labels contained in D .Our deﬁnition of empirical error on D refers to the errorincurred on sample D alone, and is denoted as a function of θ and D , (cid:98) g ( θ | D ) . Empirical error is also known as in-sampleerror.Our re-interpretation of the likelihood leads naturally tothe assumption that the probability of D being classiﬁedcorrectly by a hypothesis f ( x | θ ) is inversely proportional tothe error made by f ( · ) on D . We formulate this inverse relationassuming an exponential distribution: P ( D | θ ) = λ exp( − λ (cid:98) g ( θ | D )) (5)where λ is the rate parameter. This formulation simply statesthat the likelihood P ( D | θ ) decreases exponentially with error (cid:98) g ( θ | D ) , but it is clearly handicapped, as different values ofcomplexity θ are mapped to the same likelihood as long asthe empirical error (cid:98) g ( θ | D ) is identical. However, under equalerror rates, we would like to assign a higher likelihood tosimpler models. We propose a solution to this next. Fig. 1. Likelihood of the data given 1) empirical error (cid:98) g ( θ | D ) and 2) a scaledversion of error variance as a function of the VC-dimension d VC ( H ) . Equalvalues for (cid:98) g ( θ | D ) do not map into the same likelihood.

3) Adding Robustness to the New Likelihood:

Our for-mulation of the likelihood as a function of empirical error(Eq. (5)) can be made more robust by taking into accountthe variance component of error induced by models thatbelong to families exhibiting high VC-dimension (Vapnik-Chervonenkis dimension [34]). In short, we suggest to penalizethose scenarios where VC-dimension is high. To start, wedeﬁne g ( θ | D ) as the expected error across the whole input-output distribution: g ( θ | D ) = (cid:90) X (cid:90) Y L ( y, f ( x | θ )) P ( x , y ) dxdy (6) where the loss L ( y, f ( x | θ ) is the zero-one loss function. g ( θ | D ) is also known as generalization error. Now, we knowfrom Vapnik-Chervonenkis inequality [35] that with probabil-ity − δ , an upper bound on g ( θ | D ) is given as follows: g ( θ | D ) ≤ (cid:98) g ( θ | D ) + (cid:114) N ln 4 m H (2 N ) δ (7)where δ is user deﬁned, N is the number of training instances,and m H ( q ) is a polynomial function that deﬁnes that largestnumber of dichotomies on q training instances given the classof hypotheses H : m H ( q ) ≤ d VC ( H ) (cid:88) i =0 N ! i ! N − i ! (8)where d VC ( H ) is the VC-dimension [34], deﬁned as themaximum number of examples that can be shattered by H [35], and depends on the complexity of the hypothesis (i.e., on θ ). The VC-dimension of various classes of hypotheses is wellknown. For example, the VC-dimension d VC ( H ) of neuralnetworks with a sigmoid gate function has a lower bound of σ ( w log w ) and an upper bound of O ( w ) [36], where w is thenumber of weights in the network. In this example, parameter θ can be interpreted as the number of hidden nodes h in afeed-forward neural network (NN). The d VC ( H ) of a neuralnetwork (NN) with h hidden nodes can then be estimated usingthe lower bound w log w and deﬁning w =( i +1) × h + ( h +1) × o , where i and o are the number of input features andoutput classes respectively.We now show how to strengthen our deﬁnition of thelikelihood. In essence, we keep the exponential distributionthe same (Eq. (5)), but make parameter λ a function of modelcomplexity θ , λ ( θ ) : P t ( D | θ ) = λ ( θ ) exp( − λ ( θ ) (cid:98) g ( θ | D )) (9)and deﬁne function λ ( θ ) as a scaled version of the second partof the Vapnik-Chervonenkis inequality (Eq. (7)): λ ( θ ) = α (cid:114) N ln 4 m θ (2 N ) δ (10)where α is a user-deﬁned scale factor that decides how muchweight is placed on the variance component of error. Bytransforming parameter λ into a function parameterized by θ , λ ( θ ) , we achieve our goal of assigning higher likelihoods tosimpler models when comparing hypotheses showing similarempirical error. We illustrate these concepts in Figure 1.The ideas mentioned above have been tried using differentstrategies. Examples include a full Bayesian approach intransfer learning that ﬁnds a common subspace across tasksusing a kernel-based dimensionality-reduction technique [37];transferring priors across multiple tasks using a HierarchicalBayesian approach [38]; ﬁnding clusters of tasks under aDirichlet process prior [39]; ﬁnding a Gaussian prior fromprevious tasks [40]; and theoretical studies using the PAClearning framework on a Bayesian setting [41]. All theseethods are contingent on the proximity of source and targetdistributions, whereas our approach relies primarily on thesimilarity of model complexity.Embedding the empirical error in an exponential functionto compute the likelihood has been tried before [42], albeitwithout considering the capacity of each hypothesis. Thenovelty of our approach lies in transforming the likelihoodfunction according to the VC-dimension of H . D. Estimating The Prior

Regarding the prior distribution of θ (model complexity) onthe source domain, we adopt a parametric model assuming aunivariate Gaussian distribution, P s ( θ ) = 1 σ √ π e − ( θ − µσ ) where µ and σ are the mean and variance respectively.Speciﬁcally, our methodology generates k samples of thesource dataset T s using uniform random sampling withoutreplacement; k is user-deﬁned, and can be regarded as anexperimental design parameter.We construct classiﬁers on each of the k samples usinga range of model complexity values, θ ∈ { θ , θ , θ , ..θ m } .For each sample S i , we ﬁnd a value θ ∗ i , ≤ i ≤ m , thatminimizes the expected loss (i.e., maximizes accuracy). Theresult is a sample with k optimal values of model complexity.Our estimate of the prior is ﬁnally obtained by ﬁtting thesevalues to the Gaussian model. E. Estimating the Posterior

Once we have estimated the prior P s ( θ ) and likelihood P t ( D | θ ) , we can estimate the numerator of the poste-rior distribution (Eq. (4)): P t ( θ | D ) = ZP t ( D | θ ) P s ( θ ) ∝ P t ( D | θ ) P s ( θ ) . Since we are interested in obtaining a pointestimate for the posterior, we look for an optimal value θ ∗ = arg max θ P t ( D | θ ) P s ( θ ) .To reduce the space of complexity values during optimiza-tion, we limit the values of θ to the range [ µ − σ, µ + σ ] , where µ and σ are the mean and standard deviation of the sourceprior distribution P s ( θ ) . The ﬁnal value θ ∗ is used to builda classiﬁer f t ( x | θ ∗ ) on the target domain using the queriedinstances as our training dataset.Our methodology is outlined in Algorithm 1. The algorithmassumes as input the labeled source dataset T s , the unlabeledtarget dataset T t , the size r of a small labeled sample togenerate a model on the target, and a budget b of possiblequeries to obtain additional labeled instances on the target.The ﬁrst step is to build a prior distribution P s ( θ ) ∼ N ( µ, σ ) on the source domain by exhaustively looking for an optimalﬁgure of model complexity; this step can be computationallyexpensive, but can save substantial amounts of time when it isre-used on a future (target) task. The next steps compute thelikelihood P t ( D | θ i ) in an iterative manner, by using labeledinstances from the target obtained through active learning. Thesearch is made narrow by limiting values of model complexityto just one standard deviation away from the source mean. The last steps build a (proportional) posterior distribution, and apredictive model using our optimal point estimate for modelcomplexity. Algorithm 1

Model Complexity Estimation Using DomainAdaptation and Active Learning

Input :

Source Dataset T s , Target Dataset T t , Budget b ,Initial Sample Size r . Output :

Predictive Target Model f ∗ t ( x | θ ) Estimate prior P s ( θ ) ∼ N ( µ, σ ) using source dataset T s Set θ min = µ − σ and θ max = µ + σ Use the small set of r labeled instances from T t to buildmodel f t ( x | θ ) Use f t ( x | θ ) and active learning to label b target instancesfrom T t for θ i ← θ min , θ max do Build model f it ( x | θ ) with θ i as model complexity Compute (cid:98) g ( θ i | D ) and λ ( θ i ) to estimate likelihood P t ( D | θ i ) Estimate (proportional) posterior: P t ( D | θ i ) P s ( θ i ) end for Let θ ∗ = arg max θ i P t ( D | θ i ) P s ( θ i ) Build f t ( x | θ ∗ ) return f t ( x | θ ∗ ) IV. E

XPERIMENTS

We describe our experiments in detail next. All our codeand datasets have been made available for reproducibility asa github project .We report empirical results on two different scientiﬁc areasto validate our methodology. The ﬁrst area refers to theautomatic classiﬁcation of supernovae using photometric lightcurves. The second area is centered on the classiﬁcation oflandforms on planet Mars using digital elevation maps. A. Supernova Datasets

The automatic identiﬁcation of Supernovae Ia (SNe Ia) hasbecome a key step in many astronomical endeavors [43], [44].Among different types of supernovae, SNe Ia are of particularrelevance because they can be used as standard candles toprobe large cosmological distances. The classiﬁcation goalhere is to identify SNe Ia (positive class) among other types(SNe Ib and Ic, negative class).When analyzing light from a supernova, one can eitherexploit spectroscopic measurements, to take advantage of thewealth of information that can be obtained from spectraldata. Such approach, however, is laborious and cumbersome.Another more common approach is to exploit photometricmeasurements; these are easier to obtain, but limited to asummarization of light intensity in bands or ﬁlters. The domainadaptation framework ﬁts into this scenario as follows [45]:the source dataset corresponds to spectroscopic measurementswhere class labels (SNe Ia, Ib and Ic) are known with high Please visit https://github.com/PAL-UH/transferAL onﬁdence (but data is scarce); whereas the target domaincorresponds to photometric measurements where class labelsare missing (but data abounds).Our experiments use simulations to generate samples thatresemble the type of measurements expected when usingspectroscopic or photometric measurements. This brings theadvantage of having samples with the same set of features (i.e.,same input space X ). Speciﬁcally, we use data from the Super-nova Photometric Classiﬁcation Challenge [46], consisting ofsupernova light curves simulated according to Dark EnergySurvey speciﬁcations, using SNANA light curve simulator[47]. The data comes from simulations that approximate thecharacteristics of the Dark Energy Survey (DES). Simulationsinclude both spectroscopic (source) and photometric (target)samples; these are created with biases found in true datasets,where spectroscopic data are in general smaller, brighter,closer, and less noisy than photometric data.Regarding the construction of simulated samples, we followthe data processing steps speciﬁed in [47]. We only takeobjects with a minimum of three observed epochs per ﬁlter; atleast one of them occurs before -3 days and at least one after+24 days since maximum brightness. On each ﬁlter, we useGaussian process regression to do light-curve model ﬁtting[48]; the resulting function is sampled using a window ofsize one day. No quality cuts are imposed (SNR > ). At theend, the spectroscopic (source) sample has 718 SNe, whilethe photometric (target) sample has 11946 SNe. Both sampleshave 108 columns or features (27 epochs × B. Mars Landforms

The second area corresponds to the automatic geomorphicmapping of planetary surfaces. Speciﬁcally, the goal is tolabel segments on speciﬁc regions on planet Mars with theircorresponding landforms [50]–[52]. The input data comes inthe form of raster data or digital elevation models (DEMs)produced by orbiting satellites (Mars Orbiter Laser Altimeterinstrument on board the Mars Global Surveyor spacecraft).Each DEM is ﬁrst subdivided into meaningful segments(groups of adjacent pixels with similar terrain properties) thatare subsequently classiﬁed into the following landforms: craterﬂoors, convex crater walls, concave crater walls, convex ridges,concave ridges, and inter-crater plateau. Domain adaptation isimportant to attain accurate predictive models on new targetsites that exhibit different distribution from the original sourcesite.Figure 2 illustrates the sequence of steps needed to classifylandforms on Mars using DEMs. The original map (A) isﬁrst divided into small segments amenable to labeling (B). Amodel is then trained on a fraction of all segments and applied to the rest. Models of different complexity yield differentclassiﬁcations (C-E).

Fig. 2. DEMs on Mars are processed and classiﬁed into different landforms.The original DEM (A), corresponding to a site known as Tissia Valles,is segmented for labeling processes (red-to-blue gradient indicates high-to-low elevation). (B) The DEM is then classiﬁed using models of differentcomplexity (C-E; color labels explained on bottom-right). This site, TissiaValles, acts as the source domain.

The source domain is a region on Mars where we knowthe labels for all landforms; it is shown in Figure 2(A) andis known as Tissia Valles; it was chosen primarily becausein a relatively small area, most landforms of interest arepresent. The region is heavily cratered and many differentcrater morphologies are present in a range of sizes. Thegoal here is to leverage experience during the model buildingprocess on Tissia Valles to ﬁnd the right complexity for amodel induced on a new site on Mars, corresponding to thetarget domain, where the landform distribution is different.In our experiments, the target site corresponds to a regionknown as Evos, shown in Figure 3. Notice the difference indistribution, where the shape of the craters and number ofthem differs signiﬁcantly from Tissia Valles.

Fig. 3. A digital elevation map DEM of a region on Mars known as Evos,corresponding to our target domain (red-to-blue gradient indicates high-to-lowelevation).

The input to the learning algorithm is not the DEM, but atraining set made of feature vectors, one for each segmentin the map. A segment is made of adjacent pixels withsimilar feature values. Each segments is characterized by threefeatures, which are averages over the pixels embedded by thesegment: slope, computed as the maximum rate of changen elevation from a cell to its neighbors (indicative of thesteepness of the terrain); curvature, computed as the secondderivative of the surface elevation, useful to distinguish be-tween convex, e.g., ridge, and concave, e.g., channel, surfaces;and ﬂow, computed as the degree of ﬂow accumulated on eachcell (high values are indicative of stream or river channels).

C. Experimental Settings

To estimate the prior P s ( θ ) , we create k = 100 bootstrapsamples of the source dataset T s under uniform randomsampling without replacement. We then record the θ ∗ thatyields highest accuracy on each sample. In our experiments, θ corresponds to the number of hidden nodes in a neuralnetwork, θ ∈ [2 , . For active learning, we divide dataset T t randomly into two (equal) parts: a pool of target instancesfrom which data can be queried T t , and a pool of testinstances that remains unknown during training T t . We thenrandomly generate 10 pairs of training and testing pools. Weﬁrst limit our active-learning budget to b = 100 queries, andsubsequently study how accuracy varies with different budgets.Regarding the posterior, we calculate the value of θ ∗ thatmaximizes the product of prior and likelihood. We then use θ ∗ to build an optimal model f ( x | θ ∗ ) on the target pool T t ,and subsequently test it on the test pool T t . In both domains,we have perfect knowledge of class labels (both source andtarget samples); class labels are hidden on the target sampleto validate model performance.Our hardware is made of a 3712-core computer cluster with8 Tesla C2075 GPUs, and 22 GTK570 GPUs, 120TB LustreFilesystem, and 127TB storage space. D. Methods for Comparison

For comparison purposes, our experiments include otherdomain adaptation techniques. We list and describe suchtechniques next.Subspace Alignment [22]. The goal is to ﬁnd separate sub-spaces for source and target domains using Principal Compo-nent Analysis, followed by a linear transformation that mapsthe source domain into the target domain . The result isan alignment of both spaces through the basis vectors. Thenumber of principal components is optimized based on atheoretical bound.Joint Distribution Optimal Transportation (JDOT) for DomainAdaptation [53]. The technique assumes a map that aligns thejoint distributions (

X × Y ) of the source and target domains.The optimization function combines both the distance betweensamples and the discrepancy in the loss between class labels.Adaptation Regularization based Transfer Learning (ARTL)[54]. The central idea is to combine different strategies fortransfer learning within a single framework: it simultaneouslyoptimizes the structural risk functional over the source do-main, the joint distribution matching of both marginal and conditional distributions, and the consistency of the geometricmanifold corresponding to the marginal distribution.Transfer Joint Matching (TJM) [55]. The technique combines ashared representation between source and target domains withthe concept of instance re-weighting, where source instancesthat fall within high density regions of the target domain, seetheir weight increased.Transfer feature Learning with Joint Distribution Adapta-tion (JDA) [56]. The technique jointly adapts both marginaland class-conditional probabilities using Principal ComponentAnalysis.Domain-Adversarial Training of Neural Networks (DATNN)[57]. This is a neural network framework that implementsdomain adaptation by ﬁnding features that provide low erroron the training set, while the features remain invariant acrossthe two domains (i.e., across source and target domains). Thearchitecture combines two learners that play in an adversarialmanner: while one adjusts parameters to reduce training error,the other one adjusts parameters to discriminate (increaseerror) between source and target examples. The result is aregularized deep neural network that generates an informativeabstract feature representation.Geodesic Flow Kernel (GFK) for Unsupervised Domain Adap-tation [58]. This technique integrates an inﬁnite number ofsubspaces between source and target domains by paying at-tention to geometric and statistical properties of both domains.The technique focuses on those subspaces that are domaininvariant.

E. Results

Prior . After estimating the sampling distribution for the opti-mal parameter θ ∗ on the source domain, our experiments showthe following results. For Supernova: µ = 33 . and σ = 9 . ,and for Mars landforms: µ = 23 . and σ = 12 . . The rangeof values for the prior are set to θ ∗ ∈ [24 , for Supernovaand θ ∗ ∈ [10 , for Mars landforms.As an illustration, Figures 4 and 5 show the histogram andcorresponding univariate Gaussian approximation of optimalvalues of model complexity for the Supernova domain. Modelcomplexity is measured in terms of the number of hiddennodes in a neural network. Approximating the prior distri-bution helps to narrow the search of values for the posterior,yielding substantial savings in computational cost (as shownin the following section). Accuracy . Table I shows average accuracy comparing ourapproach with two blocks of techniques: one block labeled“Domain Adaptation” corresponding to the techniques de-scribed in Section IV-D (except for the ﬁrst technique labeled“Source Model” that simply builds a model on the sourcedomain and applies it directly to the target domain). The otherblock labeled “Source Model + Active Learning” containsresults for methods using active learning, with the initial modelbuilt on the source dataset. Our technique (Bayesian DA or

ABLE IC

LASSIFICATION A CCURACY ( NUMBERS ENCLOSED IN PARENTHESES REPRESENT STANDARD DEVIATIONS ).General Method Learning technique DatasetsSupernova Ia Mars LandformsSource Model .

13 (0 .

00) 74 .

36 (9 . Subspace Alignment .

56 (7 .

98) 85 .

16 (2 . JDOT SVM .

57 (0 .

13) 85 . . Domain JDOT NN .

05 (0 .

08) 80 .

96 (0 . Adaptation DANN . .

3) 88 .

61 (0 . TJM .

56 (0 .

01) 82 .

28 (0 . JDA .

64 (0 .

03) 80 .

40 (0 . ARTL .

21 (0 .

01) 88 .

12 (0 . GFK .

98 (0 .

02) 83 .

56 (0 . Source Model + NN + AL .

75 (0 .

04) 80 .

41 (0 . Active Learning SVM + AL .

33 (0 .

17) 85 .

90 (0 . LR + AL .

70 (0 .

03) 85 .

18 (0 . Bayesian DA NN-DA-AL . ( . ) . ( . ) Fig. 4. Histogram of optimal values of model complexity for the Supernovadomain.Fig. 5. Gaussian approximation to the (prior) distribution of optimal modelcomplexity values for the Supernova domain.

NN-DA-AL) is shown as the last row of Table I. It combinesdomain adaptation with active learning using a neural networkarchitecture (budget b = 100 queries and the initial labeledpool is of size r = 10 ).For the ﬁrst block, results show a signiﬁcant increase inaccuracy with our approach. For a statistical test, we usethe Welch’s Student Paired t-test distribution at the p =0 . level. We also perform a multiple-comparison test byadjusting the statistical test using a Bonferroni adjustment[59]; results are signiﬁcant too at the p = 0 . level afterthe adjustment; this is true on both astronomical problems.These results show that using a posterior distribution of modelcomplexity yields better classiﬁcation accuracy on the targetdataset than using the best prior. Results also show a major limitation of domain adaptation techniques founded on theassumption of the existence of feature-invariant subspacesbetween source and target domains. Real-world applicationseither do not guarantee the existence of such subspaces, orexhibit a complex subspace landscape where ﬁnding commonsubspaces turns difﬁcult. Under the general assumption whereboth marginal and posterior probabilities between source andtarget domains differ ( P s ( x ) (cid:54) = P t ( x ) and P s ( y | x ) (cid:54) = P t ( y | x ) ),a better strategy is to sample directly from the target domainunder a framework that limits the cost of class queries. Ifthe posterior class probability on the target domain follows asmooth distribution, a limited number of queries should sufﬁceto attain an accurate predictive model.The second block shows average accuracy with techniquesthat use active learning, where the initial model is built onthe source domain. We report on a neural network (NN) withlogistic activation units and 25 hidden nodes, Support VectorMachines (SVM) with a radial basis function kernel, andLogistic Regression (LR); all with budget b = 100 and r = 10 .We applied the same statistical test using Welch’s StudentPaired t-test distribution at the p = 0 . and a Bonferroniadjustment. Our approach is signiﬁcantly more accurate inall cases. This is true even with the use of a plain neuralnetwork, where there is no search for optimal complexityparameters (e.g., number of hidden units) and no source taskto guide such search. Our method is also more efﬁcient, asﬁnding the best model-complexity from scratch on the targetdomain requires a new exhaustive search for an optimal valueof model-complexity.The next experiment tests the impact of active learningwithin our approach as the budget is increased. Figure 6 showsresults for the Supernova task. Figure 7 shows results forthe Mars landforms task. For the Supernova task, there is asigniﬁcant increase in accuracy as the budget grows from to , . This is expected as a large labeled sample on the targetset provides enough evidence to generate accurate predictivemodels. Results tend to converge between − in-stances. For the Mars Landforms task, results tend to convergeafter only instances. In practical real-world scenarios, suchresults can be used to set a trade-off between the size ofhe budget and the cost of labeling new target instances. Inthe Supernova domain, for example, labeling new instances isextremely expensive as it involves running a full spectroscopicanalysis; in such case, a lower budget may be preferred atthe cost of some accuracy loss. The opposite is true on theMars domain, where the cost of labeling segments with theircorrect landforms is relatively cheap, thus allowing for abudget increase. Fig. 6. Accuracy on Supernova improves signiﬁcantly with increasing budget,and tends to converge after about 2,000 queries.Fig. 7. Accuracy on Mars Landforms improves signiﬁcantly with a budgetof about 100 queries, and tends to converge after that threshold.

Execution Time . A ﬁnal experiment assesses the beneﬁtsgained in computational complexity between our approachand the common approach that ﬁnds a common subspace tomatch source and target distributions. Experiments follow theassumption that domain adaptation generates a prior distribu-tion in the past, and hence does not incur any additional timeduring the current target task; in addition the search for anoptimal value of model complexity on the target is limitedto one standard deviation around the optimal value found onthe source prior. The model built on the source domain incurson additional computational cost by searching for a commonspace over the two domains. We invoke Subspace Alignmentas a representative case of feature matching techniques. Forthe Supernova task, execution time is reduced from about hrs to under hours. For the Mars landforms task, executiontime goes from about hrs to about minutes. Results showthe advantage of generating a prior distribution of modelcomplexity on the source domain that is readily available on anew target task: it obviates an exhaustive search for an optimalparameter value.V. S UMMARY AND C ONCLUSIONS

We propose a new direction in domain-adaptation using aMaximum a Posteriori approach where the prior distributionis obtained from a source task (previous experience), whereasthe likelihood is obtained from the target (or current) task.Our methodology invokes active learning to compensate forthe lack of (target) class labels, leaving the budget size as anexperimental parameter. Our study leads to a new formulationof the likelihood as a function of empirical error and a termthat depends on model complexity as estimated by the Vapnik-Chervonenkis dimension. Overall, our technique broadens thegeneral applicability of domain adaptation by relaxing thestringent requirement of close proximity between source andtarget distributions.Empirical results on two astronomical problems show asigniﬁcant advantage in computational cost as the range ofcomplexity values on the target domain is limited to a smallwindow; this is the result of using a prior distribution overthe complexity parameter derived from the source domain.In terms of accuracy, results show a signiﬁcant increase inperformance with our approach; this holds for both astronom-ical domains. Our experiments also show a trade-off betweenbudget size and the cost of labeling; in cases where labelingis relatively cheap, one can increase the budget to achieve anincrease in accuracy performance.As future work, we will investigate how to extend our workwhen multiple source domains are available. One possibility isto simply choose the best prior based on domain knowledge,or through a ranking system that orders all source tasks basedon spatial or temporal proximity to the astronomical event ofinterest. Another direction is to combine all priors by assigninga degree of relevance to each source task. The posteriordistribution can then be deﬁned as a weighted combinationof all available priors.dditionally we hope to stimulate the astronomical commu-nity to consider domain adaptation as a useful resource whenanalyzing different surveys on similar objects. For example,while providing class labels for transient objects or eventscontained in one single survey is still feasible –even thoughcostly– the ability of labeling variable sources across thelarge number of available surveys is almost non-existent. Thegoal of acquiring predictive models from many surveys isa daunting task. This can be tackled by creating predictivemodels that adapt across datasets under analysis using domainadaptation techniques. The need for domain adaptation liesin the distributional discrepancy between source and targetdomains. A

CKNOWLEDGMENTS

This work was partly supported by the Center for AdvancedComputing and Data Systems (CACDS), and by the TexasInstitute for Measurement, Evaluation, and Statistics (TIMES)at the University of Houston.R

EFERENCES[1] S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira, “Analysis ofrepresentations for domain adaptation.” in

NIPS , B. Schlkopf, J. Platt,and T. Hofmann, Eds. MIT Press, 2006, pp. 137–144.[2] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, and F. Pereira, “Atheory of learning from different domains,”

Machine Learning , no. 79,pp. 151–175, 2010.[3] J. Blitzer, R. McDonald, and F. Pereira, “Domain adaptation with struc-tural correspondence learning,” in

Proceedings of the 2006 conferenceon empirical methods in natural language processing, ACL , 2006, pp.120–128.[4] J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. Wortman, “Learn-ing bounds for domain adaptation,” in

Advances in Neural InformationProcessing Systems, NIPS , 2007, pp. 129–136.[5] H. Daume and D. Marcu, “Domain adaptation for statistical classiﬁers,”

Journal of Machine Learning Research , no. 26, pp. 101–126, 2006.[6] Y. Mansour, M. Mohri., and A. Rostamizadeh, “Domain adaptation:Learning bounds and algorithms,” in

Proceedings of the 22nd Confer-ence on Learning Theory, COLT , 2009.[7] D. Hal, “Frustratingly easy domain adaptation,” arXiv preprintarXiv:0907.1815 , 2009.[8] L. Bruzzone and M. Marconcini, “Domain adaptation problems: adasvm classiﬁcation technique and a circular validation strategy,”

IEEETransactions on Pattern Analysis and Machine Intelligence , vol. 32,no. 5, pp. 770–787, 2010.[9] M. Brescia and G. Longo, “Astroinformatics, data mining and thefuture of astronomical research,”

Nuclear Instruments and Methods inPhysics Research Section A: Accelerators, Spectrometers, Detectors andAssociated Equipment , vol. 720, pp. 92 – 94, 2013.[10] E. Feigelson, “The changing landscape of astrostatistics and astroinfor-matics,” in

Astroinformatics: Proceedings of the International Astronom-ical Union, Symposium No. 325 , 2017.[11] B. Settles,

Active Learning . Morgan & Claypool, 2012.[12] S. J. Pan and Q. Yang, “A survey on transfer learning,”

IEEE Trans-actions on knowledge and data engineering , vol. 22, no. 10, pp. 1345–1359, 2010.[13] B. Liu, M. Huang, J. Sun, and X. Zhu, “Incorporating domain andsentiment supervision in representation learning for domain adaptation,”in

Proceedings of the 24th International Conference on Artiﬁcial Intel-ligence , ser. IJCAI’15. AAAI Press, 2015, pp. 1277–1283.[14] J. Xu, S. Ramos, D. Vzquez, and A. M. Lpez, “Hierarchical adaptivestructural svm for domain adaptation.”

CoRR , vol. abs/1408.5400, 2014.[15] J. Quinonero-Candela, M. Sugiyama, A. Schwaighofer, and N. D.Lawrence,

Dataset shift in machine learning . The MIT Press, 2009.[16] H. Shimodaira, “Improving predictive inference under covariate shift byweighting the log-likelihood function,”

Journal of Statistical Planningand Inference , vol. 90, no. 2, pp. 227–244, Oct. 2000. [17] T. Kanamori, S. Hido, and M. Sugiyama, “A least-squares approach todirect importance estimation,”

J. Mach. Learn. Res. , vol. 10, pp. 1391–1445, Dec. 2009.[18] M. Sugiyama, S. Nakajima, H. Kashima, P. V. Buenau, and M. Kawan-abe, “Direct importance estimation with model selection and its appli-cation to covariate shift adaptation,” in

Advances in neural informationprocessing systems , 2008, pp. 1433–1440.[19] S. Bickel, M. Br¨uckner, and T. Scheffer, “Discriminative learning undercovariate shift,”

J. Mach. Learn. Res. , vol. 10, pp. 2137–2155, Dec.2009.[20] M. Sugiyama, S. Nakajima, H. Kashima, P. V. Buenau, and M. Kawan-abe, “Direct importance estimation with model selection and its appli-cation to covariate shift adaptation,” in

Advances in neural informationprocessing systems, NIPS , 2008, pp. 1433–1440.[21] A. Gretton, A. Smola, J. Huang, M. Schmittfull, K. Borgwardt, andB. Sch¨olkopf, “Covariate shift by kernel mean matching,”

Dataset shiftin machine learning , vol. 3, no. 4, p. 5, 2009.[22] F. Basura, A. Habrard, M. Sebban, and T. Tuytelaars, “Unsupervisedvisual domain adaptation using subspace alignment,” in

Proceedings ofthe IEEE International Conference on Computer Vision, ICCV , 2013,pp. 2960–2967.[23] R. K. Ando and T. Zhang, “A framework for learning predictivestructures from multiple tasks and unlabeled data,”

Journal of MachineLearning Research , vol. 6, pp. 1817–1853, 2005.[24] X. Glorot, A. Bordes, and Y. Bengio, “Domain adaptation for large-scalesentiment classiﬁcation: a deep learning approach,” in

Proceedings of the28th International Conference on Machine Learning (ICML) , 2011, pp.513–520.[25] A. Kumar, A. Saha, and H. Daume, “Co-regularization based semi-supervised domain adaptation,” in

Advances in Neural InformationProcessing Systems 23 , J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta, Eds. Curran Associates, Inc.,2010, pp. 478–486.[26] M.-F. Balcan, A. Beygelzimer, and J. Langford, “Agnostic active learn-ing,”

Journal of Computer and System Sciences , vol. 75, no. 1, pp. 78– 89, 2009.[27] D. A. Cohn, Z. Ghahramani, and M. I. Jordan, “Active learning withstatistical models,”

Journal of Artiﬁcial Intelligence Research , vol. 4,no. 1, pp. 129–145, 1996.[28] W. Bolstad,

Introduction to Bayesian Statistics . Wiley-Interscience, 2ndEdition, 2007.[29] S. Goodman, “Introduction to bayesian methods i: measuring thestrength of evidence,”

Clinical Trials , vol. 2, pp. 281–290, 2005.[30] T. Louis, “Introduction to bayesian methods ii: fundamental concepts,”

Clinical Trials , vol. 2, pp. 291–294, 2005.[31] T. Scheffer, C. Decomain, and S. Wrobel, “Active hidden markov modelsfor information extraction,” in

International Symposium on IntelligentData Analysis . Springer, 2001, pp. 309–318.[32] D. D. Lewis and W. A. Gale, “A sequential algorithm for training textclassiﬁers,” in

Proceedings of the 17th annual international ACM SIGIRconference on research and development in information retrieval , 1994,pp. 3–12.[33] D. D. Lewis and J. Catlett, “Heterogeneous uncertainty sampling forsupervised learning,” in

Proceedings of the eleventh international con-ference on machine learning, ICML , 1994, pp. 148–156.[34] D. Haussler, M. Kearns, and R. Schapire, “Bounds on the samplecomplexity of bayesian learning using information theory and thevc dimension,” in

Proceedings of the Fourth Annual Workshop onComputational Learning Theory , ser. COLT ’91. San Francisco, CA,USA: Morgan Kaufmann Publishers Inc., 1991, pp. 61–74.[35] Y. S. Abu-Mostafa, M. Magdon-Ismail, and H.-T. Lin,

Learning fromdata . AMLBook Singapore, 2012, vol. 4.[36] W. Maass, “Vapnik-chervonenkis dimension of neural nets,”

The hand-book of brain theory and neural networks , pp. 1000–1003, 1995.[37] M. G¨onen and A. A. Margolin, “Kernelized bayesian transfer learning,”in

Proceedings of the Twenty-Eighth AAAI Conference on ArtiﬁcialIntelligence , ser. AAAI’14. AAAI Press, 2014, pp. 1831–1839.[38] J. R. Finkel and C. D. Manning, “Hierarchical bayesian domain adap-tation,” in

Proceedings of Human Language Technologies: The 2009Annual Conference of the North American Chapter of the Associationfor Computational Linguistics , ser. NAACL ’09. Stroudsburg, PA, USA:Association for Computational Linguistics, 2009, pp. 602–610.[39] D. M. Roy and L. P. Kaelbling, “Efﬁcient bayesian task-level transferlearning,” in

Proceedings of the 20th International Joint Conference onrtiﬁcal Intelligence , ser. IJCAI’07. San Francisco, CA, USA: MorganKaufmann Publishers Inc., 2007, pp. 2599–2604.[40] R. Raina, A. Y. Ng, and D. Koller, “Constructing informative priors usingtransfer learning,” in

Proceedings of the 23rd International Conferenceon Machine Learning , ser. ICML ’06. New York, NY, USA: ACM,2006, pp. 713–720.[41] P. Germain, A. Habrard, F. Laviolette, and E. Morvant, “A new pac-bayesian perspective on domain adaptation,” in

Proceedings of the33rd International Conference on International Conference on MachineLearning - Volume 48 , ser. ICML’16. JMLR.org, 2016, pp. 859–868.[42] P. Germain, F. Bach, A. Lacoste, and S. Lacoste-Julien, “PAC-BayesianTheory Meets Bayesian Inference,”

ArXiv e-prints , May 2016.[43] S. Blondin, T. Matheson, R. P. Kirshner, K. S. Mandel, P. Berlind,M. Calkins, P. Challis, P. M. Garnavich, S. W. Jha, M. Modjaz, A. G.Riess, and B. P. Schmidt, “The spectroscopic diversity of type iasupernovae,”

The Astronomical Journal , vol. 143, no. 5, p. 126, 2012.[44] M. Sasdelli, E. E. O. Ishida, R. Vilalta, M. Aguena, V. C. Busti,H. Camacho, A. M. M. Trindade, F. Gieseke, R. S. de Souza, Y. T.Fantaye, and P. A. Mazzali, “Exploring the spectroscopic diversity ofType Ia supernovae with DRACULA: a machine learning approach,”

Monthly Notices of the Royal Astronomical Society , vol. 461, pp. 2044–2059, Sep. 2016.[45] D. G. K., P. R., V. R., I. E. E. O., and de Souza R. S., “Automatedsupernova ia classiﬁcation using adaptive learning techniques,” in

Pro-ceedings of the IEEE Symposium on Computational Intelligence andData Mining , ser. CIDM ’16. Morgan Kaufmann Publishers Inc., 2016,pp. 61–74.[46] R. Kessler, A. Conley, S. Jha, and S. Kuhlmann, “Supernova photometricclassiﬁcation challenge,” arXiv:1001.5210 , 2010.[47] E. Ishida and R. S. de Souza, “Kernel pca for type ia supernovaephotometric classiﬁcation,”

Monthly Notices of the Royal AstronomicalSociety , vol. 430, no. 1, pp. 509–532, 2013.[48] M. Chilenski, M. Greenwald, Y. Marzouk, N. Howard, A. White, J. Rice,and J. Walk, “Improved proﬁle ﬁtting and quantiﬁcation of uncertaintyin experimental measurements of impurity transport coefﬁcients usinggaussian process regression,”

Nuclear Fusion , vol. 55, no. 2, 2015.[Online]. Available: http://stacks.iop.org/0029-5515/55/i=2/a=023012[49] T. Hastie, R. Tibshirani, and J. Friedman,

The Elements of StatisticalLearning , ser. Springer Series in Statistics. New York, NY, USA:Springer New York Inc., 2001.[50] B. Bue and T. Stepinski, “Automated classiﬁcation of landforms onmars,”

Computers & Geosciences , vol. 32, no. 5, pp. 604 – 614, 2006.[51] T. F. Stepinski, S. Ghosh, and R. Vilalta, “Automatic recognition oflandforms on mars using terrain segmentation and classiﬁcation,” in

Proceedings of the International Conference on Discovery Science, LNAI4265 , 2006, pp. 255–266.[52] T. Stepinski and R. Vilalta, “Digital Topography Models for MartianSurfaces,”

IEEE Geoscience and Remote Sensing Letters , vol. 2, pp.260–264, Jul. 2005.[53] N. Courty, R. Flamary, A. Habrard, and A. Rakotomamonjy, “Jointdistribution optimal transportation for domain adaptation,” in

Advancesin Neural Information Processing Systems 30 , I. Guyon, U. V. Luxburg,S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett,Eds. Curran Associates, Inc., 2017, pp. 3730–3739.[54] M. Long, J. Wang, G. Ding, S. J. Pan, and P. S. Yu, “Adaptation regular-ization: A general framework for transfer learning,”

IEEE Transactionson Knowledge & Data Engineering , vol. 26, no. 5, pp. 1076–1089, May2014.[55] M. Long, J. Wang, G. Ding, J. Sun, and P. S. Yu, “Transfer jointmatching for unsupervised domain adaptation,” in

Proceedings of the2014 IEEE Conference on Computer Vision and Pattern Recognition ,ser. CVPR ’14. Washington, DC, USA: IEEE Computer Society, 2014,pp. 1410–1417.[56] ——, “Transfer feature learning with joint distribution adaptation,”in

The IEEE International Conference on Computer Vision (ICCV) ,December 2013.[57] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Lavio-lette, M. Marchand, and V. Lempitsky, “Domain-adversarial training ofneural networks,”

Journal of Machine Learning Research , vol. 17, no. 1,pp. 2096–2030, Jan. 2016.[58] K. Grauman, “Geodesic ﬂow kernel for unsupervised domain adapta-tion,” in

Proceedings of the 2012 IEEE Conference on Computer Visionand Pattern Recognition (CVPR) , ser. CVPR ’12. IEEE ComputerSociety, 2012, pp. 2066–2073. [59] D. D. Jensen and P. R. Cohen, “Multiple comparisons in inductionalgorithms,”