[PDF] SRoll3: A neural network approach to reduce large-scale systematic effects in the Planck High Frequency Instrument maps

Abstract

In the present work, we propose a neural network based data inversion approach to reduce structured contamination sources, with a particular focus on the mapmaking for Planck High Frequency Instrument (Planck-HFI) data and the removal of large-scale systematic effects within the produced sky maps. The removal of contamination sources is rendered possible by the structured nature of these sources, which is characterized by local spatiotemporal interactions producing couplings between different spatiotemporal scales. We focus on exploring neural networks as a means of exploiting these couplings to learn optimal low-dimensional representations, optimized with respect to the contamination source removal and mapmaking objectives, to achieve robust and effective data inversion. We develop multiple variants of the proposed approach, and consider the inclusion of physics informed constraints and transfer learning techniques. Additionally, we focus on exploiting data augmentation techniques to integrate expert knowledge into an otherwise unsupervised network training approach. We validate the proposed method on Planck-HFI 545 GHz Far Side Lobe simulation data, considering ideal and non-ideal cases involving partial, gap-filled and inconsistent datasets, and demonstrate the potential of the neural network based dimensionality reduction to accurately model and remove large-scale systematic effects. We also present an application to real Planck-HFI 857 GHz data, which illustrates the relevance of the proposed method to accurately model and capture structured contamination sources, with reported gains of up to one order of magnitude in terms of contamination removal performance. Importantly, the methods developed in this work are to be integrated in a new version of the SRoll algorithm (SRoll3), and we describe here SRoll3 857 GHz detector maps that will be released to the community.

Full PDF

AAstronomy & Astrophysics manuscript no. SRoll3 © ESO 2020December 18, 2020

SRoll3: A neural network approach to reduce large-scalesystematic effects in the Planck High Frequency Instrument maps

M. Lopez-Radcenco , ∗ , J.-M. Delouis , and L. Vibert Universit´e Paris-Saclay, CNRS, Institut d’Astrophysique Spatiale, 91405, Orsay, France. Laboratoire d’Oc´eanographie Physique et Spatiale, CNRS, 29238 Plouzan´e, FranceDecember 18, 2020

ABSTRACT

In the present work, we propose a neural network based data inversion approach to reduce structured contamination sources, witha particular focus on the mapmaking for Planck High Frequency Instrument data and the removal of large-scale systematic e ﬀ ectswithin the produced sky maps. The removal of contamination sources is rendered possible by the structured nature of these sources,which is characterized by local spatiotemporal interactions producing couplings between di ﬀ erent spatiotemporal scales. We focuson exploring neural networks as a means of exploiting these couplings to learn optimal low-dimensional representations, optimizedwith respect to the contamination source removal and mapmaking objectives, to achieve robust and e ﬀ ective data inversion. We de-velop multiple variants of the proposed approach, and consider the inclusion of physics informed constraints and transfer learningtechniques. Additionally, we focus on exploiting data augmentation techniques to integrate expert knowledge into an otherwise un-supervised network training approach. We validate the proposed method on Planck High Frequency Instrument 545 GHz Far SideLobe simulation data, considering ideal and non-ideal cases involving partial, gap-ﬁlled and inconsistent datasets, and demonstratethe potential of the neural network based dimensionality reduction to accurately model and remove large-scale systematic e ﬀ ects. Wealso present an application to real Planck High Frequency Instrument 857 GHz data, which illustrates the relevance of the proposedmethod to accurately model and capture structured contamination sources, with reported gains of up to one order of magnitude interms of contamination removal performance. Importantly, the methods developed in this work are to be integrated in a new versionof the SRoll algorithm (SRoll3), and we describe here SRoll3 857 GHz detector maps that will be released to the community. Key words. cosmology: observations – methods: data analysis – surveys – techniques: image processing

1. Introduction

In the last few decades, scientiﬁc instruments have been pro-ducing ever increasing quantities of data. Moreover, as remotesensing and instrumentation technology develops, the processingcomplexity of the produced datasets increases dramatically. Theambitious objectives of several scientiﬁc projects are character-ized by the reconstruction of the information present in thesedatasets, which is often mixed with instrumental e ﬀ ects and fore-ground signals (physical components of the data that mask orblur part of the signal of interest). The scientiﬁc community isconfronted, in a wide variety of contexts, with the need to ex-tract, from measurements, physical responses adapted to the dif-ferent models considered, while at the same time ensuring an ef-fective separation between these responses and instrumental ef-fects and / or foreground signals. This separation is rendered pos-sible by the structured nature (in a stochastic sense) of the instru-mental e ﬀ ects and / or foreground signals, which, from a mathe-matical point of view, is characterized by local spatiotemporalinteractions producing couplings between di ﬀ erent spatiotempo-ral scales, as opposed to Gaussian signals where no correlationexists between observations produced at di ﬀ erent spatiotemporallocations. The structured nature of such signals allows them tobe accurately represented using a reduced number of degrees offreedom, which we aim to exploit for their removal from datain order to separate them from the signal of interest, usuallyless structured and / or Gaussian in nature. Given that the afore-mentioned problem exists in multiple scientiﬁc contexts, devel-oping e ﬃ cient dimensionality reduction methods to accurately ∗ Corresponding author: M. Lopez-Radcenco,[email protected] extract relevant information from data appears as a key issuefor the scientiﬁc community. It is therefore essential to iden-tify representations involving a reduced number of degrees offreedom to achieve robust and e ﬀ ective data inversion, whileproviding enhanced capabilities to accurately describe the com-plexity of the processes and variabilities at play. In this regard,di ﬀ erent strategies can be envisaged, with recent advances re-lying most notably on the exploitation of operators learned ondata presenting some similarities with the problem of interest(e.g. Transfer Learning, as explained below). Alternatively, re-cent works explore the use of generic signal decomposition op-erators (e.g. Scattering Transform (Bruna et al. 2015)). E ﬀ ortsof this type have already yielded interesting results, for example,on the expected statistical description of galactic dust emissions(Allys et al. 2019).In the present work, we speciﬁcally consider a case study in-volving the processing of Planck High Frequency Instrument(Planck-HFI) data, with a particular interest on the separa-tion and removal of the systematic e ﬀ ects and foregrounds.In this context, we aim here at exploiting machine learningand artiﬁcial intelligence approaches to minimize the num-ber of degrees of freedom of the systematic e ﬀ ects to be re-constructed and separated, whereas previous works rely onexploiting spectral and bispectral representations (Prunet et al.2001; Dor´e et al. 2001; Natoli et al. 2001; Maino et al. 2002;de Gasperis et al. 2005; Keih¨anen et al. 2005; Poutanen et al.2006; Armitage-Caplan & Wandelt 2009; Keih¨anen et al. 2010;Planck Collaboration VIII 2014; Delouis et al. 2019), whichlack the ability to properly capture spatiotemporal scale inter-actions, to tackle this issue. Indeed, systematic e ﬀ ects and fore-grounds are usually represented using a large number of pa-rameters, whereas more appropriate low-dimensional represen-tations could be learned directly from data. Particularly, in the a r X i v : . [ a s t r o - ph . I M ] D ec opez-Radcenco et al.: SRoll 3: A neural network approach to reduce large-scale systematic e ﬀ ects in the Planck-HFI maps present work, we focus on exploring neural networks as a meansof learning, from data, optimal low-dimensional representationsthat allow for an e ﬀ ective separation of the structured instrumen-tal e ﬀ ects from the signal of interest, simultaneously with thedata inversion. The algorithmic originality of this work lies inthe integration of analysis methods issued from machine learn-ing and artiﬁcial intelligence to extract the signals of interestfrom data by minimizing the degrees of freedom of the processesto reconstruct within a classic minimization framework (e.g. aleast squares approach). As such, the objective of the proposedmethods is to ﬁnd the best low-dimensional description of thedata while ensuring an optimal separation of the signal of in-terest from any instrumental e ﬀ ects and / or foreground signals.We illustrate the relevance of our approach on a case study in-volving the contamination source removal and mapmaking, i.e,the inversion of raw satellite measurements to produce a phys-ically consistent spatial map, of Planck-HFI data. We considerboth Far Side Lobe pickup (FSL) (an unwanted signal due to thenon-ideal response of the satellite’s antenna) simulations fromthe 545 GHz Planck-HFI channel and real observations from the857 GHz Planck-HFI channel. This case study was chosen basedon the fact that the FSL pickup is a large-scale systematic ef-fect that currently remains di ﬃ cult to model and remove, giventhat the complexity of the Planck optical system forces currentFSL estimations to rely on simpliﬁed physical and mathemat-ical models. In particular Planck-HFI 545 GHz and 857 GHzchannels present a weak CMB signature and its sources of con-tamination are dominated by the FSL pickup, which makes themideal candidates for the considered case study. Importantly, thiswork builds on previously developed methods for the separationand removal of structured contamination sources, and particu-larly on the SRoll2 algorithm (Delouis et al. 2019), used for theproduction of the 2018 release of the Planck-HFI sky maps. Assuch, the methods developed in this work are to be integrated ina new version of the SRoll algorithm (SRoll3), and we describehere SRoll3 857 GHz detector maps that will be released to thecommunity. Finally, whereas the application presented providesstrong evidence of the relevance of the proposed approach forthe processing of large-scale systematic e ﬀ ects, the proposedmethodology provides a generic framework for addressing sim-ilar, yet complex, data inversion issues involving the separationand removal of structured noise, foregrounds and systematic ef-fects from data in many other scientiﬁc domains. We focus on Convolutional Neural Networks (CNNs), whichcan be used for extracting relevant information by ﬁnding low-dimensional representations of data. To this end, CNNs ex-ploit the existence of invariances within the considered datasets(LeCun et al. 1998). Broadly speaking, a CNN relies on a cas-cade of multiple layers, or neurons, to incrementally build anincreasingly complex model relating the CNN’s inputs andoutputs. Each layer consists of a convolution operator witha kernel, whose values are known as weights, followed bythe addition of a set of biases. Typically, a non-linear activa-tion function, usually a Regularized Linear Unit (ReLU), i.e.,ReLU( x ) = max(0 , x ), is introduced to allow for non-linear be-haviour. Dimensionality reduction / expansion is then achieved bya pooling operator, typically a local averaging or a local max-ima. By feeding the output of one layer as input to another layer,multiple layers are then stacked in cascade to build a larger CNN model. Network training then consists on learning, from a train-ing dataset, the network weights and biases that minimize a spe-ciﬁc cost function, adequately chosen given the problem of in-terest. Recent advances have yielded powerful algorithms capa-ble of training large networks from massive datasets e ﬃ ciently.However, neural network based models are not always invertible,in the sense that part of the (invariant) information fed to the net-work is lost. This implies that it is not possible to reconstruct aninput exclusively from the output of a CNN designed to producea low-dimensional representation of the data. Nonetheless, it isindeed possible to synthesize an input that would return a givenoutput when fed to the considered network (Mordvintsev et al.2015). This synthesized input is statistically similar to the origi-nal input that produced the output considered (in a sense relatingto the neural network architecture and its training). Such results,however, cannot be used to accurately reconstruct the input data,which is why autoencoder networks (Bourlard & Kamp 1988;Hinton & Salakhutdinov 2006) were developed. Autoencodernetworks are speciﬁcally designed and trained to keep enoughinformation to be able to accurately reconstruct the input datafrom a low-dimensional representation. Nonetheless, adaptingneural network approaches to our application of interest, whichclosely relates to the problem of source separation in signal pro-cessing (Choi et al. 2005), is not trivial, and would require theimposition of additional constraints on the network weights andbiases. Unfortunately, considering additional constraints on thenetwork parameters used for input data reconstruction, whichare determined during network training, is not straightforwardfor autoencoders (or even for most neural networks). In orderto consider additional constraints, it would be necessary to ex-plicitly rewrite the inversion used by the autoencoder to learnthe low-dimensional representation and include any desired con-straints within such an inversion scheme. Moreover, autoen-coder networks are often based on convolutional approachesthat cannot e ﬀ ectively handle partial observations and incom-plete data. In this regard, rather than using autoencoder net-works, we exploit input-training (Tan & Mayrovouniotis 1995;Hassoun & Sudjianto 1997) to train a deconvolutional decodernetwork directly from data. In the present work, both the decoder network param-eters as well as the optimal low-dimensional representa-tion of the considered dataset (which constitutes the inputof the decoder network) are learned simultaneously duringthe data inversion, without any preliminary network train-ing phase. The idea of a joint optimization of network pa-rameters and inputs, known as input-training, was ﬁrst in-troduced in (Tan & Mayrovouniotis 1995) and later revisitedin (Hassoun & Sudjianto 1997) in the context of autoencodertraining. Input-training, which closely relates to non-linearprincipal component analysis (Baldi & Hornik 1989; Kramer1991; Sch¨olkopf et al. 1998; Scholz 2002; Scholz et al. 2005),was subsequently exploited for multiple applications, includ-ing error and fault detection and diagnosis (Reddy et al. 1996;Reddy & Mavrovouniotis 1998; Jia et al. 1998; B¨ohme et al.1999; Erguo & Jinshou 2002; Bouakkaz & Harkat 2012), chem-ical process control, monitoring and modeling (B¨ohme et al.1999; Liu & Zhao 2004; Geng & Zhu 2005; Zhu & Li 2006),biogeochemical modeling (Nandi et al. 2002; Schryver et al.2006), shape representation (Park et al. 2019) and matrix com-pletion (Fan & Cheng 2018), among others. Recently, this ideawas applied in (Bojanowski et al. 2018) to train generative ad- ﬀ ects in the Planck-HFI maps versarial networks (Goodfellow et al. 2014; Denton et al. 2015;Radford et al. 2015) without an adversarial training protocol.The choice of an input-training deconvolutional decoder net-work is further motivated by known limitations of classic CNN-based methods. Indeed, even though CNNs have been exten-sively used for inverse problems (McCann et al. 2017), mostCNN-based approaches learn the optimal solution (in a proba-bilistic sense) of the considered problem from a very large train-ing dataset that not only needs to accurately represent the com-plexity of the problem of interest, but that may also not take intoaccount any known and well-understood or well-modeled partsof the underlying processes. As such, CNN-based methods aremost e ﬀ ective for the analysis processes where the solution ofthe inverse problem can be adequately characterized by exploit-ing a large ensemble of training data. Such approaches usuallyaim at exploiting a su ﬃ ciently large dataset allowing for the de-velopment of a complex model capable of generalization to sim-ilar observations outside the training dataset. In the context ofthe present study, however, we focus on cases where the sig-nal to be reconstructed is badly known or modeled, and wherea limited amount of training data is available. In this respect,we rather aim at exploiting all the available information to pro-duce the most appropriate low-dimensional representation of theavailable dataset. The objective of the decoder network learningstage is then to identify an optimal low-dimensional subspacewhere both the signal of interest as well as the instrumental ef-fects can be represented accurately, so that the inverse problemcan be formulated as a constrained optimization on the projec-tion of these signals onto the learned subspace. The idea is toproduce synthesized data from a set of inputs deﬁning a low-dimensional representation of the signals of interest and thencompare the synthesized data with real observations. In this re-gard, the decoder network parameters and the low-dimensionalrepresentation are optimized simultaneously, so that the di ﬀ er-ence between the synthesized dataset and the available observa-tions is minimal.Importantly, this approach is robust to partial observations andincomplete datasets. This property is particularly relevant for re-mote sensing data, which is often derived from satellite or air-borne partial surface measurements. The basic idea behind transfer learning lies in exploiting knowl-edge gained by applying machine learning techniques to a spe-ciﬁc problem to tackle a di ﬀ erent but related problem. Formally,a learning task can be deﬁned by a domain (or dataset) and alearning objective, usually deﬁned by a cost function to be min-imized. In transfer learning, knowledge gained from a sourcelearning task is used to improve performance in a di ﬀ erent targetlearning task. This implies that either the domain or the objec-tive of these two distinct tasks are di ﬀ erent (Pan & Yang 2009).One may consider, for example, training a galaxy classiﬁcationalgorithm on galaxies from a given survey and then applying thegained knowledge to either classify another set of galaxies froma di ﬀ erent survey (di ﬀ erent learning domain) (Tang et al. 2019)or to classify a set of galaxies pertaining to a di ﬀ erent classiﬁ-cation (di ﬀ erent learning objective). It is important to underlinethat, to be considered as transfer learning, the source and tar-get tasks must be di ﬀ erent in either their learning domain and / ortheir learning objective.The idea of leveraging general knowledge learned from a spe-ciﬁc task to improve a similar task is closely related to the con-cept of generalization. Indeed, using a speciﬁc task to extract information that is useful for a secondary task involves identify-ing speciﬁc information that pertains to more general, shared as-pects of both tasks. In traditional machine learning approaches,generalization is achieved by building a training dataset that ac-curately represents a majority of possible cases well enough togeneralize to previously unseen observations. In transfer learn-ing, generalization is achieved by means of a more subtle ap-proach that relies on discriminating information speciﬁc to thetask at hand from general information pertaining globally to bothtasks. This may be particularly interesting for the processing ofPlanck-HFI data, where certain systematic e ﬀ ects are similar be-tween detectors. While this prevents them from being removedby classic averaging-based methods (as they would be accumu-lated in the mean result used as the ﬁnal product), it also allowsfor a very e ﬃ cient modeling and transfer of shared characteris-tics between detectors.Transfer learning usually involves training a network to solvethe source learning task, and then retraining the last layers of thenetwork on the target learning task. The main idea behind thisapproach lies in fact that, since the two learning tasks are related,the ﬁrst layers of the network will involve more general learningpertaining to a more global aspect of the task (like recognizingedges or gradients in image classiﬁcation), while the ﬁnal layersexploit this knowledge to build upon it and learn more complexrules. The integration of the decoder network training along-side with the data inversion constitutes the most impor-tant original contribution of our approach, as it fundamen-tally di ﬀ ers from standard dimensionality reduction approaches(Kramer 1991; DeMers & Cottrell 1993; Roweis & Saul 2000;Tenenbaum et al. 2000; Saul & Roweis 2003; Aharon et al.2006; Hinton & Salakhutdinov 2006; Lee & Verleysen 2007;Van Der Maaten et al. 2009; Bengio et al. 2013), which are typ-ically used as independent pre-processing steps and producelow-dimensional representations that may not always be com-pletely adapted to the data inversion considered. As such, thisdimensionality reduction helps better handle the lack of explicitinformation on certain instrumental e ﬀ ects and / or foregroundsignals to e ﬀ ectively separate them from the signal of interest.Finally, particular attention must be paid to the size of the low-dimensional representation, which will directly inﬂuence the ﬁ-nal number of parameters to be estimated, as a high number ofdegrees of freedom could adversely a ﬀ ect the identiﬁability andnumerical feasibility of the problem, which can lead to noisy, in-accurate or incorrect solutions. To tackle such an issue one may,for example, consider adding statistical or physically-motivatedconstraints to the loss function minimized during the data inver-sion. Here, we illustrate the importance of such dimensionalityreduction by considering applications to both synthetic and realPlanck-HFI data. In particular, we achieve considerable gains, ofup to one order of magnitude, when considering a single inputfor the low-dimensional representation of the signals of interest.The rest of the paper is organized as follows. In Sect. 2, we for-mally introduce the data inversion problem we are interestedin, as well as the proposed input-training deconvolutional de-coder neural network based formulation, and an alternative two-dimensional formulation of the decoder neural network archi-tecture. Section 3 introduces applications to both synthetic andreal Planck-HFI datasets, provides a comparison to state-of-the-art mapmaking methods and an exploration of the potential ofthe proposed framework to synthesize and remove FSL pickup. ﬀ ects in the Planck-HFI maps Additionally, it also illustrates how integrating transfer learningtechniques into the proposed framework could improve contam-ination source removal performance. Results pertaining to theseapplications are presented in Sect. 4 and further discussed inSect. 5. Finally, we present our concluding remarks and futurework perspectives in Sect. 6.

2. Method

Following standard mapmaking formulations, we cast our datainversion problem as a linear inversion: m t = A tp s p + c tp + (cid:15) t , (1)where: – m t is the time-ordered observation data, indexed by a time-dependent index t , – s p is the spatial signal to recover, indexed by a spatial-dependent index s , – A tp is a projection matrix relating observations m t to sig-nal s p , encompassing the observation system’s geometry andany raw data pre-processing, – c tp is a spatio-temporal dependent signal comprising allstructured, non-Gaussian foregrounds and / or systematic ef-fects, – (cid:15) t is a time-dependent white noise process modeling instru-ment measurement uncertainty as well as model uncertainty.The main objective of mapmaking approaches is to recover spa-tial signal s p from time-ordered observations m t , which also in-volves ensuring an e ﬀ ective separation between m t and fore-grounds and systematic e ﬀ ects c tp , so that there is no cross-contamination in the ﬁnal produced map. As previously stated, our proposed approach relies on a decon-volutional decoder network to ﬁnd a low-dimensional represen-tation of foregrounds and systematic e ﬀ ects c tp , so that it canbe e ﬀ ectively separated from spatial signal s p . We exploit a cus-tom network training loss function to ensure the e ﬀ ective sep-aration of spatial signal s p from foreground and systematic ef-fects, coupled with an input-training approach to allow for thesimultaneous learning of both the network parameters and thelow-dimensional representation of c tp .Speciﬁcally, the proposed network architecture takes N low-dimensional feature vectors α n , n ∈ (cid:126) , N (cid:127) of size 2 K as in-put, so that the input data is initially arranged in a 2D tensorof size [ N , K ]. Input feature vectors are then projected onto ahigher-dimensional subspace by means of a deep neural networkwith multiple deconvolutional layers . For all feature vectors α n , n ∈ (cid:126) , N (cid:127) , a reshape operation followed by a non-linearity,provided by a ReLU operator, converts the input data into K channels of size n =

2, with the result of such operation being A deconvolutional layer exploits a convolutional kernel to project alow-dimensional input into a higher dimensional subspace by applyingan ”inverse” convolution (in the sense that the produced output wouldbe projected onto the input by regular convolution with the consideredconvolutional kernel). Given that a thorough exploration of convolu-tion arithmetic is outside the scope of this work, we refer the reader to(Dumoulin & Visin 2016) for an in-depth analysis of deconvolution inthe context of deep neural networks. M )2K 2(4 M-1 ) N

Reshape + ReLU[2KxKx1xKs] Circulardeconvolution layer[2Kx2Kx1xKs] Circulardeconvolution layer[2Kx2Kx1xKs] Circulardeconvolution layer[1x2Kx1xKs] Circulardeconvolution layer NN Fig. 1.

Considered Decoder CNN architecture.a tensor of size [ N , , K ]. A ﬁrst circular deconvolutional layerdilates these K channels into 2 K channels of sizes n = M − K channelsto sizes n = , . . . , n m = · m , . . . , n M − = · M − , with thecorresponding results of such operations being tensors of size[ N , , K ] , [ N , , K ] , . . . , [ N , · m , K ] , . . . , [ N , · M − , K ],respectively. A ﬁnal circular deconvolutional layer combines theexisting 2 K channels to produce the ﬁnal output of the decodernetwork, a 2D tensor o ( n , b ) of size [ N , · M ]. Each of the N lines of this output tensor corresponds to one of the N low-dimensional input feature vectors α n .Finally, a piece-wise constant interpolation scheme is used to in-terpolate the N network outputs of size 2 · M into N outputscorresponding to N higher dimensional output vectors relatingto observations m t . To this end, for each observation n , time-ordered data is binned into 2 · M bins, so that all data points cor-responding to bin b in observation n are interpolated as o ( n , b ).The binning strategy for this ﬁnal step is directly dependent onthe considered problem and dataset. For Planck-HFI 545 GHzdata, this binning is detailed in Sect. 3. A schematic representa-tion of the network structure is presented in Fig. 1.As previously explained, neural network based dimensionalityreduction is classically performed by exploiting autoencoders,which usually involve deep symmetrical architectures with a bot-tleneck central layer providing the low-dimensional representa- ﬀ ects in the Planck-HFI maps tion. This is achieved by using observations as both input andoutput at training, so that the considered network learns the op-timal low-dimensional representation space that minimizes re-construction error. In the proposed approach, however, we ratherexploit an input-training scheme to avoid training an encodernetwork. Input-training is achieved by optimizing the networkinput, in our case the low-dimensional representation α n , along-side with the remaining network parameters. Provided that theconsidered loss function is di ﬀ erentiable with respect to inputs,classic neural network training approaches can be used to back-propagate gradients through the input layer and optimize the in-puts themselves. In our framework, we wish to ensure an e ﬃ cient separation be-tween the spatial signal s p and the foregrounds and systematice ﬀ ects c tp modeled by the proposed neural network architecture.To this end, we follow classic mapmaking approaches and, un-der the hypothesis that the projected spatial signal A tp s p for anygiven pixel p remains constant in time, we exploit spatial re-dundancy in observations m t , provided by spatial crossings orco-occurrences in observations at di ﬀ erent times, to remove sig-nal s p from observations m t . For a given pixel p , this comes tocomputing the mean observation M p and subtracting it from anyand all observations m t corresponding to pixel p : M p = H ( p ) (cid:88) tp ( t ) = p m t = H ( p ) (cid:88) tp ( t ) = p (cid:16) A tp s p + c tp + (cid:15) t (cid:17) = A tp s p + H ( p ) (cid:88) tp ( t ) = p c tp + (cid:15) t , (2)ˆ m t = m t − M p = m t − H ( p ) (cid:88) tp ( t ) = p m t = c tp − H ( p ) (cid:88) tp ( t ) = p c tp , (3)where p ( t ) designates the pixel corresponding to observation m t at time t , and H ( p ), known as the hit-count, is the total numberof observations at pixel p .In the proposed decoder network based approach, observations m t are used for training by considering the output of the decodernetwork to provide a parametrization of the foregrounds and sys-tematic e ﬀ ects: c tp = f ( α n ) , (4)so that the network, including its inputs α n , can be trained tominimize reconstruction error with respect to signal free obser-vations ˆ m t . The appropriate training loss function can then bedirectly derived from Eqs. (3) and (4): L = (cid:88) p (cid:88) tp ( t ) = p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m t − H ( p ) (cid:88) tp ( t ) = p m t  −  c tp − H ( p ) (cid:88) tp ( t ) = p c tp (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:88) p (cid:88) tp ( t ) = p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:16) m t − M p (cid:17) −  c tp − H ( p ) (cid:88) tp ( t ) = p c tp (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:88) p (cid:88) tp ( t ) = p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:16) m t − M p (cid:17) −  f ( α n ) − H ( p ) (cid:88) tp ( t ) = p f ( α n ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (5) From Eqs. (3), (4), and (5), it can be concluded that the timeinvariance hypothesis of projected spatial signal A tp s p ensuresthat all traces of signal s p can be adequately removed fromobservations m t during the data inversion. Even though thishypothesis may not always be formally respected depending onthe considered dataset and application, it still remains a validapproximation for a large number of applications, provided thatthe appropriate spatio-temporal scales and sampling frequenciesfor observations are chosen.Following recent trends in machine learning (Raissi et al.2017a,b; Karpatne et al. 2017; Lusch et al. 2018;Nabian & Meidani 2018; Raissi & Karniadakis 2018a,b;Raissi et al. 2018; Yang & Perdikaris 2018; Erichson et al.2019; Lutter et al. 2019; Roscher et al. 2019; Seo & Liu 2019;Yang & Perdikaris 2019), we design a custom loss functionincluding a standard reconstruction error term (as in mostmachine learning applications) coupled with physically-derivedterms introducing expert knowledge relating to the applicationand dataset considered (see Sect. 3 for detailed examples). Besides the previously introduced Decoder CNN architecture,we propose an alternative two-dimensional (2D) formulation ofthe original Decoder CNN. The novel 2D formulation amountsto modifying the network so that the intermediate convolutionallayers involve two-dimensional convolutional kernels. In this re-gard, this alternative formulation relies on a two-dimensionalbinning of observations m t for training. In particular, we ex-ploit here a fully connected layer to allow us to considerablyreduce the dimension of the low-dimensional representation ofthe signals of interest, at the expense of increasing the numberof weights and biases to be learned during training. Contrary toa convolutional layer, which involves a convolution where eachvalue of the produced multidimensional output depends only ona local subset of a multidimensional input (due to the convolu-tion operation), a fully connected layer produces the output bymeans of a linear combination of all values in the input. In afully connected layer, the weights and biases to be learned arethose of the linear combination that produces the output. Thisimplies that trainable inputs, i.e., the low-dimensional represen-tation, α n , n ∈ (cid:126) , N (cid:127) of size K should now be arranged into a2D tensor of size [ N , K ], which will be converted by a fully con-nected layer into K channels of size [2 , , , K ]. A ﬁrst convolutionallayer further expands this tensor into 2 K channels, producingan output of size [8 , , K ]. In a similar fashion to the originalDecoder CNN, M − K channels along the ﬁrst two dimensions intosizes n = , n = , . . . , n m = , · m , . . . , n M − = · M − ,with the corresponding results of such operations being ten-sors of size [8 , , K ] , [32 , , K ] , . . . , [2 · m , · m , K ] , . . . , [2 · M − , · M − , K ], respectively. A ﬁnal circular deconvolutionallayer combines the existing 2 K channels to produce a tensor ofsize [2 · M , · M ].Finally, a piece-wise constant interpolation scheme is used to in-terpolate the network outputs of size [2 · M , · M ] into outputscorresponding to observations m t . To this end, time ordered datais binned two-dimensionally into (2 · M ) × (2 · M ) bins. Thebinning strategy for this ﬁnal step is directly dependent on theconsidered problem and dataset. ﬀ ects in the Planck-HFI maps Given that the proposed approach exploits spatial redundancy inthe observations by minimizing loss function 5, which is com-puted on observation co-occurrences only, no strong constraintis imposed on the large-scale signature of the network output.In this regard, the network output may, in some cases, resort toadding a large-scale signal that remains close to zero around theecliptical poles where most signals crossings occur, in order tofurther minimize the loss function. Since few crossings exist inbetween the ecliptical poles, this large-scale signal will not beadequately constrained by observations and will rarely producea physically sound reconstruction. To prevent such behaviour,the following additional constraint on the ﬁnal correction map,given by (cid:80) tp ( t ) = p f ( α n ) is considered: L map = (cid:88) p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M p − H ( p ) (cid:88) tp ( t ) = p f ( α n ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (6)Such constraint will penalize solutions where the ﬁnal cor-rection map diverges from the input map, thus avoiding the in-clusion of a strong large-scale signature on the network correc-tion.The compromise between the original loss function (5) and theadditional map constraint (6) is controlled by means of a user-setweight W map , so that the ﬁnal modiﬁed loss function is given by: L total = L + W map L map = (cid:88) p (cid:88) tp ( t ) = p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:16) m t − M p (cid:17) −  f ( α n ) − H ( p ) (cid:88) tp ( t ) = p f ( α n ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + W map (cid:88) p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M p − H ( p ) (cid:88) tp ( t ) = p f ( α n ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (7) In the context of the processing of Planck-HFI observations,transfer learning techniques appear as particularly interestinggiven the limited amount of data available. This is in perfectagreement with the main motivation behind transfer learning,which aims at leveraging knowledge from previously learnedmodels to tackle new tasks, thus going beyond speciﬁc learn-ing task and domains to discover more general knowledge sharedamong di ﬀ erent problems. As illustrated below, transfer learningallows us to do so by using data from di ﬀ erent bandwidths andexploit di ﬀ erent detectors to complement each other and producemore accurate sky maps.To further constraint the proposed decoder network, particu-larly for identifying and removing contamination sources sharedamong multiple detectors, we explore classic transfer learningtechniques. As previously explained, transfer learning relies onlearning and storing knowledge from a particular problem orcase study and applying such knowledge to solve a similar butdi ﬀ erent problem or case study.Given the speciﬁcities of the proposed Decoder CNN architec-ture, we train the whole network on a source task and only retrainthe low-dimensional representation (i.e., the low-dimensionalinputs α n ) on the target task. Such an approach can be seen as a particular case of feature-representation transfer learning(Pan & Yang 2009), since the knowledge transferred betweentasks lies in the way the signals and processes of interest are rep-resented in the low-dimensional subspace of the inputs. Indeed,since we may consider the Decoder CNN as a projection of theobservations onto the low-dimensional space of the inputs, trans-ferring the network weights and biases and only retraining the in-puts amounts to considering that the projection onto the space ofthe inputs is shared between the two learning tasks considered.This means that the source learning task will learn a projection,deﬁning a low-dimensional representation, that will then be usedas is by the target task.In the context of the proposed application, we exploit transferlearning to better learn structured systematic e ﬀ ects and / or fore-grounds by training the proposed Decoder CNN on a datasetaccurately depicting these foregrounds and systematic e ﬀ ects.Given that we focus on learning the projection that most accu-rately captures the structure of the foregrounds and systematice ﬀ ects, the Decoder CNN is trained on the whole dataset ratherthan on observation co-occurrences (as is done with cost func-tion (5)), which amounts to considering the following trainingcost function: L TL = (cid:88) p (cid:88) tp ( t ) = p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m t − c tp (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:88) p (cid:88) tp ( t ) = p || m t − f ( α n ) || . (8)After training, the resulting Decoder CNN is transferred to anew dataset, which amounts to retraining the Decoder CNN in-puts only using the original cost function (Eq. (5)), while keep-ing the previously trained weights and biases.In this context, two distinct cases can be discerned. The ﬁrstcase involves two detectors that measure the sky signal in thesame frequency band, while the second case involves two de-tector measuring the sky signal on di ﬀ erent frequency bands.Both cases rely on training the Decoder CNN on a dataset froma speciﬁc detector, and then, retraining inputs only on a di ﬀ erentdataset pertaining to a di ﬀ erent detector. Since the learning tasksfor both detectors are di ﬀ erent (the considered cost functions aredi ﬀ erent), such a procedure e ﬀ ectively amounts to transfer learn-ing. This is further reinforced if the second detector dataset dif-fers signiﬁcantly from the ﬁrst detector dataset, i.e, if we choose,for example, to train the Decoder CNN on a 545 GHz detectordataset and retrain the inputs using a 857 GHz detector dataset.The simpler case where both detectors measure the sky signal inthe same frequency band still amounts to transfer learning, butmay involve more accurate knowledge transfer, given the strongsimilarities between the source and target datasets.

3. Applications

The Planck satellite scanning strategy, a clear schema of whichcan be found in Sect. 1.4 of Planck Collaboration ES (2018), isdetermined by a halo orbit around the Lagrange L2 point. Thesatellite rotates around an axis nearly perpendicular to the Sun(Tauber et al. 2010) and scans the sky in nearly great circles ataround 1 rpm, which means that the ecliptic poles are observedconsiderably more frequently, and in many more directions, thanthe ecliptic equator. Thus, ecliptic poles concentrate most of theobservation crossings and co-occurrences providing the requiredredundancy to ensure e ﬀ ective separation and removal of sys-tematic e ﬀ ects and foregrounds. This can be clearly observed in ﬀ ects in the Planck-HFI maps

545 Ghz full mission hit count

Fig. 2.

Full-mission observation hit-count map, i.e, the totalnumber of observations at each pixel, for the Planck-HFI 545GHz channel.Fig. 2, where we present Planck-HFI 545 GHz channel hit-countmap, i.e, the number of observations at each pixel.The redundancy pattern produced by the Planck-HFI scanningstrategy is particularly relevant for our approach, given that thenetwork training takes spatial redundancy into account for theremoval of signal s p to ensure that the CNN is trained to cap-ture and model foregrounds and systematic e ﬀ ects c tp = f ( α n )only. In this regard, the choice of a scanning strategy is a criticalpoint in the design of most remote sensing satellite missions, asit determines a compromise between spatial redundancy (neces-sary for an accurate removal of spatially redundant sources ofcontamination) and spatiotemporal sampling resolution (neces-sary to obtain accurate and reliable measurements of the signalof interest). For most remote sensing satellite mission design,the choice of scanning strategy is usually the product of exten-sive research based on multiple end-to-end simulations of theobserving system. Time ordered data from the Planck satellite is sampled in consec-utive 1 rpm rotations of the satellite. These observations can benaturally organized into discrete packages, with measurementscorresponding to each full rotation being grouped together intounits called circles. Given the relationship between the rotationalvelocity of the satellite and its orbital velocity around the Sun,consecutive circles can be grouped together every 60 rotationsand averaged to produce a composite measurement, called a ring,under the approximation that the region of the sky observed by60 consecutive circles ( ∼ B points, so that each measurement in a ring correspondsunequivocally to a phase bin of amplitude π B .Further compression of the information present in the Planck-HFI 545 GHz and 857 GHz datasets is achieved by consideringa HEALPix pixelization (G´orski et al. 2005) with N side = l = For the application considered in the present work, we focusspeciﬁcally on one large angular scale systematic e ﬀ ect, namelythe FSL pickup, which consists of radiation pickups far fromthe Planck telescope line of sight, primarily due to the exis-tence of secondary lobes in the telescope’s beam pattern, whichcreates what is commonly known as “straylight contamination”(Tauber et al. 2010; Planck Collaboration III 2016). Typically,FSL pickup is characterized by a highly structured large angularscale signature, which makes it an ideal candidate to evaluate theproposed method’s ability to exploit such structure to project thesignals of interest onto a low-dimensional subspace where suchstructured information is adequately represented with a reducednumber of degrees of freedom. In this regard, we focus our anal-ysis on larger spatial scales (below multipole (cid:96) = ◦ ), given that FSL pickup is primarily a large-scale systematic e ﬀ ect. Moreover, the dominant contaminationsource at small scales in Planck-HFI data is the detector noise,which can be modeled as an unstructured, Gaussian signal thatcannot be e ﬀ ectively removed by the proposed method, whichfurther motivates our choice to focus on large spatial scales. Itshould be noted, however, that even though we focus here on theFSL pickup speciﬁcally, other structured contamination sourcespresent in intermediate spatial scales not yet dominated by detec-tor noise may also be removed with the proposed methodology,but this is beyond the scope of this work. To illustrate the relevance of the framework introduced in Sect.2, we consider data from the Planck-HFI 545 GHz dataset of thePlanck mission (Tauber et al. 2010). As previously explained,the choice of the Planck-HFI 545 GHz channel is motivatedby its weak CMB signature, which simpliﬁes both the data pro-cessing and the interpretation of obtained results. In particular,we exploit FSL pickup synthetic 545 GHz data to validate ourmethod’s ability to learn suitable low-dimensional representa-tions of the FSL pickup under both ideal and non-ideal settings,including cases considering incomplete, gap-ﬁlled and inconsis-tent datasets.

Besides Planck 545 GHz data, we also consider Planck 857GHz data to evaluate how data augmentation techniques can beexploited to improve the contamination source removal perfor-mance of the Decoder CNN architecture, as explained in Sect.4.2. Similarly to the Planck-HFI 545 GHz channel, the Planck-HFI 545 GHz channel presents a weak CMB signature, whichsimpliﬁes both the data processing and the interpretation of ob-tained results.Importantly, the Planck-HFI 857 GHz channel has the particu-larity that, given the position of its associated detectors on the ﬀ ects in the Planck-HFI maps focal plane, detector 857 presents very little FSL pickup. Thisimplies that the detector di ﬀ erence maps between di ﬀ erent de-tectors will predominantly depict large-scale systematic e ﬀ ects,with the FSL pickup being the dominant systematic observed(Planck Collaboration III 2020). As such, Planck-HFI 857 GHzdata provides an ideal setting to evaluate the ability of the pro-posed method to capture and remove large-scale systematic ef-fects, and specially the FSL pickup. Moreover, the consideredPlanck 857 GHz data share many similarities with the previ-ously introduced Planck 545 GHz dataset, including the circle-averaging used to produce rings and the HEALPix pixelizationbased information compression (presented in Sect. 3.2). Besidesthe di ﬀ erence in frequency bands, the main di ﬀ erence lies in theslightly di ﬀ erent observation spatial distribution, produced bythe di ﬀ erences in location and orientation of the detectors in-volved. Taking Planck-HFI 545 GHz and 857 GHz data speciﬁcities intoconsideration for the proposed decoder network based approach,we chose to train our decoder network on compressed rings di-rectly, so that the ﬁnal step of the decoder network, considering M =

3, uses a piece-wise constant interpolation to interpolate l = ) =

128 larger bins produced asoutput by the decoder network. Using phase values as the inde-pendent interpolation variable, network outputs are thus interpo-lated to length l = ﬀ ective removal of systematic e ﬀ ects and foregrounds byminimizing the custom loss function (Eq. (5)). Moreover, an ad-ditional map constraint term, as introduced in Sect. 2.3, is addedto the loss function (Eq. (5)) to introduce physics informed con-straints and leverage domain knowledge on the inversion prob-lem. Finally, transfer learning strategies, as presented in Sect.2.4 are also explored as a means to share and transfer relevantinformation between datasets.

4. Results

We explore the ideas introduced in Sect. 3 by training ourDecoder CNN on FSL simulation data from a 545 GHz detec-tor (detector 545 ). The objective of this validation stage is todemonstrate the method’s ability to adequately learn a suitablelow-dimensional representation for the signals of interest fromdata. In this regard, the learned representation embeds knowl-edge, learnt from the available dataset, that facilitates the sep-aration and removal of structured systematic e ﬀ ects and fore-grounds (provided that such knowledge e ﬀ ectively exists withinthe dataset), while also being optimized with respect to the datainversion itself. Moreover, transfer learning techniques are eval-uated using phase-shifted data from the same detector. In this re-gard, considering phase-shifted data allows us to simulate eitherpartially similar detector datasets, and / or cases where the FSLpickup is partially or badly modeled. It should be noted that theuse phase-shifted data is purely exploited as a means to emulatemissing knowledge within the training dataset. In this regard, theconsidered phase shift is not necessarily a representation of thereal physical phenomena occurring within the satellite’s opticalsystem, but it is rather a simpliﬁed scheme to demonstrate howthe proposed inversion method responds to training on incom- plete or inconsistent datasets.All considered datasets consist of 747 093 984 observationspacked into rings of size 27664, for a total ring count of N = – A classic destriping (Planck Collaboration VIII 2016) of de-tector 545 FSL simulation data (referred to as CD here-after), – A direct ﬁt of a FSL template computed from 20 ◦ phase-shifted detector 545 FSL simulation data onto detector 545 FSL simulation data (referred to as TFIT hereafter), – The Decoder CNN in its original 1D version trained and ap-plied directly on detector 545 FSL simulation data (referredto as CNN1D hereafter), – The Decoder CNN in its original 1D version trained and ap-plied directly on detector 545 FSL simulation data and con-sidering an additional weighted map constraint (referred toas CNN1D- W map hereafter), – The Decoder CNN in its original 1D version trained on 20 ◦ phase-shifted detector 545 FSL simulation data and appliedto non-shifted detector 545 FSL simulation data by retrain-ing inputs only (referred to as CNN1D-TL hereafter).Subsequently, we also perform additional tests to evaluate: – The performance of the 2D alternative formulation of theDecoder CNN for the original case studies and datasets(referred to as, respectively, CNN2D, CNN2D- W map andCNN2D-TL hereafter), – The performance of the proposed algorithms when appliedto a gap-ﬁlled dataset generated by subsampling availableobservations.For comparison and benchmarking purposes, we include, amongthe methods considered, a classic destriping approach. This re-sult is used as baseline for evaluating the performance of theproposed methodology. For the considered methods, results areevaluated quantitatively by means of ﬁnal full mission outputmaps and half mission di ﬀ erence maps, which are presented forvisual comparison.Regarding the processing of Planck-HFI data, several waysof splitting the datasets for their analysis are described in(Planck Collaboration III 2020). Here, we use half mission dif-ference maps, which are computed by dividing the whole timeordered data series in two equal halves, processing each halfindependently and then computing the di ﬀ erence between theobtained maps. As such, half mission di ﬀ erence maps removeall spatially redundant information, allowing for the analysis ofthe information remaining once structured spatial signals are re-moved. This implies that half mission di ﬀ erence maps providerelevant information regarding the training of the Decoder CNN,since it is trained using a custom cost function that explicitlyremoves reduntant spatial information, but they do not providemuch information regarding real contamination source removalperformance.Moreover, a quantitative performance evaluation is given bymeans of the power spectra of the presented maps, which arecomputed using a spherical harmonics decomposition. In thisspectral representation, multipole scale number (cid:96) relates to dif-ferent spatial angular scales. As such, the power spectra depictshow energy is distributed across angular scales, thus providinga multi-scale measurement of the power per surface unit withinthe analyzed map. ﬀ ects in the Planck-HFI maps The ﬁrst considered case study involves exploiting data fromthe Planck-HFI 545 GHz channel only. In this context, transferlearning amounts to training the Decoder CNN on phase-shifteddata from detector 545 , and exploiting this network to processnon-shifted data from detector 545 by retraining inputs only. Aspreviously stated, such an approach can be considered as transferlearning despite the similarities between the two datasets, sincethe source and target learning tasks are di ﬀ erent.For the considered case study, the 1D Decoder CNN considers K = M =

3, so that the 1D Decoder CNN archi-tecture uses 4 deconvolutional layers to project a total numberof 8 N inputs onto time ordered data binned into 128 phase bins.For the map constrained version of the Decoder CNN (trainedwith loss function (7)), we consider W map = − , which waschosen empirically as it produced the best results when testingthe method’s sensitivity to this parameter.We further complement our performance study by analyzing andcomparing results obtained by contamination source removalFSL simulation data with the 2D variants of the Decoder CNNintroduced in Sect. 2.2.2. To this end, we exploit a 2D DecoderCNN to process 545 GHz data under identical conditions asthose analyzed for the 1D Decoder CNN. In this regard, the re-sults for the 2D CNN Decoder were also obtained by considering K = M =

3, so that the considered 2D Decoder CNN con-sists of four 2D deconvolutional layers and will project 16 (4 × × W map is once again set to W map = − .Figure 3 presents, for the di ﬀ erent considered approaches, thepower spectra of the full mission maps and the half mission dif-ference maps for the di ﬀ erent variants considered. For a qual-itative analysis of these results, Fig. 4 presents these full mis-sion and half mission di ﬀ erence maps themselves. Additionally,Figs. 3 and 4 also includes maps, and their corresponding powerspectra, for the best result obtained when exploiting the 2D for-mulation of the Decoder CNN, i.e., for CNN2D-TL (exploitingCNN weights and biases learned on 20 ◦ phase-shifted detector545 FSL simulation data). For the sake of simplicity and read-ability, the lesser performing variants of the 2D formulation arenot included in Figs. 3 and 4. Moreover, since we consider hereidealized synthetic simulation data, numerical results have noreal physical interpretation, and are thus presented using arbi-trary units.Concerning the 1D variants of the Decoder CNN, Fig. 3, showsthat both CNN1D and CNN1D- W map provide a substantial gainfor the ﬁltering of smaller scale FSL structures, while not be-ing able to accurately remove the large-scale FSL signature.CNN1D, however, obtains the best performance in terms oflarge-scale contamination source removal (at multipole (cid:96) = (cid:96) values. Similarly, from Fig. 3, one canobserve a considerable gain for all spatial scales in the half mis-sion di ﬀ erence maps when considering CNN1D and CCN1D- W map . Globally, CNN1D seems to provide the more substantial gain for most spatial scales. In agreement with these ﬁndings,the half mission di ﬀ erence map for CNN1D is less energetic andcloser to a Gaussian white noise in space (even though someresidual signal can still be observed) than the other analyzedDecoder CNN variants. The use of CNN1D-TL, however, doesseem to provide some gain for all spatial scales, even though itis marginal when compared to CNN1D and CNN1D- W map , spe-cially for smaller spatial scales. TFIT produces an even smallerperformance gain, remaining quite close to the performance lev-els of CD, while CNN2D-TL appears as the worst performingvariant, overall, for half mission di ﬀ erence map contaminationsource removal. Indeed, CNN2D-TL, the best performing 2Dvariant of the Decoder CNN, does not seem to provide much gainin contamination source removal performance for half missiondi ﬀ erence maps, with a performance level slightly worse thanTFIT at larger spatial scales, and a clear degradation in contam-ination source removal performance, with respect to a CD, forsmaller spatial scales. Overall, the half mission di ﬀ erence mapsare in strong agreement with this analysis.Given that previous results showed little degradation in termsof half mission di ﬀ erence maps contamination source removalperformance, results presented in the following sections focusspeciﬁcally on full mission contamination source removal per-formance. Moreover, for the sake of readability, we focus ex-clusively on a quantitative analysis by means of power spectralplots, and do not include additional map plots. To further illustrate the relevance of transfer learning techniques,we now consider the previously introduced Planck-HFI 545 GHzdataset but sub-sample one every ten rings, which amounts toconsidering a partial dataset involving large gaps. We consideran identical conﬁguration for the considered Decoder CNN asthe one used for previously presented results, namely K = M =

3, for a total of 128 phase bins for the 1DDecoder CNN and 128 ×

128 bins in phase and time for the 2DDecoder CNN. For the map constrained versions of the DecoderCNN, W map is kept at its original value of W map = − .We present similar results as those introduced in Sect. 4.1. i.e.,full mission maps power spectra in Fig. 5. Our initial analysisof the obtained results indicate that, given the large gaps in theconsidered dataset, the Decoder CNN tends to add a consid-erable spatial o ﬀ set to the whole map in order to ﬁll in thosegaps. During our tests, this e ﬀ ect was partially limited by theadditional map constraint, even though this does su ﬃ ce to com-pletely remove the o ﬀ set. From the full mission maps beforeremoving o ﬀ sets, we observed that both CNN1D and CNN1D- W map were unable to correctly capture and ﬁlter the FSL sig-nal. CNN1D-TL, however, considerably improves performancewhen considering partial datasets involving large gaps, most no-tably for smaller spatial frequencies.After subtracting the spatial mean, we observe that performanceis considerably improved, particularly for CNN1D-TL, which,among all 1D Decoder CNN variants, produces the best re-sults for larger spatial scales, closely followed by CNN1D- W map ,which also presents the best overall performance for smallerscales. On the other hand, CNN1D is poorly suited to han-dle incomplete datasets involving large gaps, as can be con-cluded by its subpar performance with respect to CNN1D- W map and CNN1D-TL. Similarly to previous results, none of the 1Dvariants are capable of outperforming TFIT, which does indeedpresent a better contamination source removal performance forlarge spatial scales. CNN2D-TL, however, outperforms TFIT for ﬀ ects in the Planck-HFI maps Multipole P o w e r s p e c t r u m Full mission spectra CDTFITCNN1DCNN1D- W map CNN1D-TLCNN2D-TL Multipole P o w e r s p e c t r u m Half mission spectra CDTFITCNN1DCNN1D- W map CNN1D-TLCNN2D-TL

Fig. 3.

Power spectra (in arbitrary units) of full mission maps(top) and half mission di ﬀ erence maps (bottom) of detector545 FSL simulations after contamination source removal us-ing 1000 iterations of a classic destriping approach (CD), a di-rect ﬁt of phase-shifted detector 545 FSL simulation data as atemplate (TFIT), the original 1D Decoder CNN (CNN1D), the1D Decoder CNN variants using the additional map constraint(CNN1D- W map ) and transfer learning by training the DecoderCNN weights and biases on 20 ◦ phase-shifted data from detector545 FSL simulations (CNN1D-TL), and the 2D variant of theDecoder CNN using transfer learning by training the DecoderCNN weights and biases on 20 ◦ phase-shifted data from detec-tor 545 FSL simulations (CNN2D-TL).larger spatial scales, at the expense of a slightly worst contam-ination source removal performance, with respect to CD, forsmaller spatial scales.

To further illustrate the relevance of transfer learning strategiesto improve the characterization of large-scale systematic e ﬀ ects,we consider a case study involving 545 GHz FSL simulationswith additional phase shift values. The primary objective is toevaluate the ability of the proposed approach to extract knowl- edge from an incomplete or inconsistent dataset that, nonethe-less, contains relevant information that may be exploited to learna suitable low-dimensional representation of the signals of inter-est. As previously explained, considering phase-shifted data atdi ﬀ erent phase shift values allows us to emulate both partiallysimilar detectors as well as inaccurate FSL templates. As such,a phase-shifted version of the original FSL simulation is ex-ploited to learn the Decoder CNN weights and biases, which arethen subsequently applied to the mapmaking and contaminationsource removal of the original FSL simulation. Moreover, wealso explore the possibility of combining multiple phase-shifteddatasets as a means to construct an enriched dataset that betterrepresents the relevant information to be learnt by the CNN. Tothis end, the weights and biases computed from the phase-shiftedFSL are ﬁxed, and only the low-dimensional inputs are retrainedon the original (non-shifted) FSL observations.We evaluate performance by presenting and comparing resultsobtained for the following approaches: – A classic destriping of 5 ◦ phase-shifted detector 545 FSLsimulation data (referred to as CD hereafter), – A direct ﬁt of a FSL template computed from non-shifted de-tector 545 FSL simulation data onto 5 ◦ phase-shifted detec-tor 545 FSL simulation data (referred to as TFIT → here-after), – The Decoder CNN in its original 1D version trained on non-shifted detector 545 FSL simulation data and applied to 5 ◦ phase-shifted detector 545 FSL simulation data by retrain-ing inputs only (referred to as CNN1D → hereafter), – The Decoder CNN in its 2D version trained on non-shifteddetector 545 FSL simulation data and applied to 5 ◦ phase-shifted detector 545 FSL simulation data by retraining in-puts only (referred to as CNN2D → hereafter), – The Decoder CNN in its 2D version trained on a catalogbuilt from detector 545 FSL simulation data shifted by[6 ◦ , ◦ , . . . , ◦ , ◦ ] and applied to 5 ◦ phase-shifted detector545 FSL Simulation data by retraining inputs only (referredto as CNN1D [6 , → hereafter), – The Decoder CNN in its 2D version trained on a catalogbuilt from detector 545 FSL simulation data shifted by[0 ◦ , ◦ , . . . , ◦ , ◦ ] and applied to 5 ◦ phase-shifted detector545 FSL Simulation data by retraining inputs only (referredto as CNN1D [0 , → hereafter).Given that we are interested in evaluating the potential of thetransfer learning based 2D Decoder CNN to accurately learnthe shape of FSL pickups, all considered networks rely on asingle input for the low-dimensional representation of the sig-nals of interest. The principle behind such an architecture is thatthe CNN weights and biases will capture the overall shape ofFSL pickups, which implies that the free low-dimensional in-put should capture the phase shift between the di ﬀ erent datasetsconsidered. As such, the 2D architecture consists of an initialfully connected layer that projects a single input onto K = , , · , M − K channelsand produce tensors of sizes [16 , , K ] , [64 , , K ] , . . . , [4 m + , · m , K ] , . . . , [4 M + , · M , K ], respectively. A circular deconvolu-tional layer then combines the existing K channels to produce atensor of size [4 M + , · M + ]. For training, time ordered data isbinned into 4 M + × · M + bins in ring and phase space, respec-tively, to match the network output.For the present case study, we consider M =

3, so that the pro-posed network outputs relies on a 1024 ×

512 binning of timeordered data in ring and phase space. We present similar results ﬀ ects in the Planck-HFI maps CDFull mission output -0.0002 0.0002

CDHalf mission difference -0.0002 0.0002

TFITFull mission output -0.0002 0.0002

TFITHalf mission output -0.0002 0.0002

CNN1DFull mission output -0.0002 0.0002

CNN1DHalf mission output -0.0002 0.0002

CNN1D- W map Full mission output -0.0002 0.0002

CNN1D- W map Half mission output -0.0002 0.0002

CNN1D-TLFull mission output -0.0002 0.0002

CNN1D-TLHalf mission output -0.0002 0.0002

CNN2D-TLFull mission output -0.0002 0.0002

CNN2D-TLHalf mission output -0.0002 0.0002

Fig. 4.

Full mission and half mission di ﬀ erence maps (in arbitrary units) of detector 545 FSL simulations after contamination sourceremoval using 1000 iterations of a classic destriping approach (CD), a direct ﬁt of phase-shifted detector 545 FSL simulation dataas a template (TFIT), the original 1D Decoder CNN (CNN1D), the 1D Decoder CNN variants using the additional map constraint(CNN1D- W map ) and transfer learning by training the Decoder CNN weights and biases on 20 ◦ phase-shifted data from detector545 FSL simulations (CNN1D-TL), and the 2D variant of the Decoder CNN using transfer learning by training the Decoder CNNweights and biases on 20 ◦ phase-shifted data from detector 545 FSL simulations (CNN2D-TL). Leftmost columns present fullmission maps, rightmost columns present half mission di ﬀ erence maps.as those introduced in previous sections, i.e., full mission powerspectra in Fig. 6.From Fig. 6, we conclude that CNN1D → is only able tomarginally improve contamination source removal performance(with respect to CD ) for smaller spatial scales. This is ex-pected, as the 1D variant of the Decoder CNN processes eachring independently and thus has a limited potential to modeltwo-dimensional information, which appears as essential to ac-curately capture and model the phase di ﬀ erence to be transferredbetween the datasets involved. CNN2D → performs similarlyto TFIT → , and both approaches provide a signiﬁcant contam-ination source removal performance improvement at all spatialscales. Such a result is explained by the fact that, since the CNNwas trained on non-shifted data, it is unable to model phaseshifts, as this is phenomena is not accurately represented in thetraining dataset. Indeed, contamination source removal perfor-mance is considerably increased when CNN2D [6 , → is consid-ered, which further supports the fact that the inclusion of phase-shifted data is necessary to ensure that the trained CNN learns toaccurately represent phase shifts. Contamination source removalperformance, particularly for larger spatial scales, is further im-proved with CNN2D [0 , → , when additional phase-shifted data(between 0 ◦ and 4 ◦ ) is considered. This is to be expected, asdeep neural networks perform well for interpolation, but lack thenecessary information to have similar performance for extrapo-lation. Adding phase-shifted data for smaller phase shift valuesmeans that the 5 ◦ phase shift of the target dataset is now inside the phase shift training range, and the trained network is bettercapable of modeling such phase shift. Following the validation of the proposed methodology onPlanck-HFI 545 GHz FSL synthetic data, we evaluate its per-formance on real Planck-HFI 857 GHz observations. As pre-viously explained, the Planck-HFI 857 GHz channel providesan ideal setting for evaluating the ability of the proposed ap-proach to model and remove large-scale systematic e ﬀ ects, andFSL pickup in particular, given that the detector di ﬀ erence be-tween the four 857 GHz detectors will mostly depict large-scale systematic e ﬀ ects, and predominantly the FSL pickup(Planck Collaboration III 2020). The 857 GHz dataset consistsof time ordered data from four independent detectors (namedhereafter 857 d , d = , . . . , K =

32 channels to produce a tensor of size [4 , · , K ],followed by M − K channels to produce tensors of sizes [64 , , K ] , . . . , [1 , m + , · m , . . . , [1 , M + , · M , K ], respectively. A circular deconvolu- ﬀ ects in the Planck-HFI maps Multipole P o w e r s p e c t r u m Full mission spectra CDTFITCNN1DCNN1D- W map CNN1D-TLCNN2D-TL

Fig. 5.

Power spectra (in arbitrary units) of full mission maps ofdetector 545 FSL simulations considering one every ten ringsafter contamination source removal using 1000 iterations of aclassic destriping approach (CL), a direct ﬁt of phase-shifted de-tector 545 FSL simulation data as a template (TFIT), the origi-nal 1D Decoder CNN (CNN1D), the 1D Decoder CNN variantsusing the additional map constraint (CNN1D- W map ) and trans-fer learning by training the Decoder CNN weights and biaseson 20 ◦ phase-shifted data from detector 545 FSL simulations(CNN1D-TL), and the 2D variant of the Decoder CNN usingtransfer learning by training the Decoder CNN weights and bi-ases on 20 ◦ phase-shifted data from detector 545 FSL simula-tions (CNN2D-TL).tional layer combines the existing K channels to produce a tensorof size [4 M + , · M + ]. For training, time ordered data is binnedinto 4 M + × · M + bins in ring and phase space, respectively,to match the network output.We also explore the potential of data augmentation to integrateexpert knowledge into the training of the Decoder CNN and thusprovide enhanced modeling capabilities for the FSL pickup. Tothis end, the training dataset is enhanced by integrating infor-mation from all four detectors into the contamination sourceremoval procedure of each individual detector. Speciﬁcally, foreach detector, the training dataset in enriched by integrating theresidue of detectors 857 , 857 and 857 with respect to detec-tor 857 . detector 857 is chosen as the common base for allresidues considered simply because its position within the de-tector array e ﬀ ectively reduces its FSL pickup. The computa-tion of these residues is performed after the data is binned inring and phase spaces. We consider M =

3, so that time or-dered data is initially binned into 1024 bins in ring space and512 bins in phase space. Once datasets for the four detectorshave been binned, each detector dataset is enriched by addingthe residue, i.e., the di ﬀ erence, between detector 857 binneddata and binned data from the three remaining detectors. Theseresidues are then subjected to a thresholding procedure, such thatall data whose absolute value is below a user-set threshold isset to 0. The idea behind this procedure is that the consideredresidues will not only contain relevant FSL pickups that can beused for training the Decoder CNN, but also other noise signalsthat should not be taken into account and that should, ideally, beﬁltered by the thresholding operation. As such, a coarse valuefor the threshold is set empirically by taking into account the Multipole P o w e r s p e c t r u m Full mission spectra CD TFIT

CNN1D

CNN2D

CNN2D [6,20] 5

CNN2D [0,20] 5

Fig. 6.

Power spectra of full mission maps of 5 ◦ phase-shifteddetector 545 FSL simulations after contamination source re-moval using 1000 iterations of a classic destriping approach(CD ), a direct ﬁt of non-shifted detector 545 FSL simulationsonto 5 ◦ phase-shifted detector 545 FSL simulations as a tem-plate ( TFIT → ), the 1D Decoder CNN trained on non-shifteddetector 545 FSL simulation data and applied to 5 ◦ phase-shifted detector 545 FSL simulation data ( CNN1D → ), the 2DDecoder CNN trained non-shifted on detector 545 FSL simula-tion data and applied to 5 ◦ phase-shifted detector 545 FSL sim-ulation data ( CNN2D → ), the 2D Decoder CNN trained on acatalog built from detector 545 FSL simulation data shifted by[6 ◦ , ◦ , . . . , ◦ ] and applied to 5 ◦ phase-shifted detector 545 FSL Simulation data (CNN2D [6 , → ), and the 2D DecoderCNN trained on a catalog built from detector 545 FSL simu-lation data shifted by [0 ◦ , ◦ , . . . , ◦ ] and applied to 5 ◦ phase-shifted detector 545 FSL simulation data (CNN2D [0 , → ).noise levels within the considered dataset, and then ﬁne-tunedby performing multiple simulations at di ﬀ erent threshold values.Given that the threshold is user-set, this procedure can be seenas the integration of expert knowledge into the otherwise non-supervised procedure of network training. The ﬁnal approachcould therefore be qualiﬁed as a weakly supervised networktraining method. The proposed augmented datasets are used totrain the Decoder CNN weights and biases (independently foreach detector), with network inputs then being retrained directlyon the original non-augmented detector datasets.As previously explained, Planck-HFI 857 GHz detector dif-ference maps are dominated by the FSL pickup signal, whichmakes them an ideal gauge for the capacity of the proposed ap-proach to remove the FSL pickup from the ﬁnal maps. Takingthis into account, we illustrate our results by presenting thepower spectra of Planck-HFI 857 GHz detector di ﬀ erence mapsin Fig. 7, and the Planck-HFI 857 GHz detector di ﬀ erence mapsthemselves in Fig. 8. For visualization and comparison purposes,all detector di ﬀ erence maps are normalized to a common base-line amplitude level, and any existing CO di ﬀ erence map signa-tures are removed using the same template ﬁt procedure used bySRoll2 to produce the 2018 release of the Planck-HFI sky maps.We compare results for three di ﬀ erent cases, namely: – The mapmaking of Planck-HFI 857 GHz real data using aclassic destriping approach (referred to as CD hereafter), ﬀ ects in the Planck-HFI maps – The mapmaking of Planck-HFI 857 GHz real data usingSRoll2 (Delouis et al. 2019) to produce a direct ﬁt of a syn-thetic FSL simulation as a template (referred to as SRoll2hereafter), – The mapmaking of Planck-HFI 857 GHz real data using1000 iterations of the 2D Decoder CNN exploiting data aug-mentation to include inter-detector residuals in the learningdataset (referred to as CNN2D-DA hereafter).From power spectra depicted in Fig. 7, we can conclude that theinter-detector data augmentation strategy, coupled with the in-troduction of expert knowledge via the thresholding of binneddata residues, allows for a considerable improvement in contam-ination source removal performance for all spatial scales and formost detector pairs, with a considerable gain for larger spatialscales. As such, as far as large-scale contamination source re-moval is concerned, CNN2D-DA seems to outperform SRoll2for most detector pairs. Indeed large-scale contamination sourceremoval performance is only marginally degraded for a singledetector pair (857 − ) and only for larger spatial scales(around (cid:96) < ﬀ erence maps presented in Fig. 8,where one can observe a considerable improvement in contami-nation source removal performance for most detector pairs, withrespect to a SRoll2, for CNN2D-DA. Interestingly, a particu-larly strong large-scale signal can be observed to the north ofthe galactic plane, near the galactic origin. Given that we areworking with Planck-HFI 857 GHz real data, we hypothesizethis signal to be caused by other contamination sources, whichexplains CNN2D-DA inability to completely remove it, as it hasbeen extensively adapted, in the presented application, to dealspeciﬁcally with FSL pickups.

5. Discussion

As explained in Sect. 2.3, the Decoder CNN may introduce anerroneous large-scale signal to its reconstructed output. Indeed,since the Decoder CNN is trained on signal co-occurrences only,cost function (5) may be artiﬁcially decreased by adding an ad-equately chosen large-scale o ﬀ set, whereas the introduction ofthis o ﬀ set does not necessarily relate to the contamination sourceremoval of the ﬁnal map. According to our results, such large-scale signature may appear in the form of a large-scale o ﬀ set, oreven higher order moments, such as a large-scale spatial dipole.In particular, this was observed for results presented in Sect. 4.1,speciﬁcally for the case considering one every ten rings, i.e.,for partial datasets involving large gaps. This is expected, giventhat in such cases the lack of observations between the eclipticpoles is exacerbated, thus further strengthening this e ﬀ ect. Ascan be observed in our results, the introduction of a map con-straint (Eq. (6)) helps limit the introduction of a large-scale o ﬀ -set, given that it improves the network conditioning in di ﬃ cultcases, such as those considering partial, gap-ﬁlled or irregularlysampled datasets. This demonstrates both the ﬂexibility of theproposed framework to be adapted to the dataset and / or problemto be treated by incorporating appropriate additional terms to the custom cost function (5), as well as its capability to adequatelyhandle partial, gap-ﬁlled datasets. As observed, the exploitation of transfer learning techniques al-lows for the characterization of the “shape” of the systematice ﬀ ects or foregrounds we are trying to separate from our sig-nal of interest. This is achieved by constraining the smaller di-mensional subspace onto which the consider signals are pro-jected. The “shape” of systematic e ﬀ ects and foregrounds isindeed encoded into a projection operator, which is parame-terized by the Decoder CNN, by minimizing the loss functionon the training dataset. The trained Decoder CNN is then ap-plied to a second dataset by retraining the inputs only. As previ-ously stated, this can be seen as a way of identifying and learn-ing the common knowledge between the di ﬀ erent datasets (i.e.,the projection) and transferring such knowledge between di ﬀ er-ent datasets. Such an approach is particularly relevant for ap-plications where similar foregrounds or systematic e ﬀ ects existbetween di ﬀ erent datasets, as is the case for the FSL pickup.Indeed, in the presented application, the Decoder CNN trainingstage seems to learn general characteristics of the FSL pickupsignal, such as its large-scale signature, which is then transferredto the second dataset (by retraining inputs) in order to improvecontamination source removal performance. From a mathemati-cal point of view, retraining the inputs can be thought of as ﬁnd-ing the representation in the projection subspace that best ap-proximates the second dataset. This amounts to ﬁnding the bestﬁtting FSL pickup signal approximation, under the constraintthat the characteristics of this approximation were previouslylearned on the ﬁrst dataset and encoded in the Decoder CNNweights and biases. Results obtained for Planck-HFI 857 GHz real data illustratehow data augmentation techniques, coupled with expert knowl-edge integration, can improve contamination source removalperformance. Indeed, introducing, for each 857 GHz detector,inter-detector residuals with respect to detector 857 , comes toexploiting data augmentation to transfer relevant information be-tween datasets. As such, this procedure closely relates to theidea of transfer learning, since both seek to exploit informationshared between datasets to improve contamination source re-moval performance. Moreover, the inclusion of a user-set thresh-old for inter-detector binned data residues allows us to integrateexpert knowledge into an otherwise completely unsupervisedlearning scheme. This is particularly relevant for the processingof data containing both well-known and badly modeled signals,as is the case for the systematic e ﬀ ects and foregrounds presentin Planck-HFI observations.

6. Conclusions

In the present work, we propose a neural network based datainversion approach to reduce structured contamination sources,with a particular focus on the mapmaking for Planck-HFI dataand the removal of large-scale systematic e ﬀ ects within the pro-duced sky maps. The proposed approach relies on an genera-tive decoder convolutional neural network to project the signals ﬀ ects in the Planck-HFI maps Multipole 10 ( M J y S r ) - 857 difference CDSRoll2CNN2D-DA Multipole 10 ( M J y S r ) - 857 difference CDSRoll2CNN2D-DA Multipole 10 ( M J y S r ) - 857 difference CDSRoll2CNN2D-DA Multipole 10 ( M J y S r ) - 857 difference CDSRoll2CNN2D-DA Multipole 10 ( M J y S r ) - 857 difference CDSRoll2CNN2D-DA Multipole 10 ( M J y S r ) - 857 difference CDSRoll2CNN2D-DA

Fig. 7.

Power spectra for detector di ﬀ erence maps of Planck-HFI 857 GHz real data. For all detector pairs, power spectra arecomputed from detector di ﬀ erence maps for three distinct cases: the mapmaking of Planck-HFI 857 GHz real data using a classicdestriping approach (CD), the mapmaking of Planck-HFI 857 GHz real data using SRoll2 to produce a direct ﬁt of a synthetic FSLsimulation as a template (SRoll2), and the mapmaking of Planck-HFI 857 GHz real data using 1000 iterations of the 2D DecoderCNN exploiting data augmentation to include inter-detector residuals in the learning dataset (CNN2D-DA).of interest onto a learned low-dimensional subspace simultane-ously with the data inversion, so that the low-dimensional sub-space is optimized with respect to the contamination source re-moval and mapmaking objectives. This optimization is achievedby means of a loss function that take such objectives into accountduring the network training stage. The exploitation of such a cus-tom loss function also allows for the introduction of physics-based constrains to further improve contamination source re-moval performance. The low-dimensional subspace learning ispossible thanks to an input-training scheme, which also allowsfor the processing of incomplete and / or gap-ﬁlled datasets. Wepropose multiple variants of the proposed approach, a two-dimensional version capable of taking time dependencies intoaccount, as well as variants exploiting transfer learning, dataaugmentation and the introduction of expert knowledge to fur-ther improve reconstruction performance. Importantly, the pro-posed method is capable of exploiting spatiotemporal scale cou-plings within contamination sources to learn, simultaneouslywith the data inversion, a low-dimensional representation that facilitates the removal of these contamination sources. Whereasthis is illustrated here for an example considering Planck-HFIdata, the method provides a general framework for structuredcontamination source removal, and may be used to tackle similarproblems in other scientiﬁc contexts. Indeed, the proposed ap-proach can potentially be applied to any data inversion problemdealing with contamination sources, provided that these sourcesare su ﬃ ciently structured to allow for the determination of a suit-able low-dimensional subspace, optimized to facilitate the datainversion.We validate the proposed approach on synthetic 545 GHzPlanck-HFI data comprising simulated FSL pickups. This val-idation on synthetic datasets demonstrates the relevance of thetwo-dimensional variant of the proposed approach to better re-move FSL pickup signals simultaneously with the data inver-sion, with respect to both a classic destriping approach as wellas the direct ﬁt of simulated FSL pickups as a template, partic-ularly for partial, gap-ﬁlled observation datasets (comprising asubsampling of one every ten rings). Moreover, the relevance of ﬀ ects in the Planck-HFI maps CNN2D-DA857 - 857 difference -0.03 0.03 MJySr SRoll2857 - 857 difference -0.03 0.03 MJySr CNN2D-DA857 - 857 difference -0.03 0.03 MJySr SRoll2857 - 857 difference -0.03 0.03 MJySr CNN2D-DA857 - 857 difference -0.03 0.03 MJySr SRoll2857 - 857 difference -0.03 0.03 MJySr CNN2D-DA857 - 857 difference -0.03 0.03 MJySr SRoll2857 - 857 difference -0.03 0.03 MJySr CNN2D-DA857 - 857 difference -0.03 0.03 MJySr SRoll2857 - 857 difference -0.03 0.03 MJySr CNN2D-DA857 - 857 difference -0.03 0.03 MJySr SRoll2857 - 857 difference -0.03 0.03 MJySr Fig. 8.

Detector di ﬀ erence maps of Planck-HFI 857 GHz real data. For all detector pairs, detector di ﬀ erence maps are computedfor three distinct cases: the mapmaking of Planck-HFI 857 GHz real data using a classic destriping approach (not shown), themapmaking of Planck-HFI 857 GHz real data using SRoll2 to produce a direct ﬁt of a synthetic FSL simulation as a template(SRoll2, two leftmost columns), and the mapmaking of Planck-HFI 857 GHz real data using 1000 iterations of the 2D DecoderCNN exploiting data augmentation to include inter-detector residuals in the learning dataset (CNN2D-DA, two rightmost columns).the two-dimensional variant to e ﬃ ciently exploit transfer learn-ing approaches to model and capture phase-shifts in observationsis also demonstrated during the validation on synthetic simulateddata.Following validation, we further explore the proposed approachby applying it to the contamination source removal and mapmak-ing of real 857 GHz Planck-HFI observations. We exploit thetwo-dimensional variant of the proposed method, alongside withdata augmentation, to demonstrate the relevance of the proposedframework to outperform both a classic destriping approach aswell as a direct ﬁt of FSL pickup as a template for the removalof large-scale systematic e ﬀ ects in real data. In particular, thecase study clearly depicts how inter-detector data augmentationand the integration of expert knowledge, by means of a user-setthreshold for noise removal in the augmented dataset, allows fora considerable gain in terms of FSL pickup removal, thus im-proving mapmaking and contamination source removal perfor-mance.Generally speaking, the present work underlines the relevance ofdata-driven neural network based approaches to improve on cur-rent contamination source removal and mapmaking approachesand go beyond their limitations by providing enhanced capabil-ities for the separation and removal of structured, non-Gaussianinformation, such as systematic e ﬀ ects and foregrounds, whichshould allow for the creation of more accurate CMB maps andthus improve current parameter likelihood estimates in order tobetter constraint and / or validate cosmological models.Importantly, this work builds on previously developed meth-ods for the separation and removal of structured contamination sources, and particularly on the SRoll2 algorithm (Delouis et al.2019). As such, the methods developed in this work are to beintegrated in a new version of the SRoll algorithm (SRoll3), andwe describe here SRoll3 857 GHz detector maps that will bereleased to the community. The possible research avenues stemming from the proposed ap-proach include a wide arrange of both theoretical and practicalissues. Whereas, in this work, we illustrated the relevance of theproposed approach for the modeling and removal of systematice ﬀ ects, we underline the suitability of the proposed methodologyfor the modeling and removal of any structured signal, includ-ing modeling errors, observation errors, and foregrounds, amongothers. This implies that the proposed framework can be appliedto a wide range of similar problems in multiple scientiﬁc do-mains, ranging from the mapmaking and contamination sourceremoval of Planck-HFI data to the removal of structured noisesources in new generation ocean remote sensing satellite mis-sions, or even the processing of ground-borne and balloon-bornesky observations. Furthermore, one may also consider, for ex-ample, exploiting the proposed Decoder CNN to apply trans-fer learning techniques to the component separation problemin Planck data. In this regard, a multi-channel Decoder CNNcould be exploited to separate di ﬀ erent components, with dif-ferent channels representing di ﬀ erent sources. In this context,transfer learning techniques could be used on speciﬁc channelsto better capture the source considered, similarly to the approach ﬀ ects in the Planck-HFI maps illustrated above for the FSL foreground. The modeling and cor-rection of Analog-to-Digital Converter (ADC) non-linearities(Planck Collaboration VII 2016) also appears as a current is-sue that could greatly beneﬁt from the proposed transfer learn-ing based formulation. Indeed, we expect that exploiting trans-fer learning should allow us to better understand and model theADC non-linearities that exist within the Planck-HFI data by ex-ploiting simulated and / or real data to learn a low-dimensionalrepresentation where such non-linearities may become easier tocorrect. Finally, the processing of ground-based cosmologicalobservations may also be considered as a potential application ofthe proposed approach, particularly with respect to the removalof atmospheric turbulence related noise, given its slow temporalvariation. Acknowledgements.

This work is part of the Bware project supported byCNES, and part of the Deepsee project supported by the ProgrammeNational de T´el´ed´etection Spatiale of the CNRS Institut des Sciences del’Univers (http: // // // // References

Aharon, M., Elad, M., & Bruckstein, A. 2006, IEEE Transactions on SignalProcessing, 54, 4311Allys, E., Levrier, F., Zhang, S., et al. 2019, arXiv preprint arXiv:1905.01372Armitage-Caplan, C. & Wandelt, B. D. 2009, ApJSupp., 181, 533Baldi, P. & Hornik, K. 1989, Neural networks, 2, 53Bengio, Y., Courville, A., & Vincent, P. 2013, IEEE transactions on pattern anal-ysis and machine intelligence, 35, 1798B¨ohme, T. J., Fletcher, I., & Cox, C. S. 1999, e&i Elektrotechnik undInformationstechnik, 116, 375Bojanowski, P., Joulin, A., Lopez-Pas, D., & Szlam, A. 2018, in InternationalConference on Machine Learning, 599–608Bouakkaz, M. & Harkat, M.-F. 2012, in IJCCIBourlard, H. & Kamp, Y. 1988, Biological cybernetics, 59, 291Bruna, J., Mallat, S., Bacry, E., Muzy, J.-F., et al. 2015, The Annals of Statistics,43, 323Choi, S., Cichocki, A., Park, H.-M., & Lee, S.-Y. 2005, Neural InformationProcessing-Letters and Reviews, 6, 1de Gasperis, G., Balbi, A., Cabella, P., Natoli, P., & Vittorio, N. 2005, A&A,436, 1159Delouis, J.-M., Pagano, L., Mottet, S., Puget, J.-L., & Vibert, L. 2019, A&A,629, A38DeMers, D. & Cottrell, G. W. 1993, in Advances in neural information process-ing systems, 580–587Denton, E. L., Chintala, S., Fergus, R., et al. 2015, in Advances in neural infor-mation processing systems, 1486–1494Dor´e, O., Teyssier, R., Bouchet, F., Vibert, D., & Prunet, S. 2001, Astronomy &Astrophysics, 374, 358Dumoulin, V. & Visin, F. 2016, arXiv preprint arXiv:1603.07285Erguo, Y. & Jinshou, Y. 2002, in Proceedings of the 4th World Congress onIntelligent Control and Automation (Cat. No.02EX527), Vol. 4, 2755–2759vol.4Erichson, N. B., Muehlebach, M., & Mahoney, M. W. 2019, arXiv preprintarXiv:1905.10866Fan, J. & Cheng, J. 2018, Neural Networks, 98, 34Geng, Z. & Zhu, Q. 2005, Industrial & Engineering Chemistry Research, 44,3585Goodfellow, I., Pouget-Abadie, J., Mirza, M., et al. 2014, in Advances in neuralinformation processing systems, 2672–2680G´orski, K. M., Hivon, E., Banday, A. J., et al. 2005, ApJ, 622, 759Hassoun, M. H. & Sudjianto, A. 1997, in Workshop on Advances inAutoencoder / Autoassociator-Based Computations at the NIPS, Vol. 97, 605–611Hinton, G. E. & Salakhutdinov, R. R. 2006, Science, 313, 504Jia, F., Martin, E., & Morris, A. 1998, Computers & Chemical Engineering, 22,S851 , european Symposium on Computer Aided Process Engineering-8Karpatne, A., Atluri, G., Faghmous, J. H., et al. 2017, IEEE Transactions onKnowledge and Data Engineering, 29, 2318 Keih¨anen, E., Keskitalo, R., Kurki-Suonio, H., Poutanen, T., & Sirvi¨o, A. 2010,A&A, 510, A57Keih¨anen, E., Kurki-Suonio, H., & Poutanen, T. 2005, MNRAS, 360, 390Kramer, M. A. 1991, AIChE journal, 37, 233LeCun, Y., Bottou, L., Bengio, Y., & Ha ﬀ ner, P. 1998, Proceedings of the IEEE,86, 2278Lee, J. A. & Verleysen, M. 2007, Nonlinear dimensionality reduction (SpringerScience & Business Media)Liu, F. & Zhao, Z. 2004, in Advances in Neural Networks – ISNN 2004, ed. F.-L. Yin, J. Wang, & C. Guo (Berlin, Heidelberg: Springer Berlin Heidelberg),798–803Lusch, B., Kutz, J. N., & Brunton, S. L. 2018, Nature Communications, 9, 4950Lutter, M., Ritter, C., & Peters, J. 2019, in International Conference on LearningRepresentationsMaino, D., Burigana, C., G´orski, K. M., Mandolesi, N., & Bersanelli, M. 2002,A&A, 387, 356McCann, M. T., Jin, K. H., & Unser, M. 2017, IEEE Signal ProcessingMagazine, 34, 85Mordvintsev, A., Olah, C., & Tyka, M. 2015, Google Research, 2Nabian, M. A. & Meidani, H. 2018, arXiv preprint arXiv:1810.05547Nandi, S., Mukherjee, P., Tambe, S. S., Kumar, R., & Kulkarni, B. D. 2002,Industrial & Engineering Chemistry Research, 41, 2159Natoli, P., de Gasperis, G., Gheller, C., & Vittorio, N. 2001, A&A, 372, 346Pan, S. J. & Yang, Q. 2009, IEEE Transactions on knowledge and data engineer-ing, 22, 1345Park, J. J., Florence, P., Straub, J., Newcombe, R., & Lovegrove, S. 2019, arXivpreprint arXiv:1901.05103Planck Collaboration ES. 2018, The Legacy Explanatory Supplement, http://wiki.cosmos.esa.int/planck-legacy-archive (ESA)Planck Collaboration VIII. 2014, A&A, 571, A8Planck Collaboration III. 2016, A&A, 594, A3Planck Collaboration VII. 2016, A&A, 594, A7Planck Collaboration VIII. 2016, A&A, 594, A8Planck Collaboration III. 2020, A&A, 641, A3Poutanen, T., de Gasperis, G., Hivon, E., et al. 2006, A&A, 449, 1311Prunet, S., Ade, P. A. R., Bock, J. J., et al. 2001, e-printRadford, A., Metz, L., & Chintala, S. 2015, arXiv preprint arXiv:1511.06434Raissi, M. & Karniadakis, G. E. 2018a, Journal of Computational Physics, 357,125Raissi, M. & Karniadakis, G. E. 2018b, Journal of Computational Physics, 357,125Raissi, M., Perdikaris, P., & Karniadakis, G. E. 2017a, arXiv preprintarXiv:1711.10561Raissi, M., Perdikaris, P., & Karniadakis, G. E. 2017b, arXiv preprintarXiv:1711.10566Raissi, M., Yazdani, A., & Karniadakis, G. E. 2018, arXiv preprintarXiv:1808.04327Reddy, V. & Mavrovouniotis, M. 1998, Chemical Engineering Research andDesign, 76, 478 , process Operations and ControlReddy, V. N., Riley, P. M., & Mavrovouniotis, M. L. 1996, Computers &Chemical Engineering, 20, S889 , european Symposium on Computer AidedProcess Engineering-6Roscher, R., Bohn, B., Duarte, M. F., & Garcke, J. 2019, CoRR, abs / ﬀ ner, S. M., et al. 2006, Microbial Ecology, 51,177Seo, S. & Liu, Y. 2019, arXiv preprint arXiv:1902.02950Tan, S. & Mayrovouniotis, M. L. 1995, AIChE Journal, 41, 1471Tang, H., Scaife, A. M. M., & Leahy, J. P. 2019, Monthly Notices of the RoyalAstronomical Society, 488, 3358Tauber, J. A., Mandolesi, N., Puget, J., et al. 2010, A&A, 520, A1Tenenbaum, J. B., Silva, V. d., & Langford, J. C. 2000, Science, 290, 2319Van Der Maaten, L., Postma, E., & Van den Herik, J. 2009, J Mach Learn Res,10, 13Yang, Y. & Perdikaris, P. 2018, arXiv preprint arXiv:1812.03511Yang, Y. & Perdikaris, P. 2019, Journal of Computational Physics, 394, 136Zhu, Q. & Li, C. 2006, Chinese Journal of Chemical Engineering, 14, 597ner, S. M., et al. 2006, Microbial Ecology, 51,177Seo, S. & Liu, Y. 2019, arXiv preprint arXiv:1902.02950Tan, S. & Mayrovouniotis, M. L. 1995, AIChE Journal, 41, 1471Tang, H., Scaife, A. M. M., & Leahy, J. P. 2019, Monthly Notices of the RoyalAstronomical Society, 488, 3358Tauber, J. A., Mandolesi, N., Puget, J., et al. 2010, A&A, 520, A1Tenenbaum, J. B., Silva, V. d., & Langford, J. C. 2000, Science, 290, 2319Van Der Maaten, L., Postma, E., & Van den Herik, J. 2009, J Mach Learn Res,10, 13Yang, Y. & Perdikaris, P. 2018, arXiv preprint arXiv:1812.03511Yang, Y. & Perdikaris, P. 2019, Journal of Computational Physics, 394, 136Zhu, Q. & Li, C. 2006, Chinese Journal of Chemical Engineering, 14, 597