[PDF] Spectral Reconstruction with Deep Neural Networks

Abstract

We explore artificial neural networks as a tool for the reconstruction of spectral functions from imaginary time Green's functions, a classic ill-conditioned inverse problem. Our ansatz is based on a supervised learning framework in which prior knowledge is encoded in the training data and the inverse transformation manifold is explicitly parametrised through a neural network. We systematically investigate this novel reconstruction approach, providing a detailed analysis of its performance on physically motivated mock data, and compare it to established methods of Bayesian inference. The reconstruction accuracy is found to be at least comparable, and potentially superior in particular at larger noise levels. We argue that the use of labelled training data in a supervised setting and the freedom in defining an optimisation objective are inherent advantages of the present approach and may lead to significant improvements over state-of-the-art methods in the future. Potential directions for further research are discussed in detail.

Full PDF

SSpectral Reconstruction with Deep Neural Networks

Lukas Kades, Jan M. Pawlowski,

1, 2

Alexander Rothkopf, Manuel Scherzer, Julian M. Urban, Sebastian J. Wetzel, Nicolas Wink, and Felix Ziegler Institut f¨ur Theoretische Physik, Universit¨at Heidelberg, Philosophenweg 16, D-69120 Heidelberg, Germany ExtreMe Matter Institute EMMI, GSI, Planckstr. 1, D-64291 Darmstadt, Germany Faculty of Science and Technology, University of Stavanger, NO-4036 Stavanger, Norway

We explore artiﬁcial neural networks as a tool for the reconstruction of spectral functions fromimaginary time Green’s functions, a classic ill-conditioned inverse problem. Our ansatz is based ona supervised learning framework in which prior knowledge is encoded in the training data and theinverse transformation manifold is explicitly parametrised through a neural network. We systemati-cally investigate this novel reconstruction approach, providing a detailed analysis of its performanceon physically motivated mock data, and compare it to established methods of Bayesian inference.The reconstruction accuracy is found to be at least comparable, and potentially superior in particu-lar at larger noise levels. We argue that the use of labelled training data in a supervised setting andthe freedom in deﬁning an optimisation objective are inherent advantages of the present approachand may lead to signiﬁcant improvements over state-of-the-art methods in the future. Potentialdirections for further research are discussed in detail.

I. INTRODUCTION

Machine Learning has been applied to a variety ofproblems in the natural sciences. For example, it is regu-larly deployed in the interpretation of data from high-energy physics detectors [1, 2]. Algorithms based onlearning have shown to be highly versatile, with theiruse extending far beyond the original design purpose. Inparticular, deep neural networks have demonstrated un-precedented levels of prediction and generalisation per-formance, for reviews see e.g. [3, 4]. Machine Learningarchitectures are also increasingly deployed for a varietyof problems in the theoretical physics community, rang-ing from the identiﬁcation of phases and order parame-ters to the acceleration of lattice simulations [5–15].Ill-conditioned inverse problems lie at the heart ofsome of the most challenging tasks in modern theoret-ical physics. One pertinent example is the computa-tion of real-time properties of strongly correlated quan-tum systems. Take e.g. the phenomenon of energy andcharge transport, which so far has deﬁed a quantitativeunderstanding from ﬁrst principles. This universal phe-nomenon is relevant to systems at vastly diﬀerent en-ergy scales, ranging from ultracold quantum gases cre-ated with optical traps to the quark-gluon plasma bornout of relativistic heavy-ion collisions.While static properties of strongly correlated quantumsystems are by now well understood and routinely com-puted from ﬁrst principles, a similar understanding ofreal-time properties is still subject to ongoing research.The thermodynamics of strongly coupled systems, suchas the quark gluon plasma, has been explored usingthe combined strength of diﬀerent non-perturbative ap-proaches, such as functional renormalisation group meth-ods and lattice ﬁeld theory calculations. There are twolimitations aﬀecting most of these approaches: Firstly, inorder to carry out quantitative computations, time has tobe analytically continued into the complex plane, to so-called Euclidean time. Secondly, explicit computations are either fully numerical or at least involve intermediatenumerical steps.This leaves us with the need to eventually undo theanalytic continuation of Euclidean correlation functions,which are known only approximately. The most relevantexamples are two-point functions, the so-called Euclideanpropagators. The spectral representation of quantumﬁeld theory relates the propagators, be they in Minkowskior Euclidean time, to a single function encoding theirphysics, the so-called spectral function. The number ofdiﬀerent structures contributing to a spectral functionare in general quite limited and consist of poles and cuts.If we can extract from the Euclidean two-point correlatorits spectral function, we may immediately compute thecorresponding real-time propagator.If we know the Euclidean propagator analytically, thisinformation allows us in principle to recover the corre-sponding Minkowski time information. In practice, how-ever, the limitation of having to approximate correlatordata (e.g. through simulations) turns the computation ofspectral functions into an ill-conditioned problem. Themost common approach to give meaning to such inverseproblems is Bayesian inference. It incorporates addi-tional prior domain knowledge we possess on the shape ofthe spectral function to regularise the inversion task. Thepositivity of hadronic spectral functions is one prominentexample. The Bayesian approach has seen continuousimprovement over the past two decades in the context ofspectral function reconstructions. While originally it wasrestricted to maximum a posteriori estimates for the mostprobable spectral function given Euclidean data and priorinformation [16–18], in its most modern form it amountsto exploring the full posterior distribution [19]. An im-portant aspect of the work is to develop appropriate mockdata tests to benchmark the reconstruction performancebefore applying it to actual data. Generally, the successof a reconstruction method stands or falls with its perfor-mance on physical data. While this seems evident, it wasin fact a hard lesson learned in the history of Bayesian a r X i v : . [ phy s i c s . c o m p - ph ] M a y reconstruction methods, a lesson which we want to heed.Inverse problems of this type have also drawn quitesome attention in the machine learning (ML) community[20–23]. In the present work we build upon both the re-cent progress in the ﬁeld of ML, particularly deep learn-ing, as well as results and structural information gatheredin the past decades from Bayesian reconstruction meth-ods. We set out to isolate a property of neural networksthat holds the potential to improve upon the standardBayesian methods, while retaining their advantages, util-ising the already gathered insight in their study.Consider a feed-forward deep neural network that takesEuclidean propagator data as input and outputs a pre-diction of the associated spectral function. Althoughthe reasoning behind this ansatz is rather diﬀerent, onecan draw parallels to more traditional methods. In theBayesian approach, prior information is explicitly en-coded in a prior probability functional and the optimisa-tion objective is the precise recovery of the given propa-gator data from the predicted spectral function. In con-trast, the neural network based reconstruction is con-ditioned through supervised learning with appropriatetraining data. This corresponds to implicitly imposinga prior distribution on the set of possible predictions,which, as in the Bayesian case, regularises the reconstruc-tion problem. Optimisation objectives are now expressedin terms of loss functions, allowing for greater ﬂexibility.In fact, we can explicitly provide pairs of correlator andspectral function data during the training. Hence, notonly can we aim for the recovery of the input data fromthe predictions as in the Bayesian approach, but we arenow also able to formulate a loss directly on the spectralfunctions themselves. This constitutes a much strongerrestriction on potential solutions for individual propaga-tors, which could provide a signiﬁcant advantage overother methods. The possibility to access all informationof a given sample with respect to its diﬀerent represen-tations also allows the exploration of a much broader setof loss functions, which could beneﬁt not only the neuralnetwork based reconstruction, but also lead to a betterunderstanding and circumvention of obstacles related tothe inverse problem itself. Such an obstacle is given, forexample, by the varying severity of the problem withinthe space of considered spectral functions. By employ-ing adaptive losses, inhomogeneities of this type could beneutralised.Similar approaches concerning spectral functions thatconsist of normalised sums of Gaussian peaks have al-ready been discussed in [24, 25]. In this work, we inves-tigate the performance of such an approach using mockdata of physical resonances motivated by quantum ﬁeldtheory, and compare it to state-of-the-art Bayesian meth-ods. The data are given in the form of linear combina-tions of unnormalised Breit-Wigner peaks, whose distinc-tive tail structures introduce additional diﬃculties. Us-ing only a rather naive implementation, the performanceof our ansatz is demonstrated to be at least compara-ble and potentially superior, particularly for large noise levels. We then discuss potential improvements of thearchitecture, which in the future could establish neuralnetworks to a state-of-the-art approach for accurate re-constructions with a reliable estimation of errors.The paper is organised as follows. The spectral recon-struction problem is deﬁned in Section II A. State-of-the-art Bayesian reconstruction methods are summarised inSection II B. In Section II C we discuss the applicationof neural networks and potential advantages. Section IIIcontains details on the design of the networks and de-ﬁnes the optimisation procedure. Numerical results arepresented and compared to Bayesian methods in Sec-tion IV. We summarise our ﬁndings and discuss futurework in Section V. II. SPECTRAL RECONSTRUCTION ANDPOTENTIAL ADVANTAGESA. Deﬁning the problem

Typically, correlation functions in equilibrium quan-tum ﬁeld theories are computed in imaginary time af-ter a Wick rotation t → it ≡ τ , which facilitates bothanalytical and numerical computations. In strongly cor-related systems, a numerical treatment is in most casesinevitable. Such a setup leaves us with the task to recon-struct relevant information, such as the spectrum of thetheory, or genuine real-time quantities such as transportcoeﬃcients, from the Euclidean data.The information we want to access is encoded in theassociated spectral function ρ . For this purpose it is mostconvenient to work in momentum space both for ρ andthe corresponding propagator G . The relation betweenthe Euclidean propagator and the spectral function isgiven by the well known K¨allen-Lehmann spectral repre-sentation, G ( p ) = (cid:90) ∞ d ωπ ω ρ ( ω ) ω + p ≡ (cid:90) ∞ d ω K ( p, ω ) ρ ( ω ) , (1)which deﬁnes the corresponding K¨allen-Lehmann kernel.The propagator is usually only available in the form ofnumerical data, with ﬁnite statistical and systematic un-certainties, on a discrete set of N p points, which we ab-breviate as G i = G ( p i ). The most commonly used ap-proach is to work directly with a discretised version of(1). We utilise the same abbreviation for the spectralfunction, i.e. ρ i = ρ ( ω i ), discretised on N ω points. Thislets us state the discrete form of (1) as G i = N ω (cid:88) j =1 K ij ρ j , (2)where K ij = K ( p i , ω j )∆ ω j . This amounts to a classic ill-conditioned inverse problem, similar in nature to thoseencountered in many other ﬁelds, such as medical imag-ing or the calibration of option pricing methods. Typical ω . . . . . . . . ρ ( ω ) ω . . . . . . ρ ( ω ) ω . . . . ρ ( ω ) − − − − P r o b a b ili t y OriginalMean Recon

FIG. 1. Examples of mock spectral functions reconstructed via our neural network approach for the cases of one, two and threeBreit-Wigner peaks. The chosen functions mirror the desired locality of suggested reconstructions around the original function(red line). Additive, Gaussian noise of width 10 − is added to the discretised analytic form of the associated propagator ofthe same original spectral function multiple times. The shaded area depicts for each frequency ω the distribution of resultingoutcomes, while the dashed green line corresponds to the mean. The results are obtained from the FC parameter networkoptimised with the parameter loss. The network is trained on the largest deﬁned parameter space which corresponds to thevolume Vol O. The uncertainty for reconstructions decreases for smaller volumes as illustrated in Figure 5. A detailed discussionon the properties and problems of a neural network based reconstruction is given in Section IV A. errors on the input data G ( p i ) are on the order of 10 − to 10 − when the propagator at zero momentum is of theorder of unity.To appreciate the problems arising in such a recon-struction more clearly, let us assume we have a sugges-tion for the spectral function ρ sug and its correspondingpropagator G sug . The diﬀerence to the measured data isencoded in (cid:107) G ( p ) − G sug ( p ) (cid:107) = (cid:13)(cid:13)(cid:13)(cid:13) (cid:90) ∞ d ωπ ωω + p (cid:104) ρ ( ω ) − ρ sug ( ω ) (cid:105)(cid:13)(cid:13)(cid:13)(cid:13) , (3)with a suitable norm (cid:107) . (cid:107) . Evidently, even if this expres-sion vanishes point-wise, i.e. (cid:107) G ( p i ) − G sug ( p i ) (cid:107) = 0 forall p i , the spectral function is not uniquely ﬁxed. Experi-ence has shown that with typical numerical errors on theinput data, qualitatively very diﬀerent spectral functionscan lead to in this sense equivalent propagators. Thissituation can often be improved on by taking more priorknowledge into account, c.f. the discussion in [26]. Thisincludes properties such as:1. Normalisation and positivity of spectral functionsof asymptotic states. For gauge theories, this mayreduce to just the normalisation to zero, expressedin terms of the Oehme-Zimmermann superconver-gence [27, 28].2. Asymptotic behaviour of the spectral function atlow and high frequencies.3. The absence of clearly unphysical features, such asdrastic oscillations in the spectral function and thepropagator.Additionally, the parametrisation of the spectral func-tion in terms of frequency bins is just one particular ba-sis. In order to make reconstructions more feasible, other choices, and in particular physically motivated ones, maybe beneﬁcial, c.f. again the discussion in [26]. In thiswork, we consider a basis formulated in terms of physicalresonances, i.e. Breit-Wigner peaks. B. Existing methods

The inverse problem as deﬁned in (1) has an exact so-lution in the case of exactly known, discrete correlatordata [29]. However, as soon as noisy inputs are con-sidered, this approach turns out to be impractical [30].Therefore, the most common strategy to treat this prob-lem is via Bayesian inference. This approach is basedon Bayes’ theorem, which states that the posterior prob-ability is essentially given by two terms, the likelihoodfunction and prior probability: P ( ρ | D, I ) ∝ P ( D | ρ, I ) P ( ρ | I ) . (4)It explicitly includes additionally available prior infor-mation on the spectral function in order to regularise theinversion task. The likelihood P ( D | ρ ) encodes the prob-ability for the input data D to have arisen from the testspectral function ρ , while P ( ρ ) quantiﬁes how well thistest function agrees with the available prior information.The two probabilities fully specify the posterior distri-bution in principle, however they may be known onlyimplicitly. In order to gain access to the full distribu-tion, one may sample from the posterior, e.g. througha Markov Chain Monte Carlo process in the parameterspace of the spectral function. However, in practice oneis often content with the maximum a posteriori (MAP)solution. Given a uniform prior, the Maximum Likeli-hood Estimate (MLE) corresponds to an estimate of theMAP. C. Advantages of neural networks

In order to make genuine progress, we set out in thisstudy to explore methods in which our prior knowledge ofthe analytic structure can be encoded in diﬀerent ways.To this end, our focus lies on the use of Machine Learningin the form of artiﬁcial neural networks. These feature ahigh ﬂexibility in the encoding of information by learn-ing abstract internal representation. They possess theadvantageous properties that prior information can beexplicitly provided through the training data, and thatthe solution space can be regularised by choosing appro-priate loss functions.Minimising (3), while respecting the constraints dis-cussed in Section II A, can be formulated as minimisinga loss function L G ( ρ sug ) = (cid:107) G [ ρ sug ] − G [ ρ ] (cid:107) . (5)This corresponds to indirectly working on a norm or lossfunction for ρ , L ρ ( ρ sug ) = (cid:107) ρ sug − ρ (cid:107) , (6)Of course, the optimisation problem as given by (6) isintractable, since it requires the knowledge of the truespectral function ρ . Minimising L ρ ( ρ sug ) for a given setof { ρ sug } also minimises L G , since the K¨all´en–Lehmannrepresentation (1) is a linear map. In turn, however,minimising L G does not uniquely determine the spectralfunction, as has already been mentioned. Accordingly,the key to optimise the spectral reconstruction is the idealuse of all known constraints on ρ , in order to better con-dition the problem. The latter aspect has fueled manydevelopments in the area of spectral reconstructions inthe past decades.Given the complexity of the problem, as well as theinterrelation of the constraints, this calls, in our opin-ion, for an application of supervised machine learningalgorithms for an optimal utilisation of constraints. Todemonstrate our reasoning, we generate a training set ofknown pairs of spectral functions and propagators andtrain a neural network to reconstruct ρ from G by min-imising a suitable norm, utilising both L G and L ρ duringthe training. When the network has converged, it canbe applied to measured propagator data G for which thecorresponding ρ is unknown.Estimators learning from labelled data provide a po-tentially signiﬁcant advantage due to the employed su-pervision, because the loss function is minimised a priorifor a whole range of possible input/output pairs. Ac-cordingly, a neural network aims to learn the entire set ofinverse transformations for a given set of spectral func-tions. After this mapping has been approximated to asuﬃcient degree, the network can be used to make pre-dictions. This is in contrast to standard Bayesian meth-ods, where the posterior distribution is explored on a caseby case basis. Both approaches may also be combined,e.g. by employing a neural network to suggest a solution ρ sug , which is then further optimised using a traditionalalgorithm.The given setup forces the network to regularise the ill-conditioned problem by reproducing the correct trainingspectrum in accord with our criteria for a successful re-construction. It is the inclusion of the training data andthe free choice of loss functions that allows the network tofully absorb all relevant prior information. This ability isan outstanding property of supervised learning methods,which could yield potentially signiﬁcant improvementsover existing frameworks. Examples for such constraintsare the analytic structure of the propagator, asymptoticbehaviors and normalisation constraints.The parametrisation of an inﬁnite set or manifold of in-verse transformations by the neural network also enablesthe discovery of new loss functions which may be moreappropriate for a reliable reconstruction. This includes,for example, the exploration of correlation matrices withadapted elements, in order to deﬁne a suitable norm forthe given and suggested propagators. Existing, iterativemethods may also beneﬁt from the application of suchadaptive loss functions. These may include parameters,point-like representations and arbitrary other character-istics of a given training sample.Formulated in a Bayesian language, we set out to ex-plicitly train the neural network to predict MAP esti-mates for each possible input propagator, given the train-ing data as prior information. By salting the input datawith noise, the network learns to return a denoised rep-resentation of the associated spectral functions. III. A NEURAL NETWORK BASEDRECONSTRUCTION

Neural networks provide high ﬂexibility with regardto their structure and the information they can pro-cess. They are capable of learning complex internal rep-resentations which allow them to extract the relevantfeatures from a given input. A variety of network ar-chitectures and loss functions can be implemented in astraightforward manner using modern Machine Learningframeworks. Prior information can be explicitly providedthrough a systematic selection of the training data. Thedata itself provides, in addition to the loss function, aregularisation of possible suggestions. Accordingly, theproposed solutions have the advantage to be similar tothe ones in the training data.The section starts with notes on the design of the neu-ral networks we employ and ends with a detailed intro-duction of the training procedure and the utilised lossfunctions.

A. Design of the neural networks

We construct two diﬀerent types of deep feed-forwardneural networks. The input layer is fed with the noisy

G(p )G(p )G(p )G(p ) AM Γ input layer hidden layers output layer (a) input layer hidden layers output layer G(p )G(p )G(p )G(p ) ρ(ω )ρ(ω )ρ(ω )ρ(ω ) (b)FIG. 2. Sketches of the (a) PaNet, shown here for the caseof predicting the parameters A, M,

Γ of a single Breit-Wignerpeak, as well as the (b) PoNet (and by extension also the Po-VarNet). The speciﬁc network dimensions in this ﬁgure servea purely illustrative purpose, explicit details on the employedarchitectures are given in Appendix C. propagator data G ( p ). The output for the ﬁrst type isan estimate of the parameters of the associated ρ in thechosen basis, which we denote as parameter net (PaNet) .For the second type, the network is trained directly onthe discretised representation of the spectral function.This network will be referred to as point net (PoNet) .A consideration of a variable number of Breit-Wignersis feasible per construction by the point-like represen-tation of the spectral function within the output layer.This kind of network will in the following be abbreviatedby PoVarNet . See Figure 2 for a schematic illustrationof the diﬀerent network types. Note that in all cases abasis for the spectral function is provided either explic-itly through the structure of the network or implicitlythrough the choice of the training data. If not statedotherwise, the numerical results presented in the follow-ing always correspond to results from the PaNet.We compare various types of hidden layers and theimpact of depth and width within the neural networks. In general, choosing the numbers of layers and neuronsis a trade-oﬀ between the expressive power of the net-work, the available memory and the issue of overﬁtting.The latter strongly depends on the number of avail-able training samples w.r.t. the expressivity. For fullyparametrised spectral functions, new samples can be gen-erated very eﬃciently for each training epoch, which im-plies an, in principle, inﬁnite supply of data. Therefore,in this case, the risk of overﬁtting is practically non-existent. The speciﬁc dimensions and hyperparametersused for this work are provided in Appendix C. Numeri-cal results can be found in Section IV.

B. Training strategy

The neural network is trained with appropriately la-belled input data in a supervised manner. This approachallows to implicitly impose a prior distribution in theBayesian sense. The challenge lies in constructing a train-ing dataset that is exhaustive enough to contain the rel-evant structures that may constitute the actual spectralfunctions in practical applications.From our past experience with hadronic spectral func-tions in lattice QCD and the functional renormalisa-tion group, the most relevant structures are peaks ofthe Breit-Wigner type, as well as thresholds. The for-mer present a challenge from the point of view of inverseproblems, as they contain signiﬁcant tail contributions,contrary e.g. to Gaussian peaks, which approach zero ex-ponentially fast. Thresholds on the other hand set in atﬁnite frequencies, often involving a non-analytic kink be-havior. In this work, we only consider Breit-Wigner typestructures as a ﬁrst step for the application of neuralnetworks to this family of problems.Mock spectral functions are constructed using a su-perposed collection of Breit-Wigner peaks based on aparametrisation obtained directly from one-loop pertur-bative quantum ﬁeld theory. Each individual Breit-Wigner is given by ρ ( BW ) ( ω ) = 4 A Γ ω ( M + Γ − ω ) + 4Γ ω . (7)Here, M denotes the mass of the corresponding state,Γ its width and A amounts to a positive normalisationconstant.Spectral functions for the training and test set are con-structed from a combination of at most N BW = 3 diﬀer-ent Breit-Wigner peaks. Depending on which type ofnetwork is considered, the Euclidean propagator is ob-tained either by inserting the discretised spectral func-tion into (2), or by a computation of the propagator’sanalytic representation from the given parameters. Thepropagators are salted both for the training and test setwith additive, Gaussian noise G noisy i = G i + (cid:15) . (8)This is a generic choice which allows to quantify the per-formance of our method at diﬀerent noise levels.The advantage of neural networks to have direct accessto diﬀerent representations of a spectral function impliesa free choice of objective functions in the solution space.We consider three simple loss functions and combinationsthereof. The (pure) propagator loss L G ( ρ sug ) deﬁned in(5) represents the most straightforward approach. Thisobjective function is accessible also in already existingframeworks such as the BR or GrHMC methods and isimplemented in this work to facilitate a quantitative com-parison. In contrast, the loss functions that follow areonly accessible in the neural network based reconstruc-tion framework. This unique property is owed to thepossibility that a neural network can be trained in ad-vance on a dataset of known input and output pairs. Aspointed out in Section II C, a loss function can e.g. be de-ﬁned directly on a discretised representation of the spec-tral function ρ . This approach is implemented through L ρ ( ρ sug ), see (6). The optimisation of the parameters θ = { A i , M i , Γ i | ≤ i < N BW } of our chosen basis isan even more direct approach. In principle, the spaceof all possible choices of parameters is R · N BW + , assum-ing they are all positive deﬁnite. Of course, only ﬁnitesubvolumes of this space ever need to be considered astarget spaces for reconstruction methods. Therefore, wewill often refer to a ﬁnite target volume simply as the pa-rameter space for a speciﬁc setting. The respective lossfunction deﬁned in this space is given by: L θ ( θ sug ) = (cid:107) θ sug − θ (cid:107) . (9)All losses are evaluated using the 2-norm. In the case ofthe parameter net, we have ρ sug ≡ ρ ( θ sug ). Apart fromthe three given loss functions, we also investigate a com-bination of the propagator and the spectral function loss, L G,ρ ( ρ sug , α ) = L ρ ( ρ sug ) + αL G ( ρ sug ) . (10)The type of loss function that is employed as well as theselection of the training data have major impact on theresulting performance of the neural network. Given thisobservation, it seems likely that a further optimisationregarding the choice of the loss function can signiﬁcantlyenhance the prediction quality. However, for the timebeing, we content ourselves with the types given aboveand postpone the exploration of more suitable trainingobjectives to future work.In the following section, we continue with a thoroughassessment of the performance of the neural networkbased approach and compare numerical results with theexisting methods introduced in Section II B. IV. NUMERICAL RESULTS

In this section we present numerical results for the neu-ral network based reconstruction and validate the dis-cussed potential advantages by comparing to existing

Local diﬀerencesin the severityof the inverseproblemLoss function/Prior information Information losswithin the forwardpass and dueto noiseComplexity ofthe network

Reliablereconstructions

FIG. 3. This ﬁgure serves to illustrate the impact which thedetails of the training procedure and of the inverse problemitself have on the quality of the reconstructions. The term re-liable reconstruction refers to a homogeneous distribution oflosses within the parameter space. This involves a reliable er-ror estimation on the given reconstruction and ensures localityof proposed solutions. In essence, we want to emphasise theimportance of realising that aiming for reliable reconstruc-tions is a complicated, multifactorial problem whose facetsneed to be suﬃciently disentangled in order to understand allcontributions to systematic errors. methods. Details on the training procedure as well as thetraining and test datasets can be found in Appendix C,together with an introduction to the used performancemeasures. We start now with a brief summary of themain ﬁndings for our approach. Furthermore, a detailednumerical analysis and discussion of diﬀerent network se-tups w.r.t performance diﬀerences are provided. Subse-quently, additional post-processing methods for an im-provement of the neural network predictions are covered.The section ends with a discussion of results from thePoNet. Readers who are interested in a comparison of theneural network based reconstruction to existing methodsmay proceed directly with Section IV B.

A. Reconstruction with neural networks

Our ﬁndings concerning the optimal setup of a feed-forward network can be summarised as follows. Aspointed out in Section II C, the network aims to learn anapproximate parametrisation of a manifold of (matrix)inverses of the discretised K¨all´en-Lehmann transforma-tions. The inverse problem grows more severe if the prop-agator values are aﬄicted with noise. In Bayesian terms,this is caused by a wider posterior distribution for largernoise. The network needs to have suﬃcient expressivity,i.e. an adequate number of hyperparameters, to be ableto learn a large enough set of inverse transformations.We assume that for larger noise widths a smaller num-ber of hyperparameters is necessary to learn satisfactory . . . Spectralfunction loss . . . Propagatorloss − − Noise width σ . . . Parameterloss

ConvStraight FC Deep FCFC Narrow Deep FCConv PP (a) Comparison of diﬀerent net architecures. Allnetworks are trained based on the parameter loss. Theassociated architures can be found in Table III. . . . Spectralfunction loss . . . Propagatorloss − − Noise width σ . . . . . Parameterloss

Pure Param.Pure Spec. Prop./Spec. 150Prop./Spec. 600 Prop/Spec. 3000Pure Prop. (b) Comparison of diﬀerent loss functions. Details onthe loss functions are described at the end ofSection III B. All results are based on networks with thearchticture Conv.

FIG. 4. The performance of diﬀerent net architectures and loss functions is compared for additive Gaussian noise with widthsof 10 − and 10 − on the given input propagator. Shown here are the respective losses for the predicted parameters, for thediscretised reconstructed spectral function and for the reconstructed propagator to the true, noise-free propagator. For bothﬁgures, the performance measures and the training of the neural networks are based on the training and test set of the largestvolume in parameter space, Vol O. The deﬁnitions of the performance measures are given at the end of Appendix C. Theresults on the left hand side imply that for larger errors, the choice of a speciﬁc network architecture has negligible impacton the quality of the reconstructions. All performance measures can be lowered for the given noise widths by applying apost-processing procedure on the suggested parameters of the network. In particular, the propagator loss can be minimised.The comparison on the right hand side shows that the choice of the loss function has major impact on the resulting performanceof the network. The results underpin the importance of an appropriate loss function and support our argument of potentialadvantages of neural networks compared to existing approaches in Section II C. Contour plots in parameter space are illustratedfor the respective measures in Figure 12 and Figure 13. transformations, since the available information contentabout the actual inverse transformation decreases for arespective exact reconstruction. A varying severity ofthe inverse problem within the parameter space leadsto an optimisation of the spectral reconstruction in re-gions where the problem is less severe. This eﬀect occursnaturally, since there the network can minimise the lossmore easily than in regions where the problem is moresevere. Besides the severity of the inverse problem, theform of the loss function has a large impact on global op-tima within the landscape of the solution space. Basedon these observations, an appropriate training of the net-work is non-trivial and demands a careful numerical anal-ysis of the inverse problem itself, and of diﬀerent setupsof the optimisation procedure. A sensible deﬁnition ofthe loss function or a non-uniform selection of the train-ing data are possible approaches to address the disparityin the severity of the inverse problem. A more straight-forward approach is to iteratively reduce the covered pa-rameter ranges within the learning process, based on pre-vious suggested outcomes. This amounts to successivelyincreasing the prediction accuracy by restricting the net-work to smaller and smaller subvolumes of the original solution space. However, one should be aware that thisapproach is only sensible if the reconstructions for dif-ferent noise samples on the original propagator data aresuﬃciently close to each other in the solution space. Asuccessive optimisation of the prediction accuracy in sucha way can also be applied to existing methods. All ap-proaches ultimately aim at a more homogeneous recon-struction loss within the solution space. This allows fora reliable control of systematic errors, as well as an accu-rate estimation of statistical errors. The desired outcomefor a generic set of Breit-Wigner parameters is illustratedand discussed in Figure 1. The essence of our discussionhere is summarised pictorially in Figure 3.The impact of the net architecture and the loss func-tion on the overall performance within the parameterspace is illustrated in Figure 4. Associated contour plotscan be found in Figure 12 and Figure 13. These plotsdemonstrate that the minima in the loss landscape highlydepend on the employed loss function. In turn, this leadsto diﬀerent performance measures. This observation con-ﬁrms our previous discussion and the necessity of an ap-propriate deﬁnition of the loss function. It also reinforcesour arguments regarding potential advantages of neural ω ρ ( ω ) Vol O ω Vol A ω Vol B ω Vol C ω Vol D − − − − OriginalMean Recon

FIG. 5. The uncertainties of reconstructions of spectral functions on the same original propagator are illustrated in the samemanner as described in Figure 1 for diﬀerent volumes of the parameter space, again using a noise width of 10 − . The plotsdemonstrate how the quality of the reconstruction improves if the parameter space which the network has to learn is decreased.The volumes of the corresponding parameter spaces are listed in Table I. The results are computed from the Conv PaNet.The systematic deviation of the distribution of reconstructions for large volumes shows that the network has not captured themanifold of inverse transformations completely for the entire parameter space. This is in concordance with the results discussedin Figure 12 and Figure 14. − − − Vol θ − − Sp e c t r a l f un c t i o n l o ss − − − Vol θ − − − − − P r o p aga t o r l o ss − − − Vol θ − − P a r a m e t e r l o ss σ = 10 − σ = 10 − σ = 10 − σ = 10 − ConvConv PPGrHMC

FIG. 6. The plots in this ﬁgure quantify the impact of the parameter space volume used for the training on the performance ofthe network. The performance measures are computed based on the test set of the smallest volume, Vol D. The parameter rangesin the training set are gradually reduced to analyse diﬀerent levels of complexity of the problem. A network is trained separatelyfor each volume, which are listed in Table I. The results demonstrate the potential advantage of an iterative restriction of theparameter ranges of possible solutions. The contour plots in Figure 14 depict changes of the performance measures within theparameter space. More strongly peaked prior distributions lead to better reconstructions. The comparison with results of theGrHMC approach illustrates the improvement of the performance of neural networks for larger errors and smaller volumes.These observations conﬁrm the discussions of Figure 5 and Figure 8. Adding a post-processing step leads in particular for thepropagator loss and for smaller noise widths to an improvement of the reconstruction, as has also been discussed in Figure 4. − − − − Noise width σ − − − Sp e c t r a l f un c t i o n l o ss − − − − Noise width σ − − − P r o p aga t o r l o ss FIG. 7. Comparison of reconstruction errors of the PaNet and PoNet. The performance measures are computed based on thetest set of the largest parameter space volume Vol O for one, two and three Breit-Wigners. The overall smaller losses for thepoint nets are due to the large number of degrees of freedom for the point-like representation of the spectral function. Thepartly competitive performance of the PoVarNet compared to the results of the PoNet encourage the further investigation ofnetworks that are trained using a more exhaustive set of basis functions to describe physical structures in the spectral functions. networks in comparison to other approaches for spectralreconstruction. The comparison of diﬀerent feed-forwardnetwork architectures shows that the speciﬁc details ofthe network structure are rather irrelevant, provided thatthe expressivity is suﬃcient.Diﬀerences in the performance of the networks that aretrained with the same loss function become less visiblefor larger noise. This is illustrated by a comparison ofcontour plots with diﬀerent noise widths, see e.g. Fig-ure 12. The severity of the inverse problem grows withthe noise and the information content about the actualmatrix transformation decreases. These properties leadto the observation of a generally worse performance forlarger noise widths, as can be inferred from Figure 5, Fig-ure 8 and Figure 9, for example. They also imply thatfor speciﬁc noise widths, the neural network possessesenough hyperparameters to learn a suﬃcient parametri-sation of the inverse transformation manifold. Further-more, the local optima into which the network convergesare mainly determined by diﬀerences in the local sever-ity of the inverse problem. Hence, the issue remains thatgeneric loss functions are inappropriate to address thevarying local severity of the inverse problem. This issueimplies the existence of systematic errors for particularregions within the parameter space, as can be seen e.g.in the left plot of Figure 5.The results shown in Figure 5, Figure 6 and Figure 14conﬁrm our discussion regarding the expressive power ofthe network w.r.t. the complexity of the solution spaceand the decreasing information content for larger errors.The parameter space is gradually reduced, eﬀectively in-creasing the expressivity of the network relative to theseverity of the problem and improving the behavior ofthe loss function for a given ﬁxed parameter space. Therespective volumes are listed in Table I. Shrinking theparameter space leads to a more homogeneous loss land-scape due to the increased locality, thereby mitigating theissue of inappropriate loss functions. The necessary num-ber of hyperparameters decreases for larger noise widthsand smaller parameter ranges in the training and testdataset. The arguments above imply a better perfor-mance of the network for smaller parameter spaces. Areduction of the parameter space eﬀectively correspondseﬀectively to a sharpening of the prior information, whichalso has positive eﬀects on the spread of the posteriordistribution. More detailed discussions on the impact ofdiﬀerent elements of the training procedure can be foundin the captions of the respective ﬁgures.Since increasing the expressivity of the network is lim-ited by the computational demand required for the train-ing, one can also apply post-processing methods to im-prove the suggested outcome w.r.t. the initially given,noisy propagator. These methods are motivated by the insome cases large observed root-mean-square deviation ofthe reconstructed suggested propagator to the input, seefor example Figure 4. The application of standard opti-misation methods on the suggested results of the networkrepresents one possible approach to address this problem. Here, the network’s prediction is interpreted as a guessof the MAP estimate, which is presumed to be close tothe true solution. For the PaNet, we minimise the prop-agator loss a posteriori with respect to the following lossfunction:min θ sug L PP [ θ sug ] = min θ sug (cid:107) G noisy − G [ ρ ( θ sug )] (cid:107) . (11)This ensures that suggestions for the reconstructed spec-tral functions are in concordance with the given inputpropagator. Results obtained with an additional post-processing are marked by the attachment PP in thiswork. The numerical results in Figure 9 and Figure 5show that the ﬁnite size of the neural network can bepartially compensated for small errors. The resultinglow propagator losses are noteworthy, and are close tostate-of-the-art spectral reconstruction approaches. Onereason for this similarity is the shared underlying ob-jective function. However, the situation is diﬀerent forlarger noise widths. For our choice of hyperparameters,the algorithm quickly converges into a local minimum.For large noise widths, the optimisation procedure mayeven lead to worse results than the initially suggested re-construction. This is due to the already mentioned sys-tematic deviations which are caused by the inappropri-ate choice of the loss function for large parameter spaces.This kind of post-processing should therefore be appliedwith caution, since it may cancel out the potential ad-vantages of neural networks w.r.t. the freedom in the def-inition of the loss function.The following alternative post-processing approachpreserves the potential advantages of neural networkswhile nevertheless minimising the propagator loss. Theidea is to include the network into the optimisation pro-cess through the following objective:min G input L input [ G input ] = min G input (cid:107) G noisy − G [ ρ ( θ sug )] (cid:107) , (12)where G input corresponds to the input propagator of theneural network and θ sug to the associated outcome. Thisfacilitates a compensation of badly distributed noise sam-ples and allows a more accurate error estimation. Theapproach is only sensible if no systematic errors existfor reconstructions within the parameter space, and ifthe network’s suggestions are already somewhat reliable.We postpone a numerical analysis of this optimisationmethod together with the exploration of more appropri-ate loss functions and improved training strategies to fu-ture work, due to a currently lacking setup to train sucha network.In Figure 7 and Figure 15, results of the PoNet andthe PaNet are compared. We observe that spectral re-constructions based on the PoNet structure suﬀer fromsimilar problems as the PaNet, cf. again Figure 3. Thepoint-like representation of the spectral function intro-duces a large number of degrees of freedom for the solu-tion space. The training procedure implicitly regularises0 ω ρ ( ω ) σ = 10 − ω σ = 10 − ω σ = 10 − ω σ = 10 − − − − − OriginalConvGrHMC

FIG. 8. The quality of the reconstruction of two Breit-Wigner peaks is compared for diﬀerent strength of additive noise on thesame propagator. The labels indicate the noise width on the original propagator. It can be seen that the reconstructed spectralfunction of the neural network exhibits in particular for larger errors a lower deviation to the original spectral function thanthe GrHMC method. This mirrors the in general observable better performance of the neural network for larger errors, as canbe seen in Figure 6 and in Figure 9. The green and the red curve correspond to reconstructions of the Conv PaNet and theGrHMC method for the same given noisy propagator. The prior is in both cases given by the parameter range of volume VolB. The uncertainty of the reconstructions for the neural network is depicted by the grey shaded areas as described in Figure 1.For small errors, this area is covered by the corresponding reconstructed spectral functions. − − − − Noise width σ − − − − Sp e c t r a l f un c t i o n l o ss − − − − Noise width σ − − − − − P r o p aga t o r l o ss − − − − Noise width σ − − − − P a r a m e t e r l o ss FIG. 9. The performance of the reconstruction of spectral functions is benchmarked for the neural network approach withrespect to results of the GrHMC method. The neural network approach is in particular for large noise widths competitive. Theworse performance for smaller noise widths is a result of an inappropriate training procedure and a too low expressive power ofthe neural network. The problems are caused by a varying severity of the inverse problem and by a too large parameter spacethat needs to be covered by the neural network, as discussed in Section IV A. The error bars of the results for the FC networkare representative for typical errors within all methods and plots of this kind. this problem, however, a visual analysis of individualreconstructions shows that in some cases the networkstruggles with common issues known from existing meth-ods, such as partly non-Breit-Wigner like structures andwiggles. An application of the proposed post-processingmethods serves as a possible approach to circumvent suchproblems. An inclusion of further regulator terms intothe loss function, concerning e.g. the smoothness of thereconstruction, is also possible.

B. Benchmarking and discussion

In this section, we want to emphasise diﬀerences of ourproposed neural network approach to existing methods.Our arguments are supported by an in-depth numericalcomparison.Within all approaches the aim is to map out, or at least to ﬁnd the maximum of, the posterior distribution P ( ρ | G ) for a given noisy propagator G . The BR andGrHMC methods represent iterative approaches to ac-complish this goal. The algorithms are designed to ﬁndthe maximum for each propagator on a case-by-case ba-sis. The GrHMC method additionally provides the pos-sibility to implement constraints on the functional basisof the spectral function in a straightforward manner. Incontrast, a neural network aims to learn the full man-ifold of inverse K¨allen-Lehman transformations for anynoisy propagator (at least within the chosen parameterspace). In this sense, it needs to propose for each givenpropagator an estimate of the maximum of P ( ρ | D ). Acomplex parametrisation, as given by the network, an ex-haustive training dataset and the optimisation procedureitself are essential features of this approach for tacklingthis tough challenge. The computational eﬀort to ﬁnd asolution in an iterative approach is therefore shifted to1 . . . . Γ = Γ = Γ FC GrHMC × − − × − − × − Sp e c t r a l f un c t i o n l o ss . . . . Γ = Γ = Γ FC GrHMC × − − × − − × − − × − P r o p aga t o r l o ss . . . . M − M = M − . . . . Γ = Γ = Γ FC . . . . M − M = M − GrHMC × − − × − − × − P a r a m e t e r l o ss FIG. 10. Comparison of performance measures for the recon-struction of two Breit-Wigners with neural networks and withthe GrHMC method for input propagators with noise width10 − within the parameter space volume Vol O. The simi-lar loss landscape emphasises the high impact of variations ofthe severity of the inverse problem within the parameter spaceon the quality of reconstructions. Contrary to expectations,the parameter network mimics, despite an optimisation basedon the parameter loss L θ , the reconstruction of the GrHMCmethod which relies on an optimisation of the propagator loss L G with respect to the parameters. A reconstruction resultingin an averaged peak with the other parameter set eﬀectivelyremoved, as outlined in [26], results in the spiking parameterlosses for the GrHMC reconstructions with large errors. the training process as well as the memory demand ofthe network. Accordingly, the neural network based re-construction can be performed much faster after traininghas been completed, which is in particular advantageouswhen large sets of input propagators are considered.The numerical results in Figure 6, Figure 8, Figure 9,Figure 10 and Figure 11 demonstrate that the formalarguments of Section II C apply, particularly for com-parably large noise widths as well as smaller parame-ter ranges. For both cases, the network successfully ap-proximates the required inverse transformation manifold.Smaller noise widths and a larger set of possible spectralfunctions can be addressed by increasing the number ofhyperparameters and through the exploration of moreappropriate loss functions, as was already discussed pre-viously. V. CONCLUSION

In this study we have explored artiﬁcial neural net-works as a tool to deal with the ill-conditioned inverseproblem of reconstructing spectral functions from noisyEuclidean propagator data. We systematically investi-gated the performance of this approach on physically mo-tivated mock data and compared our results with exist-ing methods. Our ﬁndings demonstrate the importanceof understanding the implications of the inverse problemitself on the optimisation procedure as well as on theresulting predictions.The crucial advantage of the presented ansatz is the su-perior ﬂexibility in the choice of the objective function.As a result, it can outperform state-of-the-art methodsif the network is trained appropriately and exhibits suﬃ-cient expressivity to approximate the inverse transforma-tion manifold. The numerical results demonstrate thatdeﬁning an appropriate loss function grows increasinglyimportant for an increased variability of considered spec-tral functions and of the severity of the inverse problem.In future work, we aim to further exploit the advantageof neural networks that local variations in the severity ofthe inverse problem can be systematically compensated.The goal is to eliminate systematic errors in the predic-tions in order to facilitate a reliable reconstruction withan accurate error estimation. This can be realised byﬁnding more appropriate loss functions with the help ofimplicit and explicit approaches [31, 32]. A utilisation ofthese loss functions in existing methods is also possible ifthey are directly accessible. Varying the prior distribu-tion will also be investigated, by sampling non-uniformlyover the parameter space during the creation of the train-ing data. Furthermore, we aim at a better understandingof the posterior distribution through the application ofinvertible neural networks [23]. This novel architectureprovides a reliable estimation of errors by mapping outthe entire posterior distribution by construction.In conclusion, we believe that the suggested improve-ments will boost the performance of the proposed methodto an as of yet unprecedented level and that neuralnetworks will eventually replace existing state-of-the-artmethods for spectral reconstruction.

ACKNOWLEDGMENTS

We thank Ion-Olimpiu Stamatescu and the ITP ma-chine learning group for discussions and work on relatedtopics. M. Scherzer acknowledges ﬁnancial support fromDFG under STA 283/16-2. F.P.G. Ziegler is supportedby the FAIR OCD project. The work is supported byEMMI, the BMBF grant 05P18VHFCA, and is part ofand supported by the DFG Collaborative Research Cen-tre ”SFB 1225 (ISOQUANT)” as well as by DeutscheForschungsgemeinschaft (DFG) under Germany’s Excel-lence Strategy EXC-2181/1 - 390900948 (the HeidelbergExcellence Cluster STRUCTURES).2 ω . . . . . . ρ ( ω ) ω ω ω FIG. 11. Reconstructions of one, two and three Breit-Wigners are compared for our proposed neural network approach, theGrHMC method and the BR method. The reconstructions of the ﬁrst two methods are based on a single sample with noisewidth 10 − , while the results of the BR method are obtained from multiple samples with larger errors, but an average noisewidth of 10 − as well. In contrast to the previous plots, the neural network and the GrHMC method now use diﬀerent priorsfor each case in order to allow for a reasonable comparison with the BR method, see Table II. We observe that all approachesqualitatively capture the features in the spectral function. Due to the comparably large error on the input data, all methodsare expected to face diﬃculties in ﬁnding an accurate solution. The reconstructions of the neural network approach and theGrHMC method are comparable, whereas the BR method struggles in particular with thin peaks and the three Breit-Wignercase. The results demonstrate that, generally, using suitable basis functions and incorporating prior information lead to asuperior reconstruction performance. Appendix A: BR method

Diﬀerent Bayesian methods propose diﬀerent priorprobabilities, i.e. they encode diﬀerent types of prior in-formation. The well known Maximum Entropy Methode.g. features the Shannon-Jaynes entropy S SJ = (cid:90) dω (cid:0) ρ ( ω ) − m ( ω ) − ρ ( ω )log (cid:2) ρ ( ω ) m ( ω ) (cid:3)(cid:1) , (A1)while the more recent BR method uses a variant of thegamma distribution S BR = (cid:90) dω (cid:0) − ρ ( ω ) m ( ω ) + log (cid:2) ρ ( ω ) m ( ω ) (cid:3)(cid:1) . (A2)Both methods e.g. encode the fact that physical spectralfunctions are necessarily positive deﬁnite but are other-wise based on diﬀerent assumptions.As Bayesian methods they have in common that theprior information has to be encoded in the functionalform of the regulator and the supplied default model m ( ω ). Note that discretising ρ by choosing a particu-lar functional basis also introduces a selection of possibleoutcomes. The dependence of the most probable spectralfunction, given input data and prior information, on thechoice of S , m ( ω ) and the discretised basis comprises thesystematic uncertainty of the method.One major limitation to Bayesian approaches is theneed to formulate our prior knowledge in the form of anoptimisation functional. The reason is that while manyof the correlation functions relevant in theoretical physicshave very well deﬁned analytic properties it has not beenpossible to formulate these as a closed regulator func-tional S. Take the retarded propagator for example (fora more comprehensive discussion see [26]). Its analytic structure in the imaginary frequency plane splits into twoparts, an analytic half-plane, where the Euclidean inputdata is located, and a meromorphic half-plane which con-tains all structures contributing to the real-time dynam-ics. Encoding this information in an appropriate regula-tor functional has not yet been achieved.Instead the MEM and the BR method rather useconcepts unspeciﬁc to the analytic structure, such assmoothness, to derive their regulators. Among othersthis e.g. manifests itself in the presence of artiﬁcial ring-ing, which is related to unphysical poles contributing tothe real-time propagator, which however should be sup-pressed by a regulator functional aware of the physicalanalytic properties. Appendix B: GrHMC method

The main idea of the setup is already stated in themain text in Section II and was ﬁrst introduced in [26].Nevertheless, for completeness we outline the entire re-construction process here. The approach is based on for-mulating the basis expansion in terms of the retardedpropagator. The resulting set of basis coeﬃcients arethen determined via Bayesian inference. This leaves uswith two objects to specify in the reconstruction process,the choice of a basis/ansatz for the retarded propagatorand suitable priors for the inference.Once a basis has been chosen it is straightforward towrite down the corresponding regression model. As inthe reconstruction with neural nets we use a ﬁxed num-ber of Breit-Wigner structures, c.f. (7), corresponding tosimple poles in the analytically continued retarded prop-agator. The logarithm of all parameters is used in themodel in order to enforce positivity of all parameters.3The uniqueness of the parameters is ensured by using anordered representation of the logarithmic mass parame-ters.The other crucial point is the choice of priors, whichare of great importance to tame the ill-conditioning prac-tically and should therefore be chosen as restrictive aspossible. For comparability to the neural net reconstruc-tion, the priors are matched to the training volume inparameter space. However, it is more convenient to workwith a continuous distribution. Hence the priors of thelogarithmic parameters are chosen as normal distribu-tions where we have ﬁxed the parameters by the condi-tion that the mean of the distribution is the mean of thetraining volume and the probability at the boundariesof the trainings volume is equal. Details on the trainingvolume in parameters space can be found in Appendix C.All calculations for the GrHMC method are carried outusing the python interface [33] of Stan [34].

Appendix C: Mock data, training set and trainingprocedure

We consider three diﬀerent levels of diﬃculty for the re-construction of spectral functions to analyse and comparethe performance of the approaches in this work. Theselevels diﬀer by the number of Breit-Wigners that need tobe extracted based on the given information of the prop-agator. We distinguish between training and test setswith one, two and three Breit-Wigners. A variable num-ber of Breit-Wigners within a test set entails the task todetermine the correct number of present structures. Thiscan be done a priori or a posteriori based either on thepropagator or on the quality of the reconstruction. Wepostpone this problem to future work.The training set is constructed by sampling parame-ters uniformly within a given range for each parameter.The ranges for the parameters of a Breit-Wigner func-tion of ((7)) are as follows: M ∈ [0 . , . ∈ [0 . , . A ∈ [0 . , . A gradually. We proceed diﬀerently for thetwo masses to guarantee a certain ﬁnite distance betweenthe two Breit-Wigner peaks. Instead of decreasing themass range, the minimum and maximum distance of thepeaks is restricted. Details on the diﬀerent parameterspaces can be found in Table I. The propagator functionis parametrised by N p = 100 data points that are eval-uated on a uniform grid within the interval ω ∈ [0 , N ω = 500 data points on the same inter-val. Details about the training procedure can be foundat the end of the section. The parameter ranges deviatefor the comparison of the neural network approach withexisting methods. The corresponding ranges are listed inTable II. Vol A M Γ ∆ M O [0 . , .

0] [0 . , .

4] [0 . , . . , .

7] [0 . , .

0] [0 . , .

3] [0 . , . . , .

6] [0 . , .

0] [0 . , .

2] [0 . , . . , .

55] [0 . , .

0] [0 . , .

15] [0 . , . . , . . , .

0] [0 . , . . , . M = M − M is limited to restrict theminimum possible distance between two peaks. The volumes V θ in Figure 6 are computed based on these parameter ranges.BR Comparison A M Γ1BW [0 . , .

0] [0 . , .

0] [0 . , . . , .

8] [0 . , .

8] [0 . , . . , .

2] [0 . , .

8] [0 . , . . , .

8] [1 . , .

0] [0 . , . The diﬀerent approaches are compared by a test set foreach number of Breit-Wigners consisting of 1000 randomsamples within the parameter space. Another test set isconstructed for two Breit-Wigners with a ﬁxed scaling A = A = 0 .

5, a ﬁxed mass M = 1 and equally chosenwidths Γ := Γ = Γ . The mass M and the widthΓ are varied according to a regular grid in parameterspace. This test set allows the analysis of contour plotsof diﬀerent loss measures. It provides more insights intothe minima of the loss functions of the trained networksand into the severity of the inverse problem. The contourplots are averaged over 10 samples for the noise width of10 − (except for Figure 10).We investigate three diﬀerent performance measuresand diﬀerent setups of the neural network for a com-parison to existing methods. The root-mean-square-deviation of the predicted parameters in parameter space,of the reconstructed spectral function and of the recon-structed propagator are considered. For the latter case,the error is computed based on the original propagatorwithout noise. The measures are denoted as parameterloss , spectral function loss and propagator loss in thiswork. The spectral function loss and the propagator lossare computed based on the discretised representations onthe uniform grid. Representative error bars for all meth-ods are depicted in Figure 9.The training procedure for the neural networks in thiswork is as follows. A neural network is trained separatelyfor each training set, i.e., for each error and for eachrange of parameters. The learning rates are between 10 − and 10 − . The batch size is between 128 and 500 andthe number of generated training samples per epoch isaround 6 × . Depending on the kind of network, thenets are trained for 80 to 160 epochs. The used lossfunctions are described at the end of Section III B. Theimplemented net architectures are provided in Table III.4 . . . . Γ = Γ = Γ FC Deep FC Narrow Deep FC Straight FC Conv − × − − × − − × − Sp e c t r a l f un c t i o n l o ss . . . . Γ = Γ = Γ FC Deep FC Narrow Deep FC Straight FC Conv − × − − × − − × − − × − − P r o p aga t o r l o ss . . . . Γ = Γ = Γ FC Deep FC Narrow Deep FC Straight FC Conv × − − × − − × − P a r a m e t e r l o ss . . . . Γ = Γ = Γ FC Deep FC Narrow Deep FC Straight FC Conv − × − − Sp e c t r a l f un c t i o n l o ss . . . . Γ = Γ = Γ FC Deep FC Narrow Deep FC Straight FC Conv × − − × − − × − P r o p aga t o r l o ss . . . . M − M = M − . . . . Γ = Γ = Γ FC . . . . M − M = M − Deep FC . . . . M − M = M − Narrow Deep FC . . . . M − M = M − Straight FC . . . . M − M = M − Conv × − − P a r a m e t e r l o ss FIG. 12.

Comparison of network architectures -

Contour plots of loss measures are shown for diﬀerent net architectures.The upper three rows correspond to reconstructions of propagators with a noise width of 10 − , the lower ones with 10 − . Theplots illustrate the loss measures in a hyperplane within the parameter space whose properties are described in Appendix C. Thenetworks are trained with the parameter loss on the training set of volume Vol O. The contour plots show that the local minimaare slightly diﬀerent for small noise widths, whereas the global structures remain similar for all network architectures. Thesediﬀerences are caused by a slightly diﬀering utilization of the limited number of hyperparameters. The diﬀerences between thenetwork architectures become less visible for larger errors due to the growing severity of the inverse problem and a decreasingknowledge about the correct inverse transformations. Interestingly, the loss landscape of the convolutional neural network,which intrinsically operates on local structures, and of the fully connected networks are almost equal. The non-locality ofthe inverse integral transformation represents a possible reason for why the speciﬁc choice of the network structure is largelyirrelevant. We conclude that the actual architecture is rather negligible in comparison to other attributes of the learningprocess, such as the selection of training data and the choice of the loss function. . . . . Γ = Γ = Γ Pure Param. Pure Spec. Prop./Spec. 600 Prop./Spec. 3000 Pure Prop. − × − − × − − × − Sp e c t r a l f un c t i o n l o ss . . . . Γ = Γ = Γ Pure Param. Pure Spec. Prop./Spec. 600 Prop./Spec. 3000 Pure Prop. − × − − × − − × − − × − − P r o p aga t o r l o ss . . . . Γ = Γ = Γ Pure Param. Pure Spec. Prop./Spec. 600 Prop./Spec. 3000 Pure Prop. × − − × − − × − P a r a m e t e r l o ss . . . . Γ = Γ = Γ Pure Param. Pure Spec. Prop./Spec. 600 Prop./Spec. 3000 Pure Prop. × − − × − − Sp e c t r a l f un c t i o n l o ss . . . . Γ = Γ = Γ Pure Param. Pure Spec. Prop./Spec. 600 Prop./Spec. 3000 Pure Prop. − × − − × − − P r o p aga t o r l o ss . . . . M − M = M − . . . . Γ = Γ = Γ Pure Param. . . . . M − M = M − Pure Spec. . . . . M − M = M − Prop./Spec. 600 . . . . M − M = M − Prop./Spec. 3000 . . . . M − M = M − Pure Prop. × − − × − P a r a m e t e r l o ss FIG. 13.

Comparison of loss functions -

Contour plots of loss measures are illustrated in the same manner as in Figure 12,but with a comparison o diﬀerent loss functions. The considered loss functions are introduced in Section III B. The resultsare based on the Conv PaNet that is trained on volume Vol O. The optima in the loss function diﬀer and, consequently, leadto diﬀerent mean squared errors for the diﬀerent measures. It is interesting that the network with the pure propagator lossfunction leads to a rather homogeneous propagator loss distribution. In contrast, the networks with the pure parameter andthe pure spectral function loss do not result in homogeneous distributions for their corresponding loss function. The largeset of nearly equal propagators for diﬀerent parameters explains this observation. It conﬁrms also once more the necessity ofapproaches that can be trained using loss functions with access to more information than just the reconstructed propagator. . . . Γ = Γ = Γ Vol O Vol A Vol B Vol C Vol D − × − − × − − Sp e c t r a l f un c t i o n l o ss . . . Γ = Γ = Γ Vol O Vol A Vol B Vol C Vol D × − − × − − × − P r o p aga t o r l o ss . . . Γ = Γ = Γ Vol O Vol A Vol B Vol C Vol D × − − × − − × − − P a r a m e t e r l o ss . . . Γ = Γ = Γ Vol O Vol A Vol B Vol C Vol D − × − − × − Sp e c t r a l f un c t i o n l o ss . . . Γ = Γ = Γ Vol O Vol A Vol B Vol C Vol D − × − − × − P r o p aga t o r l o ss . . . M − M = M − . . . Γ = Γ = Γ Vol O . . . M − M = M − Vol A . . . M − M = M − Vol B . . . M − M = M − Vol C . . . M − M = M − Vol D × − − × − − P a r a m e t e r l o ss FIG. 14.

Analysis of prior information (parameter space of the training data) and of local diﬀerences in theseverity of the inverse problem -

The evolution of the landscape of diﬀerent loss measures is shown for networks that aretrained on diﬀerent parameter spaces. All contour plots are based on the same section of the parameter space, namely the rangethat is spanned by volume D. The upper three and lower rows correspond again to reconstructions of propagators with noisewidths 10 − and 10 − . The gradual reduction of the parameter space allows the analysis of diﬀerent levels of complexity of theproblem. A general improvement of performance can be observed besides a shift of the global optima. The more homogeneousloss landscape demonstrates that the problem of a diﬀerent severity of the inverse problem is still present, but damped. . . . . Γ = Γ = Γ Conv Prop. PaNet Conv Spec. PaNet FC Param. PaNet FC Spec. PoNet FC Spec. PoNetVar × − − × − − × − Sp e c t r a l f un c t i o n l o ss . . . . Γ = Γ = Γ Conv Prop. PaNet Conv Spec. PaNet FC Param. PaNet FC Spec. PoNet FC Spec. PoNetVar − × − − × − − × − − × − − P r o p aga t o r l o ss . . . . Γ = Γ = Γ Conv Prop. PaNet Conv Spec. PaNet FC Param. PaNet FC Spec. PoNet FC Spec. PoNetVar × − − × − Sp e c t r a l f un c t i o n l o ss . . . . M − M = M − . . . . Γ = Γ = Γ Conv Prop. PaNet . . . . M − M = M − Conv Spec. PaNet . . . . M − M = M − FC Param. PaNet . . . . M − M = M − FC Spec. PoNet . . . . M − M = M − FC Spec. PoNetVar × − − × − − × − P r o p aga t o r l o ss FIG. 15.

Comparison of the parameter net and the point net -

Root-mean-squared-deviations are compared betweenthe parameter net and the point net, trained on two Breit-Wigner like structures (PoNet) and trained on a variable numberof Breit-Wigners (PoNetVar), with respect to diﬀerent loss functions. The two upper rows correspond to results from inputpropagators with a noise width of 10 − and the two lower ones with a noise width of 10 − . Problems concerning a varyingseverity of the inverse problem and concerning an information loss caused by the additive noise remain independent of thechosen basis for the representation of the spectral function.Name CenterModule Number ofparametersFC FC(6700) ⇒ ReLU ⇒ FC(12168) ⇒ ReLU ⇒ FC(1024) 95 × Deep FC FC(512) ⇒ ReLU ⇒ FC(1024) ⇒ ReLU ⇒ (FC(4056) ⇒ ReLU) ⇒ (FC(2056) ⇒ ReLU) × NarrowDeep FC FC(512) ⇒ ReLU ⇒ (FC(1024) ⇒ ReLU) ⇒ (FC(2056) ⇒ ReLU) ⇒ (FC(1024) ⇒ ReLU) ⇒ FC(512) ⇒ ReLU ⇒ FC(256) 96 × StraightFC (FC(4112) ⇒ BatchNorm1D ⇒ ReLU ⇒ Dropout(0.2)) × Conv Conv(64, 10) ⇒ ReLU ⇒ Conv(256, 10) ⇒ ReLU ⇒ (FC(4096) ⇒ ReLU) ⇒ FC(1024) 41 × TABLE III. Details on the implemented network architectures. The general setup is: Input(100) ⇒ ReLU ⇒ CenterModule ⇒ ReLU ⇒ FC(3/6/9/500) ⇒ Output, whereas the CenterModule is given along with the associated name in the Table.The size of the output layer is determined by the use of a parameter net or a point net and the considered number of Breit-Wigners. An attached PP indicates that a post-processing procedure is applied on the suggested parameters, for more details,see Section IV A. [1] D. Guest, K. Cranmer, and D. Whiteson, Ann. Rev.Nucl. Part. Sci. , 161 (2018), arXiv:1806.11484 [hep-ex].[2] A. Radovic, M. Williams, D. Rousseau, M. Kagan,D. Bonacorsi, A. Himmel, A. Aurisano, K. Terao, andT. Wongjirad, Nature , 41 (2018).[3] Y. LeCun, Y. Bengio, and G. Hinton, Nature , 436(2015).[4] J. Schmidhuber, (2014), 10.1016/j.neunet.2014.09.003,arXiv:1404.7828.[5] J. Carrasquilla and R. G. Melko, Nature Physics , 431(2017).[6] P. E. Shanahan, D. Trewartha, and W. Detmold, Phys.Rev. D , 094506 (2018).[7] G. Carleo and M. Troyer, Science , 602 (2017).[8] S. J. Wetzel, Phys. Rev. E , 022140 (2017).[9] S. J. Wetzel and M. Scherzer, Phys. Rev. B , 184410(2017).[10] L. Wang, Phys. Rev. B , 195105 (2016).[11] W. Hu, R. R. P. Singh, and R. T. Scalettar, Phys. Rev.E , 062122 (2017).[12] L. Huang and L. Wang, Phys. Rev. B , 035105 (2017).[13] J. Liu, Y. Qi, Z. Y. Meng, and L. Fu, Phys. Rev. B ,041101 (2017).[14] J. Karpie, K. Orginos, A. Rothkopf, and S. Zafeiropou-los, JHEP , 057 (2019), arXiv:1901.05408 [hep-lat].[15] J. M. Urban and J. M. Pawlowski, arXiv:1811.03533.[16] M. Jarrell and J. E. Gubernatis, Phys. Rept. , 133(1996).[17] M. Asakawa, T. Hatsuda, and Y. Nakahara, Prog. Part.Nucl. Phys. , 459 (2001), arXiv:hep-lat/0011040 [hep-lat].[18] Y. Burnier and A. Rothkopf, Phys. Rev. Lett. ,182003 (2013), arXiv:1307.6106 [hep-lat].[19] A. Rothkopf, in (2019) arXiv:1903.02293[hep-ph].[20] V. Shah and C. Hegde, arXiv:1802.08406.[21] H. Li, J. Schwab, S. Antholzer, and M. Haltmeier,arXiv:1803.00092.[22] R. Anirudh, J. J. Thiagarajan, B. Kailkhura, and T. Bre-mer, arXiv:1805.07281.[23] L. Ardizzone, J. Kruse, S. Wirkert, D. Rahner, E. W.Pellegrini, R. S. Klessen, L. Maier-Hein, C. Rother, andU. K¨othe, arXiv:1808.04730 [cs.LG].[24] R. Fournier, L. Wang, O. V. Yazyev, and Q. Wu,arXiv:1810.00913 [physics.comp-ph].[25] H. Yoon, J.-H. Sim, and M. J. Han, Physical Review B , 245101 (2018), arXiv:1806.03841 [cond-mat.str-el].[26] A. K. Cyrol, J. M. Pawlowski, A. Rothkopf, andN. Wink, SciPost Phys. , 065 (2018), arXiv:1804.00945[hep-ph].[27] R. Oehme and W. Zimmermann, Phys. Rev. D21 , 1661(1980).[28] R. Oehme, Phys. Lett.

B252 , 641 (1990).[29] G. Cuniberti, E. De Micheli, and G. A. Viano, Commun.Math. Phys. , 59 (2001), arXiv:cond-mat/0109175[cond-mat.str-el].[30] Y. Burnier, M. Laine, and L. Mether, Eur. Phys. J.

C71 ,1619 (2011), arXiv:1101.5534 [hep-lat].[31] C. Nogueira dos Santos, K. Wadhawan, and B. Zhou,1707.02198 [cs.LG].[32] L. Wu, F. Tian, Y. Xia, Y. Fan, T. Qin, J. Lai, andT.-Y. Liu, arXiv:1810.12081 [cs.LG].[33] Stan Development Team, “Pystan: the python interfaceto stan, version 2.17.1.0,” http://mc-stan.org (2018).[34] B. Carpenter, A. Gelman, M. D. Hoﬀman, D. Lee,B. Goodrich, M. Betancourt, M. Brubaker, J. Guo, P. Li,and A. Riddell, Journal of statistical software76