[PDF] Deep learning for Gaussian process tomography model selection using the ASDEX Upgrade SXR system

Abstract

Gaussian process tomography (GPT) is a method used for obtaining real-time tomographic reconstructions of the plasma emissivity profile in a tokamak, given some model for the underlying physical processes involved. GPT can also be used, thanks to Bayesian formalism, to perform model selection -- i.e., comparing different models and choosing the one with maximum evidence. However, the computations involved in this particular step may become slow for data with high dimensionality, especially when comparing the evidence for many different models. Using measurements collected by the ASDEX Upgrade Soft X-ray (SXR) diagnostic, we train a convolutional neural network (CNN) to map SXR tomographic projections to the corresponding GPT model whose evidence is highest. We then compare the network's results, and the time required to calculate them, with those obtained through analytical Bayesian formalism. In addition, we use the network's classifications to produce tomographic reconstructions of the plasma emissivity profile, whose quality we evaluate by comparing their projection into measurement space with the existing measurements themselves.

Full PDF

DDeep learning for Gaussian process tomographymodel selection using the ASDEX Upgrade SXRsystem

F Matos , J Svensson , A Pavone , T Odstrˇcil , F Jenko andthe ASDEX Upgrade Team † Max Planck Institute for Plasma Physics, Boltzmannstr. 2, 85748 Garching,Germany Max Planck Institute for Plasma Physics, Wendelsteinstr. 1, 17491 Greifswald,Germany Plasma Science and Fusion Center, Massachusetts Institute of Technology,Cambridge, MA, USAE-mail: [email protected]

Abstract.

Gaussian process tomography (GPT) is a method used for obtainingreal-time tomographic reconstructions of the plasma emissivity proﬁle in a tokamak,given some model for the underlying physical processes involved. GPT can also beused, thanks to Bayesian formalism, to perform model selection – i.e., comparingdiﬀerent models and choosing the one with maximum evidence. However, thecomputations involved in this particular step may become slow for data with highdimensionality, especially when comparing the evidence for many diﬀerent models.Using measurements collected by the ASDEX Upgrade Soft X-ray (SXR) diagnostic,we train a convolutional neural network (CNN) to map SXR tomographic projectionsto the corresponding GPT model whose evidence is highest. We then compare thenetwork’s results, and the time required to calculate them, with those obtained throughanalytical Bayesian formalism. In addition, we use the network’s classiﬁcations toproduce tomographic reconstructions of the plasma emissivity proﬁle, whose qualitywe evaluate by comparing their projection into measurement space with the existingmeasurements themselves.

Keywords : Deep learning, Gaussian process, CNN, Bayesian tomography

1. Introduction

Computed tomography generally refers to the process of imaging the interior of a bodythrough indirect measurements. In many applications, this is achieved by focusingpenetrating radiation on an object of interest from several directions and measuring the † See author list of H. Meyer et al. 2019 Nucl. Fusion 59 112014 a r X i v : . [ phy s i c s . d a t a - a n ] A p r resulting decrease in radiation intensity on the opposite side (due to absorption by thebody itself). Use of this information, the so-called projection of the object, allows oneto reconstruct its internal properties[1].In the case of radiative bodies, an alternative way to determine their propertiesis to perform cross-sectional imaging by treating the emitted radiation itself as aprojection[2]. In the ﬁeld of nuclear fusion, this procedure is employed in many tokamaksfor the reconstruction of plasma emissivity proﬁles[3]. More speciﬁcally, in the ASDEXUpgrade tokamak, such imaging can be done with information from the soft X-Ray(SXR) diagnostic, which measures the line-integrated radiation emitted by the plasmaalong several lines of sight; these can be used to perform tomographic reconstruction(or inversion) of the plasma emissivity proﬁle. Knowledge of this is useful for exploringmagnetohydrodynamic phenomena, in addition to studying accumulation of impuritiesinside the plasma (particularly tungsten) due to their large contribution to the totalamount of radiation[4].Several techniques exist for solving the tomography problem[5]. One approach isto use regularization-based algorithms, namely Tikhonov- and minimum Fisher-basedtechniques[6]. More recently, work has also been done using machine learning methods,namely deep neural networks[7, 8, 9], that are trained to create new reconstructionsbased on existing ones.Yet another method is Gaussian process tomography (GPT)[10]. GPT is anestablished method for performing tomographic inversion on many diﬀerent typesof physical distributions, that are modeled as posterior Gaussian distributions in aBayesian setting. Computing a posterior ﬁrst requires specifying a prior distribution,which encodes one’s assumptions about the underlying physical process before anymeasurements of it are taken. The posterior can then be computed based on that prior,and on an observation (measurement) of the data generated by the physical process.The prior itself can either be a ﬁxed distribution, or be drawn from a family of diﬀerentmodels.Knowing the posterior, GPT guarantees that one can obtain the most likely(maximum a posteriori , MAP) estimate for the tomographic reconstruction as well as itsassociated error values. More interestingly, however, through Bayesian inference, GPTprescribes a way to estimate the evidence for diﬀerent models, through a process knownas Bayesian model selection. This procedure can be of particular importance in caseswhere the choice of prior might have a strong eﬀect on the results of the tomographicinversion.Unfortunately, in a neural network, there are no guarantees[11] about whetherthe reconstructions obtained correspond to the MAP estimate of the underlyingdistribution, and there is no direct way, in standard Deep Neural Networks, such asconvolutional neural networks, to obtain uncertainty estimates on the outputs. Bayesianneural networks[12, 13] and generative adversarial networks (GANs)[14] can generateprobability distributions for their outputs; however, they can be computationallyexpensive and, in the case of GANs, diﬃcult to train[15].On the other hand, neural networks essentially store whatever function they havelearned (through their training process) in their weights, making the inference processfor new data very fast. With GPT, computing the MAP estimate based on a ﬁxedmodel is also suﬃciently fast for real-time purposes. This does not necessarily holdtrue, however, when performing bayesian model selection, since the process requires aseries of additional computational steps, namely matrix inversions or using non-linearoptimizers, which can be time-consuming, especially for data with a high dimensionality.Thus, we propose an approach where we train a convolutional neural network(CNN) to learn the GPT model selection procedure. To do this, we take SXRmeasurement samples from several ASDEX Upgrade shots and, through Bayesian modelselection formalism, compute for each data point the corresponding model (out of a setof possible, pre-deﬁned ones) with the highest evidence. We then train the CNN toreproduce this step, i.e., to map measurements to their highest evidence model. Finally,through the GPT framework, we compute the tomographic reconstruction of the plasmaemissivity proﬁle for each measurement, given the most likely models predicted by theCNN.This paper is organized as follows. Section 2 gives an overview of the problemof tomography, in particular soft x-ray tomography ASDEX Upgrade tokamak, andthe existing techniques to solve it, including a review of GPT with bayesian modelselection. Section 3 details the data we collected, the formulation of our problem, andthe model proposed to solve it. Section 4 details the direct results of the neural networkclassiﬁcation, and the tomographic reconstructions obtained based on them; section 5describes and discusses our conclusions.

2. Background

The purpose of tomography is to reconstruct the internal (either two- or three-dimensional) properties of a given body from non-local measurements. Traditionaltomographic algorithms achieve this by acquiring several projections of the object ofinterest from diﬀerent (potentially all) directions[1]. Mathematically, a projection is afunction that computes the line-integrated absorbency (or, in the case of fusion plasmas,emissivity) of a body along several paths or lines of sight (LOSs) as P θ ( t ) = (cid:90) L ( θ,t ) G ( x, y ) dL (1)where t is a point in the projection domain, L ( θ, t ) is the LOS crossing the bodymapping to t (along a direction given by an angle θ ), and G ( x, y ) is the two-dimensionalphysical distribution of interest (see Figure 1).By computing several tomographic projections with diﬀerent directions (i.e.diﬀerent values of θ ), it is possible to reconstruct G ( x, y ). For an exact reconstructionbased only on the projections, an inﬁnite number of them would need to be obtained. In 𝜃𝑦 𝑥 𝑡 𝐺(𝑥, 𝑦) Figure 1: An illustrated projection, P θ , measured along an angle θ . The blue area G ( x, y ) is the cross section of interest, and is being traversed by radiation. Becausediﬀerent rays traverse diﬀerent areas of the object, the value at each point t in theprojection space will be diﬀerent.practice, in many settings such as nuclear fusion experiments, it is diﬃcult, or impossible,to obtain more than a handful of such projections, and some additional information,in the form of assumptions about the function G ( x, y ), must be introduced in order toobtain a tomographic reconstruction. In the ASDEX Upgrade Tokamak, the Soft X-ray (SXR) diagnostic[16] consists ofeight pinhole cameras that measure the total radiation emitted by the plasma along208 diﬀerent volumes of sight (VOSs)[4]. We considered the extent of the VOSs inthe toroidal and poloidal directions of the tokamak to be minimal, and treated theminstead as lines of sight (LOSs). In addition, we also ignored the fact that the LOSsin the same camera array partially overlap. Based on this, the measurements collectedby the individual cameras correspond to a single projection of the underlying plasmaemissivity distribution, which is computed at 208 discrete positions, in a poloidal plane.In terms of the poloidal coordinates (

R, z ) of the 2D tokamak cross-section, the totalbrightness, b i , incident on a single detector, i , is given by b i = (cid:90)(cid:90) M i ( R, z ) G ( R, z ) dR dz (2)where G ( R, z ) is the plasma emissivity distribution (in W/m ) and M i ( R, z ) isa function that yields the relative contribution of G ( R, z ) to b i . This corresponds toevaluating the line integral of the plasma emissivity along the LOS corresponding tosensor b i .By discretizing equation 2, one obtains the plasma emissivity distribution at a ﬁnitenumber of positions (or pixels) along a tomographic reconstruction grid. In this case,the incident radiation on a single detector, assuming an associated noise, ξ , is[4] b i = n (cid:88) j =1 M i,j g j + ξ i i ∈ , .., . (3) FLM

Soft X-Ray

Figure 2: ASDEX Upgrade cross-section, and partial schema of the SXR measurementsystem, with three cameras (F, L and M) shown (plot obtained with diaggeom )From now on, we will denote the set of values b i , that is, a set of 208 line-integrated SXR measurements of the plasma emissivity taken at a certain point intime, as the plasma’s tomographic projection in that instant. We denote equation3 as the forward model of the problem. Here, n corresponds to the total numberof pixels on a tomographic reconstruction grid, whereas M i,j is the discretization ofthe function M ( R, z ) in equation 2, mapping the relative contribution of pixel j ofthat grid to measurement i of the projection. The actual values of M were pre-deﬁned and contingent on the geometry and conﬁguration of the sensors inside themachine, which can vary between diﬀerent shot campaigns. Consequently, we denote M as the geometric matrix . The goal of a tomographic reconstruction algorithmis then to use the tomographic projection (i.e., the 208 measurements b i ) to ﬁnd asuitable tomographic reconstruction g that satisﬁes equation 3. However, this problemis highly ill-posed[17], since small changes in projection space can translate into largechanges in the tomographic reconstructions. It is also under-determined, meaning, thedimensionality of the reconstruction grid is much larger than that of the projection,ultimately resulting in an inﬁnite number of solutions (reconstructions) that can ﬁt thedata. To solve the ill-posed problem, traditional tomographic algorithms use regularizationtechniques, usually based on assumptions regarding the smoothness of the plasmaemissivity proﬁle, that constrain the space of possible solutions. Such algorithms,however, are often computationally expensive and typically can only be used for post-experimental tomographic reconstruction, due to computational time constraints. Inaddition, the quality of the reconstructions is highly dependent on assumptions madeabout the data[4]. Generally, those assumptions are encoded into the reconstructionsthrough the use of Tikhonov regularization. In this case, computing the tomographicreconstruction of the plasma emissivity proﬁle becomes a matter of ﬁnding areconstruction ˆ g such thatˆ g = arg min g ( || M g − b || + Λ O ( g )) (4)where O ( g ) is a penalty term that encodes information about expected propertiesof the target plasma distribution, multiplied by a regularizing parameter Λ thatcontrols the regularization strength[18]. There are several options for the choice of theregularization term O ; typical choices are the Laplace operator, which favors smoothsolutions, and minimum Fisher information[19], that favors solutions that are mostlyﬂat in low-intensity regions, and peaked in high-intensity ones. Recent work has applied deep learning algorithms to the tomographic problem, namelyby using de-convolutional neural networks to produce tomographic reconstructionstaking measurement data as input[8]. This is achieved by training the networkson reconstructions that have been previously computed using standard tomographicalgorithms. Generically, in a deep learning setting, a Deep Neural Network is trained to learn a function mapping an input x into its target output y [20]; that is, y = y ( x, θ ) (5)where θ denotes the neural network’s parameters, i.e. its weights and biases. Thetraining process consists in ﬁnding an optimal value for θ that minimizes the mismatchbetween the network’s outputs and their corresponding labels.In our setup, training a deep neural network to produce tomographicreconstructions would have required training it with measurements from the SXRdiagnostic and pre-computed reconstructions, produced by other algorithms (namely,regularization-based ones). The expectation would then have been that the parameters θ computed during training would have converged to values such that if new, unseendata were fed into the network, it would be capable of generalizing to outside of itstraining set. However, even assuming good generalization capacity of a neural network,it is at most as good as whatever data it has been trained on. In other words, shouldexisting tomograms have had errors, a neural network would have learned to reproducethem. Another alternative is to use bayesian probability theory to produce tomographicreconstructions, by treating the underlying unknown plasma emissivity distributionas a

Gaussian process . Evaluating that process along a discrete set of points (thetomographic reconstruction grid) yields a multi-dimensional Gaussian distribution.By deﬁnition, in the Gaussian process framework, one assumes that multiplesolutions for the tomographic reconstruction exist, in a Gaussian distribution ofpossible solutions. Treating the tomography problem with this framework allows usingBayesian formalism, which guarantees that the most likely solution for the tomographicreconstruction (i.e. the maximum a posteriori , or MAP, estimate), subject to someassumptions about the underlying physical and data distributions, can be computedthrough Bayes’s formula, P ( A | B ) = P ( B | A ) P ( A ) P ( B ) . (6)In the GPT setting, the terms in the formula are multivariate probabilitydistributions, which are assumed to be Gaussian and are therefore speciﬁed by vectorsof means (the individual means of each random variable) and a covariance matrix whereeach entry denotes the pair-wise covariance between those same variables.In Bayes’s theorem, the term P ( A ) is called the prior . In GPT, by denoting theunderlying plasma emissivity as e , the prior distribution P ( e ) ∼ N ( µ pr , Σ pr ) encodesexisting assumptions about the physical emission process, without observing any data(SXR measurements). Each random variable in the prior distribution is also Gaussian,and corresponds to the prior plasma emissivity e at each point x in the tomographicreconstruction grid.Here, the prior mean has a size equal to that of the tomographic reconstructiongrid, n . Intuitively, the prior covariance matrix Σ pr encodes information about theexpected smoothness of the plasma emissivity. The entries in the covariance matrix arecomputed for all pairs of points in the reconstruction grid through a prior covariancefunction. One covariance function generally used in Gaussian process regression is thesquared exponential[21]; in using this function, the prior covariance between a pair ofpoints x and x in the tomographic reconstruction grid becomescov( x , x ) = θ f exp (cid:32) − ( x − x ) θ x (cid:33) . (7)The prior covariance is only dependent on the distance between points x and x and on θ = { θ f , θ x } , which are the model’s hyperparameters . The parameter θ f controlsthe prior variance of the plasma emissivity at a given location in the reconstruction grid,whereas the parameter θ x , usually referred to as the length scale , controls the extent towhich diﬀerent points in the reconstrucion grid are correlated. Models where the lengthscale is large yield high correlations even between grid points which are far apart, whilesmaller length scales yield covariance matrices where only points which are closer toeach other are signiﬁcantly correlated.With these deﬁnitions, the prior becomes a probability distribution for the plasmaemissivity, e , subject to the model’s hyperparameters, i.e., P ( e | θ ), before any data, thatis, a tomographic projection, has been observed. The prior can then be updated bymultiplying it with the likelihood of the data d (as per Bayes’ theorem), yielding theposterior distribution, P ( e | d, θ ): P ( e | d, θ ) = P ( d | e, θ ) P ( e | θ ) P ( d | θ ) (8)The denominator in Bayes’s theorem is known as the model evidence or marginallikelihood ; if one is merely computing the posterior P ( e | d, θ ), it can be ignored, as it isjust a normalizing constant. Interestingly, however, one can use this term to compareseveral diﬀerent models (each with their own prior), and choose the one which best ﬁtsthe data[10]. In this case, one assumes a hyper-prior , from which diﬀerent possible priors(individually speciﬁed by diﬀerent hyperparameters) are sampled. The evidence canthen be computed for diﬀerent models – a process that is referred to as marginalization – and the model with the highest evidence can be selected[22]. Calculating this requiresan evaluation of the integral[23] P ( d | θ ) = (cid:90) P ( d | e, θ ) P ( e | θ ) de. (9)In practice, by assuming that the absolute value of the noise ξ associated with thedata (tomographic projection) also comes from a Gaussian distribution, the logarithmof the previous integral can be analytically calculated as[23] log ( P ( d | θ )) = − (cid:16) m log (2 π ) + log | K + Σ d | + ( d − f L ) T ( K + Σ d ) − ( d − f L ) (cid:17) (10)where m is the number of SXR measurements in a tomographic projection, andΣ d is a diagonal matrix, whose non-zero entries are the individual variances, σ , ofeach measurement in the projection. Here, matrix K is a linear transformation of theprior covariance Σ pr (imposed on the plasma emissivity) into measurement space, andis given by K = M Σ pr M T , where M is the geometric matrix deﬁned in equation 3.Similarly, f L is the mapping of the prior mean, µ pr , (imposed in reconstruction space)onto measurement space, given by f L = M · µ pr .The marginalization procedure is particularly useful because the trade-oﬀ betweenmodel complexity and data ﬁt is automatic – the model for which the evidence scoreis highest is always the simplest model that can explain the data, an embodiment ofthe Occam’s Razor principle[24]. In addition, the model evidence is also a function ofthe variance σ of the data (through matrix Σ d ), which means that it is possible totreat the expected projection error as an additional hyperparameter of the model tobe tuned; this can be done, for example, by treating the data variance as a fraction ofthe measured value of SXR radiation in the tomographic projection. This means that,through the GPT framework, one can estimate not only the most likely model for theunderlying plasma emissivity distribution, but also the most likely error values for thedata itself.Once the most likely model is selected, and applying Bayes’s formula, the posteriormean, µ post , and posterior covariance, Σ post , as a function of the mean µ pr and covarianceΣ pr for that model, are respectively given by[25] µ post = µ pr + Σ pr M T ( K + Σ d ) − ( d − f L ) (11)and Σ post = Σ pr − Σ pr M T ( K + Σ d ) − M Σ pr . (12)By computing the posterior distribution P ( e | d, θ ) ∼ N ( µ post , Σ post ), one can thenproduce tomographic reconstructions either by sampling from P ( e | d, θ ), or simply bytaking the mean of that distribution as the tomographic reconstruction (because thedistribution is Gaussian, the mean corresponds to the MAP estimate). In addition, onecan directly obtain uncertainties for the tomographic reconstruction from the diagonalvalues of the posterior covariance matrix, which correspond to the individual posteriorvariances of each pixel in the reconstruction grid.The drawback of the marginalization procedure, however, is its potentialcomputational complexity. First, the calculation of the evidence term involves a seriesof matrix multiplications and an inversion, which can be cumbersome particularlyin our setting because of the dimensionality of the data, which generates very largematrices. Furthermore, the evidence must be computed for all models that are taken intoconsideration. When each model has several hyperparameters, the number of possiblemodels to evaluate can become very large, which means that ﬁnding the optimal onecan be time-consuming. For practical purposes, this limits the number of models whichcan be evaluated and thus, potentially limits the quality of the results.We therefore propose to bypass the need for analytical marginalization, by traininga classiﬁer (in this case, a CNN) to automatically choose the most likely model (out ofseveral pre-deﬁned ones) for the tomographic projection data collected by the ASDEXUpgrade SXR system.This has potentially several advantages. On one hand, a GP model, whilepotentially having priors and posteriors with many dimensions, can be fully speciﬁedby its much smaller set of hyperparameters. In practice, this allows for parameterizinga distribution of high dimensionality with only a few variables. In the case of thiswork, this means that neural networks will learn to map tomographic projections toa lower-dimensional space (of dimensionality equal to the number of models underconsideration). This should facilitate the network’s learning process, allowing for easiergeneralization when compared with DL methods that attempt to map projectionsdirectly into a reconstruction space of larger dimensions. On the other hand, forpotential real-time applications, this method potentially speeds up GPT, since itbypasses the marginalization procedure.

3. Methods

For this work, we had at our disposal a collection of 112 ASDEX Upgrade shots, totalling127528 data points (208-dimensional tomographic projections), with each dimensioncorresponding to a speciﬁc detector in the SXR system. The projections come from thedown-sampled signal of the SXR diagnostic, at a sampling rate of 250 Hz . The dataset0also contains an error model, which assigns every measurement in every projection anestimated error value. In many cases the SXR detectors can be damaged and yieldcompletely erroneous measurements, such as for example negative brightness; in thesecases, the corresponding error values in the existing error model are inﬁnite. A sampleprojection can be seen in Figure 3. B r i g h t n e ss ( W / m ) Figure 3: Sample tomographic projection (SXR measurements) from the ASDEXUpgrade tokamak, taken from shot t = 5 , s . Faulty measurementshave been removed.We also possessed a geometric matrix M that maps the relative contribution ofeach pixel in a 60 × R, z ), based on the poloidal dimensions of ASDEX Upgrade; thetomographic reconstruction is computed on this grid. The geometric matrix itself wascomputed based on the physical layout of the SXR sensors in the ASDEX Upgradevessel, and holds for all shots in our dataset.

Before training the neural network classiﬁer, we generated its training and validationdataset by individually computing, for all the measured tomographic projections, themost likely model for the plasma emissivity distribution that generated those sameprojections. To that end, the ﬁrst task was to deﬁne a hyper-prior, i.e., a prior for themodels’ hyperparameters; then, diﬀerent models (individually speciﬁed by their speciﬁcset of hyperparameters) were compared based on their evidence.As is typical in Gaussian process regression tasks [23], we deﬁned all prior means, µ pr , as vectors of zeros, of size 2400 (the size of the reconstruction grid). We computedthe prior covariance matrices Σ pr using a squared exponential function as deﬁned inequation 7; in essence, this covariance function encodes our belief that correlationsbetween pixels on the tomographic reconstruction grid will decay exponentially asthe distance between those points increases. The covariance function has only two1parameters: θ f , the individual variance of single pixels, and θ x , the length scale whichcontrols the extent of the correlation between pixels in the reconstruction grid. Giventhese parameters, the covariance function computed the covariance between two points x and x as a function of the distance between them. We computed the distance betweentwo points in the reconstruction grid (expressed in terms of their poloidal coordinates)using the Euclidean deﬁnition, i.e., d ( x , x ) = (cid:113) ( R − R ) + ( z − z ) .We use the information in the data’s existing error model to discard faultymeasurements – that is, for a given projection, we discard measurements whoseassociated error is inﬁnite. In practice, this means that, when carrying out themarginalization process, and computing the MAP estimate for the plasma emissivity,some of the 208 measurements of each projection were not used. In addition, diﬀerentprojections have diﬀerent damaged sensors. We treat the remaining measurements in aprojection as the mean values of a multivariate gaussian distribution, where the variablesare independent from each other. The variances, σ , of that distribution correspond tothe entries in the diagonal of matrix Σ d of equation 10, and represent the uncertainties inthe measurements, derived from their noise. In a gaussian distribution, approximately99 .

73% of the data falls within 3 standard deviations ( σ ) of the mean (i.e. the 3- σ rule);we use this rule to assume that, if ξ i is the absolute value of the error of measurement i , then 3 σ i = ξ i .One possibility would then have been to compute the variance of each measurementin a projection based on the values of ξ in the existing error model. In this work,however, we instead computed the value of ξ as a fraction of the noisy measurementsthemselves, by treating that fraction as an additional hyperparameter, θ err , to beoptimized through Bayesian marginalization. We note, however, that we assumed thisvalue to be global, i.e., the same for all measurements in a projection; therefore, fora single SXR measurement m i , ξ i = θ err m i . Given our earlier assumption using the3 σ -rule, the absolute values of the variances, σ , of each measurement i could then becalculated as σ = θ err m i ; treating θ err as a scaling factor is the simplest assumptionthat can be made about the data errors, apart from using the existing error model.Formalizing, we iteratively computed, for each individual data point d (i.e.projection), the highest-evidence model for the plasma emissivity and data distributionsthat generated that projection, that is, through equation 10 we looked for ˆ θ =(ˆ θ f , ˆ θ x , ˆ θ err ) such that ˆ θ = arg max θ log p ( d | θ ) . We searched for the ideal hyperparameters in a grid by assuming a uniform hyper-prior, and computed the model evidences at several discrete positions in the hyper-priorspace. The question was then, what positions in the hyper-prior space should oneevaluate the models’ evidences on? This required taking several factors into account.The ﬁrst requirement was the expected nature of the plasma emission processitself. A previous analysis of the measurement data, and of existing tomographic2reconstructions from ASDEX Upgrade[4], showed that the plasma emissivity has a widedynamic range for diﬀerent regions of the plasma, with emissivity in the plasma corebeing up to 3 orders of magnitude higher than in the pedestal. Likewise, in some periodsof some shots, the maximum radiation value in the reconstruction grid was in the orderof magnitude of 10 W/m − , while in other phases, it could be as large as 10 W/m − .Thus, we considered this range in emissivities a good region to explore possible valuesfor the hyperparameter θ f . In addition, ASDEX Upgrade has a minor radius a = 0 . m (horizontally) and b = 0 . m (vertically)[26]; given this and the size of our reconstructiongrid, we assumed that a good region of the hyper-prior in which to evaluate the evidencefor certain values of θ x ranged, in the limit, from 0 (no correlation at all between pixels)to 1 .

6. For the hyparameter θ err we assumed that, in the limit, it could range from 0 (nonoise in the tomographic projections) to 1 (all of the measured brightness correspondedto noise).The second requirement related to the training process for neural networks. Forthis work, we wanted to train a neural network to perform a classiﬁcation task – to learnto map measurements to the the most likely model. Typically, in a machine learningclassiﬁcation setting, care should be taken such that training samples fed to a networkare reasonably balanced with respect to their diﬀerent classes; that is, a good trainingpractice is that one class not be too over-represented in the data when compared toothers. Class C o un t Figure 4: Number of data samples (from all available shots) mapping to each class (setof hyperparameter values), after the marginalization procedure.In practice, considering these requirements, we performed several evaluationsthrough trial and error of the hyper-prior at diﬀerent positions; we settled on a 3 × × θ f = { , , } , θ x = { . , . , . } and θ err = { . , . ., } , which corresponds to 27 GP models. Performing the Bayesian modelselection procedure on all projections in our dataset using the models parameterized by3these values of ( θ f , θ x , θ err ) yielded a relative balance in terms of the amount of datasamples mapping to each of the 27 possible classes (points on the hyperparameter grid);this can be seen in Figure 4, which shows the number of points mapping to each class.Computing the evidence for diﬀerent models for all tomographic projections took a totalof 48h. This dataset – i.e., the mappings between tomographic projections and the classto which their highest-evidence model belongs (out of 27 possible ones) – was then usedto train and test the neural network classiﬁer. Several possibilities exist when it comes to modelling deep neural network architectures.For our purposes (the learning of the Bayesian model selection procedure) we opted touse a CNN. CNNs are widely used for signal processing tasks, due to their ability toeﬃciently detect spatial correlations in data, which is what we expected to ﬁnd in ourSXR measurements. The model we used is, with regards to its architecture, inspiredby the VGG network for classiﬁcation of images[27]. We designed the model using theKeras framework for deep learning[28]. C o n v D ( , ) C o n v D ( , ) B a t c h N o r m a li z a t i o n M a x P oo li n g ( ) C o n v D ( , ) C o n v D ( , ) B a t c h N o r m a li z a t i o n M a x P oo li n g ( ) C o n v D ( , ) C o n v D ( , ) C o n v D ( , ) B a t c h N o r m a li z a t i o n M a x P oo li n g ( ) C o n v D ( , ) C o n v D ( , ) C o n v D ( , ) B a t c h N o r m a li z a t i o n M a x P oo li n g ( ) C o n v D ( , ) F l a tt e n I npu t ( ) I npu t ( ) D e n s e ( ) B a t c h N o r m a li z a t i o n O u t pu t ( ) D e n s e ( ) B a t c h N o r m a li z a t i o n D e n s e ( ) B a t c h N o r m a li z a t i o n D r o p o u t ( . ) D e n s e ( ) S o f t m a x Figure 5: Schematic of the deep learning model used for this work.The network itself receives two inputs: a tomographic projection (208 SXRmeasurements), and a corresponding mask of ones and zeros (taken from the existingerror model in our dataset), that gives information regarding which measurements inthe projection are assumed to be faulty. The network uses a series of convolutionallayers followed by max pooling layers to process high-level features in the measurementdata. The output of those layers is then combined with the information in the errormask and processed in the last layers of the network, which are standard fully-connected4layers. We also used batch normalization[29] layers to speed up training, and dropoutin the ﬁnal layer[30] to increase the network’s capacity for generalization outside of itstraining set. We used the rectiﬁed linear unit (ReLU) activation function throughoutthe entire network apart from the last layer, which uses a softmax function, because wemodelled the network output as probabilities over 27 possible classes, which must addup to 1. For the same reason, we used categorical cross-entropy as loss function. Weused the Adam optimizer[31], and left all optimizer hyperparameters at their defaultvalues. Figure 5 shows a schematic of the neural network architecture.

4. Results

We here performed two separate assessments. First, we evaluated the accuracy of theneural network’s ﬁt of the individual projections to their highest-evidence models. Then,based on the highest-probability class determined (by the network) for each data point,we computed the corresponding MAP estimate of the tomographic proﬁle, and measuredthe quality of those reconstructions by projecting them back into measurement space(through the forward model in equation 3), obtaining their back-projections . We thenmeasured the percentage error of those back-projections when compared to the originaltomographic projections.

To increase the robustness of our methods, we opted to train an ensemble of neuralnetworks (of equal architectures), using the k-fold cross-validation strategy[32]. K-foldcross validation is useful to determine whether the choice of the train/test split hasbiased whatever results have been obtained, or whether the results can be assumed tohold independently of the data split. We opted to divide our data into k= 10 folds – thatis, we trained 10 networks with diﬀerent overlapping splits of train data, and tested themon non-overlapping validation splits. We trained the networks for 50 epochs, and ranthem on an NVIDIA Quadro RTX 5000 Graphics Processing Unit. The total trainingtime for the whole ensemble was 1h, while total prediction time for the validation datawas 41 , x (corresponding to a tomographic projection) in our dataset wasassigned a label, y label , denoting for which of the 27 model classes the evidence washighest. A classiﬁer learns, through the training process, to compute the probability ofthat point belonging to a certain class P ( C ( x ) = c ), where c can take one out of 27possible values; we denote the vector containing the probabilities of belonging to eachof those classes y pred . We further deﬁne y pred k as the k − th most likely class given by aclassiﬁer for x ; for example, for y pred , one would get5 y pred = arg max c P ( C ( x ) = c ) = arg max c y pred whereas for c one would have y pred = arg min c P ( C ( x ) = c ) = arg min c y pred . Based on this, the top-k accuracy metric then calculates for each data point: acc k ( x ) =  y label ⊂ { y pred , ...y pred k } . K . . . . . . T o p - k a cc u r a c y Figure 6: Top-k accuracy (up to k=27) for validation data. The blue bars indicate themean accuracy across the ensemble of 10 networks, while the smaller black bars indicatethe accuracy’s standard deviation across the ensemble (for each k)6

In addition to evaluating the neural network’s capacity for classiﬁcation purposes, wealso produced and evaluated tomographic reconstructions. To that end we took, for eachdata point in the validation dataset, the most likely class prediction given by the neuralnetwork; this class prediction maps to one of the models we have previously deﬁned. Wethen computed, based on the class prediction and on equations 11 and 12, the posteriormean and covariance for each data point. We took the posterior means (i.e. the MAPestimates) as the tomographic reconstructions of the plasma emissivity - each mean wasa 2400-dimensional vector, where each entry µ j denotes the most likely value for theplasma emissivity in a point j in the reconstruction grid. The posterior covariancesallowed us to determine the error of the tomographic reconstruction, by taking thediagonal of the covariance matrix, which corresponds to the individual variance σ post of each pixel in the reconstruction grid; we converted the value of that variance into apercentage error by once again taking advantage of the 3 − σ rule, and computing saidpercentage, for pixel j , as % err j = 3 (cid:113) σ post j µ j × �� Figure 7: Sample tomographic reconstruction and error, and comparison between theSXR measurement and the back-projected reconstruction, for ASDEX Upgrade shot , s . The determined model hyperparameters by the classiﬁer were θ err = 0 , , θ f = 500 , θ x = 0 , �� Figure 8: Sample tomographic reconstruction and error, and comparison between theSXR measurement and the back-projected reconstruction, from ASDEX Upgrade shot , θ err = 0 , , θ f = 500 , θ x = 0 , To evaluate the quality of the tomographic reconstructions, we performed, for eachreconstruction, a pass through the forward model deﬁned in equation 3 to obtainthe corresponding back-projection, i.e., the projection of the reconstruction back intomeasurement space. Performing this check allowed to see how well the obtained MAPestimate actually ﬁts the original SXR data; to do so, we computed the percentage errorbetween the back-projection and the original tomographic projection, and did this forevery measurement in every data point. The histograms in Figures 9a and 9b show theresults of this evaluation, up to an error of 100%, a threshold which covers 99 ,

38% of thevalidation data. Note that the error was computed by comparing the back-projectiononly with valid measurements (i.e. which are not labeled by the existing data model ashaving inﬁnite error). On average, 91 .

25% of the 208 measurements in each data pointwere used to compute the tomographic reconstructions. P D F ( e rr o r ) (a) MAE values C D F ( e rr o r ) (b) Cumulative MAE values Figure 9: Cumulative distribution of the error values of the tomographic reconstructions’back-projections into measurement space. 54 ,

4% of the individual back-projectedmeasurements have a relative error lower than 10%; 93 .

83% have an error lower than50%; and 99 ,

38% have an error lower than 100%.

An analysis of Table 4.1 and Figure 6 shows that the ensemble of 10 neural networksachieves very good results on the classiﬁcation task, with a mean top-5 accuracy scoreof 0 .

976 (out of a maximum score of 1) for validation data. Furthermore, the standarddeviation of the accuracy score demonstrates consistently low values, indicating thatthe choice of train/test split for our data did not signiﬁcantly bias the achieved results;all neural networks in the ensemble behave similarly, even if tested on diﬀerent data.Taking the highest-probability model hyperparameters computed by the classiﬁersfor the entire validation set, and computing the corresponding MAP estimates of thetomographic reconstructions, we similarly see good results; the reconstructions’ back-0projections mostly show very good agreement with the original tomographic projections,with more than 90% of the back-projected data having an error lower than 50%.

5. Conclusions

Gaussian process tomography makes it possible to obtain the most likely estimate foran unknown, potentially inﬁnite-dimensional, quantity, given some assumptions aboutthe underlying physical distribution and about the data generated by that distribution.The tomography problem, based on SXR measurement data from the ASDEX Upgradetokamak, lends itself to investigation under this framework. If one assumes a ﬁxed modelfor the behavior of the underlying physical distribution (i.e. the plasma emissivity) andfor the data, for example by specifying the length-scales involved in the emission processand the expected fraction of noise in the measurements, GPT inversion techniquesreadily yield the corresponding maximum a posteriori estimate of the plasma SXRemissivity in the two-dimensional tokamak cross-section.Nevertheless, this raises the issue of what models one would like to assume in theﬁrst place. Through the Bayesian Occam’s Razor principle, GPT answers this questionby computing the evidence for diﬀerent possible models, out of which the one with thehighest score can then be selected. This can be useful if one wishes to test diﬀerentassumptions regarding the data distribution: for example, what fraction of noise canbe expected in the observations (measurements)? However, in a setting such as SXRtomography with ASDEX Upgrade data, this task can become cumbersome due to thedimensionality of the tomographic projections. This is further compounded when thenumber of models under evaluation is large.For that reason, we developed a novel method for automatic selection of the bestmodel (out of 27 pre-deﬁned ones) for the plasma SXR emissivity distribution andthe corresponding data, for measurements from the ASDEX Upgrade tokamak. Theindividual models had diﬀerent assumptions regarding the noise level in the collecteddata, the correlations between variables in the tomographic reconstruction grid, and theindividual variances of those same variables. The method then consisted in training aconvolutional neural network to perform the bayesian model selection (marginalization)procedure, and bypass the need to perform that task analytically. Our results show thatthe neural network achieved good classiﬁcation results when compared to the analyticalbayesian marginalization step, with top-5 accuracy (out of 27 possible classes) reachinga value of 0.976 (out of a maximum of 1). Furthermore, while the marginalizationprocedure across the entire dataset (of 127528 tomographic projections), throughanalytical methods, took approximately 48h, the same computation, performed by theneural network, took only 43s. Thus, the neural network approach can be particularlyuseful for high-dimensional data settings such as ours, as well as problems where thenumber of models under consideration is large, which would otherwise render the modelcomparison problem too slow through analytical methods. This can be particularlyuseful for settings where not only real-time inversion of tomographic proﬁles, but also1real-time comparison of diﬀerent models for certain physical distributions is a necessity.

Acknowledgements

This work has been carried out within the framework of the EUROfusion Consortium and hasreceived funding from the Euratom research and training programme 2014-2018 under grantagreement No 633053. The views and opinions expressed herein do not necessarily reﬂect thoseof the European Commission. References [1] Kak A C and Slaney M 1988

Principles of Computerized Tomographic Imaging (IEEE Press)[2] Ingesson L, Alper B, Peterson B and Vallet J C 2008

Fusion science and technology Fusion Scienceand Technology Reviewof Scientiﬁc Instruments et al. Journal of Fusion Energy FusionScience and Technology IEEE Transactions on Plasma Science [8] Matos F A, Ferreira D R, Carvalho P J and Contributors J 2017

Fusion Engineering and Design

Laser andParticle Beams

Proceedings of the 32nd International Conference on Machine Learning ( Proceedings of MachineLearning Research vol 37) ed Bach F and Blei D (Lille, France: PMLR) pp 1613–1622 URL https://icml.cc/2015/ [12] Gal Y and Ghahramani Z 2016 Dropout as a bayesian approximation: Representing modeluncertainty in deep learning international conference on machine learning pp 1050–1059 URL https://icml.cc/2016/index.html [13] Pavone A, Svensson J, Langenberg A, Pablant N, Hoefel U, Kwak S, Wolf R and Team W X 2018

Review of Scientiﬁc Instruments arXiv preprint arXiv:1511.06434 [15] Salimans T, Goodfellow I, Zaremba W, Cheung V, Radford A and Chen X 2016 Improvedtechniques for training gans Advances in neural information processing systems pp 2234–2242URL https://nips.cc/Conferences/2016 [16] Igochine V, Gude A et al. et al.

Plasma physics and controlled fusion NuclearInstruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectorsand Associated Equipment

Plasma physics and controlled fusion Deep Learning (MIT Press) [21] Li D, Svensson J, Thomsen H, Medina F, Werner A and Wolf R 2013

Review of ScientiﬁcInstruments Neural computation Gaussian processes for machine learning (MIT press)[24] MacKay D 1991

Bayesian Methods for Adaptive Models

Ph.D. thesis Caltech[25] Svensson J, Werner A, Contributors J E et al.

Plasma Physics and Controlled Fusion et al. Plasma Physics and Controlled Fusion International Conference on Learning Representations

URL [28] Chollet F et al. https://github.com/fchollet/keras [29] Ioﬀe S and Szegedy C 2015 Batch normalization: Accelerating deep network training by reducinginternal covariate shift

Proceedings of the 32nd International Conference on Machine Learning ( Proceedings of Machine Learning Research vol 37) ed Bach F and Blei D (Lille, France: PMLR)pp 448–456 URL http://proceedings.mlr.press/v37/ioffe15.html [30] Li X, Chen S, Hu X and Yang J 2019 Understanding the disharmony between dropout and batchnormalization by variance shift pp 2677–2685 ISSN 1063-6919 URL http://cvpr2019.thecvf.com/ [31] Kingma D P and Ba J 2014 arXiv preprint arXiv:1412.6980 [32] James G 2013