Comparison of classical and Bayesian imaging in radio interferometry
Philipp Arras, Hertzog L. Bester, Richard A. Perley, Reimar Leike, Oleg Smirnov, Rüdiger Westermann, Torsten A. Enßlin
AAstronomy & Astrophysics manuscript no. main c (cid:13)
ESO 2020August 27, 2020
Comparison of classical and Bayesian imaging in radiointerferometry
Cygnus A with
CLEAN and resolve
Philipp Arras , , , Richard A. Perley , Hertzog L. Bester , , Reimar Leike , , Oleg Smirnov , , Rüdiger Westermann ,and Torsten A. Enßlin , Max-Planck Institut für Astrophysik, Karl-Schwarzschild-Str. 1, 85748 Garching, Germany Ludwig-Maximilians-Universität München (LMU), Geschwister-Scholl-Platz 1, 80539 München, Germany Technische Universität München (TUM), Boltzmannstr. 3, 85748 Garching, Germany National Radio Astronomy Observatory, P.O. Box O, Socorro, NM 87801, USA South African Radio Astronomy Observatory, 2 Fir Street, Black River Park, Observatory (North Gate entrance), 7925, SouthAfrica Department of Physics and Electronics, Rhodes University, Grahamstown, 6140, South AfricaReceived < date >/ Accepted < date > ABSTRACT
CLEAN , the commonly employed imaging algorithm in radio interferometry, su ff ers from the following shortcomings: in its basicversion it does not have the concept of di ff use flux, and the common practice of convolving the CLEAN components with the
CLEAN beam erases the potential for super-resolution; it does not output uncertainty information; it produces images with unphysical negativeflux regions; and its results are highly dependent on the so-called weighting scheme as well as on any human choice of
CLEAN masksto guiding the imaging. Here, we present the Bayesian imaging algorithm resolve which solves the above problems and naturallyleads to super-resolution. In this publication we take a VLA observation of Cygnus A at four di ff erent frequencies and image itwith single-scale CLEAN , multi-scale
CLEAN and resolve . Alongside the sky brightness distribution resolve estimates a baseline-dependent correction function for the noise budget, the Bayesian equivalent of weighting schemes. We report noise correction factorsbetween 0 . resolve are paid for by higher computational e ff ort. Key words. techniques: interferometric – methods: statistical – methods: data analysis – instrumentation: interferometers
1. Introduction
Radio interferometers provide insights into a variety of astro-physical processes which deepen our knowledge on astrophysicsand cosmology in general. A common strategy to improve radioobservations is to upgrade the hardware: increase the number ofantennas or their sensitivity. In this publication we would liketo take the orthogonal approach and improve one part of radiopipelines, the imaging and deconvolution step. Interferometersdo not directly measure the sky brightness distribution but rathermeasure modified Fourier components of it. Therefore, the stepfrom the data to the image is non-trivial.One of the first deconvolution algorithms, single-scale
CLEAN (Högbom 1974), is still in use today. It was developed forthe computational resources of the 1970s and assumes that thesky brightness distribution consists of point sources. The basicidea behind single-scale
CLEAN is to transform the Fourier datainto image space, find the brightest point sources in descendingorder, simulate a measurement of those point sources, subtractthem from the data and iterate. Finally, the collection of pointsources, called
CLEAN components, is usually convolved withthe so-called
CLEAN beam which is supposed to represent theintrinsic resolution of the radio interferometer. In practice, thisalgorithm converges to some approximation of the actual skybrightness distribution. The assumption that the sky consists of point sources isproblematic, because typical radio interferometers are capableof capturing faint di ff use emission as well. Therefore, Cornwell(2008); Rau & Cornwell (2011); O ff ringa & Smirnov (2017) ex-tended CLEAN to using Gaussian-shaped structures as basis func-tions. The resulting algorithm is called multi-scale
CLEAN and isthe de-facto standard for deconvolving extended structures.There are several major reasons to rethink the
CLEAN ap-proach to imaging and deconvolution, now that more com-putational resources are available and significant progress inBayesian inference has been made compared to the 1970s. First,in order to allow
CLEAN to undo initial and too greedy flux as-signments, the above mentioned
CLEAN components are usu-ally not required to be positive. Therefore, the final sky bright-ness distribution is not necessarily positive and almost all mapsproduced from radio interferometric data contain unphysicalnegative-flux regions. Second, the convolution with the
CLEAN beam fundamentally limits the resolution of the image althoughit is known that super-resolution is possible (Honma et al. 2014;Dabbech et al. 2018). In particular, the location of strong sourcescan be determined with much higher accuracy than suggested bythe
CLEAN beam. Third, the weighting scheme, i.e. a functionwhich rescales the influence of each data point on the final im-age depending on the baseline length or proximity of other mea-surements, crucially influences the output image. A prescription
Article number, page 1 of 22 a r X i v : . [ a s t r o - ph . I M ] A ug & A proofs: manuscript no. main for setting the weighting scheme, such that the resulting imageresembles the actual sky brightness distribution in the best pos-sible way, does not exist. Finally,
CLEAN does not output reliableuncertainty information.We intend to solve the above issues by updating the Bayesianimaging algorithm resolve developed in (Arras et al. 2019b,2018) and originally pioneered by Junklewitz et al. (2015);Greiner et al. (2016). Bayesian inference is the frameworkof choice for this as it is the only consistent extension ofBoolean logic to uncertainties via real-valued probabilities (Cox1946). resolve is formulated in the language of informationfield theory (Enßlin et al. 2009) in symbiosis with the infer-ence algorithm Metric Gaussian Variational Inference (MGVI,Knollmüller & Enßlin 2019). It combines the imaging and de-convolution steps of the
CLEAN approach. Indeed, resolve sig-nificantly improves the resolution of the image, super-resolutionis built in.Bayesian imaging in radio astronomy is not new. Mostprominently, maximum entropy imaging was one of the first suchalgorithms based on the minimalistic prior assumption that pho-tons could arrive from all directions and no intrinsic emissionstructures shall be assumed a priori (Cornwell & Evans 1985;Gull & Skilling 1984). While this is been proven to be particu-larly successful for imaging di ff use emission, Junklewitz et al.(2016, Section 3.2.2) demonstrate that resolve can outperformmaximum entropy imaging. The reasons include that the latterdoes not assume any correlations between pixels a priori and abrightness distribution for each pixel with an exponential cut-o ff for high values.Related approaches include Sutton & Wandelt (2006) andSutter et al. (2014), who use Bayesian inference as well. Thoseare, however, limited to Gaussian priors and relatively few pix-els, because Gibbs sampling is used. Furthermore, Cai et al.(2018) combine the compressive sensing approach with Markov-Chain Monte Carlo sampling.The paper is structured as follows: section 2 describes theunderlying data model common to the compared imaging al-gorithms. Section 3 defines the novel resolve algorithm andspecifies the prior assumptions and section 4 recapitulates thesingle-scale CLEAN and multi-scale
CLEAN algorithms. All threealgorithms are compared in section 5 by applying them to thesame four data sets.
2. Measurement model
Astrophysical signals undergo a variety of transformations asthey travel from their source to where they are observed onEarth. We restrict ourselves to an ideal, unpolarised phase track-ing interferometer, in which case the measurement process obeys(Hamaker et al. 1996; Smirnov 2011): d u vw = n u vw + (cid:34) { ( l , m ) ∈ R | l + m < } A ( l , m ) I ( l , m ) √ − l − m e ul + v m + w (1 − √ − l − m ) d( l , m )(1)where d u vw represents the data taken by the interferometer (com-monly referred to as visibilities), n u vw represents an additivenoise realization, A ( l , m ) is the antenna sensitivity pattern and I ( l , m ) the true sky brightness distribution. The data space co-ordinates ( u , v, w ) record the relative positions of antenna pairsas the Earth rotates under the frame of the sky. The coordinates( l , m , √ − l − m ) denote the positions of points on the celes-tial sphere. The integral goes over the half of the sphere which is above the horizon. An important special case is when the arrayis coplanar ( w →
0) or when looking at a very small patch of sky( l + m → A ( l , m ) I ( l , m ). This assumption,referred to as the coplanar array approximation, is discussed fur-ther in section 4.In practice the integral in (1) is discretized to allow numericalevaluation. Then, the measurement model simplifies to: d = RI + n , (2)where R ∈ Lin R ( R N , C M ) is a discretization of (7), which mapsa discretized image I ∈ R N into visibilities s ∈ C M , and n ∈ C M is the noise present in the observation. Both resolve and wsclean use the software library ducc (Distinctly Useful CodeCollection) for evaluating the integral.Since visibilities consist of an average of a large number ofproducts of voltages, it can be assumed, by the central limit the-orem, that the noise is Gaussian with diagonal covariance N : n (cid:120) G ( n , N ). Thus, the likelihood probability density is givenby: P ( d | I , N ) = G ( d − RI , N ) : = √| π N | e − ( d − RI ) † N − ( d − RI ) , (3)where † denotes the complex conjugate transpose. For betterreadability, but also because it is the quantity which needs to beimplemented for resolve , we define the information Hamilto-nian H ( d | I , N ) : = − log P ( d | I , N ) (Enßlin et al. 2009). Then, H ( d | I , N ) =
12 ( d − RI ) † N − ( d − RI ) + h ( N ) , (4)where h ( N ) is a normalization term constant in I . Many tradi-tional imaging algorithms employ this expression without h ( N )as the data fidelity term which ought to be minimised.We conclude the section with two comments. First, note that(4) stores all information about the measurement device and thedata at hand. No specific assumptions about the data process-ing have been made yet. Therefore, (4) is the starting point ofboth resolve and CLEAN . We call the process of turning (4) intoan image “imaging” and do not di ff erentiate between “imaging”and “deconvolution”. Second, the process of recovering the truesky brightness distribution from the measured visibilities is aninverse problem. In (2), the sky I cannot be computed uniquelyfrom d and N alone because the Fourier space coverage (com-monly called uv-coverage) is not complete and because of thepresence of noise. We may know the noise level N but we neverknow the noise realization n . This is why turning data into thequantity of interest, in our case I , is a non-trivial task. The ap-pearance of uncertainties is a direct consequence of the non-invertibility of R and the presence of n .
3. Resolve resolve is a Bayesian imaging algorithm for radio interferom-eters. It is formulated in the language of information field the-ory (Enßlin et al. 2009; Enßlin & Frommert 2011; Enßlin 2018)and was first presented in Junklewitz et al. (2015) and then up-graded in Junklewitz et al. (2016); Greiner et al. (2016) andArras et al. (2018). Arras et al. (2019b) added antenna-baseddirection-independent calibration to resolve such that calibra-tion and imaging can be performed simultaneously. In this pub-lication another resolve feature is presented for the first time: https://gitlab.mpcdf.mpg.de/mtr/ducc Article number, page 2 of 22hilipp Arras et al.: Comparison of classical and Bayesian imaging in radio interferometry automatic data weighting. Additionally, the di ff use sky model isupdated to a special case of the model presented in Arras et al.(2020a). The implementation is free software . resolve views radio interferometric imaging as a Bayesian in-ference problem: it combines a likelihood and a prior probabilitydensity to a posterior probability density. We generalize the like-lihood to depend on general model parameters ξ (previously I and N ). The likelihood contains all information about the mea-surement process and the noise. In contrast, the prior P ( ξ ) is aprobability density which assigns to every possible value of themodel parameters ξ a probability which represents the knowl-edge on the model parameters before having looked at the data.These two quantities are combined with a normalization factor P ( d ) to Bayes’ theorem: P ( ξ | d ) = P ( d | ξ ) P ( ξ ) P ( d ) . (5) P ( ξ | d ) gives the probability for all configurations of the modelparameters after having looked at the data. resolve uses Bayes’ theorem together with the reparam-eterization trick (Kingma et al. 2015): It is always possible totransform the inference problem such that the prior density is astandard normal distribution: P ( ξ ) = G ( ξ, ). In this approach,all prior knowledge is formally encoded in the likelihood. Putdi ff erently, the task of defining the inference problem is to writedown a function which takes standard normal samples as input,transforms them into sensible samples of the quantity of inter-est with their assumed prior statistics and finally computes theactual likelihood.For our imaging purposes ξ is a roughly 10 million-dimensional vector. Exactly representing non-trivial high-dimensional probability densities on computers is virtually im-possible. Therefore, approximation schemes need to be em-ployed. For the application at hand, we choose the Metric Gaus-sian Variational Inference (MGVI, Knollmüller & Enßlin 2019)implementation in NIFTy (Arras et al. 2019a) because it strikes abalance between computational a ff ordability and expressivenessin the sense that it is able to capture o ff -diagonal elements of theposterior uncertainty covariance matrix. CLEAN assumes a certain weighting scheme which is a multi-plicative correction of the noise level specified in the data set. Aweighting scheme is necessary for two reasons: It can be usedto reweight by the density of the uv-coverage to make it ef-fectively uniform which
CLEAN needs to perform best (see sec-tion 4). resolve does not need this kind of correction becauseit is based on forward modelling and Bayesian statistics: a moredensely sampled region in uv-space leads to more information inthis region and not to inconsistencies in the inference.Additionally, there exist weighting schemes which furtherreweight the visibilities based on the baseline length. Thisweighting represents the tradeo ff between sensitivity (up-weightshort baselines) and resolution (uniform weighting). Dependingon the application CLEAN users need to choose between thoseextremes themselves.Moreover, we find that short baselines are subject to highersystematic noise. For the data sets at hand, this systematic noise https://gitlab.mpcdf.mpg.de/ift/resolve is up to a factor of 340 higher than the thermal noise level (seefig. 8). If the noise variance of the visibilities were correct, thatvalue would be 1. To CLEAN higher systematic noise is indistin-guishable from non-uniform sampling; to a Bayesian algorithm,which takes the uncertainty information of the input data seri-ously, it makes a crucial di ff erence. Therefore, the advanced ver-sion of resolve presented here assumes that the thermal mea-surement uncertainties need to be rescaled by a factor which de-pends only on the baseline length and which is correlated withrespect to that coordinate. This correction function (or Bayesianweighting scheme) is learned from the data alongside the actualimage. The details on this approach are described in the nextsection. To specify resolve , the standardized likelihood P ( d | ξ ) in (5)needs to be defined. In addition to the thermal noise level σ th which is generated by the antenna receivers, calibrated visibili-ties may be subject to systematic e ff ects. In order to account forthese the thermal variance is multiplied by a correction factor α which is unknown and assumed to depend on the baseline length: σ ( ξ ( σ ) ) = σ th · α ( ξ ( σ ) ) , (6)where ξ ( σ ) refers to the part of ξ which parameterizes σ . Conse-quently the noise standard deviation σ itself becomes a variablepart of the inference. The sky brightness distribution I is variableas well (i.e. it depends on ξ ) and the simulated data s is given by: s ( ξ ( I ) ) = (cid:90) A · I ( ξ ( I ) ) √ − l − m e ul + v m + w (1 − √ − l − m ) d( l , m ) , (7)where ξ ( I ) refers to the part of ξ which parameterizes I and I ( ξ ( I ) )is the discretized sky brightness distribution in units Jy / sr.The remaining task is to specify I ( ξ ( I ) ) and α ( ξ ( σ ) ). For thesky brightness distribution we assume two additive components:a point source component modelled with a pixel-wise inversegamma prior (Selig et al. 2015) and a component for di ff useemission. A priori we assume the di ff use emission to be log-normal distributed with unknown homogeneous and isotropiccorrelation structure. This is motivated by the fact that emissionis expected to vary over several magnitudes. Furthermore, weassume that the noise correction function α is log-normal dis-tributed since it needs to be strictly positive and also may varystrongly.Let F ( n ) ( ξ ) be a function which maps standard normal dis-tributed parameters ξ on a n -dimensional Gaussian randomfield with periodic boundary conditions and homogeneous andisotropic correlation structure (Enßlin 2018). The specific formof F ( n ) ( ξ ) is explained in section 3.4. Then: I ( ξ ( I ) ) = exp F (2) ( ξ ( I ) ) + (CDF − ◦ CDF
Normal )( ξ ( I ) ) , (8) α ( ξ ( σ ) ) = ( C ◦ exp) (cid:104) F (1) ( ξ ( σ ) ) (cid:105) , (9)where ◦ denotes function composition, CDF Normal , CDF − refer to the cumulative density function of the standard nor-mal distribution and the inverse cumulative density function ofthe Inverse Gamma distribution, respectively, and C is a crop-ping operator which returns only the first half of the (one-dimensional) log-normal field. This is necessary because α isnot a periodic quantity and we use Fast Fourier Transformswhich assume periodicity. While the di ff use component of the Article number, page 3 of 22 & A proofs: manuscript no. main sky brightness distribution is not periodic either, it is not nec-essary to apply zero-padding there since the flux is expected tovanish at the image boundaries. The point sources are restrictedto the locations a priori known to contain point sources.All in all, the likelihood density is given by: P ( d | σ ( ξ ( σ ) ) , s ( ξ ( I ) )) = | π (cid:99) σ | − e − ( s − d ) † (cid:100) σ − ( s − d ) , (10) H ( d | σ ( ξ ( σ ) ) , s ( ξ ( I ) )) =
12 ( s − d ) † (cid:100) σ − ( s − d ) + (cid:88) i log σ i + c , (11)where (cid:98) x denotes a diagonal matrix with x on its diagonal and c isa normalization constant. The sum goes over all data points andthe dependency of σ and s on ξ is left implicit. The normalizationfactor in (10) is chosen such that (10) is normalized if d is viewedas combination of two sets of real random variables: d = (cid:60) ( d ) + i (cid:61) ( d ) , (cid:90) P ( d | ξ ) d (cid:60) ( d ) d (cid:61) ( d ) = . (12)The following two subsections (sections 3.4 and 3.5) describethe technical details of the resolve sky model and the samplingprocedure. Section 4 describes the technical details of single-scale CLEAN and multi-scale
CLEAN . Non-technical readers maysafely skip directly to section 4 or even section 5.
The following section closely follows Arras et al. (2020a, Meth-ods section) which derives the correlated field model in a moregeneral context. For reasons of clarity and comprehensibility,we repeat the derivation here for the specific case at hand andadopted to the notation used here. The main reason for the com-plexity of the model below is that for modelling di ff use emissionneither a specific correlation kernel nor a parametric form for thekernel shall be assumed. Rather, our goal is to make the correla-tion kernel part of the inference as well. This reduces the risk ofbiasing the end result by choosing a specific kernel as prior.In order to simplify the notation we drop the indices ( I ) and( σ ) for this section and write: F ( n ) = F ( n ) ( ξ ). Still the model F ( n ) is used for both the correction function α and the dif-fuse component of the sky brightness distribution while we notethat the domains are one-dimensional and two-dimensional, re-spectively. In the following, standard normal variables will ap-pear in various places. Therefore, we write ξ = ( ξ , ξ , . . . ) and ξ > n = ( ξ n + , ξ n + , . . . ) where each ξ i is a collection of standardnormal variables.The task is to write down a function that takes a standardnormal random variable ξ as input and returns a realization of acorrelated field with unknown homogeneous and isotropic corre-lation structure. This means that the two-point correlation func-tion depends on the distance between the sampling points only: S = (cid:68) F ( n ) ( ξ )( x ) F ( n ) ( ξ )( y ) (cid:69) G ( ξ, ) = f ( | x − y | ) , (13)where (cid:10) x (cid:11) P denote the expectation value of x over he distribu-tion P . For homogeneous and isotropic processes the Wiener-Khintchin theorem (Wiener 1949; Khintchin 1934) states thatthe two-point correlation function of the process is diagonal inFourier space. Let the n -dimensional discrete Fourier transformbe the map F ( n ) : X h → X where X is a regular grid spacewith shape ( N , . . . , N n ) and pixel sizes ( ∆ x , . . . , ∆ x n ) and X h its harmonic counterpart: it has the same shape and pixel sizes(( N ∆ x ) − , . . . , ( N n ∆ x n ) − ). Define: F ( n ) ( ξ ) = o ff set + F ( n ) (vol · A ( ξ > ) · ξ ) , (14)where o ff set is the (known) mean of the Gaussian random field,ˆ A ˆ A † = S in Fourier basis, vol = (cid:81) i N i ∆ x i is the total volume ofthe space and ξ is a standard normal random field. The volumefactors in the Fourier transform are defined such that the zeromode in Fourier space is the integral over position space: x ··· = N (cid:88) i = · · · N n (cid:88) i n = (cid:16) ∆ x · · · ∆ x n · F ( n ) ( x ) (cid:17) (15)for all n -dim fields x . Then the set (cid:8) F ( n ) ( ξ ) | ξ (cid:120) G ( ξ, ) (cid:9) isa collection of correlated fields with unknown correlation struc-ture, i.e. A still depends on ξ . ξ is defined on that space as welland “ · ” denotes pixel-wise multiplication.If we could derive a sensible form of the correlation struc-ture A for both the di ff use emission and the correction function apriori, we could insert it here and infer only ξ . However, we arenot aware of a method to set the correlation structure by handwithout introducing any biases for a given data set. Therefore,we let the data inform the correlation structure A as well and seta prior on A . This approach may be viewed as an hyper parame-ter search integrated into the inference itself. In the following wewill see that even the parameters needed to model A are inferredfrom the data. So it is really a nested hyper parameter search.The presented model has five hyper parameters. In order toemulate a hyper parameter search, we do not set those directlybut rather make them part of the inference and let the algorithmtune them itself. The hyper parameters which are necessarilypositive are modelled with a log-normal prior as generated fromstandard normal variables ξ i via:LogNormal( ξ i ; m , s ) : = exp (cid:16) m + ˜ s ξ i − ˜ s (cid:17) , (16)˜ s : = (cid:114) log (cid:18) + (cid:16) sm (cid:17) (cid:19) , (17)where m and s refer to mean and standard deviation of the log-normal distribution; the ones which can be positive or negativehave a Gaussian prior and are denoted by Normal( ξ i ; m , s ) : = m + s ξ i . The values for m and s as well as for the other hyperparameters are summarized in table 1.The zero mode controls the overall di ff use flux scale. Its stan-dard deviation A is a positive quantity and we choose it to belog-normal distributed a priori: A ( ξ ) = LogNormal( ξ ; m , s ) . (18)The non-zero modes k (cid:44) A k ( ξ > ) = (cid:115) p k ( ξ > ) (cid:80) k p k ( ξ > ) · fluc( ξ ) , for k (cid:44) , (19)where p k is the model for the power spectrum of F ( n ) up to themultiplicative term fluc. By this definition we ensure that fluc isthe point-wise standard deviation of the final process: (cid:104) s x s x (cid:105) = fluc for all x after having subtracted the contribution from A .fluc is strictly positive and we model it with a log-normal prior:fluc = LogNormal( ξ ; m , s ).The remaining piece is the actual form of p k for k (cid:44)
0. Theprior knowledge we want to encode into this model is the fol-lowing:
Article number, page 4 of 22hilipp Arras et al.: Comparison of classical and Bayesian imaging in radio interferometry
1. Di ff use emission is correlated, i.e. falling power spectra andspecifically p | k | ∼ | k | − s , s > p k in a non-parametric fashion and to rep-resent the above power law property, we choose to transform p k into double-logarithmic space in which power laws becomea ffi ne linear functions: p k = e a t , with t = log | k | , k (cid:44) . (20)We choose to model a t as an integrated Wiener process, i.e. ageneral continuous random process: ∂ t a t = η t , (21)where η t is Gaussian distributed. In this form the process is notMarkovian and is not suited to be evaluated as a forward model.Therefore, we track the derivatives b t of a t as degrees of freedomthemselves: ∂ t (cid:32) a t b t (cid:33) + (cid:32) −
10 0 (cid:33) (cid:32) a t b t (cid:33) = (cid:32) √ asp flex ξ flex ξ (cid:33) , (22)where the specific form of the variances on the right-hand side ofthe equation will be interpreted below. The solution to (22) for b t is a Wiener process. Therefore, a t is an integrated Wiener pro-cess for asp =
0. asp > a t . The solution to (22) is: b t n = b t n − + flex (cid:112) ∆ t n ξ (23) a t n = a t n − + ∆ t n b t n + b t n − ) + flex (cid:114) ∆ t n + asp ∆ t n ξ (24)where t n is the n th (discretized) value of t and ∆ t n = t n − t n − .This formulation allows to compute samples of the process a t from standard normal inputs ξ and ξ . flex and asp are both pos-itive quantities and are modelled with lognormal priors: flex = LogNormal( ξ ; m , s ) and asp = LogNormal( ξ ; m , s ). As canbe seen from (22) flex controls the overall variance of the inte-grated Wiener process. asp determines the relative strength be-tween the un-integrated and integrated Wiener process. In thelimit asp → a t is a pure integrated Wiener process and asp > ff erence between the first and the last pixel ofthe integrated Wiener process is replaced by a linear componentwhose slope is avgsl:˜ a t i = a t i − a t n · t i − t t n − t + ( t i − t ) · avgsl , ∀ i ∈ { , . . . , n } (25)The slope is modelled with a Gaussian prior: avgsl = Normal( ξ ; m , s ).All in all, this defines a model which is able to generateGaussian random fields of arbitrary dimension with unknowncorrelation structure. The random field is assumed to have ho-mogeneous and isotropic correlation structure. The power spec-trum itself is modelled in double-logarithmic space as a mixtureof a Wiener process and an integrated Wiener process with thepossibility to specify the overall slope of the process. This modelis used in its one-dimensional version for the weighting schemefield α and in its two-dimensional version for the di ff use compo-nent of the sky brightness distribution I . To find approximate posterior samples, resolve employs theMGVI algorithm (Knollmüller & Enßlin 2019). This algorithmperforms a natural gradient descent to find the minimum of: E ( ξ ) = N N (cid:88) i = H ( d | ξ = ξ + ξ i ) + ξ † ξ, (26)where ξ is the latent posterior mean and ξ i are samples whichrepresent the uncertainty of the posterior. They are drawn as zerocentered Gaussian random samples with the inverse BayesianFisher metric as covariance: ξ i (cid:120) G (cid:18) ξ (cid:12)(cid:12)(cid:12)(cid:12) , (cid:104) + ∇ ξ ( σ, s ) † (cid:12)(cid:12)(cid:12) ξ F σ, s ∇ ξ ( σ, s ) (cid:12)(cid:12)(cid:12) ξ (cid:105) − (cid:19) , (27)where ∇ ξ ( σ, s ) (cid:12)(cid:12)(cid:12) ξ is the Jacobian of s and σ as a function of ξ evaluated at the latent mean ξ , and F is the Fisher informationmetric of the likelihood in terms of the visibility s and the noisestandard deviation σ . These samples from this inverse metric canbe drawn without the need of inverting explicit matrices, by us-ing the conjugate gradient algorithm. We refer to Knollmüller& Enßlin (2019, discussion around equation (58)) for a detaileddescription.For the computation of the Fisher metric of a complex Gaus-sian distribution, the real and imaginary part of the visibility s is treated individually in order to avoid ambiguities related tocomplex vs. real random variables. Using (11) we arrive at: F σ, s = (cid:42) ∇ σ H ( d | σ, s ) ∇ (cid:60) ( s ) H ( d | σ, s ) ∇ (cid:61) ( s ) H ( d | σ, s ) ∇ σ H ( d | σ, s ) ∇ (cid:60) ( s ) H ( d | σ, s ) ∇ (cid:61) ( s ) H ( d | σ, s ) T (cid:43) P ( d | σ,ξ ) = σ − σ −
00 0 σ − . (28)To draw random variates with this covariance we use normalrandom variates and multiply them with the square root of thediagonal of the matrix in (28). In the here used NIFTy packageimplementing these operations, this Fisher metric is given as afunction of σ − instead, which can be obtained from (28) by ap-plying the Jacobian ∂σ∂σ − : F σ − , s = (cid:16) ∂σ∂σ − (cid:17) T σ − (cid:16) ∂σ∂σ − (cid:17) σ −
00 0 σ − = σ σ −
00 0 σ − . (29)For computational speed, the real and imaginary parts of the visi-bilities are combined into complex floating point numbers wherepossible.
4. Traditional
CLEAN imaging algorithms
CLEAN
This section outlines the main ideas behind the
CLEAN algorithm.First, the most basic variant of
CLEAN (Högbom 1974) is de-scribed followed by a discussion of additional approximations
Article number, page 5 of 22 & A proofs: manuscript no. main that make it more e ffi cient (Clark 1980) and a more sophisti-cated version of the algorithm which overcomes coplanar arrayapproximation (Schwab & Cotton 1983).At its heart, CLEAN is an optimization algorithm which seeksto minimise (4). But since this problem is ill-posed (the operator R † N − R occuring in (4) is not invertible), a unique minimumdoes not exist. For a patch of sky consisting purely of pointsources, one could seek the smallest number of points whichwould result in the dirty image when convolved with the PSF.A practical solution, as formalised by Högbom (1974), in-volves starting from an empty sky model and then iterativelyadding components to it until the residual image appears noise-like. More precisely, noting that the residual image equates tothe dirty image at the outset, we proceed by finding the bright-est pixel in the residual image. Then, using the intuition that thedirty image is the image convolved by the PSF, we center thePSF at the current brightest pixel, multiply it by the flux value inthe pixel and subtract some fraction of it from the residual im-age. At the same time, the model image is updated by adding inthe same fraction of the pixel value at the location of the pixel.This procedure is iterated until a satisfactory solution is found,e.g. when the residual appears noise-like or its brightest pixel isless than some predetermined value. This solution loosely cor-responds to the smallest number of point sources necessary toexplain the data. The one tunable parameter in the algorithm isthe fraction of the flux of the point source which is added to themodel at a time. This parameter is called loop gain.This surprisingly simple procedure is so e ff ective that it isstill the most commonly used deconvolution algorithm in radioastronomy. However, it relies on the approximation R † N − R ≈ I PS F ∗ , (30)where ∗ denotes convolution and I PS F is an image of the pointspread function (PSF), i.e. the result of applying R † N − R to animage which has only a unit pixel at its center. In (30), equalityonly holds when the coplanar array approximation is valid . Thisleads to two alternate forms of the derivative of the likelihoodHamiltonian: ∇ I H ( d | I , N ) = R † N − ( d − RI ) ≈ I D − I PS F ∗ I , (31)where the latter approximation is exact if the coplanar array ap-proximation is valid and the primary beam structure is negligibleor ignored. For the maximum likelihood solution, set the righthand side of (31) to zero. This leads to the classic notion that thedirty image is the image convolved by the PSF: I D = I PS F ∗ I . (32)Especially if the number of image pixels is much smaller thanthe number of data points, this allows computation of the gradi-ents in (31) very e ffi ciently. The reason for this is that the oper-ator I PS F ∗ can be implemented e ffi ciently using the fast Fouriertransform (FFT), whereas R † N − R requires a combination ofconvolutional gridding (including possible w -term corrections)and the FFT.The key to the speed of the CLEAN algorithm comes from theintuition provided by (32). During model building the convolu-tion is not performed explicitly, rather the PSF is centered on thelocation of the current pixel and subtracted from the residual pix-elwise. Since point sources can be located right at the edge of theimage, the PSF image needs to be twice the size in both dimen-sions of the residual image. To save memory and computational The PSF is direction dependent when the array is non-coplanar. time, Clark (1980) approximated the PSF by a smaller versionand restricts the regions in which PSF sidelobes are subtracted.This is possible since the PSF sidelobes typically fall o ff fairlyrapidly, especially for arrays with good uv-overage. However, itis paid for by artifacts being added to the model if the approxima-tion is not done carefully. For this reason the Clark approxima-tion is often used in combination with a CLEAN mask , the regionin which real emission is expected. Outside the mask boundariesthe algorithm is not allowed to allocate components. However,even with a mask, such aggressive image space approximationsinevitably lead to artifacts. This led Schwab & Cotton (1983) tointroduce the notion of minor and major cycles.A major cycle corresponds to an exact evaluation of the gra-dient using the first of the two expressions for the gradient in(31). It removes artifacts stemming from incomplete subtractionof PSF sidelobes by subtracting the model correctly in visibilityspace. In addition, by incorporating w-projection Cornwell et al.(2008) or w-stacking O ff ringa et al. (2014) techniques into theimplementation of the measurement operator, it possible to com-pute the gradient without utilising the coplanar array approxi-mation. Since computing the gradient exactly is an expensiveoperation, it should preferably be done as few times as possi-ble. Hogbom CLEAN can be used in combination with the Clarkapproximation to add multiple components to the model whilekeeping track of the approximate gradient. This is called the mi-nor cycle. Eventually, the current model is confronted with thefull data using the exact expression for the gradient and the pro-cedure is repeated until some convergence criteria are met. Sincenew regions of emission are uncovered as the corrupting e ff ectsof the brightest sources are removed, dynamic masking strate-gies, in which the mask is adapted from one major cycle to thenext, can be employed.The criterion at which to stop the minor cycle and performanother exact evaluation of the gradient a ff ects both the com-putational cost and the quality of the final result. Careful userinput is often required to balance the tradeo ff between these twofactors. Because of the convolutional nature of the problem, thelevel of artifacts introduced by exploiting image space approxi-mations is proportional to the brightest pixel in the residual im-age. Thus, running the minor cycle for too long adds artifactsto the model. In principle it is possible to correct for these arti-facts in subsequent iterations, but in practice this is potentiallyunstable. As convergence criterion for the minor loop, a param-eter called major loop gain or peak factor is defined: iterate mi-nor loops until the residual has decreased by the peak factor. Asensible choice depends on the field of view and the degree ofnon-coplanarity of the array. Typical values are around 0.15.In AIPS, the software we use for our single-scale CLEAN maps, a new major cycle i + m i (1 + a i ), a current map specific ref-erence flux m i times a cycle dependent factor 1 + a i , which isstirred according to the following heuristic. The starting valuefor this factor, a , depends on the ratio ρ = r − m m where r i and m i are the peak and lowest flux of the absolute residual image inthe i th major cycle, respectively, and is defined as: a = . · ρ : ρ ≥ . · ρ : 1 ≤ ρ < . · ρ : ρ < Note that
CLEAN masks are not only used to limit deconvolution ar-tifacts but also to preclude possible calibration artifacts, a topic that isbeyond the scope of the current discussion.Article number, page 6 of 22hilipp Arras et al.: Comparison of classical and Bayesian imaging in radio interferometry
Then, a increases at each iteration: a i + = a i + n − i (cid:16) m i r i (cid:17) f where n i is the current number of CLEAN components and f is a freeparameter. Larger f s let a i decrease more slowly.Especially if extended emission is present, model imagesproduced by CLEAN are so far from realistic representatives of thetrue sky that astronomers can’t work with them directly. They arethe best fit to the data under the implicit prior imposed by
CLEAN but fail miserably at capturing source morphology or frequencyspectra. Therefore, the results produced by
CLEAN are interpretedwith the help of the so-called restored image. The first step increating the restored image is to convolve the model image withthe
CLEAN beam, a Gaussian that approximates the primary lobeof the PSF. This represents the intrinsic resolution of the instru-ment which is assumed to be constant across the image. Next,in an attempt to account for any undeconvolved flux and set thenoise floor for the observation, the residual image is added to themodel convolved with the PSF. The noise floor, which is takento be the RMS of the resulting image in regions devoid of struc-ture, is then supposed to give an estimate of the uncertainty ineach pixel.All in all, careful user input is required to successfully use
CLEAN for imaging. Fortunately the tunable parameters are ac-tually quite easy to set once the user has developed some intu-ition for them. However, the model images produced by single-scale
CLEAN are completely unphysical when there are extendedsources in the field. In extreme cases (for example images of theGalactic Centre (Heywood et al. 2019)), single-scale
CLEAN failsto fully deconvolve the faint di ff use emission in the field and canlead to imaging artifacts. A possible explanation for this is that,at each iteration, single-scale CLEAN tries to minimise the objec-tive function by interpolating residual visibility amplitudes witha constant function. This limitation has been partially addressedby the multi-scale variants of the
CLEAN algorithm.
CLEAN
Multi-scale
CLEAN (Cornwell 2008; Rau & Cornwell 2011; Of-fringa & Smirnov 2017) is an extension of single-scale
CLEAN which imposes sparsity in a dictionary of functions, as opposedto just the delta function. Most implementations use a pre-determined number of either circular Gaussian components orthe tapered quadratic function (Cornwell 2008) in addition tothe delta function. While this model is still not a physical rep-resentation of the sky, di ff use structures within the field of vieware more faithfully represented. Most multi-scale CLEAN imple-mentations share the major and minor cycle structure of Cotton-Schwab
CLEAN with the major cycle implemented in exactly thesame way. However, the minor cycle di ff ers between the manyvariants of multi-scale CLEAN . The implementation used for thecurrent comparison is described in detail in O ff ringa & Smirnov(2017) and implemented in the wsclean software package (Of-fringa et al. 2014).The starting point for wsclean ’s multi-scale algorithm is toselect the size of the scale kernels. While this can be specifiedmanually, wsclean also provides a feature to determine themautomatically from the uv-coverage of the observation. In thiscase, the first scale always corresponds to the delta function ker-nel scale. The second scale is then selected as the full width win-dow of the tapered quadratic function which is four times largerthan the smallest theoretical scale in the image (determined fromthe maximum baseline). The size of the corresponding Gaus-sian scale kernels is set to approximately match the extent ofthe tapered quadratic function. As noted in O ff ringa & Smirnov (2017), the factor of four was empirically determined to workwell in practice. If smaller scales are used, point sources aresometimes represented with this scale instead of the delta scale.Each subsequent scale then has double the width of the previousone and scales are added until they no longer fit into the imageor until some predetermined maximum size is reached.Once the scales have been selected, the algorithm identifiesthe dominant scale at each iteration. This is achieved by con-volving the residual image with each Gaussian scale kernel andcomparing the peaks in the resulting convolved images subject toa scale bias function (conceptually similar to matched filtering).The scale bias function (see O ff ringa & Smirnov (2017) for fulldetails) can be used to balance the selection of large and smallscales. It introduces a tunable parameter to the algorithm, viz.the scale bias β . With the dominant scale identified, the modelis updated with a component corresponding to this scale at thelocation of the maximum in the convolved residual image. Aswith single-scale CLEAN , the model is not updated with the fullflux in the pixel but only some fraction thereof. The exact frac-tion is scale-dependent (see again O ff ringa & Smirnov (2017)for details). To keep track of the approximate residual, the PSFconvolved with the scale kernel multiplied by this same fractionis subtracted from the residual image.The additional convolutions required to determine the domi-nant scale at each iteration introduce an additional computationalcost compared to single-scale CLEAN . For this reason, wsclean provides the option of running an additional sub-minor loopwhich fixes the dominant scale until the peak in the scale con-volved image decreases by some pre-specified fraction (or fora fixed number of iterations). This significantly decreases thecomputational cost of the algorithm but it is still more expen-sive than single-scale
CLEAN . While we will not delve into theexact details of how the sub-minor loop is implemented, we willnote that it introduces yet another tunable parameter to the al-gorithm which is similar to the peak factor of Cotton-Schwab
CLEAN . This parameter, called multiscale-gain in wsclean , de-termines how long a specific scale should be
CLEAN ed beforere-determining the dominant scale in the approximate residual.Importantly, the sub-minor loop also makes use of a Clark-likeapproximation to restrict regions in which peak finding and PSFsubtraction should be performed. This improves both the speedand the quality of the reconstructed images.While we have not discussed all the details behind the multi-scale
CLEAN implementation in wsclean , our discussion shouldmake it clear that it introduces additional tunable parameters tothe algorithm. Most of the time the algorithm performs reason-ably well with these parameters left to their defaults. However,some degree of tuning and manual inspection is sometimes re-quired, especially for fields with complicated morphologies.
Classical radio interferometric imaging su ff ers from a variety ofproblems. Two of these problems stand out in particular: the lackof reliable uncertainty estimates and the unphysical nature ofmodel images produced by CLEAN . As we discuss below,
CLEAN forces astronomers to conflate these two issues in a way thatmakes it very di ffi cult to derive robust scientific conclusions inthe sense that it is guaranteed that two observers would convertthe same data set into the same sky image and that meaningfulstatistical uncertainty information would be provided by the al-gorithm.Astronomers need to account for uncertainties in both fluxand position and these two notions of uncertainty are correlated Article number, page 7 of 22 & A proofs: manuscript no. main in a non-trivial way that is determined by both the uv-coverageand the signal-to-noise ratio of the observation. However, modelimages produced by
CLEAN are not representative of the trueflux distribution of the sky and come without any uncertaintyestimates. This can be attributed to the fact that
CLEAN is notbased on statistical theory but rather is a heuristic that tries torepresent flux in form of pre-determined basis functions (deltapeaks, Gaussians) via flux-greedy algorithms. As a result, as-tronomers turn to the restored image (see section 4.1) instead ofrelying directly on the model produced by
CLEAN . Compared tothe model image, the restored image has two favourable quali-ties viz. it accounts for the (assumed constant) intrinsic instru-mental resolution and it displays structures in the image relativeto the noise floor of the observation. These two aspects are sup-posed to roughly account for uncertainties in position and fluxrespectively. However, besides the fact that adding the residualsback in introduces structures in the image which are not real,and that the restored image has inconsistent units , this is com-pletely unsatisfactory from a statistical point of view. Firstly, therestored image completely neglects the correlation between un-certainties in flux and position, information which is crucial todetermine whether a discovery is real or not. In fact, since theact of convolving the model image by the CLEAN beam assumesthat the resolution is constant across the image, whereas it isknown that super-resolution of high signal-to-noise structures ispossible, the restored image paints a rather pessimistic pictureof the capabilities of radio interferometers. Secondly, both the“noise in the image” and the size of the clean beam depend onthe weighting scheme which has been used. It is di ffi cult to at-tach any degree of confidence to the results since the weightingscheme is a free parameter of CLEAN . Dabbech et al. (2018, Fig-ure 1 and 2)) shows the impact of di ff erent weighting schemeson the final image. This limitation is borne out quite explicitlyin the data set chosen for the current comparison in section 5.Furthermore, since CLEAN outputs images which contain regionswith unphysical negative flux , astronomers need to assess forthemselves which parts of the image to trust in the first place.The above limitations provide opportunities for speculative sci-entific conclusions which cannot be backed up by statisticallyrigorous arguments. They also make it impossible to quantita-tively compare images from radio interferometers processed by CLEAN to, e.g., astrophysical simulations.In addition to the above,
CLEAN relies on user input whichinvolves the careful construction of masks, selecting an appro-priate weighting scheme and setting hyper-parameters such asloop gains and stopping criteria etc. This results in an e ff ec-tive prior: it is known that CLEAN imposes some measure ofsparsity in the chosen dictionary of functions, but it is unclearhow to write down the explicit form of the e ff ective prior. Theproblem is exacerbated by the fact that CLEAN uses a form ofbackward modelling which does not perform well when thereare very little data available or when the uv-coverage is highlynon-uniform, as is the case for typical VLBI observations. Thus,the way that
CLEAN is implemented is fundamentally incompat-ible with Bayesian inference making it impossible to infer, orindeed marginalise over, optimal values for the parameters it re-quires. This is clearly problematic as far as scientific rigour isconcerned. The residual has di ff erent units from the model convolved by the CLEAN beam. Note that negative flux is also an artifact of discretising the measure-ment operator (2) since the response of a point source situated exactlyin between two pixels is a sinc function. α mean α sd I mean I sdO ff set 0 — 21 —[1] Zero mode variance 2 2 1 0.1[2] Fluctuations 2 2 5 1[5] Flexibility 1.2 0.4 1.2 0.4[6] Asperity 0.2 0.2 0.2 0.2[7] Average slope -2 0.5 -2 0.5 Table 1.
Hyper parameters for resolve runs. The numbers in the brack-ets refer to the index of the excitation vector ξ to which the specifiedmean m and standard deviation s belong, see e.g. (18). This illustrates that the notions of uncertainty, resolution andsensitivity are tightly coupled concepts when interpreting imagesproduced by radio interferometers. As such it is not su ffi cient toapply a post-processing step such as making the restored imageto derive scientific conclusions from radio maps. In fact, doingso potentially limits the usefulness of interferometric data be-cause it eliminates the possibility of super-resolution at the out-set. This is a result of incorrect prior specification and not prop-erly accounting for the interaction between the data fidelity andthe prior term during imaging. Obtaining sensible posterior esti-mates requires combining the linear Fourier measurement takenby the interferometer with a prior which respects the physicsof the underlying problem, such as enforcing positivity in thespatial domain for example. To this end, resolve approximatesthe posterior with MGVI, an algorithm that can track non-trivialcross-correlations. Instead of providing a point estimate with as-sociated error bars, MGVI provides samples from the approxi-mate posterior which can then be used to compute expectationvalues of any derived quantities while accounting for cross cor-relations between parameters.All in all, the absence of proper uncertainty information,potential negativity of flux, the arbitrariness of the weightingscheme, problems with little data and non-uniform uv-coverageand loss of resolution by convolving with the CLEAN beam il-lustrate the necessity to improve beyond the
CLEAN -based algo-rithms.
5. Comparison of results from resolve and
CLEAN
Here we compare the performance of the three imaging ap-proaches presented in sections 3 and 4. To this end we use VLAobservations of Cygnus A which have been flagged and cal-ibrated with standard pipelines. For more details on the datareduction process refer to Sebokolodi et al. (2020). We usesingle-channel data sets at the frequencies 2052, 4811, 8427 and13360 MHz. The
CLEAN maps have been converted from theunit Jy / beam to Jy / arcsec by multiplication with the half-width-half-maximum area of the CLEAN beam. All data and the resultsof the three di ff erent methods are archived Arras et al. (2020b) . All values for the hyper parameters of resolve are summarizedin table 1. The resolve parameters separate into those for thesky brightness distribution and those for the Bayesian weightingscheme. For the latter, they are chosen such that the model hasmuch flexibility to adopt to the exact situation. Because α pro-vides a multiplicative correction to the noise levels, the o ff set isset to zero (which becomes one, i.e. no correction, after expo- https://doi.org/10.5281/zenodo.3999840 Article number, page 8 of 22hilipp Arras et al.: Comparison of classical and Bayesian imaging in radio interferometry nentiation). The zero mode standard deviation is set to a highvalue because the overall noise level might be completely di ff er-ent. Also the fluctuations have a large standard deviation suchthat the algorithm can easily tune that parameter. A value of 2means that we expect the correction function α to vary withinone standard deviation two e-folds up and down. The flexibilityand asperity parameters of the power spectrum flex and asp areset such that the algorithm can pick up non-trivial values but nottoo extreme ones here. The average slope of the power spectrumis chosen to vary around -2. In other words, the Bayesian weight-ing scheme α depends in a di ff erentiable fashion on the baselinelength a priori. A relatively high a priori standard deviation of0.4 enables the algorithm to tune the slope to the appropriatevalue. The most important aspect of the hyper parameter settingis that the resulting prior has enough variance to capture the ac-tual Bayesian weighting scheme and sky brightness distribution.As discussed above the model is set up in such a way that it canadjust its hyper parameters on its own. All parameters discussedin this section are really hyper parameters of that hyper param-eter search. For the sky brightness distribution we know a priorithat typical flux values in regions with emission vary on scales of10 and 10 Jy / sr. Therefore a sensible o ff set for the Gaussianfield is log(10 ) ≈
20. A priori we let that value vary two e-foldsup and down in one standard deviation which means that withinthree standard deviations typical flux values between ≈ and ≈ Jy / sr can be reached. However, as always we make thestandard deviations themselves a parameter and choose 2 for thestandard deviation of the standard deviation of the zero modewhich makes virtually all o ff sets possible. As positions for thepoint sources modelled with an inverse-gamma prior (see (8))we assume a point source at the phase center and a second onelocated at (0 . , − .
44) arcsec relative to the phase center (Perleyet al. 2017, Cygnus A-2).Apart from the hyper parameters we need to specify the min-imization procedure for resolve (Knollmüller & Enßlin 2019).In order to arrive at a sensible starting position for the actualinference we proceed in the following steps:1. Compute the maximum-a-posterior solution assuming the er-ror bars provided by the telescope. This means that we set α = ξ , as generated byMGVI, to approximate the Metric Gaussian Kullback-Leibler divergence and solve the inference problem with re-spect to ξ ( σ ) only. In other words, we find a good weightingscheme α conditional to the sky brightness distribution foundbefore.3. Solve the MGVI inference problem for the sky brightnessdistribution conditional to the found weighting scheme usingfive mirrored samples.4. Solve the full inference problem for the sky brightness distri-bution and the Bayesian weighting scheme simultaneously.5. Terminate after the second iteration.6. Flag all data points which are more than 6 σ away from themodel data taking the Bayesian weighting scheme into ac-count. Restart from step 1.In all cases, we approximate the Metric Gaussian Kullback-Leibler divergence using five mirrored samples. These sam-ples are drawn with the help of conjugate gradient runs (seesection 3.5). These conjugate gradients are declared convergedwhen the conjuate gradient energy does not change by more than0.1 three times in a row. As an upper limit for the maximumnumber of conjuate gradient steps we choose 2000. Not iteratingthe conjugate gradient algorithm until convergence (which is not Parameter Valuej 20size 4096 3072padding 2.0scale 0.04asecweight briggs 0gain 0.1mgain 0.8niter 1000000nmiter 10multiscale-gain 0.1auto-mask 2.0 Table 2.
Common hyper parameters for multi-scale
CLEAN runs. Theparameters which di ff er for the four runs are described in the maintext. Additionally, the options multiscale , no-small-inversion , use-wgridder , local-rms have been used. computationally feasible) does not introduce biases in the infer-ence but rather increases the posterior variance as discussed inKnollmüller & Enßlin (2019).The multi-scale CLEAN results produced for the currentcomparison were obtained by first doing an imaging run withuniform weighting down to a fairly low threshold and using wsclean ’s auto-masking feature. The resulting images wereused to define an external mask containing the most prominentfeatures. A second imaging run down to a deeper threshold wasthen performed using Briggs weighting with a robustness factorof -1. These images were then used to refine the mask and to flagobvious outliers in the data. The outliers were identified by com-puting whitened residual visibilities and flagging all data pointswith whitened residual visibility amplitudes larger than five timethe global average. On average this resulted in about 1% of thedata being flagged which is more than expected from the noisestatistics. This could indicate that a small amount of bad dataslipped through the initial pre-processing steps (e.g. flagging andcalibration). The final imaging run was then performed using therefined mask and Briggs weighting with a robustness factor ofzero. While the procedure could be refined further, we found thatdoing so results in diminishing returns in terms of improving thefinal result.The wsclean settings reported in table 2 are common toall the data sets for the final multi-scale
CLEAN imaging run.The image size was set so that the PSF for the 13 GHz dataset has just more than five pixels across the FWHM of the pri-mary lobe, a rule of thumb that is commonly employed to setthe required pixel sizes for an observation. Twenty threads areemployed to approximately match the computational resourcesgiven to resolve. In addition to auto-masking which is set tokick in when the peak of the residual is approximately twice thevalue of the rms in the image, a manual FITS mask was sup-plied using the fits-mask option. The masks for the di ff erentdata sets are shown in fig. A.1. In all cases the scales were auto-matically selected. The only parameter that di ff ers between datasets is the threshold at which to stop CLEAN ing, specified throughthe threshold parameter in wsclean . These were set to 0.002,0.0007, 0.0003 and 0.0002 for the 2, 4, 8 and 13 GHz data sets,respectively, which approximately matches the noise floor in thefinal restored images. A value of zero for the Briggs robustnessfactor was chosen as it usually gives a fairly good tradeo ff be-tween sensitivity and resolution. However, as discussed in sec-tion 4.3, the need to specify the weighting scheme manually isone of the main limitations of CLEAN . This is especially evident
Article number, page 9 of 22 & A proofs: manuscript no. main
Frequency [GHz] Source 0 [mJy] Source 1 [mJy]2.052 582 ± ± ± ± . ± . . ± . . ± .
03 4 . ± . Table 3. resolve point source fluxes. Source 0 refers to the cen-tral source Cygnus A and Source 1 to the fainter secondary sourceCygnus A-2. The standard deviation is computed from the resolve posterior samples and does not account for calibration uncertainties andother e ff ects, see main text. in the 8 GHz observation where the Cygnus A-2 is just visibleusing a robustness factor of zero whereas it is clearly visible inthe images with a robustness factor on minus one. Cygnus A-2is completely lost when using natural weighting, which is wherethe interferometer is most sensitive to faint di ff use structures.For single-scale CLEAN , the default settings as implementedin AIPS are used.
Figure 1 shows a summary of the results of the twelve runs:four frequencies imaged with three di ff erent algorithms. Theunits of the CLEAN images have been converted to Jy / arcsec (by dividing the CLEAN output in Jy / beam by the beam area π · BMAJ · BMIN ). Then the pixel values of all images can bedirectly compared to each other. As discussed above, the out-put of resolve is not a single image but rather a collection ofposterior samples. For the purpose of comparison we display thepixel-wise posterior mean.Figure 1 shows that the resolve maps do not feature anynegative flux regions. Since this was a strict prior assump-tion for the algorithm, this is the expected result. The single-scale
CLEAN and the multi-scale
CLEAN have many negativeflux regions where no (bright) sources are located. Otherwise,the results of these two algorithms are similar. Additionally,figs. 2 and A.2 show the pixel-wise posterior uncertainty of the resolve runs. Note that these figures do not contain the wholeuncertainty information which is stored in the posterior samples.The posterior distribution for each pixel is not Gaussian andtherefore the higher moments are non-trivial. Additionally, thecross-correlation between the pixels cannot be recovered fromthe pixel-wise posterior uncertainty.In order to investigate the results further, figs. 3 to 5 show theWestern lobe of the 13 .
36 GHz observation only and fig. 6 showsthe bottom left hot spot of all observations. In the
CLEAN resultsit can be seen that the resolution improves significantly when go-ing to higher frequencies. This is due to the natural increase of aninterferometer: the higher the observation frequency, the higherthe intrinsic resolution. The same is true for the resolve maps.However, resolve also achieves higher resolution than
CLEAN at lower frequencies. By eye, the resolution of the resolve . CLEAN . CLEAN maps. This ispossible because the synchrotron radiation has a very broad fre-quency spectrum. Unless there is internal or extrenal absorption e ff ects which is not happening here, there cannot be major di ff er-ences in the brightness over frequency ratios of a few. Addition-ally, it can be observed that the ripples in the fainter regions nextto the hotspot which are present in both CLEAN reconstructionsare not present in the resolve one. This is rooted in the fact that resolve can take the noise level properly into account and letthe prior smooth within the regions which are less informed bythe data because the flux level is lower.Figure 7 shows a direct comparison of the multi-scale
CLEAN result and posterior samples of resolve . It can be observed thatthe resolve samples significantly deviate from the multi-scale
CLEAN map. Besides, it becomes apparent that resolve assignssignificant flux in regions which have negative flux in the single-scale
CLEAN result.Figure 8 displays posterior samples of the Bayesian weight-ing scheme. It can be observed that the correction factor gen-erally decreases with baseline length. Its minimum and maxi-mum values are 0 . ff erent conventionsregarding the covariance of a complex Gaussian probability den-sity. However, since both factors appear, this is unlikely. For the2 GHz data set the correction factor remains at values ≈ ff ect may be explainedby inconsistencies in the data due to pointing errors. Especiallyat high frequencies, Cygnus A has comparable angular size tothe primary beam. Particularly near the zenith (Cygnus A transits8 degrees from the zenith), the VLA antennas do not point accu-rately. The errors induces by this cannot be modeled by antenna-based calibration solutions. Therefore, pointing errors introduceinconsistencies in the data. An additonal source of inconsisten-cies in the data might be inconsistent calibration solutions whichhave been introduced in the data during the self-calibration pro-cedure in which negative components in the sky brightness dis-tribution have been used. An approach similar to Arras et al.(2019b) may be able to compute consistent calibration solutionsin the first place.In the following, we briefly discuss some of the materialsthat can be found in appendix A. Figure A.3 shows histogramsof the posterior residuals weighted with σ ( ξ ( σ ) ) and table A.1reports the respective χ values. All χ values are all close to 1.That means that the Bayesian weighting scheme was successfulin correcting the noise variances on the data points.For inspecting low flux areas fig. A.4 displays a saturatedversion of fig. 1 and fig. A.5 compares the multi-scale CLEAN result with the resolve posterior mean for the 2 . resolve , the three higher frequency data recon-structions exhibit regions next to the main lobes which are veryfaint. It looks like resolve tries to make these regions negativewhich is not possible due to the prior. For the 13 . Article number, page 10 of 22hilipp Arras et al.: Comparison of classical and Bayesian imaging in radio interferometry − - M H z resolve ssclean msclean − - M H z − - M H z −
50 0 50[arcsec] − - M H z −
50 0 50[arcsec] −
50 0 50[arcsec] 10 − − Fig. 1.
Overview of imaging results. The first column shows the resolve posterior mean, the middle and last column show single-scale
CLEAN multi-scale
CLEAN results, respectively. The colorbar has units Jy / arcsec . Negative flux regions are displayed in white. See also di ff erent scaledversion in fig. A.4. − [ a r c s ec ] −
50 0 50[arcsec] − [ a r c s ec ] −
50 0 50[arcsec] − − Fig. 2.
Relative pixel-wise posterior uncertainty of resolve runs. All plots are clipped to 0.7 from above and the two pixels with point sourcesare ignored in determining the colorbar. Their uncertainty is reported in table 3. sition of the point sources is not included in the error bars. Sec-ond, inconsistencies in the data induced by the calibration canlead to underestimating posterior variance because contradictorydata points pull with strong force in opposite directions in thelikelihood during the inference. This results in too little poste-rior variance. Third, MGVI only provides an lower bound on the true uncertainty but still its estimates are found to be largely sen-sible as shown in Knollmüller & Enßlin (2019).Generally, it can be observed that the posterior standard devi-ation decreases with increasing frequency. This is expected sinceinterferometers with e ff ectively longer baselines are more sensi-tive to point sources. Our results from table 3 can be compared Article number, page 11 of 22 & A proofs: manuscript no. main
Fig. 3.
Zoomed-in version of the single-scale
CLEAN reconstruction of the 13 .
36 GHz data set focusing on the Western lobe and rotated conter-clockwise by 90 degrees. The colorbar is the same as in fig. 1. Negative flux regions have been set to lower limit of the colorbar.Article number, page 12 of 22hilipp Arras et al.: Comparison of classical and Bayesian imaging in radio interferometry
Fig. 4.
Same as fig. 3, just with multi-scale
CLEAN reconstruction. Article number, page 13 of 22 & A proofs: manuscript no. main
Fig. 5.
Same as fig. 3, just with resolve posterior mean.Article number, page 14 of 22hilipp Arras et al.: Comparison of classical and Bayesian imaging in radio interferometry − . − . - M H z resolve ssclean msclean − . − . - M H z − . − . - M H z − − − . − . - M H z − − − − − − Fig. 6.
Overview of imaging results. Zoomed-in version of fig. 1 focusing on the Eastern hot spot. − − − − − − − . − . − . − . − . − . − . − . − . [ a r c s ec ] Fig. 7.
Comparison of multi-scale
CLEAN (blue contour lines, grey regions: negative flux regions) and four resolve posterior samples (red) at13.4 GHz. to Perley et al. (2017, Table 1). At 8 . . ± .
35) mJyfor Cygnus A-2. At 13 GHz they report 1440 mJy and (4 . ± .
17) mJy. These measurements have been taken in July 2015whereas our measurements are from Nov 30 and Dec 5, 2015.The comparison is still valid since Perley et al. (2017) showedthat the sources are not variable on the scale of one year. Wecan observe that all flux values are in the right ballpark andthe fluxes of Cygnus A-2 agree within 2 σ . The fluxes for the central source cannot be compared well because Perley et al.(2017) do not provide uncertainties on it. However, taking onlythe resolve uncertainties into account, the flux values di ff ersignificantly. For the lower two frequencies no data is availablein Perley et al. (2017) because the sources are not resolved by CLEAN . The resolve results give the posterior knowledge onthe secondary source given its position. In this way, statementsabout the flux of Cygnus A-2 at low frequencies can be madeeven though it is not resolved. Thus, we can claim the discovery
Article number, page 15 of 22 & A proofs: manuscript no. main − − − B a y e s i a n w e i g h t i n g s c h e m e [ ] Fig. 8.
Posterior samples of the Bayesian weighting scheme α and prior samples for the 13 .
36 GHz data set. The dashed lines are located at values0.5 and 1. The latter corresponds to no correction at all. of Cygnus A-2 given its position on a 3 σ and 7 σ level for the2 . . Each resolve run needs ≈
500 000 evaluations of the responseand ≈
400 000 evaluations of its adjoint. That makes the re-sponse part of the imaging algorithm a factor of ≈
50 000 moreexpensive compared to
CLEAN approaches. The good news is thatthe implementation of the radio response (7) in the package duccscales well with the number of data points and that the responsecalls can be parallelized over the sum in (26).The resolve runs have been performed on a single nodewith five MPI tasks, each of which needs ≈ . resolve run is between 80 and 90 h.Single-scale CLEAN takes below 30 minutes for imaging eachchannel on a modern laptop. Thus, resolve is approximately180 times slower that single-scale
CLEAN here. This comparisondoes not include that the resolve had five times the number ofCPUs available.Multi-scale
CLEAN takes about 2 hours during the final roundof imaging on the 13 GHz data set. This number does not accountfor the time taken during the initial rounds of imaging used totune the hyper parameters and construct the mask which can be atime-consuming process. However, it should be kept in mind that
CLEAN scales much better when the dimensionality of the imageis much smaller than that of the data, which is not the case here.This is because
CLEAN only requires about 10–30 applications ofthe full measurement operator and its adjoint, even including allpreprocessing steps. Taking 90 min for the average multi-scale
CLEAN run, resolve is 60 times slower than multi-scale
CLEAN .
6. Conclusions
This paper compares the output of two algorithms tradition-ally applied in the radio interferometry community (single-scale
CLEAN and multi-scale
CLEAN ) with a Bayesian approachto imaging called resolve . We demonstrate that resolve overcomes a variety of problems present in traditional imag-ing: The sky brightness distribution is a strictly positive quan-tity, the algorithm quantifies the uncertainty on the sky bright-ness distribution, and the weighting scheme is determined non-parametrically. Additionally, resolve provides varying resolu-tion depending on the position on the sky into account, whichenables super-resolution. We find that single-scale
CLEAN andmulti-scale
CLEAN give similar results. In contrast, resolve pro-duces images with higher resolution: the 4.8 GHz map has com-parable resolution to the 13.4 GHz
CLEAN maps. These advan-tages are at the cost of additional computational time, in ourcases ≈
90 h wall time on a single node.Future work may extend resolve to multi-frequency recon-structions where the correlation structure in frequency axis istaken into account as well in order to increase resolution. Also,direction-independent and antenna-based calibration may be in-tegrated into resolve . Finally, the prior on the sky brightnessdistribution may be extended to deal with polarization data aswell.
Acknowledgements.
We thank Philipp Frank for his work on the correlated fieldmodel in NIFTy, Eric Greisen for explanations regarding AIPS, George Heald,Jakob Knollmüller, Wasim Raja, and Shane O’Sullivan for discussions, explana-tions, and feedback, and Martin Reinecke for his work on the software packagesNIFTy and ducc and comments on an early draft of the manuscript. P. Arrasacknowledges financial support by the German Federal Ministry of Educationand Research (BMBF) under grant 05A17PB1 (Verbundprojekt D-MeerKAT).The research of O. Smirnov is supported by the South African Research ChairsInitiative of the Department of Science and Technology and National ResearchFoundation. The National Radio Astronomy Observatory is a facility of the Na-tional Science Foundations operated under cooperative agreement by AssociatedUnversities, Inc.
Article number, page 16 of 22hilipp Arras et al.: Comparison of classical and Bayesian imaging in radio interferometry
References
Arras, P., Baltac, M., Ensslin, T. A., et al. 2019a, Astrophysics Source CodeLibraryArras, P., Frank, P., Haim, P., et al. 2020a, The variable shadow of M87*Arras, P., Frank, P., Leike, R., Westermann, R., & Enßlin, T. A. 2019b, A&A,627, A134Arras, P., Knollmüller, J., Junklewitz, H., & Enßlin, T. A. 2018, in 2018 26thEuropean Signal Processing Conference (EUSIPCO), IEEE, 2683–2687Arras, P., Perley, R. A., Bester, H. L., et al. 2020b, Zenodo [ zenodo:3999841 ]Cai, X., Pereyra, M., & McEwen, J. D. 2018, Monthly Notices of the RoyalAstronomical Society, 480, 4154Clark, B. G. 1980, A&A, 89, 377Cornwell, T. J. 2008, IEEE Journal of Selected Topics in Signal Processing, 2,793Cornwell, T. J. & Evans, K. 1985, Astronomy and Astrophysics, 143, 77Cornwell, T. J., Golap, K., & Bhatnagar, S. 2008, IEEE Journal of Selected Top-ics in Signal Processing, 2, 647Cox, R. T. 1946, American journal of physics, 14, 1Dabbech, A., Onose, A., Abdulaziz, A., et al. 2018, Monthly Notices of the RoyalAstronomical Society, 476, 2853Enßlin, T. A. 2018, Annalen der Physik, 1800127Enßlin, T. A. & Frommert, M. 2011, Physical Review D, 83, 105014Enßlin, T. A., Frommert, M., & Kitaura, F. S. 2009, Physical Review D, 80,105005Greiner, M., Vacca, V., Junklewitz, H., & Enßlin, T. A. 2016, ArXiv e-prints[ arXiv:1605.04317 ]Gull, S. F. & Skilling, J. 1984, Indirect Imaging, JAHamaker, J. P., Bregman, J. D., & Sault, R. J. 1996, A&AS, 117, 137Heywood, I., Camilo, F., Cotton, W. D., et al. 2019, Nature, 573, 235–237Högbom, J. 1974, Astronomy and Astrophysics Supplement Series, 15, 417Honma, M., Akiyama, K., Uemura, M., & Ikeda, S. 2014, Publications of theAstronomical Society of Japan, 66, 95Junklewitz, H., Bell, M., & Enßlin, T. 2015, Astronomy & Astrophysics, 581,A59Junklewitz, H., Bell, M., Selig, M., & Enßlin, T. A. 2016, Astronomy & Astro-physics, 586, A76Khintchin, A. 1934, Mathematische Annalen, 109, 604Kingma, D. P., Salimans, T., & Welling, M. 2015, in Advances in neural infor-mation processing systems, 2575–2583Knollmüller, J. & Enßlin, T. A. 2019, ArXiv e-prints [ arXiv:1901.11033v3 ]O ff ringa, A., McKinley, B., Hurley-Walker, N., et al. 2014, Monthly Notices ofthe Royal Astronomical Society, 444, 606O ff ringa, A. R. & Smirnov, O. 2017, MNRAS, 471, 301Perley, D. A., Perley, R. A., Dhawan, V., & Carilli, C. L. 2017, The AstrophysicalJournal, 841, 117Rau, U. & Cornwell, T. J. 2011, A&A, 532, A71Schwab, F. R. & Cotton, W. D. 1983, AJ, 88, 688Sebokolodi, L. et al. 2020, forthcomingSelig, M., Vacca, V., Oppermann, N., & Enßlin, T. A. 2015, Astronomy & As-trophysics, 581, A126Smirnov, O. M. 2011, Astronomy & Astrophysics, 527, A106Sutter, P. M., Wandelt, B. D., McEwen, J. D., et al. 2014, Monthly Notices of theRoyal Astronomical Society, 438, 768Sutton, E. C. & Wandelt, B. D. 2006, The Astrophysical Journal SupplementSeries, 162, 401Wiener, N. 1949, Extrapolation, interpolation, and smoothing of stationary timeseries, Vol. 2 (MIT press Cambridge) Article number, page 17 of 22 & A proofs: manuscript no. main
Data set Real part Imaginary part2052-2MHz 1.06 0.874811-8MHz 1.07 0.948427-8MHz 1.04 0.9813360-8MHz 0.99 0.9
Table A.1. χ values of posterior of resolve reconstructions weightedwith σ ( ξ ( σ ) ). Appendix A: Supplementary material
Article number, page 18 of 22hilipp Arras et al.: Comparison of classical and Bayesian imaging in radio interferometry − [ a r c s ec ] − −
25 0 25 50 75[arcsec] − [ a r c s ec ] − −
25 0 25 50 75[arcsec]
Fig. A.1.
Masks used for multi-scale
CLEAN runs. − [ a r c s ec ] −
50 0 50[arcsec] − [ a r c s ec ] −
50 0 50[arcsec] . . . . . . Fig. A.2.
Relative pixel-wise posterior uncertainty of resolve runs on linear scale. The two pixels with point sources are ignored in determiningthe colorbar. Article number, page 19 of 22 & A proofs: manuscript no. main −
10 0 1010 −
10 0 1010 Fig. A.3.
Histogram of posterior residuals of resolve reconstructionsweighted with σ ( ξ ( σ ) ), i.e. both the thermal noise and the Bayesianweighting scheme. Blue and orange bars denote real and imaginaryparts, respectively. The black dotted line displays a standard normalGaussian distribution scaled to the number of data points. All bins are0.5 wide and range from -15 to 15 and the y-axes display the number ofdata points.Article number, page 20 of 22hilipp Arras et al.: Comparison of classical and Bayesian imaging in radio interferometry − - M H z resolve ssclean msclean − - M H z − - M H z −
50 0 50[arcsec] − - M H z −
50 0 50[arcsec] −
50 0 50[arcsec] 10 − − − − Fig. A.4.
As fig. 1, just with saturated colorbar. The colorbar has units Jy / arcsec . − − −
20 0 20 40 60[arcsec] − − − [ a r c s ec ] Fig. A.5.
Comparison multi-scale
CLEAN (blue, negative regions grey), resolve posterior mean (orange), 2052 MHz, contour lines have multi-plicative distances of √
2. Article number, page 21 of 22 & A proofs: manuscript no. main − - M H z resolve ssclean msclean − - M H z − - M H z − − - M H z − − − − Fig. A.6.
Overview of imaging results zoomed in to central source. The top row shows the resolve posterior mean, the middle and last row showsingle-scale
CLEAN multi-scale
CLEAN results, respectively. The colorbar has units Jy //