[PDF] Interpretable Faraday Complexity Classification

Abstract

Faraday complexity describes whether a spectropolarimetric observation has simple or complex magnetic structure. Quickly determining the Faraday complexity of a spectropolarimetric observation is important for processing large, polarised radio surveys. Finding simple sources lets us build rotation measure grids, and finding complex sources lets us follow these sources up with slower analysis techniques or further observations. We introduce five features that can be used to train simple, interpretable machine learning classifiers for estimating Faraday complexity. We train logistic regression and extreme gradient boosted tree classifiers on simulated polarised spectra using our features, analyse their behaviour, and demonstrate our features are effective for both simulated and real data. This is the first application of machine learning methods to real spectropolarimetry data. With 95 per cent accuracy on simulated ASKAP data and 90 per cent accuracy on simulated ATCA data, our method performs comparably to state-of-the-art convolutional neural networks while being simpler and easier to interpret. Logistic regression trained with our features behaves sensibly on real data and its outputs are useful for sorting polarised sources by apparent Faraday complexity.

Full PDF

PPublications of the Astronomical Society of Australia (PASA)doi: 10.1017/pas.2021.xxx.

Interpretable Faraday Complexity Classiﬁcation

M. J. Alger , , J. D. Livingston , N. M. McClure-Griﬃths , J. L. Nabaglo, O. I. Wong , , , C. S. Ong , Research School of Astronomy and Astrophysics, The Australian National University, Canberra, ACT 2611, Australia Data61, CSIRO, Canberra, ACT 2601, Australia CSIRO Astronomy & Space Science, PO Box 1130, Bentley, WA 6102, Australia ICRAR-M468, University of Western Australia, Crawley, WA 6009, Australia ARC Centre of Excellence for All Sky Astrophysics in 3 Dimensions (ASTRO 3D), Australia Research School of Computer Science, The Australian National University, Canberra, ACT 2601, Australia

Abstract

Faraday complexity describes whether a spectropolarimetric observation has simple or complex magneticstructure. Quickly determining the Faraday complexity of a spectropolarimetric observation is importantfor processing large, polarised radio surveys. Finding simple sources lets us build rotation measuregrids, and ﬁnding complex sources lets us follow these sources up with slower analysis techniques orfurther observations. We introduce ﬁve features that can be used to train simple, interpretable machinelearning classiﬁers for estimating Faraday complexity. We train logistic regression and extreme gradientboosted tree classiﬁers on simulated polarised spectra using our features, analyse their behaviour, anddemonstrate our features are eﬀective for both simulated and real data. This is the ﬁrst application ofmachine learning methods to real spectropolarimetry data. With 95 per cent accuracy on simulatedASKAP data and 90 per cent accuracy on simulated ATCA data, our method performs comparablyto state-of-the-art convolutional neural networks while being simpler and easier to interpret. Logisticregression trained with our features behaves sensibly on real data and its outputs are useful for sortingpolarised sources by apparent Faraday complexity.

Keywords:

Radio astronomy – Radio spectroscopy – Spectropolarimetry – Astrostatistics – Classiﬁcation

As polarised radiation from distant galaxies makes itsway to us, magnetised plasma along the way can causethe polarisation angle to change due to the Faradayeﬀect. The amount of rotation depends on the squaredwavelength of the radiation, and the rotation per squaredwavelength is called the Faraday depth. Multiple Fara-day depths may exist along one line-of-sight, and if apolarised source is observed at multiple wavelengthsthen these multiple depths can be disentangled. Thiscan provide insight into the polarised structure of thesource or the intervening medium.Faraday rotation measure synthesis (RM synthesis) isa technique for decomposing a spectropolarimetric ob-servation into ﬂux at its Faraday depths φ , the resultingdistribution of depths being called a ‘Faraday disper-sion function’ (FDF) or a ‘Faraday spectrum’. It wasintroduced by Brentjens & de Bruyn (2005) as a way torapidly and reliably analyse the polarisation structure ofcomplex and high-Faraday depth polarised observations. ∗ [email protected] A ‘Faraday simple’ observation is one for which thereis only one Faraday depth, and in this simple case theFaraday depth is also known as a ‘rotation measure’(RM). All Faraday simple observations can be modelledas a polarised source with a thermal plasma of constantelectron density and magnetic ﬁeld (a ‘Faraday screen’;Brentjens & de Bruyn, 2005; Anderson et al., 2015) be-tween the observer and the source. A ‘Faraday complex’observation is one which is not Faraday simple, andmay diﬀer from a Faraday simple source due to plasmaemission or composition of multiple screens (Brentjens& de Bruyn, 2005). The complexity of a source tellsus important details about the polarised structure ofthe source and along the line-of-sight, such as whetherthe intervening medium emits polarised radiation, orwhether there are turbulent magnetic ﬁelds or diﬀerentelectron densities in the neighbourhood. The complexityof nearby sources taken together can tell us about themagneto-ionic structure of the galactic and intergalac-tic medium between the sources and us as observers.O’Sullivan et al. (2017) show examples of simple andcomplex sources, and Figure 1 and Figure 2 show an1 a r X i v : . [ a s t r o - ph . I M ] F e b M. J. Alger et al. example of a simulated simple and complex FDF respec-tively.Identifying when an observation is Faraday complexis an important problem in polarised surveys (Sun et al.,2015), and with current surveys such as the PolarisedSky Survey of the Universe’s Magnetism (POSSUM)larger than ever before, methods that can quickly char-acterise Faraday complexity en masse are increasinglyuseful. Being able to identify which sources are simplelets us produce a reliable rotation measure grid frombackground sources, and being able to identify whichsources might be complex allows us to ﬁnd sources tofollow-up with slower polarisation analysis methods thatmay require manual oversight, such as QU ﬁtting (asseen in e.g. Miyashita et al., 2019; O’Sullivan et al., 2017).In this paper, we introduce ﬁve simple, interpretablefeatures representing polarised spectra, use these fea-tures to train machine learning classiﬁers to identifyFaraday complexity, and demonstrate their eﬀectivenesson real and simulated data. We construct our featuresby comparing observed polarised sources to idealisedpolarised sources. The features are intuitive and can beestimated from real FDFs.Section 2 provides a background to our work, includ-ing a summary of prior work and our assumptions onFDFs. Section 3 describes our approach to the Faradaycomplexity problem. Section 4 explains how we trainedand evaluated our method. Finally, Section 5 discussesthese results.

Faraday complexity is an observational property of asource: if multiple Faraday depths are observed withinthe same apparent source (e.g. due to multiple lines-of-sight being combined within a beam), then the source iscomplex. A source composed of multiple Faraday screensmay produce observations consistent with many models(Sun et al., 2015), including simple sources, so thereis some overlap between simple and complex sources.Faraday thickness is also a source of Faraday complexity:when the intervening medium between a polarised sourceand the observer also emits polarised light, the FDF can-not be characterised by a simple Faraday screen. Asdiscussed in Section 2.2 we defer Faraday thick sourcesto future work. In this section we summarise existingmethods of Faraday complexity estimation and explainour assumptions and model of simple and complex po-larised FDFs.

There are multiple ways to estimate Faraday complexity,including detecting non-linearity in χ ( λ ) (Goldstein &Reed, 1984), change in fractional polarisation as a func-tion of frequency (Farnes et al., 2014), non-sinusoidal −

250 0 250 − | F | (a) 0 .

02 0 . − P (b) −

250 0 250 φ (rad m − ) − | ˆ F | (c) 0 .

02 0 . λ (m ) − ˆ P (d) Figure 1.

A simple FDF and its corresponding polarised spectra:(a) groundtruth FDF F , (b) noise-free polarised spectrum P , (c)noisy observed FDF ˆ F , (d) noisy polarised spectrum ˆ P . Blue andorange mark real and imaginary components respectively. variation in fractional polarisation in Stokes Q and U (O’Sullivan et al., 2012), counting components in theFDF (Law et al., 2011), minimising the Bayesian in-formation criterion (BIC) over a range of simple andcomplex models (called ‘QU ﬁtting’; O’Sullivan et al.,2017), the method of Faraday moments (Anderson et al.,2015; Brown, 2011), and deep convolutional neural net-work classiﬁers (CNNs; Brown et al., 2018). See Sunet al. (2015) for a comparison of these methods.The most common approaches to estimating com-plexity are QU ﬁtting (e.g. O’Sullivan et al., 2017) andFaraday moments (e.g. Anderson et al., 2015). To ourknowledge there is currently no literature examiningthe accuracy of QU ﬁtting when applied to complexityclassiﬁcation speciﬁcally, though Miyashita et al. (2019)analyse its eﬀectiveness on identifying the structure oftwo-component sources. Brown (2011) suggested Fara-day moments as a method to identify complexity, amethod later used by Farnes et al. (2014) and Andersonet al. (2015), but again no literature examines the ac-curacy. CNNs are the current state-of-the-art with anaccuracy of 94.9 per cent (Brown et al., 2018) on simu-lated ASKAP Band 1 and 3 data, and we will compareour results to this method. Before we can classify FDFs as Faraday complex or Fara-day simple, we need to deﬁne FDFs and any assumptionswe make about them. An FDF is a function that mapsFaraday depth φ to complex polarisation. It is the dis-tribution of Faraday depths in an observed polarisationspectrum. For a given observation, we assume there is atrue, noise-free FDF F composed of at most two Faradayscreens. This accounts for most actual sources (Andersonet al., 2015) and extension to three screens would covermost of the remainder—O’Sullivan et al. (2017) found −

250 0 250 φ (rad m − ) − F (a) 0 .

02 0 . λ (m ) − P (b) −

250 0 250 φ (rad m − ) − ˆ F (c) 0 .

02 0 . λ (m ) − ˆ P (d) Figure 2.

A complex FDF and its corresponding polarised spectra:(a) groundtruth FDF F , (b) noise-free polarised spectrum P , (c)noisy observed FDF ˆ F , (d) noisy polarised spectrum ˆ P . Blue andorange mark real and imaginary components respectively. that 89 per cent of their sources were best explainedby two or less screens, while the remainder were bestexplained by three screens. We model the screens byDirac delta distributions: F ( φ ) = A δ ( φ − φ ) + A δ ( φ − φ ) . (1) A and A are the polarised ﬂux of each Faraday screen,and φ and φ are the Faraday depths of the respectivescreens. With this model, a Faraday simple source is onewhich has A = 0, A = 0, or φ = φ . By using deltadistributions to model each screen, we are assumingthat there is no internal Faraday dispersion (which istypically associated with diﬀuse emission rather than themostly-compact sources we expect to ﬁnd in wide-areapolarised surveys). F generates a polarised spectrum ofthe form shown in Equation 2: P ( λ ) = A e iφ λ + A e iφ λ . (2)Such a spectrum would be observed as noisy samplesfrom a number of squared wavelengths λ j , j ∈ [1 , . . . , D ].We model this noise as a complex Gaussian with stan-dard deviation σ and call the noisy observed spectrumˆ P : ˆ P ( λ j ) ∼ N ( P ( λ j ) , σ ) . (3)The constant variance of the noise is a simplifying as-sumption which may not hold for real data, and exploringthis is a topic for future work. By performing RM syn-thesis (Brentjens & de Bruyn, 2005) on ˆ P with uniformweighting we arrive at an observed FDF:ˆ F ( φ ) = 1 D D X j =1 ˆ P ( λ j ) e − iφλ j . (4)Examples of F , ˆ F , P , and ˆ P for simple and complexobservations are shown in Figure 1 and Figure 2 re-spectively. Note that there are two reasons that the observed FDF ˆ F does not match the groundtruth FDF F . The ﬁrst is the noise in ˆ P . The second arises fromthe incomplete sampling of ˆ P .We do not consider external or internal Faraday dis-persion in this work. External Faraday dispersion wouldbroaden the delta functions of Equation 1 into peaks,and internal Faraday dispersion would broaden theminto top-hat functions. All sources have at least a smallamount of dispersion as the Faraday depth is a bulkproperty of the intervening medium and is subject tonoise, but the assumption we make is that this dispersionis suﬃciently small that the groundtruth FDFs are well-modelled with delta functions. Faraday thick sourceswould also invalidate our assumptions, and we assumethat there are none in our data as Faraday thickness canbe consistent with a two-component model depending onthe wavelength sampling (e.g. Ma et al., 2019; Brentjens& de Bruyn, 2005). Nevertheless some external Faradaydispersion would be covered by our model, as depend-ing on observing parameters Faraday thick sources mayappear as two screens (Van Eck et al., 2017).To simulate observed FDFs we follow the method ofBrown et al. (2018), which we describe in Appendix E. The Faraday complexity classiﬁcation problem is asfollows: Given an FDF ˆ F , is it Faraday complex orFaraday simple? In this section we describe the featuresthat we have developed to address this problem, whichcan be used in any standard machine learning classiﬁer.We trained two classiﬁers on these features, which wedescribe here also. Our features are based on a simple idea: all simple FDFslook essentially the same, up to scaling and translation,while complex FDFs may deviate. A noise-free peak-normalised simple FDF ˆ F simple has the formˆ F simple ( φ ; φ s ) = R ( φ − φ s ) . (5)where R is the rotation measure spread function (RMSF),the Fourier transform of the wavelength sampling func-tion which is 1 at all observed wavelengths and 0 other-wise. φ s traces out a curve in the space of all possibleFDFs. In other words, ˆ F simple is a manifold parametrisedby φ s . Our features are derived from relating an observedFDF to the manifold of simple FDFs (the ‘simple man-ifold’). We measure the distance of an observed FDFto the simple manifold using distance measure D f , thattake all values of the FDF into account: ς f ( ˆ F ) = min φ s ∈ R D f ( ˆ F ( φ ) k ˆ F simple ( φ ; φ s )) . (6)We propose two distances that have nice properties: M. J. Alger et al. • invariant over changes in complex phase,• translationally invariant in Faraday depth,• zero for Faraday simple sources (i.e. when A = 0, A = 0, or φ = φ ) when there is no noise,• symmetric in components (i.e. swapping A ↔ A and φ ↔ φ should not change the distance),• increasing as A and A become closer to each other,and• increasing as screen separation | φ − φ | increasesover a large range.Our features are constructed from this distance and itsminimiser. In other words we look for the simple FDFˆ F simple that is “closest” to the observed FDF ˆ F . Theminimiser φ s is the Faraday depth of the simple FDF.While we could choose any distance that operates onfunctions, we used the 2-Wasserstein ( W ) distance (7)and the Euclidean distance (9). The W distance oper-ates on probability distributions and can be thought ofas the minimum cost to ‘move’ one probability distribu-tion to the other, where the cost of moving one unit ofprobability mass is the squared distance it is moved. Un-der W distance, the minimiser φ w in Equation 6 can beinterpreted as the Faraday depth that the FDF ˆ F wouldbe observed to have if its complexity was unresolved (i.e.the weighted mean of its components). The Euclideandistance is the square root of the least-squares loss whichis often used for ﬁtting ˆ F simple to the FDF ˆ F . UnderEuclidean distance, the minimiser φ s is equivalent tothe depth of the best-ﬁtting single component under as-sumption of Gaussian noise in ˆ F . We calculated the W distance using Python Optimal Transport (Flamary &Courty, 2017), and we calculated the Euclidean distanceusing scipy.spatial.distance.euclidean (Virtanenet al., 2020). Further intuition about the two distancesis provided in Section 3.2.We denote by φ w and φ e , the Faraday depth of thesimple FDF that minimises the respective distances (2-Wasserstein and Euclidean). φ w = argmin φ w D W ( ˆ F ( φ ) k ˆ F simple ( φ ; φ w )) ,φ e = argmin φ e D E ( ˆ F ( φ ) k ˆ F simple ( φ ; φ e )) . These features are depicted on an example FDF in Fig-ure 3. For simple observed FDFs, the ﬁtted Faradaydepths φ w and φ e both tend to be close to the peak ofthe observed FDF. However for complex observed FDFs, φ w tends to be at the average depth between the twomajor peaks of the observed FDF, being closer to thehigher peak. For notation convenience, we denote theFaraday depth of the observed FDF that has largestmagnitude as φ a , i.e. φ a = argmax φ a | ˆ F ( φ a ) | , Note that in practice φ a ≈ φ e . For complex observedFDFs, the values of Faraday depths φ w and φ a tend φ w φ a ≈ φ e ˆ F ( φ w ) ˆ F ( φ a ) | φ w − φ a | ˆ F observedˆ F simple minimising W Figure 3.

An example of how an observed FDF ˆ F relates toour features. φ w is the W -minimising Faraday depth, and φ a is the ˆ F -maximising Faraday depth (approximately equal to theEuclidean-minimising Faraday depth). The remaining two featuresare the W and Euclidean distances between the depicted FDFs. to diﬀer (essentially by a proportion of the location ofthe second screen). The diﬀerence between φ w and φ a therefore provides useful information to identify com-plex FDFs. When the observed FDF is simple, the 2-Wasserstein ﬁt will overlap signiﬁcantly, hence the ob-served magnitudes ˆ F ( φ w ) and ˆ F ( φ a ) will be similar.However, for complex FDFs φ w and φ a are at diﬀerentdepths, leading to diﬀerent values of ˆ F ( φ w ) and ˆ F ( φ a ).Therefore the magnitudes of the observed FDFs at thedepths φ w and φ a indicate how diﬀerent the observedFDF is from a simple FDF.In summary, we provide the following features to theclassiﬁer:• log | φ w − φ a | ,• log ˆ F ( φ w ),• log ˆ F ( φ a ),• log D W ( ˆ F ( φ ) k ˆ F simple ( φ ; φ w )),• log D E ( ˆ F ( φ ) k ˆ F simple ( φ ; φ e )),where D E is the Euclidean distance, D W is the W distance, φ a is the Faraday depth of the FDF peak, φ w is the minimiser for W distance, and φ e is the minimiserfor Euclidean distance. Interestingly, in the case where there is no RMSF, Equa-tion 6 with W distance reduces to the Faraday momentalready in common use: D W ( F ) = min φ w ∈ R D W ( F ( φ ) k F simple ( φ ; φ w )) (7)= (cid:18) A A ( A + A ) ( φ − φ ) (cid:19) / . (8)See Appendix A for the corresponding calculation. In thissense, the W distance can be thought of as a generalisedFaraday moment, and conversely an interpretation ofFaraday moments as a distance from the simple manifoldin the case where there is no RMSF. Euclidean distancebehaves quite diﬀerently in this case, and the result-ing distance measure is totally independent of Faradaydepth: D E ( F ) = min φ e ∈ R D E ( F ( φ ) k F simple ( φ ; φ e )) (9)= √ A , A ) A + A . (10)See Appendix B for the corresponding calculation. We trained two classiﬁers on simulated observations us-ing these features: logistic regression (LR) and extremegradient boosted trees (XGB). These classiﬁers are usefultogether for understanding Faraday complexity classiﬁca-tion. LR is a linear classiﬁer that is readily interpretableby examining the weights it applies to each feature, andis one of the simplest possible classiﬁers. XGB is a pow-erful oﬀ-the-shelf non-linear ensemble classiﬁer, and isan example of a decision tree ensemble which are widelyused in astronomy (e.g. Machado Poletti Valle et al.,2020; Hložek et al., 2020). We used the scikit-learn implementation of LR and we use the

XGBoost library forXGB. We optimised hyperparameters for XGB using afork of xgboost-tuner as utilised by Zhu et al. (2020).We used 1 000 iterations of randomised parameter tun-ing and the hyperparameters we found are tabulatedin Table 2. We optimised hyperparameters for LR us-ing a 5-fold cross-validation grid search implementedin sklearn.model_selection.GridSearchCV . The re-sulting hyperparameters are tabulated in Table 3 in theAppendix. We applied our classiﬁers to classify simulated (Sec-tion 4.2 and 4.3) and real (Section 4.4) FDFs. We repli-cated the experimental setup of Brown et al. (2018) forcomparison with the state-of-the-art CNN classiﬁcationmethod, and we also applied our method to 142 realFDFs observed with the Australia Telescope CompactArray (ATCA) from Livingston et al. (2020, submitted)and O’Sullivan et al. (2017).

Our classiﬁers were trained and validated on simulatedFDFs. We produced two sets of simulated FDFs, one https://github.com/chengsoonong/xgboost-tuner for comparison with the state-of-the-art method in theliterature and one for application to our observed FDFs(described in Section 4.1.2). We refer to the former asthe ‘ASKAP’ dataset as it uses frequencies from the Aus-tralian Square Kilometre Array Pathﬁnder 12-antennaearly science conﬁguration. These frequencies included900 channels from 700–1300 and 1500–1800 MHz andwere used to generate simulated training and validationdata by Brown et al. (2018). We refer to the latter asthe ‘ATCA’ dataset as it uses frequencies from the 1–3 GHz conﬁguration of the ATCA. These frequenciesincluded 394 channels from 1.29–3.02 GHz and matchour real data. We simulated Faraday depths from −

50 to50 rad m − for the ‘ASKAP’ dataset (matching Brown)and −

500 to 500 for the ‘ATCA’ dataset.For each dataset, we simulated 100 000 FDFs, ap-proximately half simple and half complex. We randomlyallocated half of these FDFs to a training set and re-served the remaining half for validation. Each FDF hadcomplex Gaussian noise added to the correspondingpolarisation spectrum. For the ‘ASKAP’ dataset, wesampled the standard deviation of the noise uniformlybetween 0 and σ max = 0 . σ : σ ∼ . √ πσ exp − log (50 σ − . × . ! (11) We used two real datasets containing a total of 142sources: 42 polarised spectra from Livingston et al. (2020,submitted) and 100 polarised spectra from O’Sullivanet al. (2017). These datasets were observed in similarfrequency ranges on the same telescope (with diﬀerentbinning), but are in diﬀerent parts of the sky. The Liv-ingston data were taken near the Galactic Centre, andthe O’Sullivan data were taken away from the plane ofthe Galaxy. There are more Faraday complex sourcesnear the Galactic Centre compared to more Faradaysimple sources away from the plane of the Galaxy (Liv-ingston et al.). The similar frequency channels used inthe two datasets result in almost identical RMSFs overthe Faraday depth range we considered (-500 to 500 radm − ), so we expected that the classiﬁers would workequally well on both datasets with no need to re-train.We discarded the 26 Livingston sources with modelledFaraday depths outside of this Faraday depth range,which we do not expect to aﬀect the applicability of ourmethods to wide-area surveys because these fairly highdepths are not common.Livingston et al. (2021) used RM-CLEAN (Heald,2008) to identify signiﬁcant components in their FDFs.Some of these components had very high Faraday depths M. J. Alger et al. up to 2000 rad m − , but we chose to ignore these compo-nents in this paper as they are much larger than might beexpected in a wide-area survey like POSSUM. They usedthe second Faraday moment (Brown, 2011) to estimateFaraday complexity, with Faraday depths determinedusing scipy.signal.find_peaks on the cleaned FDFs,with a cutoﬀ of 7 times the noise of the polarised spec-trum. Using this method, they estimated that 89 percent of their sources were Faraday complex i.e. had aFaraday moment greater than 0.O’Sullivan et al. (2017) used the QU-ﬁtting and modelselection technique described in O’Sullivan et al. (2012).The QU-ﬁtting models contained up to three Faradayscreen components as well as a term for internal andexternal Faraday dispersion. We ignore the Faradaythickness and dispersion for the purposes of this paper,as most sources were not found to have Faraday thicknessand dispersion is beyond the scope of our current work.37 sources had just one component, 52 had two, and theremaining 11 had three. Table 1

Confusion matrix entries for LR and XGB on‘ASKAP’ and ‘ATCA’ simulated datasets, and the CNNconfusion matrix entries adapted from Brown et al. (2018).‘ASKAP’ ‘ATCA’LR XGB CNN LR XGBTrue negative rate 0.99 0.99 0.97 0.92 0.91False positive rate 0.01 0.01 0.03 0.08 0.09False negative rate 0.10 0.09 0.07 0.16 0.10True positive rate 0.90 0.91 0.93 0.84 0.90

The accuracy of the LR and XGB classiﬁers on the‘ASKAP’ testing set was 94.4 and 95.1 per cent respec-tively. The rates of true and false identiﬁcations aresummarised in Table 1. These results are very closeto the CNN presented by Brown et al. (2018), with aslightly higher true negative rate and a slightly lowertrue positive rate (recalling that positive sources arecomplex, and negative sources are simple). The accu-racy of the CNN was 94.9, slightly lower than our XGBclassiﬁer and slightly higher than our LR classiﬁer. Bothof our classiﬁers therefore produce similar classiﬁcationperformance to the CNN, with faster training time andeasier interpretation.

The accuracy of the LR and XGB classiﬁers on the‘ATCA’ dataset was 89.2 and 90.5 per cent respectively.The major diﬀerences between the ‘ATCA’ and the‘ASKAP’ experiments are the range of the simulatedFaraday depths and the distribution of noise levels. The

200 400 600 800∆ φ (∆ rad m − )0 . . . . . m i n ( A , A ) . . . . . . . . . . . M e a n X G B p r e d i c t i o n (a)

200 400 600 800∆ φ (∆ rad m − )0 . . . . . m i n ( A , A ) . . . . . . . . . . . M e a n L R p r e d i c t i o n (b) Figure 4.

Mean prediction as a function of component depthseparation and minimum component amplitude for (a) XGB and(b) LR. ‘ASKAP’ dataset, to match past CNN work, only in-cluded depths from −

50 to 50 rad m − , while the ‘ATCA’dataset includes depths from −

500 to 500 rad m − . Therates of true and false identiﬁcations are again shown inTable 1.As we know the true Faraday depths of the compo-nents in our simulation, we can investigate the behaviourof these classiﬁers as a function of physical properties.Figure 4 shows the mean classiﬁer prediction as a func-tion of component depth separation and minimum com-ponent amplitude. This is tightly related to the meanaccuracy, as the entire plot domain contains complexspectra besides the left and bottom edge: by thresh-olding the classiﬁer prediction to a certain value, theaccuracy will be one hundred per cent on the non-edgefor all sources with higher prediction values. We used the LR and XGB classiﬁers which were trainedon the ‘ATCA’ dataset to estimate the probability thatour 142 observed FDFs (Section 4.1.2) were Faraday − . . . . . − P C A . . . . L R p r o b a b ili t y − . . . . . − P C A . . . . . . X G B p r o b a b ili t y Figure 5.

Principal component analysis for simulated data (coloured dots) with observations overlaid (black-edged circles). Observationsare coloured by their XGB or LR estimated probability of being complex, with blue indicating ‘most simple’ and pink indicating ‘mostcomplex’. . . . . . . P e r ce n t ag ec o m p l e x LRLivingstonO’Sullivan0 . . . . . . P e r ce n t ag ec o m p l e x XGBLivingstonO’Sullivan

Figure 6.

Estimated rates of Faraday complexity for the Liv-ingston and O’Sullivan datasets as functions of threshold. Thehorizontal lines indicate the rates of Faraday complexity estimatedby Livingston and O’Sullivan respectively. complex. As these classiﬁers were trained on simulateddata, they face the issue of the ‘domain gap’: the dis-tribution of samples from a simulation diﬀers from thedistribution of real sources, and this aﬀects performanceon real data. Solving this issue is called ‘domain adap-tation’ and how to do this is an open research questionin machine learning (Zhang, 2019; Pan & Yang, 2010).Nevertheless, the features of our observations mostly fallin the same region of feature space as the simulations(Figure 5) and so we expect reasonably good domaintransfer.Two apparently complex sources in the Livingstonsample are classiﬁed as simple with high probability byXGB. These outliers are on the very edge of the trainingsample (Figure 5) and the underdensity of training datahere is likely the cause of this issue. LR does not suﬀerthe same issue, producing plausible predictions for theentire dataset, and these sources are instead classiﬁedas complex with high probability.With a threshold of 0.5, LR predicted that 96 and83 per cent of the Livingston and O’Sullivan sourceswere complex respectively. This is in line with expecta-tions that the Livingston data should have more Faradaycomplex sources than the O’Sullivan data due to their lo-cation near the Galactic Centre. XGB predicted that 93and 100 per cent of the Livingston and O’Sullivan sourceswere complex respectively. Livingston et al. (2021) foundthat 90 per cent of their sources were complex, andO’Sullivan et al. (2017) found that 64 per cent of theirsources were complex. This suggests that our classi-ﬁers are overestimating complexity, though it could alsobe the case that the methods used by Livingston andO’Sullivan underestimate complexity. Modifying the pre-

M. J. Alger et al. diction threshold from 0.5 changes the estimated rate ofFaraday complexity, and we show the estimated ratesagainst threshold for both classiﬁers in Figure 6. Wesuggest that this result is indicative of our probabilitiesbeing uncalibrated, and a higher threshold should bechosen in practice. We chose to keep the threshold at0.5 as this had the highest accuracy on the simulatedvalidation data. The very high complexity rates of XGBand two outlying classiﬁcations indicate that the XGBclassiﬁer may be overﬁtting to the simulation and thatit is unable to generalise across the domain gap.Figure 7 and Figure 8 show every observed FDF or-dered by estimated Faraday complexity, alongside themodels predicted by Livingston and O’Sullivan et al.(2017), for LR and XGB respectively. There is a clearvisual trend of increasingly complex sources with increas-ing predicted probability of being complex.

On simulated data (Section 4.3) we achieve state-of-the-art accuracy. Our results on observed FDFs show thatour classiﬁers produce plausible results, with Figure 7and Figure 8 showing a clear trend of apparent complex-ity. Some issues remain: we discuss the intrinsic overlapbetween simple and complex FDFs in Section 5.1 andthe limitations of our method in Section 5.2.

Through this work we found our methods limited by thesigniﬁcant overlap between complex and simple FDFs.Complex FDFs can be consistent with simple FDFs dueto close Faraday components or very small amplitudeson the secondary component, and vice versa due to noise.The main failure mode of our classiﬁers is misclassi-fying a complex source as simple (Table 1). Whethersources with close components or small amplitudesshould be considered complex is not clear, since forpractical purposes they can be treated as simple: assum-ing the source is simple yields a very similar RM to theRM of the primary component, and thus would not neg-atively impact further data products such as an RM grid.The scenarios where we would want a Faraday complex-ity classiﬁer rather than a polarisation structure model –large-scale analysis and wide-area surveys – do not seemto be disadvantaged by considering such sources simple.Additional sources similar to these are likely hidden inpresumably ‘simple’ FDFs by the frequency range andspacing of the observations, just as how these complexsources would be hidden in lower-resolution observations.Note also that misidentiﬁcation of complex sources assimple is intrinsically a problem with complexity estima-tion even for models not well-represented by a simpleFDF, as complex sources may conspire to appear as awide range of viable models including simple (Sun et al., 2015).Conversely, high-noise simple FDFs may be consistentwith complex FDFs. One key question is how Faradaycomplexity estimators should behave as the noise in-creases: should high noise result in a complex predictionor a simple prediction, given that a complex or simpleFDF would both be consistent with a noisy FDF? Oc-cam’s razor suggests that we should choose the simplestsuitable model, and so increasing noise should lead topredictions of less complexity. This is not how our classi-ﬁers operate, however: high-noise FDFs are diﬀerent tothe model simple FDFs and so are predicted to be ‘notsimple’. In some sense our classiﬁers are not looking forcomplex sources, but are rather looking for ‘not simple’sources.

Our main limitations are our simplifying assumptionson FDFs and the domain gap between simulated andreal observations. However, our proposed features (Sec-tion Section 3.1) can be applied to future improvedsimulations.It is unclear what the eﬀect of our simplifying assump-tions are on the eﬀectiveness of our simulation. Thethree main simpliﬁcations that may negatively aﬀect oursimulations are 1) limiting to two components, 2) assum-ing no external Faraday dispersion, and 3) assuming nointernal Faraday dispersion (Faraday thickness). Futurework will explore removing these simplifying assump-tions, but will need to account for the increased diﬃcultyin characterising the simulation with more componentsand no longer having Faraday screens as components.Additionally, more work will be required to make surethat the rates of internal and external Faraday disper-sion match what might be expected from real sources,or risk making a simulation that has too large a rangeof consistent models for a given source: for example,a two-component source could also be explained as asuﬃciently wide or resolved-out Faraday thick source ora three-component source with a small third component.This greatly complicates the classiﬁcation task.Previous machine learning work (e.g. Brown et al.,2018) has not been run before on real FDF data, so thispaper is the ﬁrst example of the domain gap arising inFaraday complexity classiﬁcation. This is a problem thatrequires further research to solve. We have no good wayto ensure that our simulation matches reality, so someamount of domain adaptation will always be necessaryto train classiﬁers on simulated data and then applythese classiﬁers to real data. But with the low sourcecounts in polarisation science (high-resolution spectropo-larimetric data currently numbers in the few hundreds)any machine learning method will need to be trainedon simulations. This is not just a problem in Faradaycomplexity estimation, and domain adaptation is alsoan issue faced in the wider astroinformatics community:large quantities of labelled data are hard to come by, andsome sources are very rare (e.g. gravitational wave de-tections or fast radio bursts; Zevin et al., 2017; Gebhardet al., 2019; Agarwal et al., 2020). LR seems to handlethe domain adaptation better than XGB, with only aslightly lower accuracy on simulated data. Our resultsare plausible and the distribution of our simulation welloverlaps the distribution of our real data (Figure 5).

We developed a simple, interpretable machine learningmethod for estimating Faraday complexity. Our inter-pretable features were derived by comparing observedFDFs to idealised simple FDFs, which we could de-termine both for simulated and real observations. Wedemonstrated the eﬀectiveness of our method on bothsimulated and real data. Using simulated data, we foundthat our classiﬁers were 95 per cent accurate, with nearperfect recall (speciﬁcity) of Faraday simple sources. Onsimulated data that matched existing observations, ourclassiﬁers obtained an accuracy of 90 per cent. Evaluat-ing our classiﬁers on real data gave the plausible resultsshown in Figure 7, and marks the ﬁrst application ofmachine learning to observed FDFs. Future work willneed to narrow the domain gap to improve transfer ofclassiﬁers trained on simulations to real, observed data.

This research was conducted in Canberra, on land for whichthe Ngunnawal and Ngambri people are the traditional andongoing custodians. M.J.A. and J.D.L. were supported by theAustralian Government Research Training Program. M.J.A.was supported by the Astronomical Society of Australia.The Australia Telescope Compact Array is part of the Aus-tralia Telescope National Facility which is funded by theAustralian Government for operation as a National Facilitymanaged by CSIRO. We acknowledge the Gomeroi peopleas the traditional owners of the Observatory site. We thankthe anonymous referee for their comments on this work.

A 2-WASSERSTEIN BEGETS FARADAYMOMENTS

Minimising the 2-Wasserstein distance between a modelFDF and the simple manifold gives the second Faradaymoment of that FDF. Let ˜ F be the sum-normalisedmodel FDF and let ˜ S be the sum-normalised simplemodel FDF:˜ F ( φ ) = A δ ( φ − φ ) + A δ ( φ − φ ) A + A (12)˜ S ( φ ; φ w ) = δ ( φ − φ w ) . (13)The W distance, usually deﬁned on probability distri-butions, can be extended to one-dimensional complexfunctions A and B by normalising them: D W ( A k B ) = inf γ ∈ Γ( A,B ) Z Z φ max φ min | x − y | d γ ( x, y )(14)˜ A ( φ ) = | A ( φ ) | R φ max φ min | A ( θ ) | d θ (15)˜ B ( φ ) = | B ( φ ) | R φ max φ min | B ( θ ) | d θ (16)where Γ( A, B ) is the set of couplings of A and B , i.e. theset of joint probability distributions that marginalise to A and B ; and inf γ ∈ Γ( A,B ) is the inﬁmum over Γ( A, B ).This can be interpreted as the minimum cost to ‘move’one probability distribution to the other, where the costof moving one unit of probability mass is the squareddistance it is moved.The set of couplings Γ( ˜

F , ˜ S ) is the set of all jointprobability distributions γ such that Z φ max φ min γ ( φ, ϕ ) d φ = ˜ S ( ϕ ; φ w ) , (17) Z φ max φ min γ ( φ, ϕ ) d ϕ = ˜ F ( φ ) . (18)The coupling that minimises the integral in Equation 14will be the optimal transport plan between ˜ F and ˜ S .Since ˜ F and ˜ S are deﬁned in terms of delta functions, theoptimal transport problem reduces to a discrete optimaltransport problem and the optimal transport plan is: γ ( φ, ϕ ) = A δ ( φ − φ ) + A δ ( φ − φ ) A + A δ ( ϕ − φ w ) . (19)In other words, to move the probability mass of ˜ S to ˜ F ,a fraction A / ( A + A ) is moved from φ w to φ and thecomplementary fraction A / ( A + A ) is moved from φ w to φ . Then: D W ( ˜ F k ˜ S ) = Z Z φ max φ min | φ − ϕ | d γ ( φ, ϕ ) (20)= A ( φ − φ w ) + A ( φ − φ w ) A + A . (21)0 M. J. Alger et al.

To obtain the W distance to the simple manifold, weneed to minimise this over φ w . Diﬀerentiate with respectto φ w and set equal to zero to ﬁnd φ w = A φ + A φ A + A . (22)Substituting this back in, we ﬁnd ς W ( F ) = A A A + A ( φ − φ ) (23)which is the Faraday moment. B EUCLIDEAN DISTANCE IN THENO-RMSF CASE

In this section we calculate the minimumised Euclideandistance evaluated on a model FDF (Equation 1). Let˜ F be the sum-normalised model FDF and let ˜ S be thenormalised simple model FDF:˜ F ( φ ) = A δ ( φ − φ ) + A δ ( φ − φ ) A + A (24)˜ S ( φ ; φ e ) = δ ( φ − φ e ) . (25)The Euclidean distance between ˜ F and ˜ S is then D E ( ˜ F ( φ ) k ˜ S ( φ ; φ e )) (26)= Z φ max φ min (cid:12)(cid:12) ˜ F ( φ ) − δ ( φ − φ e ) (cid:12)(cid:12) d φ. (27)Assume φ = φ (otherwise, D E will always be either0 or 2). If φ e = φ , then D E ( ˜ F ( φ ) k ˜ S ( φ ; φ e )) (28)= 1( A + A ) Z φ max φ min A | δ ( φ − φ ) − δ ( φ − φ ) | d φ (29)= 2 A ( A + A ) (30)and similarly for φ e = φ . If φ e = φ and φ e = φ , then D E ( ˜ F ( φ ) k ˜ S ( φ ; φ e )) = A + A + 1( A + A ) . (31)The minimised Euclidean distance when φ = φ istherefore D E ( F ) = min φ e ∈ R D E ( F ( φ ) k F simple ( φ ; φ e )) (32)= √ A , A ) A + A . (33)If φ = φ , then the minimised Euclidean distance is 0. C HYPERPARAMETERS FOR LR ANDXGB

This section contains tables of the hyperparameters thatwe used for our classiﬁers. Table 2 and Table 3 tabulatethe hyperparameters for XGB and LR respectively forthe ‘ATCA’ dataset. Table 4 and Table 5 tabulate thehyperparameters for XGB and LR respectively for the‘ASKAP’ dataset.

Table 2

XGB hyperparameters for the ‘ATCA’ dataset.

Parameter Valuecolsample_bytree 0.912gamma 0.532learning_rate 0.1max_depth 7min_child_weight 2scale_pos_weight 1subsample 0.557n_estimators 135reg_alpha 0.968reg_lambda 1.420

Table 3

LR hyperparameters for the ‘ATCA’ dataset.

Parameter Valuepenalty L1C 1.668

Table 4

XGB hyperparameters for the ‘ASKAP’ dataset.

Parameter Valuecolsample_bytree 0.865gamma 0.256learning_rate 0.1max_depth 6min_child_weight 1scale_pos_weight 1subsample 0.819n_estimators 108reg_alpha 0.049reg_lambda 0.454

D PREDICTIONS ON REAL DATA

This section contains Figure 7 and Figure 8, which showsthe predicted probability of being Faraday complex forall real data used in this paper, drawn from Livingstonet al. (2021) and O’Sullivan et al. (2017).1

Figure 7.

The 142 observed FDFs ordered by LR-estimated probability of being Faraday complex. Livingston-identiﬁed componentsare shown in orange while O’Sullivan-identiﬁed components are shown in magenta. Simpler FDFs (as deemed by the classiﬁer) areshown in purple while more complex FDFs are shown in green, and the numbers overlaid indicate the LR estimate. A lower numberindicates a lower probability that the corresponding source is complex, i.e. lower numbers correspond to simpler spectra. M. J. Alger et al.

Figure 8.

The 142 observed FDFs ordered by XGB-estimated probability of being Faraday complex. Livingston-identiﬁed componentsare shown in orange while O’Sullivan-identiﬁed components are shown in magenta. Simpler FDFs (as deemed by the classiﬁer) areshown in purple while more complex FDFs are shown in green, and the numbers overlaid indicate the XGB estimate. A lower numberindicates a lower probability that the corresponding source is complex, i.e. lower numbers correspond to simpler spectra. Table 5

LR hyperparameters for the ‘ASKAP’ dataset.

Parameter Valuepenalty L2C 0.464

E SIMULATING OBSERVED FDFS

We simulated FDFs by approximating them by arraysof complex numbers. An FDF F is approximated on thedomain [ − φ max , φ max ] by a vector F ∈ R d : F j = X k =0 A k δ ( − φ max + jδφ − φ k ) (34)where δφ = ( φ max − φ min ) /d and d is the number ofFaraday depth samples in the FDF. F is sampled byuniformly sampling its parameters: φ k ∈ [ φ min , φ min + δφ, . . . , φ max ] (35) A k ∼ U (0 , . (36)We then generate a vector polarisation spectrum P ∈ R m from F using a Equation 37: P ‘ = j X j =0 F j e i ( φ min + jδ φ ) λ ‘ d φ. (37) λ ‘ is the discretised value of λ at the ‘ th index of P .This requires a set of λ values, which depends on thedataset being simulated. These values can be treated asthe channel wavelengths at which the polarisation spec-trum was observed. We then add Gaussian noise withvariance σ to each element of P to obtain a discretisednoisy observation ˆ P . Finally, we perform RM synthesisusing the Canadian Initiative for Radio Astronomy DataAnalysis RM package , which is a Python module thatimplements a discrete version of RM synthesis:ˆ F j = m − m X ‘ =1 ˆ P ‘ e − i ( φ min + jδ φ ) λ ‘ . (38) REFERENCES

Agarwal D., Aggarwal K., Burke-Spolaor S., LorimerD. R., Garver-Daniels N., 2020, MNRASAnderson C. S., Gaensler B. M., Feain I. J., Franzen T.M. O., 2015, ApJ, 815, 49Brentjens M. A., de Bruyn A. G., 2005, A&A, 441, 1217Brown S., 2011, Assess the Complexity of an RM Syn-thesis Spectrum. No. 9 in POSSUM REPORTBrown S., et al., 2018, MNRASFarnes J. S., Gaensler B. M., Carretti E., 2014, ApJS,212, 15 https://github.com/CIRADA-Tools/RM Flamary R., Courty N., 2017, POT Python OptimalTransport library, https://github.com/rflamary/POThttps://github.com/rflamary/POT