[PDF] A robust principal component analysis for outlier identification in messy microcalorimeter data

Abstract

A principal component analysis (PCA) of clean microcalorimeter pulse records can be a first step beyond statistically optimal linear filtering of pulses towards a fully non-linear analysis. For PCA to be practical on spectrometers with hundreds of sensors, an automated identification of clean pulses is required. Robust forms of PCA are the subject of active research in machine learning. We examine a version known as coherence pursuit that is simple, fast, and well matched to the automatic identification of outlier records, as needed for microcalorimeter pulse analysis.

Full PDF

JJournal of Low Temperature Physics manuscript No. (will be inserted by the editor)

A robust principal component analysis for outlier identiﬁcationin messy microcalorimeter data

J.W. Fowler · B. K. Alpert · Y.-I. Joe · G. C. O’Neil · D. S. Swetz · J. N. Ullom

Draft of: November 4, 2019

Abstract

A principal component analysis (PCA) of clean microcalorimeter pulse recordscan be a ﬁrst step beyond statistically optimal linear ﬁltering of pulses towards a fully non-linear analysis. For PCA to be practical on spectrometers with hundreds of sensors, an au-tomated identiﬁcation of clean pulses is required. Robust forms of PCA are the subject ofactive research in machine learning. We examine a version known as coherence pursuit thatis simple, fast, and well matched to the automatic identiﬁcation of outlier records, as needed for microcalorimeter pulse analysis.

Keywords

Microcalorimeters, X-ray pulses, Pulse analysis

The analysis of microcalorimeter pulses generally employs statistically optimal linear ﬁl-tering.

This approach starts from the assumption that all pulses have the same shape,so an optimal estimation of their amplitude serves as a monotonically increasing indicationof the photon energy deposited in the absorber. As transition-edge sensors (TES) are some-times operated near to their saturation energy, this assumption can fail badly enough thatso-called “optimal ﬁltering” becomes clearly sub-optimal. Also, nonlinear analysis of TESpulse data has been proposed as a way to improve energy linearity.

There is no clearconsensus among low-temperature microcalorimeter researchers about how we should an-alyze data beyond the linear, single-shape assumption. Published proposals include: linearinterpolation among a set of energy-speciﬁc pulse templates, Taylor expansion of a contin-uous pulse shape model to leading order in energy, and local linearization of a nonlinearmanifold. A logical ﬁrst step into nonlinear analysis is the projection of pulse records into a low-dimensional subspace. Such a projection is fully linear and can be made noise-optimal in thesame sense as traditional optimal ﬁltering. If the subspace is properly selected, projectionwould preserve most of the signal effects—linear or not—yet eliminate much of the noise.

Quantum Sensors Group, National Institute of Standards and Technology, 325 Broadway, Boulder, Colorado80305, USA.Fowler, Joe, and Ullom are also with University of Colorado, Department of Physics, Boulder, Colorado80309, USA.E-mail: [email protected] a r X i v : . [ phy s i c s . d a t a - a n ] N ov J.W. Fowler et al.

A nonlinear analysis could then begin from the projection, which would naturally be ofmuch lower dimension than a complete pulse record: we ﬁnd that the subspace typicallyrequires no more than six dimensions, far fewer than the hundreds or thousands of samplesthat characterize a typical microcalorimeter pulse in its raw form.Another advantage of subspace projection lies in the separation of good pulses fromstatistical outliers. This advantage is relevant even to highly linear microcalorimeters suchas metallic magnetic calorimeters. The residual pulse after projection should be quitesensitive to whether a pulse does or does not conform to the model implied by the sub-space. Failure to conform produces large residuals. The advantages of projection onto alow-dimensional subspace are thus at least threefold: dimensionality reduction; removal ofa large share of the noise; and identiﬁcation of outlier records.A principal component analysis (PCA) of representative, clean pulses is one way to iden-tify a productive subspace.

A normal data stream, however, is a mixture of clean pulsesand unwanted records. Unwanted records include those containing two or more piled-uppulses or the decaying tails of earlier pulses; they may also include pathologies of the read-out system that are difﬁcult to model a priori. While the analysis of certain pulse-summaryquantities can undoubtedly allow one to identify and eliminate these unwanted records, suchanalysis can be cumbersome and require unreasonable amounts of attention from human ex-perts if applied to arrays of hundreds of sensors. An identical problem plagues machinelearning, where there is often great value in a low-dimensional model that can describe most but not all data examples. We have examined several robust variations on PCA, in searchof one that suits the needs of microcalorimeter pulse analysis.

Principal component analysis ﬁnds the best rank- r approximation to a matrix M ∈ R m × n ofdimensions much larger than r . Here, “best” means the smallest sum of square residuals,among other properties. That is, it solves:minimize || M − UU T M || F , with respect to U , subject to U T U = I , (1)where U ∈ R m × r , r < m ≤ n , and || · || F is the Frobenius norm of a matrix. The matrix U is anorthonormal basis for a subspace of dimension r . PCA guarantees that the subspace that U spans best represents the columns of M . The principal components can be found by singular-value decomposition (SVD) of M or by eigenvalue decomposition of MM T . Speciﬁcally, wecan compute the PCA via the SVD as M = USV T , where the squares of the elements of thediagonal matrix S (the squares of the singular values) are the amount of the variance in M accounted for by each principal component, and the columns of U corresponding to thelargest singular values are the leading principal components.For purposes of pulse analysis, let the matrix M ∈ R m × n contain thousands of pulserecords all from the same sensor, with each of the n columns representing one pulse recordof length m . These records of “training data” should span the range of photon energies ofinterest, so that the principal components of M will represent all future clean, single-pulserecords over the broadest possible energy range. PCA analysis generally starts with recen-tering and rescaling each column of M , subtracting its mean and dividing by its standarddeviation to achieve zero mean and unit variance. In this paper, we make a different, relatedadjustment. A global constant (the median value of all pretrigger samples) is subtracted from M , because DC offsets are both meaningless and large in typical microcalorimeter data. robust principal component analysis for outlier identiﬁcation in messy microcalorimeter data 3 Suppose that the matrix M is dominated by signal, as it is with the very high-resolutionpulses of an x-ray microcalorimeter. Then the leading principal components of M will de-ﬁne, among all possible data vectors, a low-dimensional subspace that approximately con-tains all actual pulses. Projection of raw pulse records into this subspace can serve the dualpurposes of compressing the pulse records into a few, most informative numbers apiece andeliminating a large fraction of the noise. A subspace of higher dimension will capture bothmore signal and more noise than will one of lower dimension.If the goal is identiﬁcation of the appropriate subspace for clean pulses, then we actu-ally require not the full PCA or SVD, but only the column space of the matrix, and in fact,only those columns corresponding to the few largest singular values. Algorithms faster thanthe full SVD can be used to determine leading singular vectors. The truncated SVD isan iterative algorithm to compute singular vectors in order, implemented in the PROPACKlibrary. Also fast and very simple to implement are randomized SVD algorithms. Theycan compute an excellent approximation to the column space of a matrix and the leading sin-gular values, and they are well-suited to the problem of ﬁnding a low-dimensional subspacethat contains most of the signal in pulse records.

Unfortunately, the presence of outliers in the training data M can be disastrous. Even a singlecolumn out of thousands that contains, for example, two piled-up pulses can easily affectthe ﬁrst several principal components. This distorts our estimate of the space spanned by allnormal, single pulses with two deleterious effects. First, we are effectively wasting one of thefew numbers intended as a summary of clean pulses, using it to carry unwanted informationinstead. Second, we would like to use the residual after projection into the subspace as atest for outlier pulses. The residual should be large if and only if a record looks nothing likea pulse, whereas piled-up double pulses with timing and relative amplitude similar to anoutlier will have reduced residuals if that outlier is included in the PCA, as Figure 1 shows.What is needed is a variation on PCA that is immune to outliers in the training data.The search for robust methods of PCA is a very active area of research in statistics andmachine learning. This search is sometimes divided into robust PCA , where individualmatrix elements of M are outliers, and robust subspace recovery , where entire columnsmay be outliers. We have considered

L1-norm PCA , in which principal compo-nents minimize not the variance of the data residuals (the L2 norm) but the absolute devia-tion of the residuals (the L1 norm). While this PCA is more tolerant of a few pathologicalpulse records, we ﬁnd it not useful for the current application: it tends to reduce rather thaneliminate the inﬂuence of outliers. The same is true of L2,1-norm PCA, which addressesa certain practical objection to L1-norm PCA.Two other algorithms for robust subspace recovery operate by cleaning the data matrix M through the identiﬁcation and removal of outlier columns, after which a standard PCA canbe performed. The outlier pursuit method is more productive than the L1- and L2,1-normversions of PCA for the data-cleaning problem, but it is relatively slow and depends on theuser setting two separate free parameters that might be difﬁcult to determine automatically.The method best suited to microcalorimeter analysis is known as coherence pursuit . It relies on the idea that good columns in M (here, clean pulses) will tend to lie in nearlythe same direction in R m as many other columns, while outlier columns will tend to liefar from all or most other columns, even in the extreme case that outliers outnumber the J.W. Fowler et al. A Clean pulse examples B Outlier examples C PC 1PC 2PC 3PC 4 D Fig. 1

Example pulses.

Top left: several clean pulses of various energies.

Top right: two examples of pileup,which we wish to ﬂag as outliers.

Bottom left: four principal components determined by SVD of a set of1500 clean pulses. Crudely, the ﬁrst represents a pulse; the second, the difference between pulses of high andlower energy; and the third, the difference between pulses that arrive early or late with respect to the samplingclock. Further singular components lack simple explanations.

Bottom right: ﬁve principal components froma “contaminated” set: the same 1500 clean pulses plus the single outlier marked in the top right examples.Erroneous inclusion of this single outlier increases the dimension of the needed subspace by one. Also, itundermines the rejection of future outliers based on root-mean-square (rms) difference between pulse recordsand their projection into the low-dimensional subspace. For example, the typical rms residual of normalizedclean pulses is 0.006 after projection into either basis. The outliers 1 and 2, however, have rms residual of0.66 after projection into the clean basis but only 0.29 and 0.001 after projection into the contaminated basis.(Color ﬁgure online.) clean pulses. (cid:63) The underlying idea that random high-dimensional points are nearly alwaysorthogonal to one another can be made precise. The mutual coherence of two pulses isdeﬁned as the absolute value of the cosine of the angle between the column vectors. Whenwe compute the mutual coherence for all pairs of pulse records, outliers tend to have amuch lower sum of mutual coherence with all other records than non-outliers do. Coherencepursuit has a major advantage over most other robust PCA methods in that it is non-iterative;most competing methods require numerous iterations with expensive steps in each, such asan SVD.Speciﬁcally, the coherence pursuit algorithm applied to pulse records is: (cid:63)

Outlier pursuit thus relies on the

Anna Karenina principle: clean pulse records are all alike; every uncleanrecord is unclean in its own way. robust principal component analysis for outlier identiﬁcation in messy microcalorimeter data 5

1. Subtract the median pretrigger value d from all samples: M ≡ M − d . Thus further stepsare insensitive to a meaningless and arbitrarily large overall offset, but the variations inbaseline level from one record to another are preserved.2. Compute the L2 norm r of columns of M , the pulse rms amplitude : r j = [ ∑ i ( M ) i j ] / .3. Normalize the columns of M by creating the diagonal matrix R from the pulse rmsvalues and deﬁning X ≡ M R − .4. Compute the pairwise mutual coherence matrix G = X T X .5. Compute the coherence vector g as the L1-norm of each column of G (omitting the g ii = g j = ∑ ni = , i (cid:54) = j | G i j | , as there are n columns (pulses) in M .6. Deﬁne the mean coherence as c ≡ g / ( n − ) . Given this scaling, values c = j with large values of c j are clean pulses; low values of c j indicate that record j is an outlier. But where shall we draw the line? For nonlinear detectors, the answer to thisquestion turns out to depend on photon energy. We return to this matter in the next section.Certain choices in the algorithm deserve further study on a wide range of data sets. In step5, for example, we ﬁnd no clear preference for the L1-norm over the L2-norm. Similarly,the median coherence appears to be an acceptable alternative to the mean in step 6.A related, non-iterative method for outlier removal also starts from the mutual coher-ence matrix G and then rejects unstructured and structured (clustered) outliers in successivesteps. It requires no free parameters. This attractive feature appears to work well only when outliers are uniformly random in R m , however. Outlier pulse records are often similar to clean pulses for much of their length, on the other hand, so they violate the core assumptionthat would allow the rejection criterion to be computed from ﬁrst principles.If the number of pulses to be analyzed, n , is large and it is difﬁcult to compute the full G ∈ R n × n matrix, then we can subdivide the data into smaller batches and perform the outlierrejection step on each batch separately. After outliers have been identiﬁed and removed from M , we can complete the robust PCA by computing the SVD of the cleaned matrix. We demonstrate coherence pursuit on approximately 6000 pulse records from one sensor ina TES array in Figure 2. The x-rays in the range 4 keV to 10 keV are the K-shell ﬂuorescencelines of various transition metals and L-shell lines of certain lanthanide metals, similar to thedata previously used for metrology studies. Records are selected arbitrarily but from allportions of a measurement lasting ﬁfteen hours. Speciﬁc pulses are shown, including threeclean pulses and eight outliers, some of which differ only subtly from clean records.The lower left panel of Figure 2 shows that good pulses have a coherence that dependson pulse amplitude. Clean pulses at the energy extremes, either high or low, show slightlylower coherence than clean pulses at the middle of the energy range. TES nonlinearity causesthis effect: because pulse shapes change slightly with energy, pulses at the lowest or highestenergies are less coherent with the ensemble of good pulses overall. While the effect is small,a PCA that fully accounts for pulse shapes at the energy extremes requires a threshold onthe coherence c that depends on pulse size or energy.The entire robust PCA procedure can be no more automatic than our ability to selectthis coherence threshold curve. The following is a ﬁrst attempt at such selection. We startfrom the notion that outliers are less coherent than good pulses, but no pulse can be more coherent. Therefore, we can approximate the shape of the c vs size curve by studying itsmaximum value vs size. A simple approach is to build a linear interpolation between some J.W. Fowler et al.

RMS pulse size0.40.60.81.0 C o h e r e n c e DEFG HRMS pulse size0.970.980.99 C o h e r e n c e ABC JKL

ABCDEFGHJKL

Fig. 2

Left:

Two views (different y ranges) of the mean absolute coherence c as a function of rms pulse size,which in the absence of pileup serves as a rough estimator of photon energy (approximately 0 to 10 keV isshown across the full x range). The coherence of clean pulses is slightly less at the energy extremes than inthe center of the spectrum. Right:

Eleven speciﬁc pulses, whose coherence and size are indicated by circleson the left-hand panels. A, B, C are clean pulses of large, medium, and small size. The others are identiﬁedas outliers by the coherence metric. D–H are clearly outliers, with C < .

95. Even J, K, and L contain smallamounts of pileup, though C > .

98 for these examples. (Color ﬁgure online.) ten to twenty anchoring points. Starting at the point with the highest value of c , we requirethat the slope of the interpolation is a monotone decreasing function of rms pulse size. Inthis data set, any pulse with coherence c differing from this model of the upper limit by atleast 5 × − can be considered an outlier. A key question about this method is whetherthis threshold is generally valid, and if not, whether it can be selected automatically. Ourinvestigations so far suggest that one threshold should sufﬁce for many measurements, solong as the experimental conditions (including spectra) are similar. We have so far considered the standard version of PCA, which is statistically optimal onlywhen signals contain exactly white noise. With microcalorimeter data, we generally havehigher noise at low frequencies, below the inverse of the thermal and electrical time con-stants. If we can estimate R , the noise covariance matrix in some way (perhaps by record-ing pulse-free data), then any matrix W satisfying WRW T = I is a linear noise-whiteningtransformation. That is, if raw data d has covariance E ( dd T ) − E ( d ) E ( d T ) = R , then whiteneddata Wd will have a covariance of I , or white noise (of unit variance). One possible con-struction of W is the inverse of the lower Cholesky factor of R . In cases where the whitenoise assumption of robust PCA is inadequate, then it can performed on a noise-whitened robust principal component analysis for outlier identiﬁcation in messy microcalorimeter data 7 version of the training data, on WM instead of on M . Coherence pursuit can used to removeoutlier columns of either M or WM before performing this noise-weighted PCA .We do not have a clear sense of when or whether it is important to perform a noise-weighted PCA in microcalorimeter data analysis. With enough clean pulses in the trainingdata set, one might always ﬁnd principal components and noise-weighted ones spanning sonearly the same space as to be indistinguishable. We intend to compare the two versions ina range of future analyses, hoping to determine the value added by the noise-whitening step.Whether PCA is noise-weighted or not, it is without a doubt important to perform pro-jections into the low-dimensional subspace with noise weighting. This step makes subspaceprojection optimal in the same sense as traditional optimal ﬁltering is. Let U be the approx-imate column space found by robust PCA, where U has orthonormal columns ( U T U = I ).Then the optimal projection d u of a pulse record d into the subspace spanned by U is: d u = U ( U T R − U ) − U T R − d (2)This is a projection in the sense that it is linear in d and that d u = UU T d u . (3)It is optimal with respect to noise in the sense that it is the projection that maximizes thelikelihood of the observed data vector d under the multivariate Gaussian noise model R . We distinguish the projected pulse record d u from the vector of principal component amplitudes p , which one would use to estimate ﬁltered pulse heights. The ﬁrst is an approximation tothe measured data; the second is a summary of it. They are related by p = U T d u or d u = Up .A geometric view of this optimality is that projection Eq. 2 minimizes the Mahalanobisdistance between d and its projection, || d − d u || M , instead of the usual Euclidean distance || d − d u || . The Mahalanobis, or signal-to-noise, norm is deﬁned as || v || M ≡ ( v T R − v ) / . Analysis of data from large arrays of microcalorimeters will require ever-increasing levelsof automation in our pulse processing steps. If we are to entertain nonlinear analysis ofpulses, we are likely to start with the data compression step of noise-optimal projection ofpulse records into small linear subspaces revealed through PCA. We have described a robust,outlier-resistant form of PCA that can be applied to microcalorimeter pulses with minimalexpert human intervention. The method is based on coherence pursuit to reject outliers fromtraining data; randomized PCA of the cleaned data to ﬁnd an appropriate subspace; andnoise-optimal projection of pulse records into this subspace.The problem we have addressed here is only one of many hard and unsolved problems inhigh-resolution pulse analysis. Other examples include learning detector parameters and thegoverning electro-thermal differential equations from data, modeling detector noise fromnoisy data, reducing the effects of crosstalk, and zero-prior-knowledge alignment of uncali-brated spectra. One feature that many of these hard problems share is that they are unlikelyto have simple, analytic, and clean solutions. Another is that solutions must be automatedfor practical use with large calorimeter arrays. In both senses, they have much in commonwith the problems faced in machine learning. We believe that the enormous attention nowbeing turned upon machine learning problems, and the vast number of creative ideas beinggenerated will continue to beneﬁt our work on microcalorimeter data in the future.

J.W. Fowler et al.

Acknowledgements

This work was supported by NIST’s Innovations in Measurement Science program andby NASA SAT NNG16PT18I, “Enabling & enhancing technologies for a demonstration model of the AthenaX-IFU.” We thank Dan Becker and Malcolm Durkin for numerous discussions and earlier work on pulseoutlier identiﬁcation.

References

1. S. H. Moseley, J. C. Mather, D. J. McCammon,

J. Appl. Phys. , , 1257 (1984).doi:10.1063/1.3341292. A. E. Szymkowiak, R. L. Kelley, S. H. Moseley, C. K. Stahle, J. Low Temp. Phys. ,

281 (1993). doi:10.1007/BF006934333. B. K. Alpert, R. D. Horansky, D. A. Bennett, et al.

Rev. Sci. Instr. , , 6107 (2013).doi:10.1063/1.48068024. S. R. Bandler, E. Figueroa-Feliciano, N. Iyomoto, et al. Nucl. Instr. Meth. Phys. Re-search A , , 817–819 (2006). doi:10.1016/j.nima.2005.12.1495. P. Peille, M. T. Ceballos, B. Cobo, et al. SPIE Astronom. Telescopes + Instr. , ,99055W–1 (2016). doi:10.1117/12.22320116. C. Pappas, J. W. Fowler, D. A. Bennet, et al., J. Low Temp. Phys , J. Low Temp. Phys. , , 249 (2018). doi:10.1007/s10909-018-1999-87. B. Shank, J. J. Yen, B. Cabrera, et al. AIP Advances , , 117106 (2014). doi:10.1063/1.49012918. J. W. Fowler, , B. K. Alpert, W. B. Doriese, et al. IEEE Trans. Applied Superconductiv-ity , J. Low Temp. Phys. , , 16–26 (2014). doi:10.1007/s10909-014-1149-x10. J. W. Fowler, C. G. Pappas, B. K. Alpert, et al., J. Low Temp. Phys. , , 539 (2018).doi:10.1007/s10909-018-1892-511. S. Kempf, A. Fleischmann, L. Gastaldo, C. Enss J. Low Temp. Phys. , , 365 (2018).doi:10.1007/s10909-018-1891-612. S. E. Busch, J. S. Adams, S. R. Bandler, et al. J. Low Temp. Phys. , , 382–388 (2016).doi:10.1007/s10909-015-1357-z13. D. Yan, T. Cecil, L. Gades, et al. J. Low Temp. Phys. , , 397–404 (2016).doi:10.1007/s10909-016-1480-514. E. J. Cand`es, X. Li, Y. Ma, J. Wright, J. ACM , , 11 (2011).doi:10.1145/1970392.197039515. G. H. Golub, C. F. Van Loan Matrix Computations, 4th Ed. , Johns Hopkins (2013).16. R. M. Larsen,“Lanczos Bidiagonalization With Partial Reorthogonalization,” DAIMIReport Series,

537 (1998). doi:10.7146/dpb.v27i537.707017. R. M. Larsen, http://sun.stanford.edu/~ rmunk/PROPACK/

18. N. Halko, P. G. Martinsson, J. A. Tropp,

SIAM Review ,

217 (2011).doi:10.1137/09077180619. T. Bouwmans, E. H. Zahzah,

Computer Vision and Image Understanding , 22 (2014)doi:10.1016/j.cviu.2013.11.00920. N. Vaswani, P. Narayanamurthy,

Proc. IEEE , 1359 (2018).doi:10.1109/JPROC.2018.284412621. G. Lerman, T. Maunu,

Proc. IEEE , 1380 (2018).doi:10.1109/JPROC.2018.285314122. N. Kwak,

IEEE Trans. Pattern Analysis Machine Intelligence , robust principal component analysis for outlier identiﬁcation in messy microcalorimeter data 9

23. F. Nie, H. Huang, C. Ding, D. Luo, H. Wang

Proc. 22nd International Joint Conf. onArtiﬁcial Intelligence (2011). Available at https://bit.ly/2EVPGWu

24. Y. W. Park, D. Klabjan

IEEE 16th Int. Conf. on Data Mining (2016).doi:10.1109/ICDM.2016.005425. P. P. Markopoulos, S. Kundu, S. Chamadia, D. A. Pados,

IEEE Trans. Signal Processing , IEEE Trans. Information Theory , IEEE Trans. Signal Processing , J. Machine Learning Research , IEEE Trans. Signal Processing , Metrologia , , 494 (2017).doi:10.1088/1681-7575/aa722f32. K. D. Irwin and G.C. Hilton, Cryogenic Particle Detection , pp. 63–150, Springer, Hei-delberg (2005). doi:10.1007/b1216933. J. W. Fowler, B. K. Alpert, W. B. Doriese, et al.,

Astrophys. J. Suppl. , , 35 (2015). doi:10.1088/0067-0049/219/2/3534. P. C. Mahalanobis, Proc. Nat. Inst. Sciences India ,2