[PDF] An algorithm for the automatic deglitching of x-ray absorption spectroscopy data

Abstract

Analysis of x-ray absorption spectroscopy (XAS) data often involves the removal of artifacts or glitches from the acquired signal, a process commonly known as deglitching. Glitches result either from specific orientations of monochromator crystals or from scattering by crystallites in the sample itself. Since the precise energy or wavelength location and the intensity of glitches in a spectrum cannot always be predicted, deglitching is often performed on a per spectrum basis by the analyst. Some routines have been proposed, but they are prone to arbitrary selection of spectral artifacts and are often inadequate for processing large data sets. Here we present a statistically robust algorithm, implemented as a Python program, for the automatic detection and removal of glitches that can be applied to a large number of spectra. It uses a Savitzky-Golay filter to smooth spectra and the generalized extreme Studentized deviate test to identify outliers. We achieve robust, repeatable, and selective removal of glitches using this algorithm.

Full PDF

11 An algorithm for the automatic deglitching of x-rayabsorption spectroscopy data

Samuel M. Wallace, a Marco A. Alsina b and Jean-Franc¸ois Gaillard a * a Department of Civil and Environmental Engineering, Northwestern University,2145 Sheridan Road, Evanston, USA, and b Department of Construction Engineeringand Management, University of Talca, Camino Los Niches Km. 1, Curic´o, Chile.E-mail: [email protected]

X-ray absorption spectroscopy ; glitches ; deglitching Abstract

Analysis of x-ray absorption spectroscopy (XAS) data often involves the removal ofartifacts or glitches from the acquired signal, a process commonly known as deglitching .Glitches result either from speciﬁc orientations of monochromator crystals or fromscattering by crystallites in the sample itself. Since the precise energy — or wavelength— location and the intensity of glitches in a spectrum cannot always be predicted,deglitching is often performed on a per spectrum basis by the analyst. Some routineshave been proposed, but they are prone to arbitrary selection of spectral artifacts andare often inadequate for processing large data sets.Here we present a statistically robust algorithm, implemented as a Python program,for the automatic detection and removal of glitches that can be applied to a large num-ber of spectra. It uses a Savitzky-Golay ﬁlter to smooth spectra and the generalized

PREPRINT:

Journal of Synchrotron Radiation A Journal of the International Union of Crystallography a r X i v : . [ phy s i c s . d a t a - a n ] N ov extreme Studentized deviate test to identify outliers. We achieve robust, repeatable,and selective removal of glitches using this algorithm.

1. Introduction

In x-ray absorption spectroscopy (XAS), glitches correspond to artifacts in a conﬁnedenergy range, usually a few data points, manifesting as a sharp variation in measuredabsorption (Stern & Lu, 1982; Bunker, 2010; Calvin, 2013). Glitches often arise frommultiple-diﬀraction events within the crystal monochromator used to set the energyof the x-rays, resulting in a signiﬁcant decrease or rise in the intensity of the beamdelivered to the sample (Bauchspiess & Crozier, 1984). Alternatively, diﬀraction fromother crystalline phases, for instance in diamond anvil cells, may also create glitches,impacting the measurement of the absorption coeﬃcient ( µ ( E )) (Sapelkin & Bayliss,2002).Analysts can minimize the impact of glitches by using best practices at the time ofdata collection, which includes ensuring uniform sample preparation, conﬁrming thelinearity of equipment at the beamline, and detuning the monochromator to reduce theintensity of the multiple diﬀraction events; however, these practices do not completelyeliminate glitches (Abe et al. , 2018). Collecting data free from signiﬁcant artifactsis particularly challenging in some energy ranges, such as at the Fe K-edge usinga Si(111) monochromator (Pickering, 1999). Leaving spurious points in place mayinterfere with the analysis of XAS data; for instance, glitches may introduce erroreither in the normalization of the spectrum or the transformation of the extendedx-ray absorption ﬁne structure (EXAFS) data into R-space (Abe et al. , 2018), andspectra may be improperly clustered. As such, the processing of spectra may requirea deglitching step, where data points corresponding to glitches are removed.Given the erratic nature of glitches, their removal is commonly based on the judge- IUCr macros version 2.1.11: 2020/04/29 ment of the analyst through visual inspection of the data. For example, the programAthena oﬀers a graphical user interface for glitch removal, where one may either con-duct point-and-click identiﬁcation of spurious data points or set a threshold valuebased on the post-edge normalization curve outside of which data points are removed(Ravel, 2016). Other programs include the capability to remove data points at spec-iﬁed indices (Wellenreuther & Meyer-Klaucke, 2009) or the option to compare µ ( E )with respect to the incoming incident beam intensity I et al. , 2018) and contin-uous scanning ”quick EXAFS” modes (Prestipino et al. , 2011) along with increasingaccessibility to laboratory-source XAS instruments (Anklamm et al. , 2014; Jahrman et al. , 2019) promises large datasets where manual deglitching of spectra becomesimpractical. A rapid and robust method for automated deglitching of XAS spectrawould directly address this shortcoming. Previous methods with greater potential forautomation used the ﬁrst derivative of the absorption coeﬃcient to identify glitches(Zhuchkov et al. , 2001). While useful, such strategies can result in points adjacent toglitches being erroneously identiﬁed, and applicability to the x-ray absorption nearedge structure (XANES) region is limited. Moreover, a manual threshold must still beset by the analyst to identify glitches based on visual inspection of the data.Here, we propose a method for the high-throughput deglitching of XAS data.Outliers are identiﬁed in the normalized residuals between original and low-pass-ﬁltered data. A second iteration of this process minimizes the occurrence of type Ierrors. Through this, we achieve accurate and repeatable removal of glitches fromfull-spectrum XAS data. We present deglitching examples on XANES and EXAFS IUCr macros version 2.1.11: 2020/04/29 spectra, as well as a negative control. Finally, we discuss potential limitations of ourmethod as well as strategies for improvement.

2. Methods

The deglitching algorithm was implemented as a computer program written in thePython 3.7 programming language. The packages Scipy (Virtanen et al. , 2020) andNumpy (Oliphant, 2006) were used for all calculations, and Larch (Newville, 2013)was used for XAS data processing in a Jupyter Notebook environment (Kluyver et al. ,2016).The deglitching program was designed to be compatible with the data structureused by Larch for XAS analysis; the only required input for the deglitching programare data channels corresponding to the energy ( E ) and the absorption coeﬃcient( µ ( E )) within a Larch group. Absorption data may or may not be normalized priorto deglitching. Additional parameters may be speciﬁed here to tune the deglitchingprogram, though default parameters should work in most circumstances.All spectra presented here were collected at the bending magnet beamline of theDow-Northwestern-Dupont Collaborative Access Team (DND-CAT), Sector 5 at theAdvanced Photon Source at Argonne National Laboratory. Fe K-edge data is pre-sented for drinking water treatment residual samples collected from drinking watertreatment plants in the United States. The other samples include a denture adhesivecream and cobalt (II, III) oxide (Aldrich). Energy was set using a Si(111) doublecrystal monochromator. Intensity values for the incident beam ( I ), the transmittedbeam ( I T ), and the secondary transmitted beam ( I T ) were collected using Oxfordionization chambers with 29.6 cm path lengths. Fluorescence data were captured usingVortex ME4 silicon drift detectors. IUCr macros version 2.1.11: 2020/04/29

3. Algorithm

What follows is a description of the deglitching process, also depicted graphically inFigure 1 and outlined in Figure 2. Data channels for energy and µ ( E ) are providedto the algorithm (Figure 1.1). First, the µ ( E ) channel is ﬁt with a Savitzky-Golayﬁlter to provide a smoothed representation of the data, µ SG ( E ) (Figure 1.2A). TheSavitzky-Golay ﬁlter uses a least-squares ﬁtting procedure to ﬁt each point over arolling window of odd length ( w SG ), eﬀectively acting as a lowpass ﬁlter (Savitzky &Golay, 1964) (Figure 1.2B). Savitzky-Golay window length and the polynomial orderof the rolling ﬁt are adjustable parameters set at 9 and 3 respectively by default. Theresults of the ﬁlter are then subtracted from the normalized absorption to computethe residuals between the original and ﬁltered data, δµ ( E ) A (Figure 1.3). δµ ( E ) A = µ ( E ) − µ SG ( E ) A (1)Given that the Savitzky-Golay ﬁlter acts as a low-pass ﬁlter, δµ ( E ) A will reﬂectthe contribution of high frequencies to µ ( E ). One expects a normal distribution ofresiduals should Gaussian noise be the only source of misﬁt between the data and a ﬁtaiming to represent the true values of the data (Trutna et al. , 2003). Glitches, beingaberrant high-frequency spectral features, will result in residuals that are outlierscompared to the rest of the residuals. In simple cases, only glitches will be outliers inthese residuals. However, glitches may impact δµ ( E ) A at any point within ( w SG / − | δµ ( E ) A | near the absorption threshold, owing both to thewider range of absorbances contained in ﬁtting windows in this region and, in caseswhere the low-pass ﬁlter is overly restrictive, the attenuation of some high-frequency IUCr macros version 2.1.11: 2020/04/29 features. Identifying outliers on untreated residuals may lead to the false positiveidentiﬁcation of glitches in the XANES region or failure to identify subtle glitches inthe EXAFS region as a result.Residuals must be normalized to identify glitches across the full spectrum. Average-based values, like standard deviation, are strongly impacted by outliers, while medianvalues are comparatively more resistant to outliers. For this reason, a rolling medianis found for | δµ ( E ) A | to acquire a rolling median absolute deviation. Much like theSavitzky-Golay ﬁlter, these median values are calculated from a rolling window, w m .The window size for the median calculation is selected such that, should a glitch becontained within w m , more than half the points in w m will not have included the glitchin their Savitzky-Golay ﬁlter ﬁtting window w SG . This may be deﬁned as: w m = 2 ∗ ( w SG + L g −

1) + 1 (2)Where w m is the window length for calculating the median absolute deviation, w SG is the Savitzky-Golay ﬁlter window length, and L g is the maximum number of pointscorresponding to a single glitch. Normalized residuals r are computed by dividing δµ ( E ) A by the local median absolute deviation of residuals to account for regionalvariability in the ﬁt (Figure 1.4): r i = δµ ( E ) A, i median | δµ ( E ) A, w | , w ∈ (cid:26) i − w m − , · · · , i + w m − (cid:27) (3)Where r i corresponds to the normalized residual value at E i . At the beginning and endof the data, calculations are performed using truncated windows. Provided Savitzky-Golay parameters that do not ﬁlter true signal, these normalized residuals will benormally distributed with a standard deviation of approximately 1.48, correspondingto the conversion from median absolute deviation to standard deviation based onthe cumulative distribution function of the standard normal distribution (Filliben &Heckert, 2003) (Figure 1.4/1.5). IUCr macros version 2.1.11: 2020/04/29

From the normalized residuals, outliers are mathematically identiﬁed using a gen-eralized extreme Studentized deviate test (generalized ESD) (Figure 1.5). The gener-alized ESD identiﬁes outliers in normally-distributed data when provided a maximumnumber of outliers and a signiﬁcance value for the outlier identiﬁcation (Rosner, 1983);these values are set at 10% of the data points contained in the spectrum and 0.025by default respectively. The ﬁrst step of the generalized ESD is to calculate the meanof the dataset. Next, the data point furthest from the mean value is identiﬁed. The test statistic is calculated by ﬁnding the distance of this point from the mean valuein terms of number of standard deviations. Using a t distribution at the providedsigniﬁcance level, the maximum acceptable distance from the mean is calculated fora dataset of a given length. This provides a critical value which is, again, in unitsof number of standard deviations from the mean. The most distant point is removedand the calculation is repeated until the maximum number of outliers are reached. Ifthe critical value of a given point is greater than the test statistic, that point and allpreviously analyzed points are identiﬁed as outliers.Outliers identiﬁed in this ﬁrst stage are taken as candidate points for glitches. Asuﬃciently large glitch may result in a poor ﬁt from the Savitzky-Golay ﬁlter on pointswithin the same ﬁlter window as the glitch (Figure 1.6), so outliers in the ﬁrst pass( candidate points ) may not all be glitches. Therefore, candidate points are removedfrom a copy of the data, and µ ( E ) at these points is interpolated using a cubic spline(Figure 1.6). The resulting µ ( E ) B only diﬀers from µ ( E ) at the candidate points . Asecond set of Savitzky-Golay ﬁltered data, µ SG ( E ) B , is generated based on this copyof the data using the same parameters as before (Figure 1.7A). From here, the processis repeated, ﬁnding the residuals ( δµ ( E ) B ) between the original data ( µ ( E )) and thesecond Savitzky-Golay ﬁlter ( µ SG ( E ) B ), deﬁned as: δµ ( E ) B = µ ( E ) − µ SG ( E ) B (4) IUCr macros version 2.1.11: 2020/04/29

As before, these residuals are normalized by the regional median absolute deviation,this time calculated on δµ ( E ) B . The window for this second normalization shrinksto (2 ∗ L g ) + 1 points, eﬀectively limiting the maximum number of points in a glitchto L g . This adjustment is made possible by the interpolation of candidate points ,which minimizes the impact of glitches on the ﬁlter’s ﬁt for other points within w SG .Outliers are identiﬁed in δµ ( E ) B using the generalized ESD. The outliers identiﬁed inthis second pass that are within one half the Savitzky-Golay window length, w SG , ofa candidate point are taken to be glitches and removed.

4. Results and Discussion

The deglitching algorithm was tested on a range of elements and data collection modes,as shown in Figure 3. For all data, the deglitching procedure was applied across the fullspectrum and used a Savitzky-Golay window length of 9, a signiﬁcance level of 0.025,and a maximum glitch length of 4, the default values for the program. In each of thesecircumstances, the algorithm successfully identiﬁed and removed glitches across thefull spectrum of data without removing non-glitch points. Given the diverse datasetsincluded in these examples - which include XANES glitches, minor EXAFS glitches,and data for various elements collected in both absorbance and ﬂuorescence mode,the default parameters should be adequate for most applications.Figure 3.1 presents a negative control for analysis: a cobalt (II-III) oxide samplewith no apparent glitches. The deglitching algorithm identiﬁed no points as glitches.Figure 3.2(A-B) shows data for a sample where the algorithm removed a subtle EXAFSglitch at a high K value. Finally, Figure 3.3(A-C) shows an example of the algorithmsimultaneously removing glitches in the XANES and EXAFS regions.This algorithm provides repeatable and robust removal of glitches. As such, forglitches that are subtle or in relatively noisy data, parameters may need to be adjusted

IUCr macros version 2.1.11: 2020/04/29 to identify and remove these points. Several strategies may be used in these situations.For EXAFS glitches, the low-frequency, low-intensity features of the region may allowstricter low-pass ﬁlters to be implemented. A longer Savitzky-Golay ﬁlter window willhave the eﬀect of ﬁltering more high frequencies from the data. For glitches anywherein the spectrum, the signiﬁcance level may be increased for more aggressive outlieridentiﬁcation. For both of these changes, it is recommended to limit the deglitchingalgorithm to a region of interest that includes the glitch; otherwise, non-glitch pointsmay be removed.We thank Dr. Qing Ma for his technical assistance while performing XAS experi-ments at the APS. Portions of this work were performed at the DuPont-Northwestern-Dow Collaborative Access Team (DND-CAT) located at Sector 5 of the AdvancedPhoton Source (APS). DND-CAT is supported by Northwestern University, The DowChemical Company, and DuPont de Nemours, Inc. This research used resources ofthe Advanced Photon Source, a U.S. Department of Energy (DOE) Oﬃce of ScienceUser Facility operated for the DOE Oﬃce of Science by Argonne National Laboratoryunder Contract No. DE-AC02-06CH11357.

References

Abe, H., Aquilanti, G., Boada, R., Bunker, B., Glatzel, P., Nachtegaal, M. & Pascarelli, S.(2018).

Journal of Synchrotron Radiation , , 972–980.Aberdam, D. (1998). Journal of Synchrotron Radiation , , 1287–1297.Anklamm, L., Schlesiger, C., Malzer, W., Grotzsch, D., Neitzel, M. & Kanngiesser, B. (2014). Review of Scientiﬁc Instruments , (5).Bak, S.-M., Shadike, Z., Lin, R., Yu, X. & Yang, X.-Q. (2018). NPG Asia Materials , (7),563–580.Bauchspiess, K. R. & Crozier, E. D. (1984). In EXAFS and Near Edge Structure III , pp.514–516. Springer Berlin Heidelberg.Bunker, G. (2010).

Introduction to XAFS: a practical guide to X-ray absorption ﬁne structurespectroscopy . Cambridge University Press.Calvin, S. (2013).

XAFS for Everyone . CRC press.Filliben, J. J. & Heckert, A. (2003).

NIST/SEMATECH e-handbook of statistical methods ,book section Exploratory Data Analysis.Jahrman, E. P., Holden, W. M., Ditter, A. S., Mortensen, D. R., Seidler, G. T., Fister, T. T.,Kozimor, S. A., Piper, L. F. J., Rana, J., Hyatt, N. C. & Stennett, M. C. (2019).

Reviewof Scientiﬁc Instruments , (2). IUCr macros version 2.1.11: 2020/04/29 Kluyver, T., Ragan-Kelley, B., P´erez, F., Granger, B., Bussonnier, M., Frederic, J., Kelley, K.,Hamrick, J., Grout, J., Corlay, S., Ivanov, P., Avila, D., Abdalla, S. & Willing, C. (2016).In

Positioning and Power in Academic Publishing: Players, Agents and Agendas , editedby F. Loizides & B. Schmidt, pp. 87 – 90. IOS Press.Newville, M. (2013). In , vol. 430 of

Journal of Physics Conference Series .Oliphant, T. E. (2006).

A guide to NumPy , vol. 1. Trelgol Publishing USA.Pickering, I., (1999). Monochromator crystal glitch library. Online.

URL:

Prestipino, C., Mathon, O., Hino, R., Beteva, A. & Pascarelli, S. (2011).

Journal of synchrotronradiation , (2), 176–182.Ravel, B. (2016). In Athena: XAS Data Processing , section 9.5.Rosner, B. (1983).

Technometrics , (2), 165–172.Sapelkin, A. V. & Bayliss, S. C. (2002). High Pressure Research , (6), 315–329.Savitzky, A. & Golay, M. J. E. (1964). Analytical Chemistry , (8), 1627–1639.Stern, E. A. & Lu, K. (1982). Nuclear Instruments and Methods in Physics Research , (1-2),415–417.Trutna, L., Spagon, P., del Castillo, E., Moore, T., Hartley, S. & Hurwitz, A. (2003). NIST/SEMATECH e-handbook of statistical methods , book section Process Improvement.Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau, D.,Burovski, E., Peterson, P., Weckesser, W., Bright, J., van der Walt, S. J., Brett, M.,Wilson, J., Jarrod Millman, K., Mayorov, N., Nelson, A. R. J., Jones, E., Kern, R.,Larson, E., Carey, C., Polat, ˙I., Feng, Y., Moore, E. W., Vand erPlas, J., Laxalde, D.,Perktold, J., Cimrman, R., Henriksen, I., Quintero, E. A., Harris, C. R., Archibald, A. M.,Ribeiro, A. H., Pedregosa, F., van Mulbregt, P. & Contributors to SciPy 1. 0 (2020).

Nature Methods , , 261–272.Wellenreuther, G. & Meyer-Klaucke, W. (2009). In , vol. 190 of Journal of Physics Conference Series .Zhuchkov, K. N., Shuvaeva, V. A., Yagi, K. & Terauchi, H. (2001).

Journal of SynchrotronRadiation , , 302–304. IUCr macros version 2.1.11: 2020/04/29

IUCr macros version 2.1.11: 2020/04/29 k -weighted EXAFS for each spectrum.Row A shows Fe K-edge data collected in ﬂuorescence mode on a drinking watertreatment residual sample. Signiﬁcant monochromator glitches are present in bothXANES and EXAFS regions (A1). One point corresponding to a monochromatorglitch in the XANES region is removed by the algorithm (A1 inset), and four pointscorresponding to two separate glitches are removed in the EXAFS region (A2). RowB shows Zn K-edge transmission data collected on a denture adhesive cream (B1).Only one point, corresponding to a monochromator glitch, is removed at approxi-mately 13 ˚A in the EXAFS region (B2). Finally, Row C shows transmission datacollected on a cobalt (II-III) oxide sample. Though there is some variation in I0(C1), there are no obvious eﬀects on either the XANES or the EXAFS (C2), andthe algorithm removes no data points. IUCr macros version 2.1.11: 2020/04/29 Synopsis