An algorithm for the automatic deglitching of x-ray absorption spectroscopy data
11 An algorithm for the automatic deglitching of x-rayabsorption spectroscopy data
Samuel M. Wallace, a Marco A. Alsina b and Jean-Franc¸ois Gaillard a * a Department of Civil and Environmental Engineering, Northwestern University,2145 Sheridan Road, Evanston, USA, and b Department of Construction Engineeringand Management, University of Talca, Camino Los Niches Km. 1, Curic´o, Chile.E-mail: [email protected]
X-ray absorption spectroscopy ; glitches ; deglitching Abstract
Analysis of x-ray absorption spectroscopy (XAS) data often involves the removal ofartifacts or glitches from the acquired signal, a process commonly known as deglitching .Glitches result either from specific orientations of monochromator crystals or fromscattering by crystallites in the sample itself. Since the precise energy — or wavelength— location and the intensity of glitches in a spectrum cannot always be predicted,deglitching is often performed on a per spectrum basis by the analyst. Some routineshave been proposed, but they are prone to arbitrary selection of spectral artifacts andare often inadequate for processing large data sets.Here we present a statistically robust algorithm, implemented as a Python program,for the automatic detection and removal of glitches that can be applied to a large num-ber of spectra. It uses a Savitzky-Golay filter to smooth spectra and the generalized
PREPRINT:
Journal of Synchrotron Radiation A Journal of the International Union of Crystallography a r X i v : . [ phy s i c s . d a t a - a n ] N ov extreme Studentized deviate test to identify outliers. We achieve robust, repeatable,and selective removal of glitches using this algorithm.
1. Introduction
In x-ray absorption spectroscopy (XAS), glitches correspond to artifacts in a confinedenergy range, usually a few data points, manifesting as a sharp variation in measuredabsorption (Stern & Lu, 1982; Bunker, 2010; Calvin, 2013). Glitches often arise frommultiple-diffraction events within the crystal monochromator used to set the energyof the x-rays, resulting in a significant decrease or rise in the intensity of the beamdelivered to the sample (Bauchspiess & Crozier, 1984). Alternatively, diffraction fromother crystalline phases, for instance in diamond anvil cells, may also create glitches,impacting the measurement of the absorption coefficient ( µ ( E )) (Sapelkin & Bayliss,2002).Analysts can minimize the impact of glitches by using best practices at the time ofdata collection, which includes ensuring uniform sample preparation, confirming thelinearity of equipment at the beamline, and detuning the monochromator to reduce theintensity of the multiple diffraction events; however, these practices do not completelyeliminate glitches (Abe et al. , 2018). Collecting data free from significant artifactsis particularly challenging in some energy ranges, such as at the Fe K-edge usinga Si(111) monochromator (Pickering, 1999). Leaving spurious points in place mayinterfere with the analysis of XAS data; for instance, glitches may introduce erroreither in the normalization of the spectrum or the transformation of the extendedx-ray absorption fine structure (EXAFS) data into R-space (Abe et al. , 2018), andspectra may be improperly clustered. As such, the processing of spectra may requirea deglitching step, where data points corresponding to glitches are removed.Given the erratic nature of glitches, their removal is commonly based on the judge- IUCr macros version 2.1.11: 2020/04/29 ment of the analyst through visual inspection of the data. For example, the programAthena offers a graphical user interface for glitch removal, where one may either con-duct point-and-click identification of spurious data points or set a threshold valuebased on the post-edge normalization curve outside of which data points are removed(Ravel, 2016). Other programs include the capability to remove data points at spec-ified indices (Wellenreuther & Meyer-Klaucke, 2009) or the option to compare µ ( E )with respect to the incoming incident beam intensity I et al. , 2018) and contin-uous scanning ”quick EXAFS” modes (Prestipino et al. , 2011) along with increasingaccessibility to laboratory-source XAS instruments (Anklamm et al. , 2014; Jahrman et al. , 2019) promises large datasets where manual deglitching of spectra becomesimpractical. A rapid and robust method for automated deglitching of XAS spectrawould directly address this shortcoming. Previous methods with greater potential forautomation used the first derivative of the absorption coefficient to identify glitches(Zhuchkov et al. , 2001). While useful, such strategies can result in points adjacent toglitches being erroneously identified, and applicability to the x-ray absorption nearedge structure (XANES) region is limited. Moreover, a manual threshold must still beset by the analyst to identify glitches based on visual inspection of the data.Here, we propose a method for the high-throughput deglitching of XAS data.Outliers are identified in the normalized residuals between original and low-pass-filtered data. A second iteration of this process minimizes the occurrence of type Ierrors. Through this, we achieve accurate and repeatable removal of glitches fromfull-spectrum XAS data. We present deglitching examples on XANES and EXAFS IUCr macros version 2.1.11: 2020/04/29 spectra, as well as a negative control. Finally, we discuss potential limitations of ourmethod as well as strategies for improvement.
2. Methods
The deglitching algorithm was implemented as a computer program written in thePython 3.7 programming language. The packages Scipy (Virtanen et al. , 2020) andNumpy (Oliphant, 2006) were used for all calculations, and Larch (Newville, 2013)was used for XAS data processing in a Jupyter Notebook environment (Kluyver et al. ,2016).The deglitching program was designed to be compatible with the data structureused by Larch for XAS analysis; the only required input for the deglitching programare data channels corresponding to the energy ( E ) and the absorption coefficient( µ ( E )) within a Larch group. Absorption data may or may not be normalized priorto deglitching. Additional parameters may be specified here to tune the deglitchingprogram, though default parameters should work in most circumstances.All spectra presented here were collected at the bending magnet beamline of theDow-Northwestern-Dupont Collaborative Access Team (DND-CAT), Sector 5 at theAdvanced Photon Source at Argonne National Laboratory. Fe K-edge data is pre-sented for drinking water treatment residual samples collected from drinking watertreatment plants in the United States. The other samples include a denture adhesivecream and cobalt (II, III) oxide (Aldrich). Energy was set using a Si(111) doublecrystal monochromator. Intensity values for the incident beam ( I ), the transmittedbeam ( I T ), and the secondary transmitted beam ( I T ) were collected using Oxfordionization chambers with 29.6 cm path lengths. Fluorescence data were captured usingVortex ME4 silicon drift detectors. IUCr macros version 2.1.11: 2020/04/29
3. Algorithm
What follows is a description of the deglitching process, also depicted graphically inFigure 1 and outlined in Figure 2. Data channels for energy and µ ( E ) are providedto the algorithm (Figure 1.1). First, the µ ( E ) channel is fit with a Savitzky-Golayfilter to provide a smoothed representation of the data, µ SG ( E ) (Figure 1.2A). TheSavitzky-Golay filter uses a least-squares fitting procedure to fit each point over arolling window of odd length ( w SG ), effectively acting as a lowpass filter (Savitzky &Golay, 1964) (Figure 1.2B). Savitzky-Golay window length and the polynomial orderof the rolling fit are adjustable parameters set at 9 and 3 respectively by default. Theresults of the filter are then subtracted from the normalized absorption to computethe residuals between the original and filtered data, δµ ( E ) A (Figure 1.3). δµ ( E ) A = µ ( E ) − µ SG ( E ) A (1)Given that the Savitzky-Golay filter acts as a low-pass filter, δµ ( E ) A will reflectthe contribution of high frequencies to µ ( E ). One expects a normal distribution ofresiduals should Gaussian noise be the only source of misfit between the data and a fitaiming to represent the true values of the data (Trutna et al. , 2003). Glitches, beingaberrant high-frequency spectral features, will result in residuals that are outlierscompared to the rest of the residuals. In simple cases, only glitches will be outliers inthese residuals. However, glitches may impact δµ ( E ) A at any point within ( w SG / − | δµ ( E ) A | near the absorption threshold, owing both to thewider range of absorbances contained in fitting windows in this region and, in caseswhere the low-pass filter is overly restrictive, the attenuation of some high-frequency IUCr macros version 2.1.11: 2020/04/29 features. Identifying outliers on untreated residuals may lead to the false positiveidentification of glitches in the XANES region or failure to identify subtle glitches inthe EXAFS region as a result.Residuals must be normalized to identify glitches across the full spectrum. Average-based values, like standard deviation, are strongly impacted by outliers, while medianvalues are comparatively more resistant to outliers. For this reason, a rolling medianis found for | δµ ( E ) A | to acquire a rolling median absolute deviation. Much like theSavitzky-Golay filter, these median values are calculated from a rolling window, w m .The window size for the median calculation is selected such that, should a glitch becontained within w m , more than half the points in w m will not have included the glitchin their Savitzky-Golay filter fitting window w SG . This may be defined as: w m = 2 ∗ ( w SG + L g −
1) + 1 (2)Where w m is the window length for calculating the median absolute deviation, w SG is the Savitzky-Golay filter window length, and L g is the maximum number of pointscorresponding to a single glitch. Normalized residuals r are computed by dividing δµ ( E ) A by the local median absolute deviation of residuals to account for regionalvariability in the fit (Figure 1.4): r i = δµ ( E ) A, i median | δµ ( E ) A, w | , w ∈ (cid:26) i − w m − , · · · , i + w m − (cid:27) (3)Where r i corresponds to the normalized residual value at E i . At the beginning and endof the data, calculations are performed using truncated windows. Provided Savitzky-Golay parameters that do not filter true signal, these normalized residuals will benormally distributed with a standard deviation of approximately 1.48, correspondingto the conversion from median absolute deviation to standard deviation based onthe cumulative distribution function of the standard normal distribution (Filliben &Heckert, 2003) (Figure 1.4/1.5). IUCr macros version 2.1.11: 2020/04/29
From the normalized residuals, outliers are mathematically identified using a gen-eralized extreme Studentized deviate test (generalized ESD) (Figure 1.5). The gener-alized ESD identifies outliers in normally-distributed data when provided a maximumnumber of outliers and a significance value for the outlier identification (Rosner, 1983);these values are set at 10% of the data points contained in the spectrum and 0.025by default respectively. The first step of the generalized ESD is to calculate the meanof the dataset. Next, the data point furthest from the mean value is identified. The test statistic is calculated by finding the distance of this point from the mean valuein terms of number of standard deviations. Using a t distribution at the providedsignificance level, the maximum acceptable distance from the mean is calculated fora dataset of a given length. This provides a critical value which is, again, in unitsof number of standard deviations from the mean. The most distant point is removedand the calculation is repeated until the maximum number of outliers are reached. Ifthe critical value of a given point is greater than the test statistic, that point and allpreviously analyzed points are identified as outliers.Outliers identified in this first stage are taken as candidate points for glitches. Asufficiently large glitch may result in a poor fit from the Savitzky-Golay filter on pointswithin the same filter window as the glitch (Figure 1.6), so outliers in the first pass( candidate points ) may not all be glitches. Therefore, candidate points are removedfrom a copy of the data, and µ ( E ) at these points is interpolated using a cubic spline(Figure 1.6). The resulting µ ( E ) B only differs from µ ( E ) at the candidate points . Asecond set of Savitzky-Golay filtered data, µ SG ( E ) B , is generated based on this copyof the data using the same parameters as before (Figure 1.7A). From here, the processis repeated, finding the residuals ( δµ ( E ) B ) between the original data ( µ ( E )) and thesecond Savitzky-Golay filter ( µ SG ( E ) B ), defined as: δµ ( E ) B = µ ( E ) − µ SG ( E ) B (4) IUCr macros version 2.1.11: 2020/04/29
As before, these residuals are normalized by the regional median absolute deviation,this time calculated on δµ ( E ) B . The window for this second normalization shrinksto (2 ∗ L g ) + 1 points, effectively limiting the maximum number of points in a glitchto L g . This adjustment is made possible by the interpolation of candidate points ,which minimizes the impact of glitches on the filter’s fit for other points within w SG .Outliers are identified in δµ ( E ) B using the generalized ESD. The outliers identified inthis second pass that are within one half the Savitzky-Golay window length, w SG , ofa candidate point are taken to be glitches and removed.
4. Results and Discussion
The deglitching algorithm was tested on a range of elements and data collection modes,as shown in Figure 3. For all data, the deglitching procedure was applied across the fullspectrum and used a Savitzky-Golay window length of 9, a significance level of 0.025,and a maximum glitch length of 4, the default values for the program. In each of thesecircumstances, the algorithm successfully identified and removed glitches across thefull spectrum of data without removing non-glitch points. Given the diverse datasetsincluded in these examples - which include XANES glitches, minor EXAFS glitches,and data for various elements collected in both absorbance and fluorescence mode,the default parameters should be adequate for most applications.Figure 3.1 presents a negative control for analysis: a cobalt (II-III) oxide samplewith no apparent glitches. The deglitching algorithm identified no points as glitches.Figure 3.2(A-B) shows data for a sample where the algorithm removed a subtle EXAFSglitch at a high K value. Finally, Figure 3.3(A-C) shows an example of the algorithmsimultaneously removing glitches in the XANES and EXAFS regions.This algorithm provides repeatable and robust removal of glitches. As such, forglitches that are subtle or in relatively noisy data, parameters may need to be adjusted
IUCr macros version 2.1.11: 2020/04/29 to identify and remove these points. Several strategies may be used in these situations.For EXAFS glitches, the low-frequency, low-intensity features of the region may allowstricter low-pass filters to be implemented. A longer Savitzky-Golay filter window willhave the effect of filtering more high frequencies from the data. For glitches anywherein the spectrum, the significance level may be increased for more aggressive outlieridentification. For both of these changes, it is recommended to limit the deglitchingalgorithm to a region of interest that includes the glitch; otherwise, non-glitch pointsmay be removed.We thank Dr. Qing Ma for his technical assistance while performing XAS experi-ments at the APS. Portions of this work were performed at the DuPont-Northwestern-Dow Collaborative Access Team (DND-CAT) located at Sector 5 of the AdvancedPhoton Source (APS). DND-CAT is supported by Northwestern University, The DowChemical Company, and DuPont de Nemours, Inc. This research used resources ofthe Advanced Photon Source, a U.S. Department of Energy (DOE) Office of ScienceUser Facility operated for the DOE Office of Science by Argonne National Laboratoryunder Contract No. DE-AC02-06CH11357.
References
Abe, H., Aquilanti, G., Boada, R., Bunker, B., Glatzel, P., Nachtegaal, M. & Pascarelli, S.(2018).
Journal of Synchrotron Radiation , , 972–980.Aberdam, D. (1998). Journal of Synchrotron Radiation , , 1287–1297.Anklamm, L., Schlesiger, C., Malzer, W., Grotzsch, D., Neitzel, M. & Kanngiesser, B. (2014). Review of Scientific Instruments , (5).Bak, S.-M., Shadike, Z., Lin, R., Yu, X. & Yang, X.-Q. (2018). NPG Asia Materials , (7),563–580.Bauchspiess, K. R. & Crozier, E. D. (1984). In EXAFS and Near Edge Structure III , pp.514–516. Springer Berlin Heidelberg.Bunker, G. (2010).
Introduction to XAFS: a practical guide to X-ray absorption fine structurespectroscopy . Cambridge University Press.Calvin, S. (2013).
XAFS for Everyone . CRC press.Filliben, J. J. & Heckert, A. (2003).
NIST/SEMATECH e-handbook of statistical methods ,book section Exploratory Data Analysis.Jahrman, E. P., Holden, W. M., Ditter, A. S., Mortensen, D. R., Seidler, G. T., Fister, T. T.,Kozimor, S. A., Piper, L. F. J., Rana, J., Hyatt, N. C. & Stennett, M. C. (2019).
Reviewof Scientific Instruments , (2). IUCr macros version 2.1.11: 2020/04/29 Kluyver, T., Ragan-Kelley, B., P´erez, F., Granger, B., Bussonnier, M., Frederic, J., Kelley, K.,Hamrick, J., Grout, J., Corlay, S., Ivanov, P., Avila, D., Abdalla, S. & Willing, C. (2016).In
Positioning and Power in Academic Publishing: Players, Agents and Agendas , editedby F. Loizides & B. Schmidt, pp. 87 – 90. IOS Press.Newville, M. (2013). In , vol. 430 of
Journal of Physics Conference Series .Oliphant, T. E. (2006).
A guide to NumPy , vol. 1. Trelgol Publishing USA.Pickering, I., (1999). Monochromator crystal glitch library. Online.
URL:
Prestipino, C., Mathon, O., Hino, R., Beteva, A. & Pascarelli, S. (2011).
Journal of synchrotronradiation , (2), 176–182.Ravel, B. (2016). In Athena: XAS Data Processing , section 9.5.Rosner, B. (1983).
Technometrics , (2), 165–172.Sapelkin, A. V. & Bayliss, S. C. (2002). High Pressure Research , (6), 315–329.Savitzky, A. & Golay, M. J. E. (1964). Analytical Chemistry , (8), 1627–1639.Stern, E. A. & Lu, K. (1982). Nuclear Instruments and Methods in Physics Research , (1-2),415–417.Trutna, L., Spagon, P., del Castillo, E., Moore, T., Hartley, S. & Hurwitz, A. (2003). NIST/SEMATECH e-handbook of statistical methods , book section Process Improvement.Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau, D.,Burovski, E., Peterson, P., Weckesser, W., Bright, J., van der Walt, S. J., Brett, M.,Wilson, J., Jarrod Millman, K., Mayorov, N., Nelson, A. R. J., Jones, E., Kern, R.,Larson, E., Carey, C., Polat, ˙I., Feng, Y., Moore, E. W., Vand erPlas, J., Laxalde, D.,Perktold, J., Cimrman, R., Henriksen, I., Quintero, E. A., Harris, C. R., Archibald, A. M.,Ribeiro, A. H., Pedregosa, F., van Mulbregt, P. & Contributors to SciPy 1. 0 (2020).
Nature Methods , , 261–272.Wellenreuther, G. & Meyer-Klaucke, W. (2009). In , vol. 190 of Journal of Physics Conference Series .Zhuchkov, K. N., Shuvaeva, V. A., Yagi, K. & Terauchi, H. (2001).
Journal of SynchrotronRadiation , , 302–304. IUCr macros version 2.1.11: 2020/04/29
IUCr macros version 2.1.11: 2020/04/29 k -weighted EXAFS for each spectrum.Row A shows Fe K-edge data collected in fluorescence mode on a drinking watertreatment residual sample. Significant monochromator glitches are present in bothXANES and EXAFS regions (A1). One point corresponding to a monochromatorglitch in the XANES region is removed by the algorithm (A1 inset), and four pointscorresponding to two separate glitches are removed in the EXAFS region (A2). RowB shows Zn K-edge transmission data collected on a denture adhesive cream (B1).Only one point, corresponding to a monochromator glitch, is removed at approxi-mately 13 ˚A in the EXAFS region (B2). Finally, Row C shows transmission datacollected on a cobalt (II-III) oxide sample. Though there is some variation in I0(C1), there are no obvious effects on either the XANES or the EXAFS (C2), andthe algorithm removes no data points. IUCr macros version 2.1.11: 2020/04/29 Synopsis