[PDF] Applications of Machine Learning Algorithms In Processing Terahertz Spectroscopic Data

Abstract

We present the data reduction software and the distribution of Level 1 and Level 2 products of the Stratospheric Terahertz Observatory 2 (STO2). STO2, a balloon-borne Terahertz telescope, surveyed star-forming regions and the Galactic plane and produced approximately 300,000 spectra. The data are largely similar to spectra typically produced by single-dish radio telescopes. However, a fraction of the data contained rapidly varying fringe/baseline features and drift noise, which could not be adequately corrected using conventional data reduction software. To process the entire science data of the STO2 mission, we have adopted a new method to find proper off-source spectra to reduce large-amplitude fringes and new algorithms including Asymmetric Least Square (ALS), Independent Component Analysis (ICA), and Density-based spatial clustering of applications with noise (DBSCAN). The STO2 data reduction software efficiently reduced the amplitude of fringes from a few hundred to 10 K and resulted in baselines of amplitude down to a few K. The Level 1 products typically have the noise of a few K in [CII] spectra and ~1 K in [NII] spectra. Using a regridding algorithm, we made spectral maps of star-forming regions and the Galactic plane survey using an algorithm employing a Bessel-Gaussian kernel. Level 1 and 2 products are available to the astronomical community through the STO2 data server and the DataVerse. The software is also accessible to the public through Github. The detailed addresses are given in Section 4 of the paper on data distribution.

Full PDF

SSeptember 3, 2020 0:30 main

Journal of Astronomical Instrumentationc (cid:13)

World Scientiﬁc Publishing Company

Applications of Machine Learning Algorithms In Processing Terahertz Spectroscopic Data

Young Min Seo , Paul F. Goldsmith , Volker Tolls , Russell Shipman , Craig Kulesa , William Peters , ChristopherWalker , Gary Melnick Jet Propulsion Laboratory, California Institute of Technology, 4800 Oak Grove Drive, Pasadena, CA, 91109, USA Harvard-Smithsonian Center for Astrophysics, 60 Garden Street, Cambridge, MA 02138, USA SRON Netherlands Institute for Space Research, Landleven 12, 9747 AD Groningen, The Netherlands Department of Astronomy & Steward Observatory, University of Arizona, 933 N. Cherry Ave., Tucson, AZ 85721, USA

Received (to be inserted by publisher); Revised (to be inserted by publisher); Accepted (to be inserted by publisher);We present the data reduction software and the distribution of Level 1 and Level 2 products of the StratosphericTerahertz Observatory 2 (STO2). STO2, a balloon-borne Terahertz telescope, surveyed star-forming regionsand the Galactic plane and produced approximately 300,000 spectra. The data are largely similar to spectratypically produced by single-dish radio telescopes. However, a fraction of the data contained rapidly varyingfringe/baseline features and drift noise, which could not be adequately corrected using conventional data reductionsoftware. To process the entire science data of the STO2 mission, we have adopted a new method to ﬁnd properoﬀ-source spectra to reduce large amplitude fringes and new algorithms including Asymmetric Least Square(ALS), Independent Component Analysis (ICA), and Density-based spatial clustering of applications with noise(DBSCAN). The STO2 data reduction software eﬃciently reduced the amplitude of fringes from a few hundredto 10 K and resulted in baselines of amplitude down to a few K. The Level 1 products typically have noise ofa few K in [CII] spectra and ∼ Keywords : Terahertz; Balloon-borne telescope; Machine-learning

1. INTRODUCTION

Spectroscopic observations in the far-infrared (FIR) and submillimeter wavelengths have been critical toastronomy; for example, probing kinematics, tracing chemistry, and characterizing physical conditionsin diﬀerent phases of the interstellar medium. With the development of highly sensitive receivers andspectrometers at far-infrared and submillimeter wavelengths during the last decades, the amount of data hasexploded and been bringing a wealth of new knowledge about our universe. On the other hand, processing,and publishing these extensive data sets has became a signiﬁcant challenge. A robust, automated pipelinesoftware that can handle a large amount of data and a wide range of spectral features is a critical elementfor successful future astronomical facility or mission.The Stratospheric Terahertz Observatory 2 (STO2) is one of the missions that has produced large datasets that challenge data processing, as a result of the volume and characteristics of the data. STO2 is aballoon-borne survey telescope, which observed the Galactic plane and several star-forming regions andproduced over 300,000 spectra. While a large fraction of the data had no signiﬁcant problems, a fraction ofthe data included rapidly-varying features and noise (e.g., fringes with the pattern changing signiﬁcantly Young Min Seo, [email protected]. 1 a r X i v : . [ a s t r o - ph . I M ] S e p eptember 3, 2020 0:30 main Young Min Seo

Fig. 1. On the ﬂy (OTF) observation sequence of STO2. Every OTF observation of STO2 starts with a reference scan, whichconsists of taking hot (red disk) and oﬀ-source (cross) spectra, and then moving to the OTF scan. A single OTF scan legconsists taking on-source (OTF spectra, white crosses in blue rectangle) and hot (red disks) spectra. every few tens of seconds, with amplitude >

50 K). The cause of those features are still under investigationbut most probable causes are electrical standing waves and LO power ﬂuctuations. We found that the fast-varying features could not be corrected eﬃciently using conventional data reduction algorithms (Kutner &Ulich, 1981; Shipman et al. , 2017). Notably, algorithms that require prior information about source velocityto process the data are not suitable for the Galactic plane survey. Thus, processing STO2 data requiredadopting more ﬂexible and robust algorithms, including automatic identiﬁcation of line emission duringthe reduction.To facilitate the data processing, considering the fraction of the data with low quality, we have em-ployed several machine learning algorithms in the STO2 data processing pipeline. The machine learningalgorithms have shown promising results in removing unwanted features in the STO2 data, including thehigh-amplitude fringes and mid-frequency drift noise. This paper elaborates on the algorithms in the STO2data pipeline and their advantages and limitations.In § §

3, we elaborate on the algorithms for the STO2 data and the results. We brieﬂydescribe the Level 1 and 2 data products and their distribution in §

4. Finally, we discuss and summarizethe limitations of the algorithms in §

2. STO2 Observation and Characteristics of STO2 Data

The STO2 data are velocity resolved spectral observations of the [CII] and [NII] lines recorded in 1024channels with 1 MHz resolution, which is equivalent to a velocity resolution of 0.16 km s − at 1.9 THz.The emission lines of [CII] and [NII] may have intrinsic shapes of a few to 20 km/s wide in complex lineshapes in the STO2 survey. Each output spectrum is saved as a separate ﬁle in the Single-Dish-FITS format(Garwood, 2000), which is a binary table. We incorporate the scan number and the OTF dump numberin their ﬁle names. We also include a summary of the observation information, for example, telescopetelemetry data, observation position, integration time, and target names in the FITS headers.Scientiﬁc observations using STO2 were carried out in the On-The-Fly (OTF) mode. The typical OTFsequence of STO2 is described in Figure 1. An OTF observation starts with a reference scan, which takeshot-load and oﬀ-source spectra at a reference position prior to starting the OTF scan and is used to esti-mate gain response at every channel. Hot-load spectrum are taken by observing an internal hot load on thegondola, which is at the ambient temperature within the payload. Reference positions are carefully selectednot to have any signiﬁcant [CII] and [NII] emission based on previous observations using ISO, Herschel,Spitzer, and ground-based radio telescopes. We also observed the hot load in the middle of the OTF scansif a single OTF scan lasted longer than 30 seconds to track noise and gain variations per channel over time.The longest spectroscopic Allan variance minimum time for STO2 is close to 30 seconds. A single rastereptember 3, 2020 0:30 main Applications of Machine Learning Algorithms In Processing Terahertz Spectroscopic Data vs. spectrum processed using theinterpolated Oﬀ-source spectra. observation is typically in the range of 30 to 220 seconds, and consists of single or multiple sets of hot-loadscans and a single OTF scan leg (typical single OTF scan leg duration ¡ 30 seconds). In terms of numberof ﬁles, it is equivalent to 45 – 270 OTF ﬁle-outputs. Typical integration times per OTF spectrum andhot-load spectrum are 0.65 and 11 seconds, respectively. Using a conventional data reduction method (e.g.,Kutner & Ulich, 1981), we ﬁrst assessed the characteristics of the STO2 data and found two challengingfeatures in a fraction of STO2 data. These are the following:1. The hot-load spectra in the reference scans are signiﬁcantly diﬀerent from the hot-load spectra takenduring OTF scans (Figure 2), which might be due to temperature gradient or thermal instability of thetelescope. Having diﬀerent hot-load spectra at diﬀerent observation positions suggests that the gain of thereceiver system varies with time and as a function of the observation pointing angle. It also indicates thatwe cannot directly use the spectra from the reference scans to calibrate the OTF observations. Due to thisproblem, signiﬁcantly large fringes (amplitude of 20 – 250 K) appear in the processed spectra if we use thehot-load and oﬀ-source spectra in the reference scans to calibrate OTF spectra.2. Baselines of the spectra ﬂuctuate over a short period ( <

30 seconds) for a signiﬁcant fraction of data,which is likely due to electrical standing waves and LO power ﬂuctuation. The baseline varies appreciablywithin one OTF scan leg. There can also be a sudden increase of the fringe amplitude followed by decayover time.We could not achieve the desired science quality spectra using the conventional data-reduction methoddue to the above two features. Thus, we have developed an optimized data-processing algorithm for theSTO2 data-reduction pipeline to obtain the best quality spectra. The ﬂow of the STO2 data-reductionpipeline is shown in Figure 3. The pipeline performs the conversion of the raw data into antenna temper-ature, de-fringing, baseline correction, and regridding. In the following subsections, we describe details ofthe STO2 data-reduction pipeline with elaborations of the methods to suppress the two features discussedabove.

3. STO2 Data Reduction Software3.1.

Conversion to the Antenna Temperature

The ﬁrst step of the STO2 data reduction is to convert the raw data in units of counts/s to antennatemperature. A conventional way for the conversion is to use the following equation for each spectraleptember 3, 2020 0:30 main Young Min Seo channel: T A = T sys OTF − Oﬀ Ref Oﬀ Ref , (1)where T A , T sys , OTF, and Oﬀ Ref denote antenna temperature, system temperature, spectrum from OTFscanning, and oﬀ-source spectrum at a reference position (reference scan), respectively. T sys is estimated atthe reference position. The equation can be applied as long as the gain of the spectrometer channels doesnot vary between the reference position and the on-source position. Unfortunately, STO2 data exhibitedspectrum gains thatvary signiﬁcantly with the telescope pointing position between a reference observationand the OTF leg, and the baselines of oﬀ-source spectra are vastly diﬀerent from those of the OTF spectra.The results using equation (1) were not adequate for scientiﬁc use, as shown in the left panel of Figure 2.To rectify the problem, we must ﬁnd appropriate oﬀ-source spectra to correct the OTF scans. UsingOTF spectra at the edge of a map where no emission is detected for an oﬀ-source spectrum is a frequently-used method when there are no reference spectra. However, for the STO2 mission, the target lines are [CII]and [NII], which typically extend across large areas and often stretch beyond the areas covered. Also, in aGalactic plane survey, the location and spatial extent of the emission are rarely known. Using spectra atthe edges of maps is not suitable for STO2 data reduction.Therefore, we focus our eﬀort on utilizing the hot-load spectra in reference scans and those obtainedduring OTF scans to predict oﬀ-source spectra for the OTF scans. With a detailed analysis of the hot-loadand oﬀ-source spectra in reference and OTF scans, we found that the ratio of the hot-load spectrum tothe oﬀ-source spectrum, Hot Ref /Oﬀ

Ref , does not vary with the telescope pointing position and evolvesrelatively slowly over time. We also found that the hot-load spectra taken during OTF scans, Hot

OTF ,vary slowly over time and show the same gain as that of the OTF spectra. Using these two results, weobtained oﬀ-source spectra for the OTF scans and made a modiﬁed version of Equation (1) for the STO2data reduction as T A = T sys ( t OTF ) OTF − Oﬀ Interp ( t OTF )Oﬀ

Interp ( t OTF ) , (2)where t OTF denotes output time when an OTF spectrum is dumped and recorded to a storage.Oﬀ

Interp ( t OTF ) is a oﬀ-source spectrum for the OTF spectra at t OTF , which is evaluated asOﬀ

Interp ( t OTF ) = Hot

OTF ( t OTF )Hot

Ref ( t OTF ) / Oﬀ Ref ( t OTF ) , (3)where Hot( t OTF ) denotes hot-load spectrum linearly interpolated to t OTF . The subscripts of Hot, OTF,and Ref denote the hot-load spectrum taken in OTF and at reference scans, respectively. Oﬀ

Ref ( t OTF ) isobtained by linearly interpolating the oﬀ-source spectra at a reference position to t OTF . T sys ( t OTF ) is thesystem temperature and deﬁned as T sys ( t OTF ) = T Hot − y ( t OTF ) T Oﬀ y ( t OTF ) − , (4)where y( t OTF ) ≡ Hot

Ref ( t OTF )/Oﬀ

Ref ( t OTF ), T Hot is the hot-load temperature (ambient temperature of thepayload), which is measured during reference scan and recorded in the FITS header, and T Oﬀ is the noisetemperature of an oﬀ-source (empty) sky, which is roughly 45 K at the altitude of STO2. This methodresulted in a signiﬁcant improvement as shown in the right panel in Figure 2. The typical Single SidebandT sys ranges from 3000 K to 3800 K at 1.9 THz.There are some OTF observations that do not have any hot-load observations, which makes it inap-propriate to use Equation (3). For the observations that do not have a hot-load spectrum within an OTFscan, we use a library of the ratios of hot-load spectra, which is a collection of all observed Hot OTF /Hot

Ref at every observation time-step of Hot

OTF . We interpolated the Hot

Ref to obtain the Hot

Ref at the time-stepof Hot

OTF . Using the library, we create a series of the predicted oﬀ-source spectra at time t asOﬀ lib ( t ; i ) ≡ Oﬀ Ref ( t )(Hot OTF / Hot

Ref ) i , (5)eptember 3, 2020 0:30 main Applications of Machine Learning Algorithms In Processing Terahertz Spectroscopic Data where the subscript i denotes i -th element of the library, Oﬀ lib ( t ; i ) is the oﬀ-source spectrum evaluated attime t using the i -th element of the library. With the series of predicted Sky spectra, we reduce the ﬁrstand the last OTF spectra within a single OTF scan and choose the best oﬀ-source spectrum that resultsin the ﬂattest baseline. We found that the method of using the library gives better results than using theconventional method (e.g., Kutner & Ulich, 1981), but not as good as those obtained using the interpolationmethod using Equation (2). This method is similar to the data reduction method of the Herschel

HIFIinstrument, which extracts the families of baseline from Oﬀ-position spectra based on the spectral featuresand selects the optimal baseline for spectra using a Bayesian approach. The same analysis of the STO2data reduction revealed that there are more than 1,000 families of baselines in the STO2 spectra, whilethe spectra of the HIFI instrument have much smaller number of families ( < a ). We found that the method was not eﬃcient for STO2 data due to the high diversity of featuresin the STO2 data. Spectrum Quality Assessment & De-fringing

In the second step of the STO2 data reduction, we group spectra based on the observation time, assessspectrum quality and remove fringes. We found that fringe patterns are similar to each other when thespectra are taken within a single OTF scan (typically <

30 seconds or <

40 outputs). Based on this charac-teristic, we ﬁrst cluster the spectra using observation time as a main parameter of the clustering procedure.Among many available clustering algorithms, we adopt the Density-Based Spatial Clustering of Applica-tions with Noise (DBSCAN) in the Scikit-Learn (Pedregosa et al. , 2011), which is a machine learning a eptember 3, 2020 0:30 main Young Min Seo

Fig. 4. Phase portrait of de-noised spectra in Scan 3619 (left) and Scan 3843 (right). Each dot corresponds to values of onechannel. The spectra in Scan 3619 have a large fringe amplitude while the spectra in Scan 3843 have small fringes as well asclear [CII] lines, which is shown as large fan-shaped features in the positive Ta quadrants. package in python. We use DBSCAN because DBSCAN automatically clusters a large number of OTFspectra ( > < P i ≡ v k = v (cid:88) v k = v T i ( v k ) , (6)where P i is the squared deviation of the i -th spectrum in a group of the OTF spectra, v is velocity,and T i is the antenna temperature of the i -th spectrum. The subscripts 1 & 2 are the minimum andthe maximum velocity range of the spectrum, which is set to cover the half of spectral range. We makethe phase portrait of spectra, which is a plot of T ( v ) vs. ∂T /∂v , after we substantially reduce the whitenoise using a wavelet de-noising method within each spectrum. For a spectrum with large fringes, thesquare deviation is signiﬁcantly larger than the squared deviation of a spectrum with no fringes. Also, themaximum density point in the phase portrait deviates considerably from (0,0) (Figure 4). The standarddeviation of the squared deviation indicates whether the fringes of spectra within a group are similar orvary considerably within a group, for example, a group with similar baselines found to have values lessthan 10 , while a group with fastly varying baselines found to have typically over 10 . For a group ofspectra with strong lines but without larges fringes, the value of squared deviations is typically small andthe maximum density point is close to (0,0) in the phase portrait but there is a large standard deviationin the squared deviations of the spectra. Thus, using these values, we classify the groups into the fourdiﬀerent categories (Figure 3) and determine which algorithm to process the group of spectra.The next step is the de-fringing for a group of spectra with signiﬁcantly large fringes (fringes withamplitudes >

30 K, reduction ﬂags 2, 3, and -2). We skip this process for a group without large fringes(reduction ﬂag 1). We use the independent component analysis (ICA) to do the de-fringing. The ICAeptember 3, 2020 0:30 main

Applications of Machine Learning Algorithms In Processing Terahertz Spectroscopic Data algorithm is often used to decompose diﬀerent sources within a signal assuming that the observed signal islinearly mixed from multiple sources. We assume that fringes are the combinations of diﬀerent contributionsto frequency-dependant output (e.g., optical and electrical standing waves) in the STO2 system. Amongmany modes of ICA, we use the deﬂation mode of ICA (deﬂation ICA). The deﬂation ICA delivers ICA-components in an orderly manner based on amplitude, and the ﬁrst ICA-component of the deﬂation ICAis a feature with the largest amplitude. Thus, we use only the ﬁrst ICA-component to correct the fringeof the largest amplitude throughout a group. We found that residual fringes are typically ±

10 K aftercorrection. This method is eﬃcient when the amplitude of fringes is larger than that of line emission in allspectra within a single group.We found that the ﬁrst ICA-component can be contaminated by the line emission in the spectra whenthe amplitude of line emission is comparable to that of the strongest fringe. To check whether or notthe ﬁrst ICA-component is contaminated by strong emission, we examine the rms of each channel acrossspectra within a group after de-fringing. We found that the channels involving strong emission typicallyshow signiﬁcantly larger rms values compared to those of the other channels when the emission varieswithin a group (reduction ﬂag 3). For such a group, we mask the channels that have large rms valuesand re-do the de-fringing step. The masked channels are replaced by a second-order polynomial curve. Wefound that this additional procedure prevents removing the line emission during the de-fringing step, butthis method only works when the intensity or shape of the emission varies signiﬁcantly within a group.If the line shape and intensity are the same throughout the spectra within a group, this method will failto isolate the line emission from fringes. We also found that there are groups with all channels havinglarge rms after de-fringing (reduction ﬂag of -2). In such groups, fringes vary rapidly, typically in lessthan 1 second, and the deﬂation ICA failed to ﬁnd a common fringe pattern for the group. We stoppedfurther processing of these groups since we could not process them using either conventional methods (e.g.,high-order polynomial ﬁtting) or the deﬂation ICA.

Baseline Correction using the ALS and the Parallel ICA and Regridding

In the third step of the STO2 data reduction, we correct the baselines of the de-fringed spectra and thespectra with small fringes. We correct the baselines using either the asymmetric least square method (ALS,Eilers & Boelens 2005) or the ICA in the parallel mode (parallel ICA) or both. The ALS baseline correctioneﬃciently removes any low-frequency fringe. The parallel ICA can remove both low- and high-frequencyfringes simultaneously but tends to be slower compared to the ALS baseline correction. The parallel ICAis diﬀerent from deﬂation ICA in that it evaluates features without ordering the amplitude of features. Insignal processing, the parallel ICA is often used for the blind source separation when an output signal islinearly composed of many input sources (Jain & Rai, 2012). We assume that the spectra of STO2 are amixture of standing waves from the receiver electronics and the emission from astronomical sources, weseparate the fringes and the emission using the parallel ICA. In the pipeline, we use the ALS followed bythe parallel ICA as a default setting for the baseline correction. After baseline correction, we found thatthe residual fringes are typically ∼ N ICA-components as a default setting if a group of spectra has N spectra. It is possible togenerate an arbitrary number of ICA-components, but if the number of the ICA-components is too small( < N components, the parallel ICA provides a better separation of fringes from theline emission in exchange of computational costs.3. The third step is to remove any contribution originating from the line emission in all ICA-components. We found that the parallel ICA decomposes the fringes into N components, but it also decom-poses line emission into a few ICA-components rather than concentrating it in a single ICA-component.eptember 3, 2020 0:30 main Young Min Seo

Fig. 5. Baseline correction using a wavelet de-noising method and the asymmetric least square (left) and the independentcomponent analysis (right).

We determine the contribution of line emission within ICA-components using a combination of the sigma-clipping, the peak detection algorithm, and the Gaussian-line-ﬁtting algorithm. After line detection, wereplace the contribution of the line emission in each ICA-component with a low-degree polynomial (thedefault is a linear function).4. We construct the baselines by combining the modiﬁed ICA-components using the original mixingmatrix obtained by the parallel ICA and correct the baseline of the spectra.Fringes with either a broader or narrower width than the FWHM of line emission are eﬃciently removedusing this method (Figure 5). But if fringe width is similar to the FWHM of the line emission and theamplitude of fringes are higher than the line amplitude, this method may not isolate the baseline andfringes from the line emission.To make maps from the STO2 spectra, we need to re-grid spectra. We use a regridding algorithm fromMangum et al. (2007). We used a Gaussian-Bessel kernel for the regridding, as suggested in that paper.

Signal Fitting Using a Convolutional Neural Network

The last step in the STO2 data reduction is a signal-ﬁtting software using a convolutional neural network(CNN), more speciﬁcally, CNN autoencoder. The STO2 data contains surveys of multiple high-mass star-forming regions, which have complex velocity structures. Thus, the line proﬁles in STO2 data are verycomplex and require multiple Gaussian functions to ﬁt a line proﬁle. A conventional method to ﬁt a lineproﬁle employs least χ ﬁtting and is useful if a line proﬁle can be ﬁtted using fewer than ﬁve Gaussianfunctions. If more than ﬁve are required, the number of ﬁtting parameters becomes too large, and the least χ method takes an excessively long time to obtain a proper solution. On the other hand, CNN is excellentin recognizing and processing highly complicated shapes or patterns. Line proﬁles in astronomy are often aseries of superposed Gaussian functions. We generate a training set containing noise, fringes, and signals,and we trained the CNN to eliminate noise and fringes and to return only the signal. The training setincludes 500,000 samples with various combinations of simulated baseline (sine waves and polynomials)and signals (up to 20 skewed Gaussian functions). In testing the CNN signal-ﬁtting through a simple Monte-Carlo simulation, we found that roughly 90% of 2 σ emission is detected and ﬁtted correctly, whereas theCNN is not able to recover any emission weaker than 1.5 σ . We applied the CNN to the spectral maps ofthe STO2 observations (Figure 6) and found that ﬁtting signals is signiﬁcantly faster than a conventionalGaussian ﬁtting algorithm, and we could process 10,000 spectra, including complex line proﬁles with morethan several intensity peaks, in less than one second with highly accurate results.We have also written a CNN for baseline correction, which is not included in the STO2 pipeline. Wefound that it delivers results similar to or better than the ICA de-fringing and baseline correction in manycases and is at least 100 times faster in computation speed. However, we found that it is diﬃcult to generatea training set covering all features in the STO2 data since the STO2 features are extremely diverse, havingeptember 3, 2020 0:30 main Applications of Machine Learning Algorithms In Processing Terahertz Spectroscopic Data et al. (2019)) andsignal-ﬁtted map using a convolutional neural network (right).Fig. 7. Reduction ﬂags of the STO2 Level 1 data. Each point denotes a single spectrum. Each ﬂag is marked by diﬀerentcolor with short descriptors (see text for details). more than 1,000 families of features in 300,000 spectra. The algorithm will be further tested using SOFIAGREAT data.

4. Data Products and Distribution

We reduced all [CII] spectra of the STO2 science observations using the software described above. STO2also observed [NII], but we found that the [NII] data do not reveal any appreciable ( > Young Min Seo products are all the individual spectra processed through the software. Figure 7 shows the distributionof the reduction ﬂags in the Level 1 product. We found that roughly 40% of the spectra show relativelyweak fringes and are designated as reduction ﬂag 1. These are relatively good spectra, which required onlybaseline corrections. The ﬁnal rms values for these spectra are typically ≤ − , while the expected rms noise is roughly 2 K if the instrument follows the radiometer equation.Another 40% of the spectra have large-amplitude fringes, which were corrected through the de-fringingstep using the deﬂation ICA method. These are ﬂagged as 2 and 3. The spectra so ﬂagged as 2 and 3showed rms values ∼ l = 286 ◦ , 310 ◦ , and 328 ◦ , with | b | ≤ ◦ .The STO2 observations of the Trumpler 14 region have been analyzed in detail and published in Seo et al. (2019).The validation of the ﬁnal reduction is carried out for Trumpler 14 map using an integrated intensitymap of the same region from Infrared Space Observatory (ISO). We found that only ISO observed thesame region, providing data that can be used for comparison to the STO2 observations. We found that thetwo maps have a good agreement (uncertainty of ±

38% in the integrated intensity) assuming the beameﬃciency of STO2 to be 0.7 (Appendix 1 in Seo et al. b and DataVerse. The entire STO2 data processing software is written in python 3 and thescripts are publicly accessible from Youngmin Seo’s Github c .

5. Summary

The STO2 mission, the ﬁrst successful terahertz heterodyne balloon-borne astronomical observatory, waslaunched from the McMurdo station in December 2016. The mission was successful in surveying severalstar-forming regions and the Galactic plane, but the data required an enhanced eﬀort to process due tounexpected features in spectra, including fringes with various frequencies and amplitudes, the dependenceof fringe patterns on telescope position, and fast-varying fringe patterns. We found that it is ineﬃcientto process ∼ b http://soral.as.arizona.edu/STO2/ c https://github.com/seoym3919/STO2 PIPELINE eptember 3, 2020 0:30 main Applications of Machine Learning Algorithms In Processing Terahertz Spectroscopic Data spectra is typically worse than that of the spectra ﬂagged as 1. For the last 20% of the spectra, we foundthat the STO2 software could not eﬃciently process the spectra to a scientiﬁcally usable level. Thus, onlythe spectra ﬂagged 1, 2, and 3 have been used for creating maps.3. We found that the signal-ﬁtting algorithm using a Convolutional Neural Network (CNN) autoencoderis extremely fast in ﬁtting line proﬁles compared to the method minimizing χ . We have also written aCNN autoencoder for correcting the baseline. We found that it is very eﬀective and rapid to process aportion of the STO2 data but was not capable to process the entire data set since the fringe patterns varyexcessively in parts of the STO2 data, which makes it hard to build an eﬀective set of training models forthe neural network. The algorithm may still be very useful for other data and will be tested further.4. Level 1 and 2 data products of the STO2 surveys are open to the public and are accessible from thePI team server (link in §

4) and DataVerse. The entire STO2 data process software is written in python 3and the scripts are publicly accessible from Youngmin Seo’s Github (link in footnote c of §

6. Acknowledgement

We thank the anonymous referee for helping improve the paper in a variety of ways. We acknowledge thatthis work is supported by NASA Astrophysics and Data Analysis Program (17-ADAP17-0048). STO2 is amulti-institutional eﬀort funded by the National Aeronautics and Space Administration (NASA) throughthe ROSES-2012 program under grant NNX14AD58G. This work was carried out in part at the JetPropulsion Laboratory, which is operated for NASA by the California Institute of Technology.

References

Eilers, P. & Boelens, H. [2005]

Unpubl. Manuscr .Garwood, R. W. [2000] “SDFITS: A Standard for Storage and Interchange of Single Dish Data,”

Astronomical DataAnalysis Software and Systems IX , eds. Manset, N., Veillet, C. & Crabtree, D., p. 243.Jain, S. & Rai, D. [2012]

IJEST .Kutner, M. L. & Ulich, B. L. [1981] Astrophysical Journal , 341, doi:10.1086/159380.Mangum, J. G., Emerson, D. T. & Greisen, E. W. [2007]

Astronomy & Astrophysics , 679, doi:10.1051/0004-6361:20077811.Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss,R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M. & Duchesnay, E. [2011]

Journal of Machine Learning Research , 2825.Seo, Y. M., Goldsmith, P. F., Walker, C. K., Hollenbach, D. J., Wolﬁre, M. G., Kulesa, C. A., Tolls, V., Bernasconi,P. N., Kavak, ¨U., van der Tak, F. F. S., Shipman, R., Gao, J. R., Tielens, A., Burton, M. G., Yorke, H., Young,E., Peters, W. L., Young, A., Groppi, C., Davis, K., Pineda, J. L., Langer, W. D., Kawamura, J. H., Stark, A.,Melnick, G., Rebolledo, D., Wong, G. F., Horiuchi, S. & Kuiper, T. B. [2019] Astophysics Journal , 120,doi:10.3847/1538-4357/ab2043.Shipman, R. F., Beaulieu, S. F., Teyssier, D., Morris, P., Rengel, M., McCoey, C., Edwards, K., Kester, D., Lorenzani,A., Coeur-Joly, O., Melchior, M., Xie, J., Sanchez, E., Zaal, P., Avruch, I., Borys, C., Braine, J., Comito, C.,Delforge, B., Herpin, F., Hoac, A., Kwon, W., Lord, S. D., Marston, A., Mueller, M., Olberg, M., Ossenkopf,V., Puga, E. & Akyilmaz-Yabaci, M. [2017]

Astronomy & Astrophysics608