[PDF] Active deep learning method for the discovery of objects of interest in large spectroscopic surveys

Abstract

Current archives of the LAMOST telescope contain millions of pipeline-processed spectra that have probably never been seen by human eyes. Most of the rare objects with interesting physical properties, however, can only be identified by visual analysis of their characteristic spectral features. A proper combination of interactive visualisation with modern machine learning techniques opens new ways to discover such objects. We apply active learning classification supported by deep convolutional networks to automatically identify complex emission-line shapes in multi-million spectra archives. We used the pool-based uncertainty sampling active learning driven by a custom-designed deep convolutional neural network with 12 layers inspired by VGGNet, AlexNet, and ZFNet, but adapted for one-dimensional feature vectors. The unlabelled pool set is represented by 4.1 million spectra from the LAMOST DR2 survey. The initial training of the network was performed on a labelled set of about 13000 spectra obtained in the region around H α by the 2m Perek telescope of the Ondřejov observatory, which mostly contains spectra of Be and related early-type stars. The differences between the Ondřejov intermediate-resolution and the LAMOST low-resolution spectrographs were compensated for by Gaussian blurring. After several iterations, the network was able to successfully identify emission-line stars with an error smaller than 6.5%. Using the technology of the Virtual Observatory to visualise the results, we discovered 1013 spectra of 948 new candidates of emission-line objects in addition to 664 spectra of 549 objects that are listed in SIMBAD and 2644 spectra of 2291 objects identified in an earlier paper of a Chinese group led by Wen Hou. The most interesting objects with unusual spectral properties are discussed in detail.

Full PDF

AAstronomy & Astrophysics manuscript no. ms c (cid:13)

ESO 2020September 8, 2020

Active deep learning method for the discovery of objectsof interest in large spectroscopic surveys (cid:63) , (cid:63)(cid:63) P. Škoda , , O. Podsztavek , and P. Tvrdík Astronomical Institute of the Czech Academy of Sciences, Friˇcova 298, 25165 Ondˇrejov, Czech Republice-mail: [email protected] Faculty of Information Technology, Czech Technical University in Prague, Thákurova 9, 16000 Prague 6, Czech Republice-mail: {skodape4,podszond,tvrdik}@fit.cvut.cz

ABSTRACT

Context.

Aims.

We apply active learning classiﬁcation methods supported by deep convolutional neural networks to automatically identifycomplex emission-line shapes in multi-million spectra archives.

Methods.

We used the pool-based uncertainty sampling active learning method driven by a custom-designed deep convolutionalneural network with 12 layers. The architecture of the network was inspired by VGGNet, AlexNet, and ZFNet, but it was adapted foroperating on one-dimensional feature vectors. The unlabelled pool set is represented by 4.1 million spectra from the LAMOST datarelease 2 survey. The initial training of the network was performed on a labelled set of about 13 000 spectra obtained in the 400 Åwide region around H α by the 2 m Perek telescope of the Ondˇrejov observatory, which mostly contains spectra of Be and relatedearly-type stars. The di ﬀ erences between the Ondˇrejov intermediate-resolution and the LAMOST low-resolution spectrographs werecompensated for by Gaussian blurring and wavelength conversion. Results.

After several iterations, the network was able to successfully identify emission-line stars with an error smaller than 6.5%.Using the technology of the Virtual Observatory to visualise the results, we discovered 1 013 spectra of 948 new candidates ofemission-line objects in addition to 664 spectra of 549 objects that are listed in SIMBAD and 2 644 spectra of 2 291 objects identiﬁedin an earlier paper of a Chinese group led by Wen Hou. The most interesting objects with unusual spectral properties are discussed indetail.

Key words. methods: statistical – techniques: spectroscopic – stars: emission-line, Be – line: proﬁles – virtual observatory tools

1. Introduction

The stellar spectral classiﬁcation, as explained in Gray & Cor-bally (2009), is an important astrophysical task of assigning aparticular label (mixture of letters and Arabic and Roman num-bers), called the spectral class, to each spectrum based on thevisual similarities (e.g. presence, strength, and width of the spec-tral lines of a given element, or a combination of multiple lines).A common automatic procedure (see e.g. Gray & Corbally 2009,Chap 13.5) uses statistical matching (mainly using χ ﬁtting) ofa given spectrum with an extensive set of template spectra thatmay be either synthetic or come from a library of carefully se-lected stars (called spectral standards). This method is also usedin various modiﬁcations for the automatic spectral classiﬁcationof large spectroscopic surveys, such as the Sloan Digital SkySurvey (SDSS; Lee et al. 2008) or Large Sky Area Multi-ObjectFiber Spectroscopic Telescope (LAMOST; Wu et al. 2011; Leeet al. 2015). (cid:63) Based on spectra obtained with 2 m Perek Telescope of Ondˇrejovobservatory, Czech Republic and archival LAMOST DR2 spectra. (cid:63)(cid:63)

Catalogues of our emission-line candidates are avail-able only in electronic form at the CDS via anony-mous ftp to cdarc.u-strasbg.fr (130.79.128.5) or via http://cdsweb.u-strasbg.fr/cgi-bin/qcat?J/A+A/.

A problem arises in many cases when appropriate model ofthe spectrum is not known and the library used for matchingis not rich enough to contain unusual or new types. In addi-tion to this, many types of celestial objects may show complexshapes of only several prominent spectral lines (mainly H α orother Balmer and Paschen lines) that cover only small parts ofthe whole spectrum. The integral statistics then fails, and target-tailored methods must be applied to discover such usually rareobjects. This is the case of various objects with emission linesthat allow us to study a wide range of interesting physical pro-cesses.Pre-main-sequence stars such as young stellar objects andT Tau stars (Reipurth et al. 1996; Kurosawa et al. 2006), or hotstars with expanding envelopes or strong winds show promi-nent emission lines, as do cataclysmic variables, novae, and evenlate-type stars with chromospheric activity. See Kogure & Leung(2007) or Traven et al. (2015) for a comprehensive overview ofthese cases.The classical Be stars (Porter & Rivinius 2003) and the rareclass of B[e] stars (Zickgraf 2003) are other cases of well-studiedobjects with complicated emission-line proﬁles that often looklike symmetric or slightly asymmetric double peaks, sometimessuperimposed on absorption lines, depending on their disk ge-ometry (Silaj et al. 2010). The visual classiﬁcation of their pro- Article number, page 1 of 15 a r X i v : . [ a s t r o - ph . I M ] S e p & A proofs: manuscript no. ms ﬁles (Hanuschik et al. 1988) is a challenging task even on smallsamples, but it becomes impossible in surveys with millions ofspectra.The classical approach to ﬁnding emission lines is to com-pute integral statistics around their expected positions. It is sim-ilar to the standard method of measuring the line equivalentwidth (Kang & Lee 2012; Waters & Hollek 2013).Such an integral measure based on three-pixel statistics wastaken by Lin et al. (2015) on the LAMOST data release 1 (DR1)in order to ﬁnd strong uprising peaks. This resulted in a cata-logue of 203 emission-line stars, 23 of which were identiﬁedas classical Be stars and 180 are claimed to be discovered can-didates. In order to ﬁnd double-peak proﬁles hidden in deep ab-sorption, Hou et al. (2016) (hereafter H16) used a more advancedmethod based on the di ﬀ erence of several statistics with di ﬀ er-ent kernel width. The authors made an extensive analysis of theLAMOST data release 2 (DR2) survey and published a catalogueof 11 204 spectra of emission-line stars .We propose an alternative approach for the discovery ofemission-line spectra here based on machine learning of individ-ual shapes of prominent spectral lines. For the sake of simplic-ity, we limit ourselves to the vicinity of the H α line. The earlyattempts on a small sample of good spectra (Škoda & Vážný2012; Bromová et al. 2014) have already justiﬁed this method,and its application to the LAMOST DR1 (Škoda et al. 2015,2016) has resulted in the discovery of unknown emission-linecandidates. This article describes the ﬁrst systematic investiga-tion of the LAMOST DR2 using a deep convolutional neuralnetwork (CNN) in combination with active learning.We organised this article as follows. Section 2 describes ouractive learning method based on CNNs in detail. Section 3 showsthe application of the developed method to the discovery ofemission-line spectra in the LAMOST DR2. Section 4 discussesthe outcomes of the experiment and lists examples of discoveredobjects of interest. Finally, we conclude in Sect. 5. Furthermore,we compare our method to the non-active learning scenario inAppendix A, and we provide a detailed analysis of the results inAppendix B.

2. Active deep learning method

The discovery of objects of interest in large archives of astro-nomical spectra would be a standard machine learning task if alarge and representative labelled data sample of a given archivewere available. With such a training set, it would be straightfor-ward to train a supervised learning model and classify the wholearchive with high accuracy. However, our experiment in Sect. 3has shown that if there is no proper training dataset, standardmachine learning methods provide poor results with a high rateof both false and missed candidates.This means that if the training labelled data are not a suf-ﬁciently large representation of a spectral archive, for example,when the training set is biased or comes from another, but sim-ilar archive, other machine learning approaches need to be de-veloped to obtain reasonable discovery results. We propose andevaluate here an extension of a deep CNN classiﬁcation methodwith class balancing and active learning.The following subsections explain in detail why and how wecombined a CNN with a class balancing algorithm and an activelearning method. This uniﬁed active deep learning workﬂow al-lowed us to discover objects of interest (objects with emission- http://paperdata.china-vo.org/vac/dr2/HouEmission2016.tar.gz line spectra) in the LAMOST DR2 altough only a small numberof training data were available from a di ﬀ erent spectral archive. Deep learning is a type of machine learning that solves the prob-lem of representational learning by learning a hierarchy of con-cepts. In representational learning, we try to learn a representa-tion of the data that would facilitate the subsequent learning task.Deep learning allows computers to learn a good data representa-tion by building complicated representations out of more simpleones (LeCun et al. 2015; Goodfellow et al. 2016).Today, CNNs (LeCun et al. 1989) are the most advanceddeep learning method. CNNs started to be recognized whenKrizhevsky et al. (2012) achieved a winning top-5 test error rateof 15.3% on the ImageNet Large Scale Visual Recognition Chal-lenge 2012 (Russakovsky et al. 2015). We wish to take advan-tage of CNNs because spectra of stellar objects can be viewedas one-dimensional arrays with a single channel, whereas a typ-ical image is a two-dimensional array with usually three RGBchannels.The CNNs are specialised neural networks that use convo-lution to process data with a grid-like topology. A convolutionleverages three essential properties of these biologically inspirednetworks: sparse interactions (kernels used for convolution withan image have fewer parameters than a fully connected layer),parameter sharing (rather than learning a separate set of param-eters, CNNs learn one set for all locations), and equivariance totranslation (if an object shifts in the input, its corresponding out-put shifts by the same distance). Furthermore, a typical CNN haspooling layers that follow the convolution and activation layers.The pooling layers make the representation invariant to smalltranslations and rotations of the input. This invariance is a usefulproperty for application to spectra because spectral lines mightbe blue- or red-shifted due to the high radial velocity (Goodfel-low et al. 2016).Deep CNNs have already been successfully applied in as-tronomy and astrophysics. For example, Aniyan & Thorat(2017), Domínguez Sánchez et al. (2018), and Alhassan et al.(2018) used CNNs to automate the morphological classiﬁcationof radio sources. Alger et al. (2018) localised host galaxies fora given radio component with a CNN using data from expertsand crowdsourced training data. Furthermore, George & Huerta(2018) applied two CNN time-series data to the detection andparameter estimation of gravitational waves from binary blackhole mergers. The two CNNs achieved a similar performance asprevious advanced methods but were much faster, thus allow-ing real-time processing. For all these reasons, we decided todevelop an active deep learning method with CNNs for the dis-covery of objects of interest.

When discovering rare objects of interests in large spectroscopicsurveys, we face the class imbalance problem (Prati et al. 2009).Labelled spectra of rare objects of interest (hereafter target spec-tra) will usually be in the minority, in contrast to the labelledspectra of abundant objects (hereafter non-target spectra). There-fore, the labelled training data will tend to be imbalanced. More-over, target spectra will be in a signiﬁcant minority in generalmassive spectral archives (e.g. LAMOST or SDSS).Our application of the active deep learning, see Sect. 3 for de-tails, revealed exactly the class imbalance problem. The archive

Article number, page 2 of 15. Škoda et al.: Active deep learning in large spectroscopic surveys initialisationwith labelledsamples gatheredtraining set balancedtraining set trained CNNclassiﬁedunlabelledsamplesstoppingcriterionunlabelledbatch ofsampleslabelled batchof samples predictedcandidatesall candidates classbalancing trainingclassiﬁcationof unlabelled samplesperformanceestimationuncertaintysamplingoraclelabelsadd the labelled batchto the training set take out sampleslabelled as targetadd candidatesof the oracle

Fig. 1.

Flowchart of our active learning method. First, the algorithm is initialised only with labelled training data, the CNN is trained, andunlabelled spectra are classiﬁed. Then, uncertainty sampling selects the spectra that the network is least certain for. Finally, these spectra arelabelled by an oracle, are added to the training set, and a new training iteration starts. When the performance is satisfactory, samples classiﬁed intotarget classes are taken as candidates and are extended with samples classiﬁed into target classes by the oracle. of the Ondˇrejov 2 m Perek telescope is focused on the obser-vation of emission-line stars. Although there is almost the samepercentage of single peaks as absorptions, double peaks are stillin the minority (see Sect. 3.3). Moreover, there are (at least byorder of magnitude) fewer emission-line spectra than standardones in the LAMOST survey because emission-line objects arerare in the Universe. In these cases, class balancing has shown tobe an essential part of workﬂows and leads to successful perfor-mance (e.g. for the necessity of class balancing in astronomy, seede la Calleja et al. (2011) or Lyon et al. (2016), and in medicine,see Rastgoo et al. (2016)).To overcome the fact that CNNs will tend to discriminatethe minority classes, we incorporated in our experiments thesynthetic minority over-sampling technique (SMOTE) proposedby Chawla et al. (2002). This technique allows enlarging thenumber of labelled target spectra to the same size as the moreabundant non-target spectra.

Our experiments have shown that the combination of a CNN andclass balancing is still not su ﬃ cient for the discovery of objectsof interest because the ﬁrst prediction of candidates delivered aconsiderable amount of false candidates and featureless noisyspectra. The reason for this failure was an imperfect trainingdataset. Therefore, we decided to explore active learning (Settles2009) to circumvent the requirement of good representativenessof labelled samples to exploit the full potential of deep neuralnetworks to discover objects of interest.Active learning has already shown to be successful in astron-omy, for example, in estimating parameters of stellar populationsynthesis models by Solorio et al. (2005) or for the classiﬁcationof light curves of variable stars by Richards et al. (2012). Gupta et al. (2016) used active learning to learn a model for photo-metric data classiﬁcation from spectroscopic data (the work wasextended by Vilalta et al. (2019)), and recently, active learningwas used to minimise the number of required spectroscopicallyconﬁrmed labels in preparing training sets for the photometricclassiﬁcation of supernova light curves by Ishida et al. (2019a)and for active anomaly detection in light curves of supernovaeby Ishida et al. (2019b). Moreover, active deep learning has beensuccessfully tested in remote sensing by Liu et al. (2017), withfurther examples reviewed in Yang et al. (2018). To the best ofour knowledge, our method represents the ﬁrst astronomical ap-plication of active deep learning.Active learning is a machine learning technique based onthe idea that an algorithm will perform better with fewer train-ing data if it is allowed to choose data for its training. A ma-chine learning algorithm combined with active learning (an ac-tive learner) queries unlabelled data examples to be labelled byan oracle (e.g. a human expert).In the case of large spectra archives, there are huge pools ofunlabelled data that can be processed and gathered at once (aso-called pool-based setting in the context of active learning).Spectra are queried selectively from the pool according to aninformativeness measure that evaluates all spectra in the pool.Concerning CNNs, the most straightforward approach is to useuncertainty sampling as the query strategy. This strategy selectsspectra for which the CNN provided the least certain labellingbecause the last layer of the CNN is usually a softmax layer. Thislayer produces probabilities of classes for each spectrum. There-fore, to query spectra for labelling, we compute the informationentropy, H = − (cid:88) i p i ln p i , (1) Article number, page 3 of 15 & A proofs: manuscript no. ms where p i is the probability of class i , for all the spectra in thepool. Then, the method selects spectra with the highest informa-tion entropy.Because the training of a CNN can be time-consuming, ourmethod uses so-called batch-mode active learning, which iteratesin cycles: an oracle labels a batch of queried samples in each iter-ation in order to save time and computational resources (trainingof a CNN). More speciﬁcally, the method selects a batch of apreviously speciﬁed size (e.g. one hundred as in our experimentsin Sect. 3.4) from all spectra in the pool, and the oracle visuallyclassiﬁes them. Then, we add all the visually labelled spectra tothe training set, so that it contains training data from the previousiterations and newly classiﬁed spectra.Lastly, to decide when to stop the active learning iterativeprocedure, we need to track the performance of the CNN. Theobvious possibility is to estimate a performance measure andstop learning when a plateau is reached (e.g. when adding newlylabelled spectra would not increase the performance measure ofthe CNN).When a large pool of unlabelled samples contains a negligi-ble number of target spectra, it is reasonable to estimate preci-sion, deﬁned as precision = TPTP + FP , (2)where TP (true positive) is the number of correctly predictedtarget spectra, and FP (false positive) is the number of incor-rectly predicted target spectra. In the case of precision, we canexpect that a random sample of spectra classiﬁed into targetclasses will contain the true target spectra. On the other hand,a random sample of all spectra or non-target spectra will prob-ably contain only non-target spectra. Therefore, an estimationof any performance metric based on such random samples willnot yield a useful result. For example, an estimate of accuracy,which has to be based on a random sample of all spectra, willalmost certainly be 1 or very close to it. Moreover, when discov-ering rare objects, we are not interested in accuracy, but rather inprecision and recall . However, the estimation of recall faces thesame problem as the estimation of accuracy.For this reason, we cannot have any randomly sampled per-formance estimation set ﬁxed for all iterations. However, wehave to sample a new random sample in every iteration as theset of predicted target spectra is changing.In summary, our active deep learning method takes the la-belled data as the initial training set and balances it. Having abalanced training set, we train the CNN and use the trained CNNto classify all the unlabelled pool of spectra. Then, we use theuncertainty sampling query strategy to obtain a batch of sam-ples for labelling by an oracle that labels all the samples in thebatch. The labelled samples are taken out of the unlabelled pooland placed into the labelled training set. Now, we repeat thesesteps until the performance of our CNN is satisfactory. When weare satisﬁed with the CNN performance, the unlabelled samplesthat were lastly predicted as target ones become new candidatesof emission-line stars. Finally, we move the samples labelled bythe oracle as target from the training set to the candidate set. Theﬂowchart in Fig. 1 illustrates the whole algorithm of our activedeep learning method. Recall is the ratio of correctly predicted target spectra and all targetspectra.

3. Experiments

To illustrate the application of our active deep learning method,we have performed experiments with the discovery of objectswith signatures of H α emission in the LAMOST DR2 surveyusing labelled data from the Ondˇrejov 2 m Perek telescope. Thefollowing sections describe the data, the data preparation, theclasses of interest, and our method application. The archive of spectra obtained with 700 mm camera in theCoudé spectrograph of the 2 m Perek telescope at the Ondˇrejovobservatory of the Astronomical Institute of the Czech Academyof Sciences is a unique source of spectra of emission-line stars(mostly Be and B[e] stars, stars with strong winds and severalnovae). This continuously growing archive (hereafter CCD700),currently contains about 17 000 spectra, the majority of which(more than 13 000) are exposed in spectral range 6 250–6 700 Åwith a spectral resolving power of about 13 000. The standardIRAF procedure (Škoda & Šlechta 2002) reduces the spectra,including the calibration in air wavelengths and heliocentric cor-rection.The LAMOST telescope has delivered one of the currentlylargest collections of optical spectra. Four thousand ﬁbres po-sitioned by micro-motors feed 16 LAMOST spectrographs. Itspublicly available DR2 contains over four million spectra witha spectral resolution power of about 1 800, covering the range3 690–9 100 Å (Luo et al. 2016). The LAMOST pipeline (Wuet al. 2011) automatically assigns an estimated spectral classto the spectra. However, the pipeline uses classiﬁcation mostlybased on the global shape and integral properties of a spectrumin given band-passes using a set of predeﬁned templates. The lo-cal features (e.g. detailed line proﬁles) are ignored. Strong nar-row emissions can even be rejected by the pipeline as possiblyspoiled pixels. Therefore, we did not use the assigned spectralclasses. Hereafter we call the set of all unlabelled LAMOSTDR2 spectra the LAMOST pool. The spectral axis of the FITSﬁles in the LAMOST archive are expressed in the logarithm ofthe vacuum wavelength.

A common assumption in machine learning is that training data(in our work, the CCD700 data) and the data of interest (theLAMOST pool) are from the same probability distribution (Pan& Yang 2010). However, in this work, we are interested inthe classiﬁcation of the LAMOST pool using the training setfrom the Ondˇrejov spectrograph, which contains mostly emis-sion spectra. This means that the training set is highly biased.The distribution mismatch between the training data and the dataof interest is a well-known problem in machine learning and iscalled domain adaptation (Glorot et al. 2011).Using the technology of the Virtual Observatory (see Ap-pendix B.1) for cross-matching, we have identiﬁed only 22 spec-tra that were observed both by the Ondˇrejov 2 m Perek Tele-scope and by LAMOST. Only a few (e.g. BT CMi, HD 53 416,or V395 Aur) of them show emission lines. The lack of labelledtraining spectra in the LAMOST pool prevents the usage of su-pervised training. To use the CCD700 spectra as our training set,we therefore applied a domain transfer to the CCD700 spectra(based on optical engineering procedures), so that they will lookas if they were exposed with the LAMOST spectrograph. Taig-

Article number, page 4 of 15. Škoda et al.: Active deep learning in large spectroscopic surveys ﬂu x ( c o un t s ) V395 Aur from Ondˇrejov originalvacuum and convolved6550 6600 6650 6700wavelength (˚A)40005000 ﬂu x ( c o un t s ) V395 Aur from LAMOST

Fig. 2.

Comparison of a LAMOST spectrum with a Ondˇrejov CCD700spectrum converted into the LAMOST lower resolution and vacuumwavelengths. man et al. (2017) claimed that domain transfer is useful whensolving the domain adaptation problem.Firstly, we applied air-to-vacuum wavelength conversion tothe CCD700 spectra using formulas provided in Heiter (2014)because spectra from the CCD700 archive are in air wavelengths,but the LAMOST spectra use vacuum wavelengths. Addition-ally, we converted the vacuum wavelengths of spectra from theLAMOST pool from the logarithmic into linear scale.Secondly, because the CCD700 spectra have a higher spec-tral resolution than the LAMOST spectra, we applied the spec-tral resolving power degradation to the CCD700 spectra, roughlyapproximated by the convolution with the Gaussian kernel of agiven pixel width to reduce the high-resolution details. Compar-ison ﬁgures of simulated spectra from CCD700 and the LAM-OST pool of all 22 objects mentioned above showed that thestandard deviation of seven-pixel value works best. Figure 2shows the comparison of an Ondˇrejov spectrum, a cross-matchedLAMOST spectrum, and the preprocessed spectrum.Next, the CNN requires a vector of features as an input. Tohave the same features for all spectra, they need to be resam-pled to obtain the measurement in the same wavelengths acrossall spectra. We decided to use a linear interpolation (using thelinear interpolation function of the NumPy library) to 140 uni-formly distributed wavelength points in the spectral range be-tween 6 519 and 6 732 Å. We used this number of points becausethe LAMOST spectra mostly have this number of measurementsin the given range. We derived the range from the fact that ourclassiﬁcation is based on the H α line and most of the CCD700spectra are exposed between these wavelengths. This range alsocontains He i − ,

1] using the equa-tion x (cid:48) = x − min( x )max( x ) − min( x ) − , (3)where x is an input not-scaled spectrum, and x (cid:48) is a scaledspectrum. Thus, each spectrum has a maximum ﬂux of value 1 and a minimum of value −

1. We applied this preprocessing pro-cedure for two reasons: we would like to classify the spectraaccording to their shapes (this procedure e ﬀ ectively suppressesthe di ﬀ erences in intensities), and it obtains the value in the com-fortable small-valued range that is suitable for a neural networktraining (this is not a feature scaling, but a scaling across eachspectrum). In the next step, the preprocessed CCD700 spectra were classi-ﬁed by Podsztavek (2017) according the visual shape of the H α into three classes: single peak, double peak, and absorption. Thelabelled spectra resulted in a dataset of 12 936 labelled spectra(hereafter the Ondˇrejov dataset ) that were suitable for machinelearning. The counts of spectra in classes are the following: – single peak: 5 301 spectra (40.98%), – double peak: 1 533 spectra (11.85%), and – absorption: 6 102 spectra (47.17%).Figure 3 displays representatives of each class. In bothsingle-peak and double-peak spectra the H α line is in emission,and the di ﬀ erence between the two classes is in the number ofpeaks, which are clearly visible in the spectrum. Spectra in thesingle-peak and double-peak target classes are the target emis-sion spectra of our interest, and as expected, their number issmaller than the number of non-target absorption spectra, whichare not interesting for us.The Ondˇrejov dataset contains only well-exposed spectra,while the LAMOST pool contains many noisy spectra with in-strumental and reduction artefacts, spectra without peaks or ab-sorption, and spectra with low signal-to-noise ratio. During ourexperiment, we placed all these spectra into the non-target un-interesting class. Therefore the non-target not interesting classcontains bad and absorption spectra, which are both uninterest-ing for us. When the data were ready, we applied our method. We chose thearchitecture of a CNN as developed in previous work that provedto be working well (see Podsztavek 2017). This CNN architec-ture was inspired primarily by VGGNet (Simonyan & Zisserman2015), AlexNet (Krizhevsky et al. 2012), and ZFNet (Zeiler &Fergus 2014). However, these CNNs were designed to processmulti-channel two-dimensional images. We therefore adaptedthe architecture for our one-dimensional data (replace two-dimensional convolutions with one-dimensional convolutions).After several experiments, we converged to the architectureshown in Fig. 4. This CNN was implemented using Tensor-Flow (Abadi et al. 2015) through the Keras (Chollet et al. 2015)high-level interface and was run on an NVIDIA GTX980 GPU(4 GB memory, 2 048 CUDA cores). The network was trainedwith the Adam optimiser (Kingma & Ba 2015) in the default set-ting of Keras. The best-found weights were restored at the endof each training. We stopped the training when the categoricalcross-entropy loss function was not improved by at least 10 − during the last ten iterations.After we trained the CNN with the Ondˇrejov dataset (the ini-tial training set) balanced with SMOTE, we used the model topredict classes and probabilities of classes for all spectra in the https://doi.org/10.5281/zenodo.2640970 Article number, page 5 of 15 & A proofs: manuscript no. ms − r e s c a l e dﬂu x single-peak double-peak absorption − r e s c a l e dﬂu x − r e s c a l e dﬂu x − r e s c a l e dﬂu x − r e s c a l e dﬂu x Fig. 3.

Examples of spectra from all three classes in the Ondˇrejovdataset.

LAMOST pool. From all the classiﬁed spectra, a batch of 100spectra with the highest information entropy computed from theprobabilities of classes was selected (the uncertainty samplingstrategy), visually reviewed by us (in the role of the oracle), andclassiﬁed. Then, all the 100 visually labelled spectra were movedto the training set and removed from the LAMOST pool. Hence,after the ﬁrst iteration, the training set contained the spectra fromthe Ondˇrejov dataset and 100 new spectra from the LAMOSTpool.To track the performance of our CNN, we decided to esti-mate the precision (the ratio of correctly predicted single-peakand double-peak spectra in all predicted target spectra) in eachiteration because of the reasons stated in Sect. 2.3. Thereforewe randomly selected 30 spectra classiﬁed into single-peak anddouble-peak (target spectra) classes from the LAMOST pool(hereafter the performance estimation sample). The size of 30was chosen as a good trade-o ﬀ between conﬁdence and the de-mands of visual veriﬁcation. Then we manually labelled the per-formance estimation sample and compared our labels with thelabels predicted by our CNN. Thus we estimated the precisionafter each iteration. The performance estimation sample of 30spectra functions as a test set. In a standard machine learningscenario, a test set is a random sample of all unseen data thatcould be put into the CNN. In our case, all possible data for ourCNN are in the so far unlabelled LAMOST pool. Therefore, theperformance estimation sample will provide an unbiased esti-mate of precision. We would like to point out that the manuallabelling of the performance estimation sample is di ﬀ erent fromthe labelling by the oracle. The labels of the performance esti- input (140 pixel spectrum) conv3-64conv3-64maxpool2conv3-128conv3-128maxpool2conv3-256conv3-256maxpool2fc-512dropoutfc-512dropoutsoftmax Fig. 4.

Architecture of our CNN. Convolutional layers are marked as conv3 , where the number 3 means the size of the ﬁlter in pixels. Themark is followed by a dash and a number that speciﬁes the count ofﬁlters. maxpool2 are max-pooling layers with pool size 2, stride 2, andno padding. fc-512 denotes a fully connected layer with 512 units, and softmax is the output layer. Dropout layers (Srivastava et al. 2014) withthe hyperparameter set to the value 0.5 are used as regularisers. . . . . . . . . . e s t i m a t e dp r ec i s i o n Fig. 5.

Estimated precision from a sample of 30 single-peak and double-peak spectra for each iteration (the zeroth iteration is estimated when theCNN is trained only with the initial Ondˇrejov dataset). mation sample are forgotten after the precision estimation, andthe spectra are left in the LAMOST pool.Finally, we stopped our experiment in the 17th iteration whenthe estimated precision reached more than the predeﬁned thresh-old (in our case 80%) for the third time. We chose the val-ues of these parameters as a trade-o ﬀ between time and perfor-mance requirements, and it can be chosen di ﬀ erently for di ﬀ er-ent datasets. Figure 5 displays the precision of our CNN over 17active learning iterations. Article number, page 6 of 15. Škoda et al.: Active deep learning in large spectroscopic surveys

Table 1.

Partial confusion matrix of the ﬁnal classiﬁcation of our exper-iment (excluding candidates found by the oracle). The numbers showthe percentage and counts (in brackets) of correctly predicted spectraof all spectra predicted for a given class. The 4161 spectra in this ta-ble are all the candidates predicted as single or double peaks after thelong training. After we visually reviewed all of them, we found that58 of candidates are uninteresting spectra (37 predicted as single peaksand 21 predicted as double peaks). The target classes also include somemisclassiﬁcation: 53 double peaks are classiﬁed as single peaks, and 18single peaks are classiﬁed as double peaks. We note that we were un-able to compute the last row of the uninteresting class because it wouldmean visually classifying all the four million spectra that are predictedas uninteresting.

Predicted Actual classclass single peak double peak uninterestingsingle peak 97.5% (3 484) 1.5% (53) 1.0% (37)double peak 3.1% (18) 93.4% (548) 3.6% (21)Because the training of our CNN was time-consuming, wesped up the method by training the CNN during the active learn-ing phase for a smaller number of epochs. Then, after the activelearning phase, we ran the Adam optimisation algorithm of theCNN for a longer time (the training was stopped when the lossfunction did not improve by 10 − during 100 training iterations)to ensure that good convergence was achieved, and thus fewerfalse candidates will be produced. In the following text, we referto this step as long training.

4. Results

Our method identiﬁed 4 379 candidate spectra with signaturesof emission-line proﬁles including candidates found by the or-acle in all the 4 136 482 LAMOST DR2 spectra. The last CNNpredicted 3 574 spectra as single-peak and 587 as double-peakproﬁles, while the oracle found 157 single-peak candidates and61 double-peak candidates during the active learning phase. Asexplained earlier, it also includes absorption proﬁles with smallvisible disturbances that may be caused by additional circumstel-lar emissions. After visual inspection (see Appendix B) of thepredicted candidates, we rejected 58 as bad (partly destroyed,noisy, or with pure absorption proﬁles) and computed the par-tial confusion matrix in Table 1. Finally, we had a set of 4 321spectra of about The exact number of individual objects is di ﬃ cult to estimate be-cause of cross-matching problems, as explained in Appendix B.2. Moreover, through the visual preview of candidates, severalnormal and Seyfert galaxies and a high-velocity star, LAMOSTHVS1, were also identiﬁed. Lastly, we compared our active deeplearning method to the dual non-active learning scenario in Ap-pendix A. The comparison shows the signiﬁcant gain of our ac-tive deep learning method.

The ﬁnal catalogues of spectra of all our emission-line candi-dates obtained by active deep learning are available at the CDS(see the footnote on the title page), and they are also stored inthe science cloud in the Zenodo repository (Škoda et al. 2019)for further investigation. HTML tables list the spectra that wereknown by H16, spectra of new objects that we were able to cross-match with SIMBAD, and all our new so far unknown emission-line spectra. For the sake of completeness, we also show the tableof spectra that were visually proved to be in some way broken,extremely noisy, or lacked emission signatures. All of these ta-bles also contain direct links to the CDS Vizier repository ofLAMOST DR2, where the particular spectrum may be interac-tively plotted. Furthermore, there are the CSV versions (suitablefor spreadsheets) and VOTable format for further analysis in VOtools, such as Aladin, Topcat, or SPLAT-VO. The electronic PDFversion of this paper includes electronic links to various publicresources, such as LAMOST or SDSS archives of spectra, DSSimages of the sky, or detailed description of objects in SIMBADincluding links to our previews stored at Zenodo as well. The important research result is also a list of objects with unusualspectra properties. Some objects may have been caught just dur-ing a LAMOST observation in a particular evolutionary phase,or we might have witnessed the outburst of an unknown nova orsupernova. Many such spectra correspond to known SIMBADobjects, but the SIMBAD class of such objects may be di ﬀ erentfrom what we see. Very interesting objects were also found bychecking the bad (noisy or artefact-contaminated) spectra in de-tail. For example, a part of the spectral range could be destroyed,but the remaining part with a dominant line proﬁle may be stillintact. Our candidate list also includes 1 013 spectra of 948 objectsthat are neither cross-matched with SIMBAD nor listed by H16.They cover a wide range of line proﬁle shapes belonging to newso far unknown Be stars, T Tau stars, cataclysmic variables, closebinaries, symbiotic stars etc. Figures 6 and 7 show randomly se-lected examples of interesting spectra of each target class.Some stars have quite peculiar spectra that exhibit quitecomplicated proﬁles in multiple lines. This promises interestingphysical conditions.Spectrum spec-56661-GAC061N34B1_sp07-028 of objectLAMOST J040901.83 + iii ii spec-56699-GAC085N52V3_sp15-178 of objectLAMOST J053944.81 + https://doi.org/10.5281/zenodo.3241520 https://doi.org/10.5281/zenodo.3236165 Article number, page 7 of 15 & A proofs: manuscript no. ms ﬂu x ( c o un t s ) spec-56604-EG041109N021906M01 sp14-049 ﬂu x ( c o un t s ) ﬂu x ( c o un t s ) spec-56650-GAC113N27B1 sp08-235 ﬂu x ( c o un t s ) ﬂu x ( c o un t s ) spec-56266-GAC080N32V2 sp16-192 ﬂu x ( c o un t s ) Fig. 6.

Examples of spectra classiﬁed as single peaks.

LAMOST red and blue spectrographs. This was done by the re-duction pipeline (therefore it is listed in the table of bad candi-dates), but the red part shows P Cyg and inverse P Cyg proﬁles inseveral lines and emission combined with absorption at 7 610 Å(see Fig. 9). In the DSS2 survey, a strange object with the sig-nature of an edge-on ring (resembling Saturn) lies at the givenposition. However, the object is resolved into three in-line starsin the PanSTARRS-1 survey. Both satellite stars are perfectlyaligned in a straight line with the bright central object, and theyare separated by almost exactly 6.7 (cid:48)(cid:48) from its centre. The con-ﬁguration did not change since the PanSTARRS exposures (se-cured multiple times in years 2011 to 2014) up to now, whichwe conﬁrmed on 15 April 2020 by the exposure of the newlycommissioned photometric camera of the 2m Perek telescope at ﬂu x ( c o un t s ) spec-55967-GAC 073N44 V4 sp15-074 ﬂu x ( c o un t s ) ﬂu x ( c o un t s ) spec-56706-GAC097N20V2 sp15-157 ﬂu x ( c o un t s ) ﬂu x ( c o un t s ) spec-56350-GAC089N28V1 sp08-022 ﬂu x ( c o un t s ) ﬂu x ( c o un t s ) spec-56627-GAC056N46V1 sp11-237 ﬂu x ( c o un t s ) Fig. 7.

Examples of spectra classiﬁed as double peaks.Article number, page 8 of 15. Škoda et al.: Active deep learning in large spectroscopic surveys ﬂu x ( c o un t s ) spec-56661-GAC061N34B1 sp07-028 ﬂu x ( c o un t s ) Fig. 8.

Candidate Wolf-Rayet WN star LAMOSTJ040901.83 + − ﬂu x ( c o un t s ) spec-56699-GAC085N52V3 sp15-178 ﬂu x ( c o un t s ) Fig. 9.

Object LAMOST J053944.81 + Ondˇrejov observatory. The symmetrical conﬁguration deservesfurther investigation as it may be an e ﬀ ect of gravitational lens-ing.Spectrum spec-56202-EG042015S023742V01_sp15-180 in Fig. 10 is very similar, including the same bug in joiningred and blue spectrograph, showing the object LAMOSTJ041919.80-020211.6, which is neither known to SIMBADnor exposed by the SDSS. It has two emissions at 5 790 Åand 5 820 Å, the H α absorption with an emission peak in thered wing, and a strong emission combined with absorption at7 600 Å. A P Cyg proﬁle is visible at 6 870 Å.Figure 11 gives examples of other newly discovered objectswith complex spectra. Three spectra with a very bright and wide single-peak emissionline with an FWHM of about 200 Å are very unusual. H16 iden-tiﬁed none of these objects, and they cannot be cross-matchedwith SIMBAD. We have found all of them in SDSS DR15, butno estimate of the spectral class is given there. They are claimedto be stars, however. Because of their extremely wide red-shiftedH α line (the others are hidden in noise) and because severalgalaxies are seen around them, we speculate that they may be − ﬂu x ( c o un t s ) spec-56202-EG042015S023742V01 sp15-180 − − − ﬂu x ( c o un t s ) − −

200 7550 7600 7650vacuum wavelength (˚A) − Fig. 10.

Object LAMOST J041919.80-020211.6 with complex proﬁles. distant supernovae. However, they might be mere reduction arte-facts as well, therefore further investigation is desirable.Spectrum spec-56012-F5601204_sp10-030 belongs to objectLAMOST J114232.73-011535.9. In the SDSS DR15 colour im-age at this position, a faint white star is visible that may be iden-tiﬁed with SDSS J114232.73-011535.9, surrounded by galaxies.Spectrum spec-56012-F5601204_sp02-158 is the LAM-OST J114009.42-012454.3 equivalent to SDSS J114009.42-012454.3. A faint orange object is surrounded by a number ofgalaxies seen within 10 (cid:48)(cid:48) in the SDSS.The last spectrum of a similar shape is spec-56396-HD165712N321400M01_sp06-166 of object LAMOSTJ170758.50 + + + (cid:48)(cid:48) apart. The object is also present in the 2MASS and GALEXsurveys. Visual inspection of suggested candidates also resulted in theidentiﬁcation of several extragalactic objects, both Seyfert andnormal galaxies (conﬁrmed by SIMBAD).Spectrum spec-56716-HD114322N280318M_sp08-042 be-longs to object LAMOST J114631.67 + + spec-56657-M31020N36M1_sp08-216 belongs toobject LAMOST J012555.94 + + spec-56798-HD141746N331518M01_sp16-150 be-longs to object LAMOST J141403.15 + + spec-56752-HD150254N020528B01_sp08-150 be-longs to object LAMOST J150711.68 + spec-56304-HD083110N401329F01_sp09-064 be-longs to object LAMOST J083426.80 + + spec-56633-HD095000N333605M01_sp10-109 ofobject LAMOST J093915.39 + Article number, page 9 of 15 & A proofs: manuscript no. ms − ﬂu x ( c o un t s ) spec-56096-kepler08F56096 sp05-155 ﬂu x ( c o un t s ) ﬂu x ( c o un t s ) spec-56609-GAC093N22B1 sp05-096 ﬂu x ( c o un t s ) ﬂu x ( c o un t s ) spec-56609-GAC093N22M1 sp08-237 ﬂu x ( c o un t s ) ﬂu x ( c o un t s ) spec-56679-GAC078N31B1 sp09-075 ﬂu x ( c o un t s ) Fig. 11.

Random discovered objects with complex proﬁles. ﬂu x ( c o un t s ) spec-56012-F5601204 sp10-030 − ﬂu x ( c o un t s ) spec-56012-F5601204 sp02-158 ﬂu x ( c o un t s ) spec-56396-HD165712N321400M01 sp06-166 Fig. 12.

Supernovae candidates identiﬁed in the LAMOST DR2.

Spectrum spec-56739-HD114047N212109B_sp10-131 of ob-ject LAMOST J113114.00 + Identiﬁed as target class objects by our CNN (despite the rela-tively minor deformation of the absorption line proﬁle, which iscomparable with a noise level), a group of objects has a consid-erable red- or blue-shifted H α line. It appears that most of themare stars labelled in the LAMOST observing program as M31targets coming from the observing plan that covered the centralregions of the M31 and M33 galaxies (Luo et al. 2015). Thespectra clearly show that the centre of the H α emission line isheavily blue-shifted, which corresponds to line-of-sight veloci-ties of −

301 km s − and −

180 km s − of M31 and M33, respec-tively (van der Marel & Guhathakurta 2008).However, other objects are not associated with galax-ies by the LAMOST observing plan. The most red-shifted is the spectrum (available in our previews) spec-56357-HD090901N073047B01_sp12-133 of object LAMOSTJ091206.52 + − (Zhenget al. 2014). This spectrum is classiﬁed in the LAMOST DR2archive as a galaxy, while in its FITS header, where it comesfrom our preview above, as an A0III star. Another spectrum(spec-56285-HD090744N104005B03_sp07-117) of the sameobject, classiﬁed in the DR2 archive as an A0III star, is avail-able.During the visual preview, we found another two objectswith high red-shifted radial velocity. Spectrum spec-55997-B5599703_sp14-056 of object LAMOST J105350.26 + − in H α and 293 km s − in H β . Four otherspectra of the same star are listed in DR2, namely spec-55910-B91005_sp14-056, spec-55998-B5599803_sp14-056,spec-56638-HD104953N275826B01_sp08-015, and spec-56685-HD104953N275826M01_sp08-015.Spectrum spec-56617-VB081S05V1_sp02-071 of objectLAMOST J052354.52-070508.3 gives radial velocities358 km s − in H α and 351 km s − in H β . It is the same inanother spectrum, spec-56617-VB081S05V2_sp02-071.Spectrum spec-56393-HD172143N395828M01_sp16-162 ofobject LAMOST J171623.21 + Article number, page 10 of 15. Škoda et al.: Active deep learning in large spectroscopic surveys H α seems to be blue-shifted by about 710 km s − and thePaschen P13 and P15 lines by 660 km s − and 630 km s − , re-spectively. Because of the high noise and apparent asymmetry ofthe lines, it is impossible to obtain a precision better than about10 km s − , but it is evident that this star is one of the fastest starsin our galaxy and approaches us, unlike HVS1. However, it maybe of extragalactic origin as well, as the image of the object inSDSS DR15 is partly saturated and lies less than 5 (cid:48)(cid:48) away fromthe large elliptical galaxy. It is known in the GALEX survey asGALEXASC J171623.29 + +

5. Conclusions

We have introduced a promising method for the discovery ofobjects of interest in large archives based on active deep learning.This technique, supported by interactive visual classiﬁcation of asmall sample of suggested target classes, is very e ﬃ cient and hasled to the discovery of many new unknown emission-line stars.To the best of our knowledge, this is the ﬁrst application ofactive deep learning techniques in astronomy: used for spectralclassiﬁcation. Many details still need to be elaborated, and moreexperiments must be run on di ﬀ erent samples of various types ofspectra. The main advantage of the method is that target classeswith characteristic spectral features can be identiﬁed in caseswhere the classical deep learning fails because not enough la-belled examples are available.Our experiments identiﬁed many emission-line candidatesthat deserve more detailed examination because they may hiderare astronomical objects with interesting physical properties.All results are publicly available at Zenodo and will also be up-loaded in the main astronomical catalogue repository, the CDSVizier. Acknowledgements.

The early stages of this research were supported bythe grant LD-15113 of Ministry of Education, Youth, and Sports of theCzech Republic and the COST Action TD1403 Big Sky Earth. The ﬁ-nal parts of the research and all the experiments described here were sup-ported by the same Ministry in project Research Center for Informatics,CZ.02.1.01 / / / / / / OHK3 / /

18. Further-more, this work is based on spectra from the Ondˇrejov 2 m Perek telescope andthe public LAMOST DR2 survey. Therefore, we would like to thank the Gu-oshoujing Telescope. Guoshoujing Telescope (the Large Sky Area Multi-ObjectFiber Spectroscopic Telescope LAMOST) is a National Major Scientiﬁc Projectbuilt by the Chinese Academy of Sciences. Funding for the project has beenprovided by the National Development and Reform Commission. LAMOST isoperated and managed by the National Astronomical Observatories, ChineseAcademy of Sciences. We are namely indebted to Dr. Chenzhou Cui for long-term support of our activities in the framework of collaboration with ChineseVO. Finally, we also express our thanks to Miroslav Šlechta for reduction of allCCD700 spectra used in our research.

References

Abadi, M., Agarwal, A., Barham, P., et al. 2015, TensorFlow: Large-Scale Ma-chine Learning on Heterogeneous Systems, software available from tensor-ﬂow.orgAlger, M. J., Banﬁeld, J. K., Ong, C. S., et al. 2018, MNRAS, 478, 5547Alhassan, W., Taylor, A. R., & Vaccari, M. 2018, MNRAS, 480, 2085Aniyan, A. K. & Thorat, K. 2017, ApJS, 230, 20Bonnarel, F., Fernique, P., Bienaymé, O., et al. 2000, A&AS, 143, 33Bromová, P., Škoda, P., & Vážný, J. 2014, International Journal of Automationand Computing, 11, 265Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. 2002, Journalof Artiﬁcial Intelligence Research, 16, 321Chollet, F. et al. 2015, Keras, https://keras.io , accessed: 2 July 2020de la Calleja, J., Benitez, A., Medina, M. A., & Fuentes, O. 2011, in 2011 Inter-national Conference of Soft Computing and Pattern Recognition, 435–439 Demleitner, M., Neves, M. C., Rothmaier, F., & Wambsganss, J. 2014, Astron-omy and Computing, 7–8, 27Domínguez Sánchez, H., Huertas-Company, M., Bernardi, M., Tuccillo, D., &Fischer, J. L. 2018, MNRAS, 476, 3661Fernique, P., Allen, M. G., Boch, T., et al. 2015, A&A, 578, A114George, D. & Huerta, E. A. 2018, Phys. Rev. D, 97, 044039Glorot, X., Bordes, A., & Bengio, Y. 2011, in 28th International Conference onMachine Learning, ed. L. Getoor & T. Sche ﬀ er (New York, USA: ACM),513–520Goodfellow, I., Bengio, Y., & Courville, A. 2016, Deep Learning (MIT Press), , accessed: 2 July 2020Gray, R. O. & Corbally, Christopher, J. 2009, Stellar Spectral Classiﬁcation(Princeton University Press)Gupta, K. D., Pampana, R., Vilalta, R., Ishida, E. E. O., & de Souza, R. S. 2016,in 2016 IEEE Symposium Series on Computational Intelligence (SSCI),Athens, 1–8Hanuschik, R. W., Kozok, J. R., & Kaiser, D. 1988, A&A, 189, 147Heiter, U. 2014, Air-to-vacuum conversion, , accessed: 2 July 2020Hou, W., Luo, A.-L., Hu, J.-Y., et al. 2016, Research in Astronomy and Astro-physics, 16, 138Ishida, E. E. O., Beck, R., González-Gaitán, S., et al. 2019a, MNRAS, 483, 2Ishida, E. E. O., Kornilov, M. V., Malanchev, K. L., et al. 2019b, Active AnomalyDetection for time-domain discoveriesKang, W. & Lee, S.-G. 2012, MNRAS, 425, 3162Kingma, D. P. & Ba, J. 2015, in 3rd International Conference on Learning Rep-resentations, ed. Y. Bengio & Y. LeCunKogure, T. & Leung, K.-C., eds. 2007, Astrophysics and Space Science Library,Vol. 342, The Astrophysics of Emission-Line Stars (Springer)Krizhevsky, A., Sutskever, I., & Hinton, G. E. 2012, in Advances in Neural In-formation Processing Systems 25, ed. F. Pereira, C. J. C. Burges, L. Bottou,& K. Q. Weinberger (Curran Associates, Inc.), 1097–1105Kurosawa, R., Harries, T. J., & Symington, N. H. 2006, MNRAS, 370, 580LeCun, Y., Bengio, Y., & Hinton, G. 2015, Nature, 521, 436LeCun, Y., Boser, B., Denker, J. S., et al. 1989, Neural Computation, 1, 541Lee, Y. S., Beers, T. C., Carlin, J. L., et al. 2015, AJ, 150, 187Lee, Y. S., Beers, T. C., Sivarani, T., et al. 2008, AJ, 136, 2022Lin, C.-C., Hou, J.-L., Chen, L., et al. 2015, Research in Astronomy and Astro-physics, 15, 1325Liu, P., Zhang, H., & Eom, K. B. 2017, IEEE Journal of Selected Topics inApplied Earth Observations and Remote Sensing, 10, 712Luo, A.-L., Zhao, Y.-H., Zhao, G., et al. 2016, VizieR Online Data Catalog,V / https://dspace.cvut.cz/handle/10467/69666?locale-attribute=en , accessed: 2 July 2020Porter, J. M. & Rivinius, T. 2003, PASP, 115, 1153Prati, R. C., Batista, G. E., & Monard, M. C. 2009, in 4th Indian InternationalConference on Artiﬁcial Intelligence, 359–376Rastgoo, M., Lemaitre, G., Massich, J., et al. 2016, in 3rd International Confer-ence on BioimagingReipurth, B., Pedrosa, A., & Lago, M. T. V. T. 1996, A&AS, 120, 229Richards, J. W., Starr, D. L., Brink, H., et al. 2012, ApJ, 744, 192Russakovsky, O., Deng, J., Su, H., et al. 2015, International Journal of ComputerVision, 115, 211Settles, B. 2009, Active Learning Literature Survet, Computer Sciences Techni-cal Report 1648, University of Wisconsin–MadisonSilaj, J., Jones, C. E., Tycner, C., Sigut, T. A. A., & Smith, A. D. 2010, ApJS,187, 228Simonyan, K. & Zisserman, A. 2015, in 3rd International Conference on Learn-ing Representations, ed. Y. Bengio & Y. LeCunŠkoda, P. 2019, A complex orchestration of VO tools in discovery of unknownsky objects, doi:10.5281 / zenodo.3242658Škoda, P., Bromová, P., Lopatovský, L., Paliˇcka, A., & Vážný, J. 2015, in ASPConf. Ser., Vol. 495, Astronomical Data Analysis Software and SystemsXXIV, ed. A. R. Taylor & E. Rosolowsky, 87 Article number, page 11 of 15 & A proofs: manuscript no. ms

Škoda, P., Draper, P. W., Neves, M. C., Andrešiˇc, D., & Jenness, T. 2014, As-tronomy and Computing, 7–8, 108Škoda, P., Paliˇcka, A., Koza, J., & Shakurova, K. 2016, in Astroinformatics, Pro-ceedings of the IAU Symposium, ed. G. Longo & S. Cavuoti No. 325 (Cam-bridge University Press), 180–185Škoda, P., Podsztavek, O., & Tvrdík, P. 2019, Tables of EmissionLine Spectra discovered by Active Deep Learning in LAMOST DR2,doi:10.5281 / zenodo.3241520Škoda, P. & Šlechta, M. 2002, Publications of the Astronomical Institute of theCzechoslovak Academy of Sciences, 90, 22Škoda, P. & Vážný, J. 2012, in ASP Conf. Ser., Vol. 461, Astronomical DataAnalysis Software and Systems XXI, ed. P. Ballester, D. Egret, & N. P. F.Lorente, 573Solorio, T., Fuentes, O., Terlevich, R., & Terlevich, E. 2005, MNRAS, 363, 543Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R.2014, Journal of Machine Learning Research, 15, 1929Taigman, Y., Polyak, A., & Wolf, L. 2017, in 5th International Conference onLearning Representations (OpenReview.net)Taylor, M. B. 2005, in ASP Conf. Ser., Vol. 347, Astronomical Data AnalysisSoftware and Systems XIV, ed. P. Shopbell, M. Britton, & R. Ebert, 29Taylor, M. B., Boch, T., & Taylor, J. 2015, Astronomy and Computing, 11, 81Tody, D., Dolensky, M., McDowell, J., et al. 2012, Simple Spectral Access Pro-tocol Version 1.1, IVOA Recommendation 10 February 2012Traven, G., Zwitter, T., Van Eck, S., et al. 2015, A&A, 581, A52van der Marel, R. P. & Guhathakurta, P. 2008, ApJ, 678, 187Vilalta, R., Gupta, K. D., Boumber, D., & Meskhi, M. M. 2019, PASP, 131Walborn, N. R. & Fitzpatrick, E. L. 2000, PASP, 112, 50Waters, C. Z. & Hollek, J. K. 2013, PASP, 125, 1164Wu, Y., Luo, A.-L., Li, H.-N., et al. 2011, Research in Astronomy and Astro-physics, 11, 924Yang, L., MacEachren, A., Mitra, P., & Onorati, T. 2018, ISPRS InternationalJournal of Geo-Information, 7Zeiler, M. D. & Fergus, R. 2014, in Computer Vision – ECCV 2014, ed. D. Fleet,T. Pajdla, B. Schiele, & T. Tuytelaars (Cham: Springer International Publish-ing), 818–833Zheng, Z., Carlin, J. L., Beers, T. C., et al. 2014, ApJ, 785, L23Zickgraf, F.-J. 2003, A&A, 408, 257 Article number, page 12 of 15. Škoda et al.: Active deep learning in large spectroscopic surveys

Appendix A: Comparison with non-active learning

To clarify the real gain of active learning, we compare our activedeep learning method to a non-active learning dual scenario. Thenon-active learning can be considered the zeroth iteration of theapplication of our active deep learning in Sect. 3. However, thezeroth iteration in our application is carried out in the speeded-up regime (see the last paragraph of Sect. 3.4).We carried out an independent experiment to prove the ben-eﬁts of active learning. We trained our CNN using the setting ofthe long training with the initial training set of our active deeplearning method (the preprocessed Ondˇrejov dataset), and weused the trained CNN to classify all the spectra in the LAMOSTpool. Then, we estimated precision of the CNN from randomsamples of 100 spectra from target classes. In order to make amore reliable conclusion, we ran the experiment three times. Theresults are shown in Table A.1.

Table A.1.

Results of three runs of non-active learning. The table showsthe precision estimated from a random sample of 100 spectra from eachtarget class, and the numbers in brackets are counts of spectra classiﬁedinto each target class.

Run Estimated precision (predicted spectrum count)single peak double peakNo. 1 4.1% (343 988) 2.0% (248 336)No. 2 3.0% (301 396) 2.0% (409 908)No. 3 4.0% (167 545) 0.0% (342 230)The comparison of Table A.1 and Table 1 shows that thethree CNNs were unable to learn without the support of spec-tra from LAMOST added to the training set by active learning.Therefore we conclude that the gain of our active deep learningmethod is signiﬁcant.

Appendix B: Detailed analysis of emission-linecandidates

All spectra in the candidate list predicted by our CNN were visu-ally inspected to conﬁrm the correct prediction of their classes.We reviewed not only the limited spectral range around the H α line given to the CNN (which was only available in the Ondˇre-jov dataset), but also the whole LAMOST spectrum displayed inlinear wavelengths (instead of the original logarithmic scale ofthe input FITS ﬁles).We removed from the candidate spectra those that wereclearly bad (reduction artefacts, missing data in the studied spec-tral range) or were so noisy that the line proﬁle drawn as acontinuous line was broken in a sawtooth-like manner. Otherbad spectra belonged to cold stars, where the molecular bands,smeared because of the low resolution of LAMOST, mimickedthe searched emission proﬁle in a small zoomed range, but therest of the continua had a similar variability amplitude. The restwere spectra in the non-target class showing only absorptionlines without signatures of emission. We removed 58 spectra intotal. The list is available in the on-line tables published on Zen-odo and CDS. Despite being dropped in the uninteresting (non-target) class, the visual inspection of apparently bad spectra alsoyielded several interesting objects, such as high-velocity stars orobjects with complicated physics as described in Sects. 4.2.2 and4.2.4.In general, it was challenging to decide about the quality ofthe data and the correct classiﬁcation in these boundary cases.We took the size of the line proﬁle disturbance, the fuzziness of spectra (and namely its continua), and other metadata (obtainedfrom the Virtual Observatory as explained below) into accountto classify an object as the one with expected emission (e.g. a Bestar). Our overall experience with this very subjective classiﬁca-tion in boundary cases was, however, very surprising. The deepCNN was able to see even very tiny signatures of expected shapestructures that the human could barely see. Appendix B.1: Technology of the Virtual Observatory

The veriﬁcation of the performance of our active deep learningmethod required many visualisations of spectra with the pos-sibility of previewing entire spectra as well as their zoomedparts. In many cases, we used the positional cross-matching fol-lowed by visual inspection of the appearance of a candidate ob-ject in common all-sky imaging and photometric surveys. Thistask would be extremely tedious without the usage of the VirtualObservatory technology based on the IVOA standards, namelythe combination of the Table Access (Nandrekar-Heinis et al.2014) and Simple Spectra Access (Tody et al. 2012) protocolsand VO client applications such as TOPCAT (Taylor 2005), Al-adin (Bonnarel et al. 2000), or SPLAT-VO (Škoda et al. 2014).All LAMOST DR2 FITS ﬁles, converted into the linear wave-length in Ångströms, were ingested into a VO server based onthe DACHS system (Demleitner et al. 2014) that runs locally inOndˇrejov, and the links (called accref or access_url ) to indi-vidual spectra on that server were joined with spectrum names inthe candidate table. This allowed an immediate visualisation andinteractive zooming of every spectrum in SPLAT-VO. We wereunable to use the original China-VO services because the DR2is not available via the SSAP protocol (only DR1 is available).The combination of the TOPCAT, Aladin, and SPLAT-VOtools interlinked using the SAMP protocol (Taylor et al. 2015)allowed us to set up a powerful workﬂow. In addition to basicoperations on tables (e.g. sorting, counting, ordering, and search-ing in rows) and cross-matching using internal TOPCAT capa-bilities, we extensively used the CDS cross-match service withSIMBAD running on CDS computers.The resulting tables were then sent to Aladin and SPLAT-VO, and the TOPCAT activation actions were set on them, sothat the selection of every row in the TOPCAT table triggereda sending operation of the object coordinates to Aladin. Herethe detailed image of a star or galaxy from the all-sky surveysDSS2, 2MASS, SDSS, and GALEX quickly appeared thanksto the hierarchical progressive survey (HiPS) technology (Fer-nique et al. 2015). In addition, the TOPCAT activation actionalso triggered the sending of accref content to the SPLAT-VO,where the corresponding spectrum was shown. We veriﬁed cor-rect cross-matching with SIMBAD by over-plotting all SIMBADobjects that are visible in the ﬁeld in Aladin (also based on theHIPS catalogues), and placing the pointer at a particular targetresulted in showing the SIMBAD web page about the objectin a browser. More details are shown in our presentation fromthe Paris IVOA interoperability meeting (Škoda 2019), which isavailable on Zenodo .The productivity of candidate veriﬁcation was enormouslyincreased by the VO technology in comparison to a manualsearch for information in multiple sources. It also allowed usto discover the e ﬀ ects described below (e.g. misplacement of anoptical ﬁbre). In addition to the VO exploitation, SPLAT-VO wasalso used for direct measurement of radial velocities through itsbuilt-in method of line proﬁle mirroring (Parimucha & Škoda https://doi.org/10.5281/zenodo.3242658 Article number, page 13 of 15 & A proofs: manuscript no. ms

Appendix B.2: Multiplicity of exposures

Some objects in the LAMOST DR2 were observed severaltimes, usually twice in di ﬀ erent epochs, but a few spectrain our candidate list were also exposed ﬁve times. This factmay be used for analysing the evolution of the line proﬁles,which is typical for some Be stars. For example, the com-parison of spectrum spec-56625-GAC113N37V1_sp07-098 of ob-ject LAMOST J074244.51 + spec-56344-GAC113N37V1_sp07-098 shows a double-peakemission that diminishes in a deep absorption line, while thespectrum spec-56350-GAC089N28V1_sp04-121 of object LAM-OST J055821.00 + spec-55876-GAC_089N28_B2_sp04-121 shows the blue-shiftedpeak of a double-peak proﬁle that decreases, while the red-shifted peak is stable and the absorption in the line core is muchdeeper.We grouped 855 of 4 379 candidate spectra into 398 groupswith the same designation as stated in the LAMOST header.The remaining 3 524 objects seemed to have only one exposure.However, the internal cross-matching using the coordinates inthe radius of 5 (cid:48)(cid:48) and visual veriﬁcation of spectra shape iden-tiﬁed several objects with a di ﬀ erent designation that lie closetogether. We were able to identify 436 groups of multiple ex-posures within a radius of 5 (cid:48)(cid:48) . However, this complicates thecross-matching because we cannot consider the object desig-nation as a unique identiﬁer. For example, objects LAMOSTJ053611.80 + + + + + ﬀ erent stars with almostidentical spectra: spectra spec-56204-GAC080N33B101_sp08-234 of star LAMOST J052402.81 + spec-56306-GAC080N33B2_sp08-234 of star LAMOSTJ052401.53 + α emis-sion, although the stars are 24 (cid:48)(cid:48) apart and have di ﬀ erentSIMBAD names, 2MASS J05240280 + + Appendix B.3: LAMOST classiﬁcation pipeline ﬂaws

The LAMOST FITS headers contain the estimate of the spec-tral type of almost all stars as well as labels produced by theautomatic pipeline (Wu et al. 2011) that mark the objects asnon-stellar (galaxy, quasar, or unknown). However, we notedvarious inconsistencies in these labels. For example, objectLAMOST J053040.90 + spec-56218-GAC083N27B2_sp02-234 as star B6, while in spec-56271-GAC084N26B1_sp10-057 it is A2V, although the visually exam-ined line proﬁles look very similar. Although some stars are typ-ical Be stars (identiﬁed by H16 as classical Be stars), the LAM-OST classiﬁcation mostly assigns class A, but also class F or G. Objects marked as ‘Non’ are often just bad data with reductionartefacts, but there are also some very interesting cases, as wasshown in previous sections. Appendix B.4: Comparison with H16

The main reason why we have used the LAMOST DR2 surveyfor our analysis instead of publicly available later versions (e.g.DR4 with 7.6 million spectra is available since July 2018) is notthe lower data volume (which facilitates the whole analysis), butthe excellent opportunity of comparing our active deep learningmethods with the more straightforward pattern-recognition algo-rithm of H16 on the same data set. Their catalogue gives a moredetailed line proﬁle classiﬁcation into six classes (obtained bycross-correlation with 27 templates) and attempts to cross-matchthe objects with other known emission-star catalogues, namely,with SIMBAD.The detailed analysis of the cross-matching with our candi-date list has, however, identiﬁed a number of problems that pre-vented us from using their catalogue for a direct comparison ofthe performance of our method with theirs, despite the same dataset and the same target class (emission-line objects).The ﬁrst discrepancy is the non-unique identiﬁcation of ob-jects. As they do not give spectrum identiﬁcations but onlyan object designation, we joined our candidate list with theircatalogue using the LAMOST designation (which is in factan encoding of coordinates in J2000). However, the coordi-nate cross-matching using circles with a radius of 3 (cid:48)(cid:48) identi-ﬁed objects with coordinates stated by H16 that were di ﬀ er-ent from those in the LAMOST header. For example, for ob-ject LAMOST J015611.38 + (cid:48)(cid:48) o ﬀ set from those in the FITS header. As the objectJ034031.33 + (cid:48)(cid:48) between coordinates in the H16 catalogue and theLAMOST header, we cross-matched our candidates with H16with a radius of 5 (cid:48)(cid:48) , instead of using just the verbatim matchof designations. We assume that the di ﬀ erences in coordinatesstated in H16 and those we obtained from the FITS headermay be caused by some post-processing of the LAMOST DR2archive before it was publicly released (H16 probably used theDR2 before its public release).We also visually veriﬁed randomly selected spectra of ob-jects from their catalogue (using the multiple SSA query inTOPCAT at the given coordinates) and realised that many ob-jects in their catalogue were not justiﬁed to have the emis-sion signatures in H α line, as the apparent emission bumpson line proﬁles were just a coincidence of a noise ﬂuctu-ations, similar to what we discuss at the beginning of Ap-pendix B. This concerns most of the objects that are classiﬁedby them as unknown. For example, the spectrum of LAMOSTJ010450.45 + spec-55976-GAC_084N40_V1_sp07-095 and spec-56257-GAC089N38V1_sp14-033 of LAMOST J054458.54 + + ii region. For example, all four available spec-tra spec-56618-GAC057N34B1_sp06-069 , spec-56661-GAC061N34B1_sp03-203 , spec-56680-GAC057N34B2_sp06-069, Article number, page 14 of 15. Škoda et al.: Active deep learning in large spectroscopic surveys and spec-56685-GAC061N34B2_sp03-203 of object LAMOSTJ040124.35 + α core is just coincidence ofnoise (compared with the noise amplitude in the continuum).The situation is not better for 5 580 stars marked byH16 as CBe type, the classical Be stars. Here, a randomvisual inspection identiﬁed a number of spectra that aretoo noisy to be able to assess the proﬁle or spectra withoutany emission signatures. For example, the single-spectrumobjects J000414.96 + + + + α type II proﬁle orJ052630.97 + spec-55880-B8004_3_sp03-189 and spec-55930-B5593002_sp03-192 of object J024156.88 + spec-56350-GAC089N28V2_sp04-081 and spec-56684-GAC089N28V2_sp04-081 of object J055836.34 + spec-55875-B7505_sp14-198 and spec-55910-B91003_sp14-198 of object J015840.60 + spec-55879-B7905_1_sp01-140 and spec-55910-B91003_sp01-140 of objectJ021505.52 + spec-56627-HD095359N274143M01_sp06-068 and spec-56687-HD101242N281431M_sp10-078 of object J100140.14 + Appendix B.5: Cross-matching with SIMBAD

As many objects in our candidate list look interesting becausethey resemble Be stars, cataclysmic variables, or young stellarobjects, it is important to ﬁnd more information about them. Wetherefore tried to cross-match them with the SIMBAD databaseusing the CDS cross-match service built-in the TOPCAT. It isbased on ﬁnding the minimum angular distance between objectcoordinates and the SIMBAD catalogue in a small circle of agiven radius on the sky. As object coordinates, we took the co-ordinates given in the LAMOST FITS headers, which representthe coordinates on which the LAMOST optical ﬁbre was placed.The simple idea of cross-matching using tight tolerance (ofthe order 1–2 (cid:48)(cid:48) ) appeared to be incorrect when we started to ver-ify the given position in the all-sky surveys DSS2, 2MASS, andSDSS (using the Aladin sky atlas). We noted faint spectra of ob-jects with a prominent emission proﬁle on coordinates, wherethere was no visible object in the sky surveys, but there werebright stars nearby, for which SIMBAD stated a young star or di-rectly the emission-line object. The spectra in some ﬁbres prob-ably came from the bright object at a distance of even tens ofarcseconds from the ﬁbre position. We therefore ﬁnally cross-matched our list of candidates with SIMBAD in di ﬀ erent circlesof sizes 5 (cid:48)(cid:48) up to 300 (cid:48)(cid:48) and inspected the SIMBAD type of ob-jects with larger distances until we conﬁrmed that the match was ﬂu x ( c o un t s ) spec-56283-GAC120N18V2 sp15-115 ﬂu x ( c o un t s ) ﬂu x ( c o un t s ) HD 65666 (CCD700 spectrum)

Fig. B.1.

Spectrum of the star that is cross-matched as HD 65 666. Theupper two panels are from LAMOST, and the lower panel is from theCCD700 archive of the Perek 2 m telescope. It is a bright star: V = . ﬀ set by 6.5 (cid:48)(cid:48) , so it isdi ﬃ cult to cross-match it with SIMBAD. incorrect (the nearest SIMBAD objects was not of emission na-ture). The number of cross-matched objects also started to risesteeply after a certain tolerance was met, indicating that the near-est object was not the correct one. After some iterations, we setthe acceptable radius for a SIMBAD cross-match to 20 (cid:48)(cid:48) , and wevisually inspected that the objects we cross-matched with SIM-BAD were reasonably bright targets of emission type. In sev-eral cases, we found a galaxy or a nebula (and the correspondingcross-matched object was conﬁrmed by viewing at its spectrum).Here are few examples of misplaced light entering the ﬁbre:spectrum spec-56618-GAC105N47V2_sp10-151 of object LAM-OST J065632.72 + ﬀ set by 13 (cid:48)(cid:48) , or spec-56295-VB056N24V1_sp08-031 of objectLAMOST J034912.80 + (cid:48)(cid:48) away from theﬁbre position.The spectra with a prominent emission spec-55960-GAC_101N09_V1_sp10-152 , spec-55960-GAC_101N09_V2_sp10-152 and spec-55968-GAC_101N09_V3_sp10-152 of object LAM-OST J063910.49 + / Be star R Mon, which lies 26 (cid:48)(cid:48) away.A very interesting case was also found for spec-trum spec-56283-GAC120N18V2_sp15-115 of object LAMOSTJ080032.50 + r mag of value 11.04 (not identiﬁedin H16). It is the bright seventh magnitude star HD 65 666 with aﬁbre o ﬀ set by 6.5 (cid:48)(cid:48) . Figure B.1 also shows its spectrum obtainedby the Ondˇrejov 2 m Perek Telescope with a spectral resolution13 000, which unveils a more complex structure of the line pro-ﬁles that is typical for Be stars.. Figure B.1 also shows its spectrum obtainedby the Ondˇrejov 2 m Perek Telescope with a spectral resolution13 000, which unveils a more complex structure of the line pro-ﬁles that is typical for Be stars.