[PDF] Machine Learning for Scientific Discovery

Abstract

Machine Learning algorithms are good tools for both classification and prediction purposes. These algorithms can further be used for scientific discoveries from the enormous data being collected in our era. We present ways of discovering and understanding astronomical phenomena by applying machine learning algorithms to data collected with radio telescopes. We discuss the use of supervised machine learning algorithms to predict the free parameters of star formation histories and also better understand the relations between the different input and output parameters. We made use of Deep Learning to capture the non-linearity in the parameters. Our models are able to predict with low error rates and give the advantage of predicting in real time once the model has been trained. The other class of machine learning algorithms viz. unsupervised learning can prove to be very useful in finding patterns in the data. We explore how we use such unsupervised techniques on solar radio data to identify patterns and variations, and also link such findings to theories, which help to better understand the nature of the system being studied. We highlight the challenges faced in terms of data size, availability, features, processing ability and importantly, the interpretability of results. As our ability to capture and store data increases, increased use of machine learning to understand the underlying physics in the information captured seems inevitable.

Full PDF

aa r X i v : . [ a s t r o - ph . I M ] F e b (O4.4) Machine Learning for Scientiﬁc Discovery Shraddha Surana , Yogesh Wadadekar and Divya Oberoi ThoughtWorks Pvt. Ltd., Binarius Building, Shastrinagar, Yerawada, Pune411006, Maharashtra, India. [email protected] National Centre for Radio Astrophysics, Tata Institute of FundamentalResearch, Post Bag 3, Ganeshkhind, Pune 411007, India.

Abstract.

Machine Learning algorithms are good tools for both classiﬁcation and predictionpurposes. These algorithms can further be used for scientiﬁc discoveries from the enor-mous data being collected in our era. We present ways of discovering and understand-ing astronomical phenomena by applying machine learning algorithms to data collectedwith radio telescopes. We discuss the use of supervised machine learning algorithmsto predict the free parameters of star formation histories and also better understand therelations between the di ﬀ erent input and output parameters. We made use of DeepLearning to capture the non-linearity in the parameters. Our models are able to predictwith low error rates and give the advantage of predicting in real time once the modelhas been trained. The other class of machine learning algorithms viz. unsupervisedlearning can prove to be very useful in ﬁnding patterns in the data. We explore howwe use such unsupervised techniques on solar radio data to identify patterns and varia-tions, and also link such ﬁndings to theories, which help to better understand the natureof the system being studied. We highlight the challenges faced in terms of data size,availability, features, processing ability and importantly, the interpretability of results.As our ability to capture and store data increases, increased use of machine learning tounderstand the underlying physics in the information captured seems inevitable.

1. Introduction

Machine learning (ML) algorithms learn the information contained in the data and useit for the purpose of prediction, classiﬁcation and clustering. As astronomical datasetsare growing exponentially in size, ML techniques are becoming increasingly useful forcreating models which will enable astronomers to expedite the process of astronomicaldiscovery. For example, ML algorithms can be used to improve performance (in termsof time and processing capacity) - as we show in prediction of star formation propertiesin Section 2. They can be used to generate inferences by analysing large amounts ofdata to uncover their patterns. The outcome of the ML algorithm can be compared withthe current model and be used to improve the understanding of the current models i.e.if the ML predictions for a small number of data points form outliers from the trendthen they may point to the need of further investigation, as typically ML algorithms aredesigned to generalise on all data points.Section 2 brieﬂy describes the outcome of the supervised deep learning approachto predict three star formation properties of galaxies viz. stellar mass, star formation1 Surana, Wadadekar and Oberoirate (SFR) and dust luminosity using broadband ﬂux measurements. Section 3 presentsour exploration of the unsupervised technique to discover information present in solarradio images.

2. Discovering physical models based on supervised machine learning

In this work, our goal is to model and replicate the behavior of a speciﬁc stellar popu-lation synthesis code viz. magphys (da Cunha et al. 2008). The best ﬁt parameters thatcharacterise the star formation histories of galaxies are determined using these mod-els. We implement a supervised machine learning technique - deep learning - to mimicthe behavior of the magphys model to predict three important star formation properties -stellar mass, star formation rate and dust luminosity. The data used are from the GAMA(Galaxy And Mass Assembly) (Driver et al. 2011) survey. We selected all galaxies with0 < z ≤ . ≥ M ⊙ from the GAMA catalog. After applying ﬁltersto remove noise from the data, we had 76,455 galaxies which form our ﬁnal sample setto train the deep learning model. Estimating the three star-formation parameters usingthe magphys code for 10,000 galaxies would take ∼ ,

000 minutes (about 10 min-utes per galaxy). As telescope technology improves, the size of galaxy samples withmulti-wavelength data is ever increasing. A traditional stellar population synthesis codescales linearly with sample size. On the other hand, the deep learning model training,takes 3 to 30 mins depending on the free parameter being modelled, with most of thetime taken up by the training phase, which is a one time e ﬀ ort. Once modelled, thetime taken to predict the free parameters for new galaxies is negligible. This representsa huge savings in time, with potentially larger savings for samples from future largearea imaging surveys. This model can further be modiﬁed to also give the conﬁdencelevel of each prediction. Galaxies for which the conﬁdence level of prediction is verylow, can be further investigated and run with the standard stellar population technique.This outcome can be investigated further and incorporated in the deep learning model,enriching the model as it encounters more and more data. The error of the deep learn-ing model is 0.0577, 0.1643 and 0.1143 (in their respective units) for the logarithm ofstellar mass, star formation rate and dust luminosity respectively which is calculatedusing error = σ ( y actual − y predicted )The comparison of predicted values against values from magphys is shown in Fig. 1.More details on the prediction of star-formation properties is available in Surana et al.(2020).

3. Discovering patterns and anomalies based on Unsupervised machine learning

Unsupervised ML algorithms are of particular importance in research,as they are usedto uncover patterns, complex relationships and groupings that exist in the data. Quiteoften, the research explorations and approaches are biased by the knowledge, experi-ence, expectations and intuitions of the researchers. Much of the analysis is also drivenby visual inspection, which has severe limitations in terms of the number of dimensionsthat can be explored simultaneously, and the capacity of the human brain to assimilateinformation from large data volumes. We use unsupervised ML algorithms to under-stand data demography and observe groupings, patterns or anomalies in the data. WeO4.4) Machine Learning forScientiﬁc Discovery 3

Figure 1. The scatter plots in the upper panels show the values predicted by thedeep learning model compared to the magphys values, for stellar mass, star forma-tion rate and dust luminosity. The dashed line shows the best linear ﬁt through thescatter plot and the solid line represents the points where the predicted values equalthe magphys model values. The lower panel shows the error in the prediction as afunction of the respective parameter. use solar radio images from the Murchison Wideﬁeld Array (MWA) to explore MLapproaches. This opportunity has been enabled by the recent availability of an interfer-ometric imaging pipeline (Mondal et al. 2019), which provides solar radio images withunprecedented ﬁdelity and dynamic range. The output of this pipeline is a hypercube - I ( θ, φ, ν, t ), where I is the intensity of emission, θ and φ are the image coordinates, ν isthe frequency, and t is the time corresponding to the image. The MWA can produce upto ∼ images every minute. It quickly becomes infeasible to explore such large datasets using conventional human e ﬀ ort intensive approaches. Though, we are currentlyworking with much slower image rate of ∼ . × images per minute, these are al-ready challenging enough to require signiﬁcant parallalization and processing power.This can be achieved both in the model implementation and the pre-processing of datarequired by the ML algorithm. In order to utilize this extensive amount of data andsynthesise the information from them, we use an unsupervised ML technique calledSelf-Organising Maps (SOMs) (Kohonen 1982). SOMs are a form of unsupervisedneural networks that produce a lower dimensional (typically two) representation of thedata, while preserving the topology. They can be used to explore similarities and / ordissimilarities in the solar images, pointing to the direction for further investigation.Visualization also forms a large part of the exploration, both before and after applica-tion of the unsupervised ML algorithms - before to understand the data and after tounderstand the patterns identiﬁed by the ML algorithm.

4. Summary

We have successfully applied Machine Learning to predict star formation properties ofgalaxies that has bought down the time to predict the speciﬁed parameters with verylow error. Since ML models are designed to be generalised models, their results canbe compared to existing models (simulations etc.) and the data points whose results donot match can be analysed further. This may help to better understand the underlyingphysical model, or help make the existing simulated models better. On the other hand,ML can be applied to a vast set of unstructured data to discover phenomena whichwould otherwise take a lot of time and insight to uncover. Our work in using SOMs insolar radio images, is a promising example in using ML techniques to uncover patterns,clusters and anomalies in the data. As our ability to capture and store data increases, Surana, Wadadekar and Oberoiincreased use of machine learning to understand the underlying information captured isinevitable.

Acknowledgments.

SS thanks National Centre for Radio Astrophysics for hostingher for a part of this work. SS also thanks her colleagues from ThoughtWorks for thenumerous interesting and useful discussions, and stimulating ideas.

References da Cunha, E., Charlot, S., & Elbaz, D. 2008, MNRAS, 388, 1595.

Driver, S. P., Hill, D. T., Kelvin, L. S., Robotham, A. S. G., Liske, J., & et al. 2011, MNRAS,413, 971.

Kohonen, T. 1982, Biological Cybernetics, 43, 59. URL http://dx.doi.org/10.1007/BF00337288

Mondal, S., Mohan, A., Oberoi, D., Morgan, J. S., Benkevitch, L., Lonsdale, C. J., Crowley,M., & Cairns, I. H. 2019, ApJ, 875, 97.