[PDF] Learning the Fundamental MIR Spectral Components of Galaxies with Non-Negative Matrix Factorisation

Abstract

The mid-infrared (MIR) spectra observed with the \textit{Spitzer} Infrared Spectrograph (IRS) provide a valuable dataset for untangling the physical processes and conditions within galaxies. This paper presents the first attempt to blindly learn fundamental spectral components of MIR galaxy spectra, using non-negative matrix factorisation (NMF). NMF is a recently developed multivariate technique shown to be successful in blind source separation problems. Unlike the more popular multivariate analysis technique, principal component analysis, NMF imposes the condition that weights and spectral components are non-negative. This more closely resembles the physical process of emission in the mid-infrared, resulting in physically intuitive components. By applying NMF to galaxy spectra in the Cornell Atlas of Spitzer/IRS sources (CASSIS), we find similar components amongst different NMF sets. These similar components include two for AGN emission and one for star formation. [... ABBREVIATED...] We show an NMF set with seven components can reconstruct the general spectral shape of a wide variety of objects, though struggle to fit the varying strength of emission lines. We also show that the seven components can be used to separate out different types of objects. We model this separation with Gaussian Mixtures modelling and use the result to provide a classification tool. We also show the NMF components can be used to separate out the emission from AGN and star formation regions and define a new star formation/AGN diagnostic which is consistent with all mid-infrared diagnostics already in use but has the advantage that it can be applied to mid-infrared spectra with low signal to noise or with limited spectral range. The 7 NMF components and code for classification are made public on arxiv and are available at: \url{this https URL}

Full PDF

MMon. Not. R. Astron. Soc. , 1– ?? (2002) Printed 14 October 2018 (MN LaTEX style ﬁle v2.2) Learning the Fundamental MIR Spectral Components ofGalaxies with Non-Negative Matrix Factorisation

P.D. Hurley, (cid:63) S. Oliver, D. Farrah, V. Lebouteiller, H. W. W. Spoon, Astronomy Centre, Department of Physics and Astronomy, University of Sussex, Falmer, Brighton BN1 9QH, UK Virginia Polytechnic Institute & State University, Department of Physics,MC 0435, 910 Drillﬁeld Drive, Blacksburg, VA 24061 Laboratoire AIM, CEA/DSM-CNRS-Universite? Paris Diderot DAPNIA/Service dAstrophysique Bat. 709, CEA-Saclay, F-91191Gif-sur-Yvette Cedex, France Department of Astronomy and Center for Radiophysics and Space Research, Cornell University, Space Sciences Building,Ithaca, NY 14853-6801, USA

Released 2002 Xxxxx XX

ABSTRACT

The mid-infrared (MIR) spectra observed with the

Spitzer

Infrared Spectrograph(IRS) provide a valuable dataset for untangling the physical processes and conditionswithin galaxies.This paper presents the ﬁrst attempt to blindly learn fundamental spectral compo-nents of MIR galaxy spectra, using non-negative matrix factorisation (NMF). NMF isa recently developed multivariate technique shown to be successful in blind source sep-aration problems. Unlike the more popular multivariate analysis technique, principalcomponent analysis, NMF imposes the condition that weights and spectral compo-nents are non-negative. This more closely resembles the physical process of emissionin the mid-infrared, resulting in physically intuitive components. By applying NMF togalaxy spectra in the Cornell Atlas of Spitzer/IRS sources (CASSIS), we ﬁnd similarcomponents amongst diﬀerent NMF sets. These similar components include two forAGN emission and one for star formation. The ﬁrst AGN component is dominatedby ﬁne structure emission lines and hot dust, the second by broad silicate emissionat 10 and 18 µ m. The star formation component contains all the PAH features andmolecular hydrogen lines. Other components include rising continuums at longer wave-lengths, indicative of colder grey-body dust emission. We show an NMF set with sevencomponents can reconstruct the general spectral shape of a wide variety of objects,though struggle to ﬁt the varying strength of emission lines. We also show that theseven components can be used to separate out diﬀerent types of objects. We modelthis separation with Gaussian Mixtures modelling and use the result to provide aclassiﬁcation tool.We also show the NMF components can be used to separate out the emission fromAGN and star formation regions and deﬁne a new star formation/AGN diagnosticwhich is consistent with all mid-infrared diagnostics already in use but has the advan-tage that it can be applied to mid-infrared spectra with low signal to noise or withlimited spectral range. The 7 NMF components and code for classiﬁcation are madepublic on arxiv and are available at: https://github.com/pdh21/NMF_software/ . Key words: galaxies: statistics – infrared: galaxies

Spectra of the integrated mid-infrared (MIR) emission fromgalaxies contain a wealth of diagnostics that probe the originof their MIR luminosity. For example, the main polycyclicaromatic hydrocarbons (PAHs) emission features found at (cid:63)

Email: [email protected] µ m are strong in objects wherestar formation activity contributes signiﬁcantly to the mid-IR luminosity (Genzel et al. 1998; Laurent et al. 2000). ThePAH features are either weak or absent for objects domi-nated by an active galactic nucleus (AGN) while emissionlines with a high ionisation potential, for example the Neonﬁne structure line [Ne V] 14.3 µ m, tend to be strong in thepresence of an AGN (Genzel et al. 1998; Sturm et al. 2000). © a r X i v : . [ a s t r o - ph . C O ] O c t P.D. Hurley et al.

Ratios of other ﬁne structure lines such as [Ne III] µ m 15.56/ [Ne II] 12.81 µ m versus [S III] 33.48 µ m/[Si II] 34.82 µ mhave been shown to diagnose power source (Dale et al. 2006)as has the shape of the underlying mid-infrared dust contin-uum. (Brandl et al. 2006).Observations from the Infrared Space Observatory(ISO: (Kessler et al. 1996)), and the Infrared Spectrograph(IRS; (Houck et al. 2004)) on the Spitzer Space Telescope(Werner et al. 2004) allowed the MIR spectral features tobe used as diagnostics of star formation and AGN activ-ity. Combinations of the PAH emission lines, high to lowexcitation mid-infrared emission lines, silicate features andcontinuum measurements have been used as diagnostics forcharacterising the power source behind Ultraluminous In-frared Galaxies (ULIRGs) (Genzel et al. 1998; Rigopoulouet al. 1999; Spoon et al. 2007; Farrah et al. 2007, 2008, 2009;Petric et al. 2010).However, diagnostics based on speciﬁc emission and ab-sorption lines only focus on small parts of the spectrum, dis-regarding the information contained in the rest of the mid-infrared region. They can also be ambiguous. Dust and gasrequire ionising radiation to emit in the mid-IR, the sourceof the radiation is not important. For example, hot OB starsor an accretion disk around a supermassive black hole canboth produce the [OIV] 25.9 µ m emission line, as well asshocks (e.g. Lutz et al. 1998). The line ratios of ﬁne struc-ture lines can also be aﬀected by the geometry of the emit-ting region and the age of a starburst, while the metallicitycan aﬀect PAH emission strength(e.g. Thornley et al. 2000;Engelbracht et al. 2005; Madden et al. 2006; Wu et al. 2006;Farrah et al. 2007). As a result, diﬀerent diagnostics cangive conﬂicting estimates for the contribution from star for-mation and/or AGN (e.g. Armus et al. 2007; Veilleux et al.2009).Separation of spectral features from continuum and themixing of neighbouring spectral features can also be prob-lematic. For example, measurement of the 9.7 µ m silicatefeature requires diﬀerent methods depending on the strengthof the 8.6 and 11.2 µ m PAH emission lines (Spoon et al.2007).An alternative method for identifying the power sourceis to decompose the spectra with AGN and starburst spec-tral templates. These templates tend to be a spectrum froma speciﬁc object (e.g. M82) or a mean spectrum of a numberof similar object types. Pope et al. (2008) use a combinationof the M82 spectrum, average spectral template of starburstgalaxies (Brandl et al. 2006) and a power law to decomposethe IRS spectra of 13 high redshift submillimeter galaxies.Valiante et al. (2009) ﬁt IRS spectra across the range 5.5-6.85 µ m with a combination of the M82 spectrum and a lin-ear approximation for the AGN continuum. Alonso-Herreroet al. (2011) use the (Brandl et al. 2006) starburst templateand CLUMPY radiative transfer models for AGN to decom-pose the IRS spectra of 53 LIRGs into starburst and AGNcomponents. Using average starburst templates is both sim-plistic and problematic. Prior theoretical prejudices drivethe choice for what objects are used for the average tem-plates, and they may be contaminated by AGN emission.The same is true for AGN average spectral templates.With the public release of all low resolution Spitzer/IRSspectra by the Cornell Atlas of Spitzer/IRS sources (CAS- SIS)(Lebouteiller et al. 2011) , we are now in a better po-sition to investigate the role played by star formation andAGN with more sophisticated techniques. In this paper weuse a multivariate analysis technique to blindly learn thefundamental MIR spectral components, which we interpretas diﬀerent physical environments within galaxies. Learningthe MIR spectral shape of physical environments, allows thewhole MIR wavelength range to be used as a diagnostic. Thespectral components also provide an alternative to averagespectral templates.A subclass of multivariate analysis techniques includematrix factorisation algorithms. The techniques are oftenassociated with pattern recognition and blind source sep-aration Lee & Seung (2001). Algebraically, the algorithmsapproximate a data matrix by two simpler matrices: aweight matrix and component matrix. Common factori-sation techniques include Singular Value Decomposition,Principal Component Analysis and Independent ComponentAnalysis. The diﬀerent techniques use diﬀerent assumptionsto carry out the factorisation, resulting in diﬀerent weightsand components. As multivariate datasets of spectra havebecome more prevalent, techniques such as Principal Com-ponent Analysis (PCA) have been applied to astronomicalproblems. PCA has already been used for spectral classiﬁ-cation of optical galaxies (e.g. Connolly et al. 1995; Brom-ley et al. 1998; Taghizadeh-Popp et al. 2012). PCA has alsobeen successfully applied to the IRS spectra of local ULIRGs(Wang et al. 2011; Hurley et al. 2012).The weights and spectral templates derived with PCAcan be both positive and negative. Spectral reconstructioninvolves both addition and cancellation of spectral features.As a result, the PCA templates are inherently diﬃcult tointerpret physically.A relatively new matrix factorisation technique, Non-negative matrix factorisation (NMF;Lee & Seung (1999))can be thought of as PCA but with non-negative constraintson weights and templates. The constraints make reconstruc-tion a purely additive process which more closely resemblesemission in the mid-infrared. The ﬁrst application of NMFto astronomy was carried out by Blanton & Roweis (2007)who adopted the Lee & Seung (2001) NMF algorithms andapplied it to optical spectra and photometry. It has alsobeen used as a blind source separation algorithm on the IRSspectra of galactic photo-dissociation regions (Bern´e et al.2007; Rosenberg et al. 2011).This paper presents the ﬁrst NMF analysis on mid-infrared galaxy spectra. We use spectra from the recentlyreleased Cornell Atlas of Spitzer/IRS sources (CASSIS)(Lebouteiller et al. 2011). Our paper provides the ﬁrst largescale statistical analysis of the IRS spectra to date using theNMF algorithm. Section 2 describes the CASSIS databaseand data reduction. In Section 3, we describe the suitabilityof matrix factorisation to IRS spectra, and give details onthe NMF algorithm. In section 4 we present our results andin Section 5 our conclusions. We assume a spatially ﬂat cos-mology with H = 70kms − Mpc − , Ω = 1, and Ω m = 0 . . The Cornell Atlas of Spitzer/IRS Sources (CASSIS) is a prod-uct of the Infrared Science Center at Cornell University, sup-ported by NASA and JPL. © , 1– ?? earning the Fundamental Components of Galaxies with NMF We use spectra from the Cornell Atlas of Spitzer/IRS sources(CASSIS) (Lebouteiller et al. 2011). The atlas containssources observed in low resolution mode with the InfraredSpectrograph (IRS;Houck et al. (2004)) on board the

SpitzerSpace Telescope (Werner et al. 2004). IRS low resolutionmode observations were made using two low-resolution mod-ules, ShortLow and LongLow (hereafter SL and LL), cov-ering 5.2-14.5 and 14.0-38.0 µ m respectively. The modulesalso had a resolving power of R ≈ −

120 ( ≈

75% ofthe observations) and an aperture size of 3 . × (cid:48)(cid:48) for SLand 10 . × (cid:48)(cid:48) for LL. The observations in the CASSISdatabase are ﬁrst processed with the Basic Calibrated Data(BCD) pipeline from the Spitzer

Science pipeline (releaseS18.7.0.) and produces BCD frames. This removes electronicand optical artefacts. The BCD images are then processedusing the CASSIS pipeline which carries out image clean-ing, background subtraction, and spectral extraction. Thepipeline algorithm is both automatic and ﬂexible enough tohandle diﬀerent observations, from barely detected sourcesto bright sources and from point-like to somewhat extendedsources.

The current version of CASSIS (version 4) contains 11304distinct sources. 2118 of those distinct sources have knownspectroscopic redshifts taken from NASA/IPAC Extragalac-tic Database (NED ). We make the additional redshift cut(0 . < z < . Observations using data from both SL and LL spectral mod-ules can suﬀer from mismatching due to telescope pointinginaccuracy or if a source is extended in SL and not in LL.The mismatching causes the spectra from one of the mod-ules (normally the SL) to have lower ﬂux calibration thanthe other. Correcting the mismatch is inherently diﬃcultas the data from the overlap between the two modules cansuﬀer from the ’14 micron teardrop’ (see IRS instrumenthandbook, ) , leaving a small gap at around 13-14 µ m. http://irsa.ipac.caltech.edu/data/SPITZER/docs/irs/ N o . o f ob j e c t s Figure 1.

The redshift distribution for the sample selection weapply the NMF algorithm to.

We correct for the mismatch using a simpliﬁed versionof our NMF technique. For the ﬁrst step, we generated twosets of templates, one using SL data and the other using LLdata. The distribution in redshift causes the mismatch re-gion to occur at diﬀerent rest frame wavelengths for diﬀerentobjects. This ensures at least one template set covered themismatch region for each object. We then ﬁtted the templateset to a region of width 7 µ m, centred on the mismatch area.Wavelength points associated with PAH and Neon emissionlines were removed to prevent strong line strengths fromdistorting the ﬁts. We carry out the ﬁt for diﬀerent scalingsapplied to the SL data. The scaling factor value that givesthe lowest χ is chosen as the scaling correction. Havingstitched the spectra using both SL and LL template sets, wethen generated our initial NMF sets for the entire spectralrange. We then re-stitch the spectra with the new NMF set.Additional spectra used for analysis in this paper are alsostitched with our ﬁnal NMF set, introduced in section 4. The NMF analysis requires all spectra to be normalised to astandard value to prevent sources with higher ﬂux, biasingthe algorithm. We normalise all the spectra by the averageﬂux across the restframe wavelength range of 7 − µ m. Wechoose this range as it is common to all sources with bothSL and LL data. Analysis of spectra from Spitzer’s IRS has tended to be doneusing diagnostics based on only a few of the speciﬁc features(e.g. Sajina et al. 2007; Pope et al. 2008; Alonso-Herreroet al. 2012). For example, Spoon et al. (2007) introduced aclassiﬁcation scheme based on the 6.2 µ m PAH line and 10 µ msilicate feature. Quantifying the contribution from star for-mation and AGN has also been carried out using ﬁne struc-ture lines, for example the [OIV]/[NeII] and [NeV]/[NeII]line ratios versus the 6.2 µ m PAH equivalent width. (e.g. Ar-mus et al. 2007; Petric et al. 2010)In essence, line diagnostic analyses are carrying out a © , 1–, 1– ?? P.D. Hurley et al. crude compression by using only small parts of the spec-trum to describe each object (e.g. the 6.2 µ m feature). Ma-trix factorisation techniques provide an alternative approachto compression by transforming data from wavelength spaceto one that better captures the variance in the dataset. Asa result, classiﬁcation or quantiﬁcation of properties suchas star formation is carried out considering a greater wave-length range.Algebraically, matrix factorisations ﬁnd a linear approx-imation to a data matrix X such that X ≈ WH, or:X iµ ≈ (WH) iµ = r (cid:88) a =1 W ia H aµ (1)Where, i is object index, µ is wavelength and a is com-ponent index. The matrix H can be thought of as a set of r components that represent latent structure explicit in thedataset, and W are a set of weighting coeﬃcients. Each ob-ject in the dataset can now be approximated by a linearcombination of the derived components, H.Diﬀerent matrix factorisation techniques use diﬀerentassumptions to carry out the approximation. Independentcomponent analysis (ICA) assumes the derived components(H) are independent. Principal component analysis (PCA)models the dataset as a multivariate Gaussian distributionin wavelength space and ﬁnds the orthogonal componentsof the Gaussian. Non-Negative Matrix Factorisation (NMF)assumes the data, weights and components are all non-negative, but makes no assumption on the distribution ofthe data or correlation between derived components. By applying linear matrix factorisation techniques to themid-infrared spectra of galaxies, we are assuming mid-infrared spectra of galaxies, F ( λ ), can be modelled as alinear combination of components. Ideally the componentswould relate to physical regions, for example a star form-ing region ( T SF ), an active galactic nuclei torus ( T AGN ), amolecular cloud ( T MC ) or diﬀuse dust component ( T C ). Aspectrum for a galaxy would then simply be: F ( λ ) = a · T SF ( λ ) + b · T AGN ( λ ) + c · T MC ( λ ) + d · T C ( λ ) (2)Where, a, b, c and d are the relative weights for each compo-nent.For the above model, ICA is not suitable as the com-ponents are unlikely to be independent, for example AGNand star formation are believed to be triggered by similarmechanisms such as mergers (e.g. Sanders et al. 1988), andare likely to be connected through feedback processes (e.g.Farrah et al. 2012; Rovilos et al. 2012).PCA has already been applied to the mid-infrared spec-tra of ULIRGs (e.g. Wang et al. 2011; Hurley et al. 2012).Algebraically, PCA calculates the eigenvectors of the covari-ance matrix. For spectra, the principal components repre-sent the principal variations from a mean spectral template.The components are therefore allowed to have features whichare positive and negative, and are also allowed to have a neg-ative weighting when ﬁtting objects. The freedom to be bothpositive and negative does not mimic the process of emissionin the MIR, resulting in components that are inherently dif-ﬁcult to interpret. By their nature, the principal components have a statistical rather than physical interpretation. There-fore, although PCA can successfully reduce dimensionalityof spectra for classiﬁcation from known objects, it is notsuitable for our model.The non-negative constraint of NMF more closely re-ﬂects the physical process of emission in the mid-infrared,which does not suﬀer from the same problems of absorptionas other spectral ranges. As a result the NMF generatedtemplates are more physically intuitive.NMF is therefore the most applicable matrix factorisa-tion routine for our linear interpretation of galaxy emission.However, the situation is complicated by dust extinction.This introduces a non-linearity to the problem since extinc-tion is multiplicative and exponential. F ( λ ) = ( a · T SF ( λ ) + b · T AGN ( λ ) + c · T MC ( λ )+ d · T C ( λ )) e − f · τ ( λ ) (3)Where f is the weight associated with extinction and τ ( λ )can either be known or unknown.We can take the model one step further by allowingextinction to vary across all four components: F ( λ ) = a · T SF ( λ ) e − f · τ ( λ ) + b · T AGN ( λ ) e − g · τ ( λ ) + c · T MC ( λ ) e − h · τ ( λ ) + d · T C ( λ ) e − i · τ ( λ ) (4)The weights for the extinction are f, g, h and i .We have explored the suitability of non-linear ker-nel based matrix factorisation algorithms (e.g. Zafeiriou &Petrou 2010; Pan et al. 2011) and found they are not suitedfor the non-linear behaviour described in equations 3 and 4.We discuss why in Appendix A. Current algorithms there-fore restrict us to describe mid-infrared galaxy spectra as aset of linear components (e.g. equation 2) and NMF is themost appropriate matrix factorisation technique.The ﬁrst application of NMF in astronomy was carriedout by Blanton & Roweis (2007) who updated the popularNMF multiplicative algorithm from Lee & Seung (2001) toinclude uncertainties and for heterogeneous datasets (e.g.optical spectra and photometric observations of galaxies atdiﬀerent redshifts). They also restricted the space of possi-ble spectra to those predicted from high resolution stellarpopulation synthesis models. We use the NMF algorithmfrom Blanton & Roweis (2007) to identify and learn the mid-infrared sources that are common to galaxies in the CASSISdatabase. Unlike Blanton & Roweis (2007), we do not useany models as a prior for shape of the components, we usethe algorithm to blindly learn the shape of our components. As with PCA, the goal of NMF is to minimise a cost func-tion. The most widely used is the squared approximationerror described in Lee & Seung (2001): χ = (cid:88) iµ (cid:32) X iµ − (cid:88) a W ia H aµ (cid:33) (5)Minimising equation 5 requires some sort of numericaltechnique to ﬁnd local minima. Lee & Seung (2001) pre-sented ’multiplicative update rules’ for H and W. Upon eachiteration, the rules are used to update H and W by a multi-plicative factor whilst minimising equation 5. The algorithm © , 1– ?? earning the Fundamental Components of Galaxies with NMF implemented in Blanton & Roweis (2007) altered the origi-nal multiplicative update algorithm of Lee & Seung (2001)for nonuniform uncertainties ( σ ). The cost function then be-comes the weighted squared approximation error: χ = (cid:88) iµ (cid:18) X iµ − (cid:80) a W ia H aµ σ iµ (cid:19) (6)Blanton & Roweis (2007) showed the multiplicative up-date rules for H and W are as follows:W ia ← W ia (cid:32)(cid:88) µ X iµ H aµ σ iµ (cid:33) (cid:32)(cid:88) mµ W im H mµ H aµ σ iµ (cid:33) − (7)H aµ ← H aµ (cid:32)(cid:88) i W ia X iµ σ iµ (cid:33) (cid:32)(cid:88) mi W ia W im H mµ σ iµ (cid:33) − (8)The update rules in equations 7 and 8 are guaranteedto reduce the error, however the cost function in equation6 is not necessarily convex therefore the algorithm may getstuck in a local minimum. We run the algorithm ﬁve timeswith diﬀerent initial starting positions to check the solutionis consistent.Convergence can be evaluated by looking at the de-crease in cost function across iterations and checking thesolution has reached a minimum. In practise, we ﬁnd 3000iterations are enough for H and W to converge.The number of components generated by NMF is a userinput. Unlike PCA where the shape of the original com-ponents remain unchanged as more are added, the NMFcomponents will not remain the same. We investigate thenumber of components required to constrain the data bygenerating 11 diﬀerent NMF sets, containing from 3 up to14 components. We deﬁne the following notation, NMF xy todescribe the x th component from an NMF set containing y components. To determine the minimum number of components that arejustiﬁed by the data, one should calculate the Bayesian ev-idence ( E ). E ≡ (cid:90) L ( θ ) π ( θ ) dθ (9)The evidence can be thought of as the average likeli-hood, L ( θ ), over all of the prior, π ( θ ), parameter space, dθ ,of a given model and automatically implements Occam’s ra-zor, i.e. simpler models are preferred unless simplicity canbe traded for greater explanatory power.There are two ways in which one could calculate theBayesian evidence for our setup. The ﬁrst would be to calcu-late the evidence for the NMF algorithm, where the numberof parameters is equal to the number of elements in bothH and W. This approach would be the most appropriate ifcomparing the suitability of NMF to other matrix factorisa-tion techniques, the integral however becomes highly multi-dimensional making the calculation numerically challenging.Alternatively, if NMF is the most appropriate algorithm toour problem, then we can assume that the components arecorrect. The number of parameters is then equal to the num-ber of elements in W, i.e. the number of components. No. of components l n ( B a y e s i a n E v i d e n ce ) Figure 2.

The Bayesian evidence as a function of number ofcomponents. For each NMF set, we run the algorithm 5 times andcalculate the median evidence value of the entire galaxy sample.We plot the mean and standard deviation of the 5 repeats.

We choose the later approach as we have already chosenNMF as the most appropriate algorithm to our problem andare not comparing alternative procedures.We calculate the evidence by using the nested sam-pling routine, MULTINEST (Feroz et al. 2008) to re-ﬁt theCASSIS sample with diﬀerent NMF sets. MULTINEST isa Bayesian inference tool which calculates the evidence andproduces posterior samples from distributions with (oftenan unknown number of) multiple modes and/or degenera-cies between parameters. Nested Sampling (Skilling 2004) isa Monte Carlo technique that randomly samples from theprior space, and zooms in on areas of higher likelihood dur-ing successive iterations.We ﬁt every galaxy with component sets

NMF to NMF and their respective repeats. For every repeat, wecalculate the median evidence of the sample. The main un-certainty on our evidence values comes from the diﬀerence inNMF sets across repeats (i.e. the convergence on slightly dif-ferent local minima by the NMF algorithm). To quantify theuncertainty on our evidence values, we calculate the meanand standard deviation evidence values from the 5 repeats,as a function of number of components. As discussed in the previous section, we would like to quan-tify how many components are required by the data. Figure2 shows the mean and standard deviation for the Bayesianevidence values from 5 repeats, as a function of number ofcomponents. The Bayesian evidence should start decreasingas the number of components exceeds the optimum numberneeded to constrain the data. We see no turnover, indicatingthere is not an obvious, optimum NMF set below 14 compo-nents. We note however a slight levelling of at 7 componentsbefore increasing again beyond 8.We have also looked at the ratio of evidence values be-tween consecutive NMF sets. The ratio, referred to as the © , 1–, 1–

Bayes factor ( K ), is used as a measure for a Bayesian ver-sion of classical hypothesis testing. We use the Jeﬀreys scaleto interpret K . A value of K < K = 1 − K = 3 −

10 indicates substantial support for thesimpler model, while K = 10 −

30 is strong, K = 30 −

100 isvery strong, and

K >

100 is considered decisive. Using theJeﬀreys scale, we ﬁnd more than 14 components are neededto reconstruct spectra within the uncertainties. However, wenote that K begins to level oﬀ after 6 /

7, indicating that al-though more complicated component sets are preferred, thegain in increasing the number of components is beginningto decrease.Ideally, we would calculate the Bayesian evidence andBayes factor beyond

NMF . However, calculating evidencefor highly multidimensional parameter spaces becomes com-putationally challenging. We have qualitatively examinedNMF sets where number of components >

14. As an ex-ample, in Figure B1 we show the NMF components for

NMF . Interpreting a many-component NMF set such as NMF becomes challenging as signatures begin to separateout into several components, whose physical interpretationis not clear.We also note that the Bayesian evidence calculationcould be inﬂuenced by two fundamental factors. The ﬁrst isthe use of uncertainties associated with IRS spectra, whichhave often been underestimated below the observed varia-tion between individual nod positions on the IRS, as de-scribed in Chapter 7 of the IRS Instrument Handbook .As a result, our model selection may be too conservative.The other problem comes from the suitability of the NMFalgorithm to the non-linear behaviour associated with ex-tinction. We have carried out a simple simulation to showhow extinction could be a factor in driving our linear meth-ods to more templates than might be required by underlyingphysical conditions. Details can be found in Appendix B.We have investigated how many components are neededin a quantitative manner. For the rest of this paper we inves-tigate the how many components are needed qualitatively,by examining some of the simpler NMF component sets,limiting our investigation to NMF - NMF . NMF to NMF Figure 3 shows each spectral component for sets

NMF − NMF . We have ordered the components so that similarcomponents appear in the same order. We note the orderingof components given by NMF is unimportant.The NMF sets in Figure 3 show that many of the com-ponents remain similar, despite an increase in the allowednumber of components.The ﬁrst component contains a dust continuum whichpeaks at around 24 µ m and contains emission from the Sul-phur line [SIV] at 10.51 µ m, the 12.8, 15.6 and 24.3 µ mNeon lines and Oxygen line [OIV] at 25.89 µ m, all of whichare associated with a hot ionised gas source. The continuumin the component from NMF and NMF varies from theothers in that continuum does not start until 13 µ m. This http://irsa.ipac.caltech.edu/data/SPITZER/docs/irs/ coincides with the appearance of the ninth and tenth compo-nents which show similar features. The hot dust continuumpeaks at a wavelength similar to that of AGN tori, while thehot ionised gas emission lines have also typically been asso-ciated with AGN. The appearance of both in one componentis consistent with the idea they are correlated.The second component shows silicate emission featuresat 10 and 18 µ m due to stretching and bending of the Si-Oand O-Si-O bonds respectively. Silicate emission is typicallyassociated with emission from very hot dust, found on theinner surface of AGN tori or narrow line regions (Masonet al. 2009).The third component captures the 6.2, 7.7, 8.6, 11.3,12.7, 16.4 and 17.0 µ m PAH features, and a cold dust slopeat longer wavelength. There is also emission from Argon line[ArII] at 6.89 µ m and Sulphur line [SIII] at 18.71 µ m. Itsshape is similar to the (Brandl et al. 2006) average starbursttemplate, based on 13 starburst galaxies. The ratio of thePAH features are very similar amongst component sets, butdust slope decreases with number of components. The re-duction in dust slope for more complex NMF sets coincideswith rising continuums seen in the fourth, sixth and seventhcomponents.The ﬁfth component shows continuum emission up to 7 µ m before dropping oﬀ at 10 µ m. It also shows strong emis-sion from the Sulphur line [SIV] at 10.51 µ m. The remainderof the spectrum is noisy and featureless.The eighth, ninth and tenth components show similari-ties to the ﬁrst component. They show varying amounts ofemission from the Neon lines, while the merged Oxygen andIron lines appear as emission in the ninth component andabsorption in the tenth. The variation of the ﬁrst componentin NMF and NMF compared to the other NMF sets isa result of the introduction of the ninth and tenth compo-nents and occurs because the NMF algorithm is using thefreedom of extra components to break down the ﬁrst intosub components. The ﬁrst two components both show features associated withhot dust and gas emission and are likely to be related toAGN emission. The uniﬁed model of AGNs predict silicateemission from type 1 AGN and silicate absorption in type 2AGN. More recently, the IRS spectra of type 2 quasi-stellarobjects (QSOs) have shown silicate emission (Sturm et al.2006). (Schweitzer et al. 2008) have shown that the IRS spec-tra of 23 QSOs can be modelled with dusty narrow line re-gion models, while Mason et al. (2009) and Mor et al. (2009)showed that clumpy torus models could also provide silicateemission for both type 1 and type 2 AGN. The fact we seea relatively stable silicate emission component amongst dif-ferent NMF sets would suggest that silicate emission is oc-curring in more than just type 1 AGN and is a fundamentalspectral component.The third component is the main star formation compo-nent. It is dominated by PAH emission, often used as an indi-cator of star formation (e.g. Roussel et al. 2001; Peeters et al.2004; Calzetti et al. 2005; Kennicutt et al. 2009), and pre-dominantly comes from photo-dissociation regions (PDRs)(Roussel et al. 2007; Peeters 2011). For simpler NMF sets,the component also contains a rising continuum at longer © , 1– ?? earning the Fundamental Components of Galaxies with NMF Figure 3.

The derived NMF spectral components for sets

NMF - NMF . Each NMF set is colour coded, with components ordered bysimilarity. For example, the ﬁve components of NMF are the ﬁve brown spectra. Prominent spectral features in each component arealso labelled and regions aﬀected by broad silicate absorption and emission are highlighted in light blue. © , 1–, 1–

Figure 4.

The derived NMF spectral components for

NMF - NMF , using only objects dominated by the third PAH component seenin Figure 3. Each NMF set is colour coded, with components ordered by similarity. wavelengths due to colder dust emission (T ≈ K ), alsoassociated with star formation (e.g. Calzetti et al. 2007).For the more complex NMF sets, the rising dust continuumis given its own component (e.g. the sixth and seventh). Thisindicates that although the colder dust and PAH emissionboth trace star formation, they come from diﬀerent regionsand the NMF algorithm uses the additional freedom of ex-tra components to separate the two. We note that the PAHemission is extremely stable amongst all NMF sets and wedo not see signiﬁcant PAH emission in any other component.Previous studies show the ratio of PAH features vary withmetallically and radiation hardness (e.g. Smith et al. 2007),yet we have one component with PAH emission.To investigate the stability and lack of variation in thePAH emission features, we have re-run the NMF algorithmon objects from our original sample which are dominatedby the third component. Figure 4 shows the componentsfrom NMF to NMF for our reduced sample. The NMFalgorithm now ﬁnds two components with PAH emission.The ﬁrst shows emission at 6.2, 7.7, 8.6, 11.3, 12.7, 16.4 and 17.0 µ m, the second shows reduced emission for the 8.6, 11.3and 12.7 µ m PAH features and no emission at 16.4 and 17.0 µ m, while at longer wavelengths there is a rising continuum.The two new PAH components show a resemblance to thosefound in an NMF analysis of IRS spectro-imagery data forgalactic PDRs (Bern´e et al. 2007). Their ﬁrst component,interpreted as emission from deep within the PDR, showedbroad emission at 6.2, 7.8, and 11.4 µ m and a rising contin-uum. The second component contained emission from the6.2, 7.6, 8.6, 11.3, 12.7 and 17.4 µ m PAH features, and wasshown to be more dominant in regions closer to the star.By restricting the sample to objects dominated by starformation, the NMF algorithm does not need to use compo-nents to separate out hotter dust from AGN, and uses theadditional freedom to separate out the PAH emission. ThePAH emission in our original third component is thereforecapturing the average PAH emission from galaxies.Components four, six and seven from Figure 3, all con-tain rising continuums, though with varying slopes and iscapturing dust emission at diﬀerent temperatures. The fact © , 1– ?? earning the Fundamental Components of Galaxies with NMF Wavelength (microns) . . . . . . . . . M ed i an ( | x i − N M F | / x i ) NMF 5NMF 6NMF 7NMF 8NMF 9

Figure 5.

The median absolute residuals, normalised by σ , for NMF sets NMF - NMF . The residuals show all NMF sets fail to capturethe variance in many of the emission lines. However, for NMF sets NMF and above, the residuals for the underlying continuum aredown to 1 σ . we see numerous components with varying slopes suggeststhat the colder greybody emission of dust varies consider-ably amongst galaxies. The seventh component also containsa bump at around 8 and 12 µ m. The bumps help build upa silicate absorption feature at 10 µ m, this component istherefore important for dusty galaxies.To further investigate the components, we can begin tolook at how they contribute to diﬀerent types of spectra. Inorder to simplify the analysis and to provide a simple setof components, we restrict our components to those from NMF . Our choice of seven is more qualitative than quanti-tative, as we have already shown that a quantitative analysisrequires more than 14 components. To validate our choice,we have studied the median, absolute residuals of NMF ﬁtsto the CASSIS sample with NMF to NMF , shown inFigure 5. The residuals are high for some of the emissionlines, particularly the PAH features, because our compo-nents capture the average line emission. However we notethat by seven components, the residuals for the underlyingcontinuum are down to 1 σ and there is little advantage inusing more complicated sets. By choosing seven, we believewe strike the balance between having enough simplicity tohave a useful and physically intuitive NMF set of compo-nents, whilst being able to reconstruct the general spectralshape. The seven components are re-plotted in Figure 6. NMF ﬁts to example galaxy spectra We now examine the NMF ﬁts to spectra of diﬀerent typesof galaxies in order to show how contributions from com-ponents vary and that our

NMF set can capture the gen-eral shape of diﬀerent types of spectra. Our example ﬁts,along with the corresponding residuals (i.e. data-ﬁt) can be found in Figures C1 and C2. The ﬁrst plot in FigureC1 shows the NMF ﬁt to the Blue Compact Dwarf (BCD)KUG 1013+381, observed as part of the IRS GuaranteedTime Observation (GTO) program. BCDs tend to be smallgalaxies with low metallicity, that have undergone a recentburst of star formation but have suppressed star formationcompared to typical starburst galaxies (Wu et al. 2006).Our NMF ﬁt shows component one makes a signiﬁcantcontribution, suggesting there is some hot dust. Componentfour also makes a large contribution, indicating emissionfrom colder dust. Components six and seven, both contain-ing dust slopes at longer wavelengths, also contribute. Thereis very little emission from component two, which we believeis associated with the inner surface of an AGN and there isvery little emission from the third ’PAH’ component. Theresidual plot shows the NMF set can construct the un-derlying continuum, however the [SIV], [NeIII] and [SIII]emission lines are underestimated.Our second NMF ﬁt is to the ULIRG and type 1 Seyfertgalaxy, Markarian 231. Unlike, KUG 1013+381, the second’silicate emission’ component makes a contribution, and theother, warmer dust components such as six and seven con-tribute as much power to the longer wavelengths as thefourth component. There is very little contribution from thethird component. Residuals show the ﬁt is reasonable exceptbeyond 25 µ m, where there appears to be some instrumentalartefact in the spectra.The third ﬁt is to PG 1211+143, also a type 1 Seyfertgalaxy. The second component dominates the emission ofthis object. The ﬁrst, ﬁfth and sixth component make com-parable contributions. The residual plot shows that our NMF set slightly over estimates emission from the [NeIII]and [OIV] lines. © , 1–, 1–

Figure 6.

The 7 components from

NMF , corresponding to theyellow components in Figure 3. The new colour coding is used toidentify the diﬀerent components in subsequent ﬁgures. The ﬁt to the ULIRG and type 2 Seyfert galaxy,Markarian 273, is dominated by emission from the fourth’cold dust’ component. Residuals show the NMF compo-nents underestimate some of the emission lines, particularlythe [NeIII] line. The continuum appears to be well recon-structed by the NMF components.Our ﬁnal two ﬁts in Figure C1 are to the starburstgalaxies, NGC 3301 and NGC 3256. The third componentcontributes in the shorter wavelengths, while the colder dustcomponents, four and six, contribute at longer wavelengths.The residuals show the components are capable of recon-structing the continuum, but fail to capture the emissionlines accurately.Four additional example ﬁts are shown in Figure C2.The ﬁrst is to LINER, 3C270. The ﬁrst, second and ﬁfthcomponents are the main contributors. while the residualsshow the ﬁt can reconstruct the continuum, but underes-timate the 12.8 µ m Neon line. The submillimeter galaxyGN26 is over a short wavelength region and the spectrumis quite noisy. Our ﬁnal two ﬁts are to quasar PG0804+761and ULIRG IRAS 10378+1108. As with other type 1 AGN,the second component dominates emission. Our NMF setfails to model the full width of the very broad silicate emis-sion feature at 9.7 µ m, however the rest of the continuumis well reconstructed. Our NMF ﬁt to the ULIRG IRAS10378+1108 dominates the emission, while the residualsshow the NMF set slightly overestimate the greybody emis-sion longwards of 27 µ m.In addition to galaxy spectra, we also ﬁt our NMF setto the average spectral templates from the IRS spectral AT-LAS of galaxies (Hern´an-Caballero & Hatziminaoglou 2011).Table C2 in Appendix C gives more details on the sourcesused for the ATLAS average templates. As can be seen inFigure C3, the change in contributions for diﬀerent types ofobject is consistent with those in Figure C1. The continuumis well constructed for all average templates, however theresiduals show the emission lines are not accurately recon-structed, especially for the average LINER template.Overall, our ﬁts show for Seyfert galaxies, the ﬁrst andsecond component, along with the warmer dust componentsof ﬁve, six and seven are all important, though their con-tributions vary. For the starburst galaxies, the third andfourth component play a more important role. The residualplots show that our NMF set is capable of reconstructingthe continuum to a reasonable accuracy, however some ofthe emission lines are not always ﬁtted well. This is to beexpected since, as we have previously shown, the compo-nents capture the ’average emission’ of spectral lines. Toaccurately ﬁt continuum and emission lines, our Bayesianevidence calculation has shown we would need an NMF setwith more than 14 components. The goal of this paper is toﬁnd a physically intuitive component set, which requires abalance between number of components and ability to re-construct spectra. We believe Figure 5 and C1 shows our NMF set ﬁts this requirement.To illustrate how the components contribute to a num-ber objects, we can use the weightings provided by the NMFﬁts as multidimensional co-ordinates. Each galaxy is now apoint in a seven dimensional space we call NMF space. Weuse classiﬁcations from the IRS spectral ATLAS of galaxies(Hern´an-Caballero & Hatziminaoglou 2011) to investigatewhat regions of NMF space are associated with diﬀerent © , 1– ?? earning the Fundamental Components of Galaxies with NMF types of galaxies. The ATLAS collection contains spectrafrom a number of observing programs. They provide op-tical classiﬁcations from the literature and three additionalMIR classiﬁcations: MIR SB, MIR AGN1, MIR AGN2 basedon the fractional contribution from a PDR component usedduring spectral decomposition. The AGN subgroups MIRAGN1 and MIR AGN2 are subsets of AGN, classiﬁed bywhether spectra show silicate emission or silicate absorp-tion. Figure 7 shows how objects from the ATLAS groups:MIR AGN1, MIR AGN2, MIR SB, Sbrst, Sy1 and Sy2 aredistributed in the seven dimensional NMF space.As can be seen in Figure 7, the Seyfert 1 and MIR AGN1objects all lie in a region with low contribution from NMF ,high contribution from NMF and very little contributionfrom NMF . The Seyfert 2 and MIR AGN2 objects arefound in a region with a higher contribution in NMF ,less or very little contribution from NMF and very lit-tle contribution from NMF . Starburst like objects on theother hand require little contribution from either NMF or NMF , and a high contribution from NMF .We note that the components most inﬂuential in sepa-rating out the diﬀerent objects are the components one, twoand three. Less inﬂuential but still signiﬁcant are the colderdust components NMF and NMF . They contribute verylittle to objects classiﬁed as AGN, while the contributionfor starbursts show a large variation. This ﬁts in with ourearlier interpretation that these two components representobscured star formation components which vary more thanthe PAH features seen in NMF . The remaining two com-ponents are the least signiﬁcant. There is a slight diﬀerencein contribution between AGN 1 objects and the other twoclasses, while NMF separates out type 1 and type 2 objectsto a certain extent. We have shown NMF space is capable of separating out dif-ferent types of objects. We now model how objects separateout in this multidimensional space by applying the paramet-ric technique Gaussian mixtures modelling (GMM). GMMhas already been successfully applied to the colour and red-shift space of galaxies (Davoodi et al. 2006). GMM assumesthe distribution of objects can be modelled by a series ofclusters, each described by a multidimensional Gaussian.We use the GMM software from the Auton Lab (Moore1999) to model the distribution of the CASSIS sample inour 7 dimensional NMF space. The software uses the Ex-pectation Maximisation algorithm to learn the position andsize of the clusters and uses the Akaike Information Crite-rion (AIC;Akaike (1974)) to select how many are needed todescribe the distribution of objects.We ﬁnd that 8 clusters are required to adequately modelthe distribution. Each cluster describes a probability densityfunction (PDF) for any position in NMF space. By using anobjects position in NMF space, we can assign it to one of the8 clusters. Table 1 shows how some of the ATLAS classiﬁed Every position in NMF space has eight PDF values associ-ated with it (one for each cluster). Using the highest probabilitydensity provides the optimal (maximum likelihood) classiﬁcation.However, since the PDFs overlap, this will not provide the best sources are distributed across the 8 clusters, with clustersordered by their normalisation (i.e. how many objects arein that cluster). As can be see in Table 1, the majority ofobjects are contained within the ﬁrst ﬁve clusters. The nor-malisations associated with the remaining clusters (i.e. howmany objects they capture) are also very small. We thereforeuse the ﬁrst ﬁve clusters to deﬁne a classiﬁcation scheme.The location in NMF space of the ﬁrst ﬁve clusters canbe seen in Figure 8. Each cluster is represented by its 1 sigmacontour. The CASSIS sample used for training the GaussianMixtures modelling are also plotted.As can be seen in Figure 8 and classiﬁcations in Table 1,cluster one captures nearly all the Seyfert one galaxies, andsome Seyfert two galaxies. Cluster two contains a signiﬁcantnumber of objects previously classiﬁed as starbursts, whilecluster three contains a large proportion of the remainingSeyfert two objects. The position of cluster four indicatesthis could be an intermediary group between typical Typeone and Type two galaxies. The ﬁfth cluster contains justover a ﬁfth of those objects classiﬁed as starbursts in theMIR and no optically classiﬁed starbursts. Its position inNMF space also suggests it captures those objects whichare dusty starbursts.We conclude that cluster one is related to Seyfert 1galaxies, cluster two with starbursts, cluster three withSeyfert two galaxies and cluster four for galaxies showingsigns of both Seyfert one and Seyfert two (e.g. Type 1.5).The ﬁfth cluster captures those galaxies which are dusty andobscured. The clusters can be used as a classiﬁcation schemeby taking any IRS galaxy spectrum, ﬁtting with

NMF setand using the corresponding weights to identify what clusterthe object is associated with.We compare our classiﬁcation scheme to the Spoon et al.(2007) diagram, which classiﬁed ULIRGs via the strengthof their 9 . µ m silicate feature and 6.2 µ m equivalent width.Figure 9 shows 89 ULIRGs in the Spoon et al. (2007) di-agram, colour coded by our our classiﬁcation. Seyfert oneclassiﬁed galaxies lie on the far left of the bottom horizon-tal branch, corresponding to a 1A and 1B Spoon classiﬁ-cation, Seyfert two classiﬁed galaxies span the horizontalbranch and 2B Spoon classiﬁcation. The starburst classiﬁedobjects are located in the far bottom right of the Spoondiagram, while dusty objects are spread out across the di-agonal branch. Only three objects are classiﬁed as Type 1.5and they lie on the horizontal branch, in-between the Seyfertone and Seyfert two classiﬁed galaxies.Comparing the success rates of diﬀerent classiﬁcationschemes, without knowing the ’true’ classiﬁcation is alwaysproblematic, however our classiﬁcation scheme is consistentwith the Spoon et al. (2007) interpretation of Figure 9 interms of the location of starbursts, AGN dominated objectsand dusty objects. Unlike the Spoon diagram, our classiﬁ-cation scheme can also distinguish between Seyfert one andSeyfert two galaxies.We have shown our classiﬁcation scheme is just as suc-cessful as the Spoon classiﬁcation. However, our classiﬁca- classiﬁcation for the population statistics. We therefore take thesame approach as Davoodi et al. (2006) and randomly assign eachgalaxy to a cluster, with probability proportional to the PDF val-ues at the galaxies position in NMF space. © , 1–, 1–

NMF 20.000.050.100.150.200.250.30 N M F NMF 30.00.20.40.60.81.0 N M F NMF 40.00.10.20.30.40.50.60.7 N M F NMF 50.000.020.040.060.080.100.12 N M F NMF 60.000.050.100.150.200.25 N M F N M F MIR AGN1MIR AGN2MIR SBSy1Sy2Sbrst

Figure 7.

The distribution of objects/spectra from the ATLAS groups: MIR AGN1, MIR AGN2, MIR SB, Sbrst, Sy1 and Sy2 in our7D space deﬁned by the

NMF set. Symbols and colours for the diﬀerent groups are described in the legend. The position of the averagetemplate for each group is marked by a larger symbol. tion has three distinct advantages over Spoon et al. (2007).First, Spoon et al. (2007) only use the 9 . µ m silicate featureand 6.2 µ m PAH equivalent width to separate out classes. Byusing the NMF components as a basis for our GMM basedclassiﬁcation scheme, we make use of the whole MIR regionto classify objects. This also enables us to classify objectswhere the 9 . µ m silicate feature and 6.2 µ m PAH equivalentwidth are not available or diﬃcult to measure. Secondly, ourclassiﬁcation scheme is modelled on the number density ofour CASSIS sample in NMF space. Since our sample con- tains a large variety of objects, any sample biases will havea small aﬀect on the outcome of our classiﬁcation scheme.The Spoon classes on the other hand, are chosen based onarbitrary cuts in the 9 . µ m silicate feature and 6.2 µ m PAHequivalent width. Thirdly, because our clusters describe aprobability density function, we can give an indication ofhow likely a galaxy could be found in any one of the ﬁveclusters. For example, in Table 2 we show the probability ofbeing in any of the ﬁve clusters for some famous objects. © , 1– ?? earning the Fundamental Components of Galaxies with NMF GMM and ATLAS classiﬁcation for 7 templatesCluster prob. Sy1 Sy2 MIR AGN1 MIR AGN2 MIR SB Sbrst1 0.301 90.9 37.7 97.5 52.3 1.6 6.22 0.287 0.0 17.0 0.0 1.7 43.6 68.83 0.156 0.0 28.3 0.8 24.1 11.7 12.54 0.147 9.1 17.0 0.8 8.6 21.4 6.25 0.080 0.0 0.0 0.0 8.0 20.2 0.06 0.022 0.0 0.0 0.8 2.3 0.8 6.27 0.004 0.0 0.0 0.0 1.1 0.8 0.08 0.003 0.0 0.0 0.0 1.7 0.0 0.0

Table 1.

The percentage of ATLAS classiﬁed objects in each cluster for 7 NMF templates. The ﬁrst column indicates the clusternumber. The second column shows the probability that a CASSIS object is in that cluster (i.e. how many objects can be found in it).The remaining columns contain the percentage of ATLAS classiﬁcation in each cluster.Cluster1 Cluster2 Cluster3 Cluster4 Cluster5Object Sy1 Sbrst Sy2 Sy1.5 Dusty SBArp220 0.00 0.23 0.41 0.02 0.34Mrk231 0.32 0.00 0.34 0.32 0.02PG1211+143 0.92 0.00 0.00 0.08 0.00IRAS10565+2448 0.00 0.71 0.25 0.00 0.04IRAS10378+1109 0.00 0.01 0.06 0.00 0.93

Table 2.

The approximate probability of being in one of the ﬁve clusters in our GMM based classiﬁcation scheme. − . − . − . − . − . − . . . (PAH 6.2 µ m EW) − − − − − S ili c a t e S t r eng t h Cluster 1Cluster 2Cluster 3Cluster 4Cluster 5

Figure 9.

The Spoon et al. (2007) diagram showing Silicatestrength versus the 6.2 µ m PAH equivalent width. The plot isseparated into the diﬀerent Spoon classes and objects are colourcoded by our GMM classiﬁcation. We make our classiﬁcation tool publicly available on thearxiv and at https://github.com/pdh21/NMF_software/ . We have shown that the NMF components are capable ofdistinguishing between the objects showing extreme star for-mation or AGN activity. We now use them to introduce adiagnostic to quantify the contribution from star formationand AGN. Unlike other diagnostics, ours employ the wholeMIR spectrum to disentangle the SF versus AGN contribu- tions, and it is not based on speciﬁc features for which weneed to know information on their origin.For AGN,

NMF and NMF are the most importantand bear the physical features we know to originate fromAGN tori. We therefore adopt NMF and NMF as contri-bution from AGN. For star formation, the third componentis the most important, however we argue that the fourth andﬁfth components are also required as they contain the colderdust associated with obscured star formation. This is espe-cially important for objects like Arp 220 which are known tobe predominantly powered by star formation but have lessthan average PAH emission compared to other submillime-ter galaxies (Pope et al. 2008). We do not include NMF and NMF in our diagnostic. These components contributeto both AGN and starbursts and we have interpreted themas arbitrary dust components that are not speciﬁcally asso-ciated with star formation or AGN activity. Our diagnosticis taken as the ratio of MIR luminosity from the followingcomponents:starformationAGN = L NMF + L NMF + L NMF L NMF + L NMF (10) We now show this diagnostic compared to other MIR diag-nostic plots quantifying star formation and AGN contribu-tion.Farrah et al. (2009) applied Bayesian inferencing andgraph theory to a data set of 102 mid-infrared spectra.By examining how position in the network was related toother parameters (e.g.infrared luminosity, optical spectraltype and black hole mass) they concluded that the networkdepicted the evolutionary scheme of ULIRGs, with diﬀerentbranches relating to Starburst+AGN and luminous AGN. © , 1–, 1–

Figure 8.

NMF space for 7 templates. CASSIS objects used for NMF and GMM are also plotted. The ellipses represent the diﬀerentclusters found through Gaussian Mixtures Modelling

We now investigate how our

NMF set relate to thesame network by decomposing the Farrah et al. (2009) sam-ple with our NMF components and colour-coding the net-work by our NMF diagnostic. The connections are takenfrom Farrah et al. (2009) and we use the same Cytoscapesoftware to produce the network. We note that our networkis not identical to that in Farrah et al. (2009) due to the ran-dom seed starting position used by the spring-embedded al- Available from http://cytoscape.org/. gorithm in Cytoscape. The two main branches seen in Farrahet al. (2009) are still seen in Figure 10, with the lower andright hand branches corresponding to the Starburst+ AGNand Luminous AGN branches respectively. Each galaxy iscolour coded by our new NMF diagnostic.As can be seen in Figure 10, our NMF diagnostic isconsistent with the interpretation that star formation occurson the left hand side of the network, with AGN activityincreasing as we move to the right. The right hand branchappears to be AGN dominated, as was concluded in Farrahet al. (2009). © , 1– ?? earning the Fundamental Components of Galaxies with NMF Figure 10.

The network diagram along with interpretation fromFarrah et al. (2009). Starbursts dominate the left hand side of thenetwork. As the AGN becomes more dominant, galaxies move tothe right and ﬁnally on to one of the two branches. The Nodes arecolour coded by our NMF diagnostic. Nodes in black are wherespectra are not available. − − − PAH(6.2 µ m)/f(5.5 µ m) − − f ( µ m ) /f ( . µ m ) − . − . − . − . . . . . . . l og ( S F N M F / A G N N M F ) Figure 11.

The ratio of 15 to 5 µ m continuum ﬂux, against the6 . µ m PAH ﬂux to 5 µ m continuum ﬂux, as seen in Armus et al.(2007). Points are colour coded by our NMF diagnostic. Our second comparison is with the diagnostic diagramintroduced by Laurent et al. (2000) and modiﬁed for Spitzerby Armus et al. (2007). The diagrams use the integratedcontinuum ﬂux from 14 − µ m, the integrated continuumﬂux from 5 . − . . µ m PAH ﬂux to indicatefractional contributions from AGN and starbursts. Figure11 shows the same diagnostic plot, plotted with objects fromthe CASSIS database with measurements of the continuumand 6 . µ m ﬂux taken from the CASSIS database. The pointsare colour coded by our NMF diagnostic.Objects with a high NMF SF-AGN ratio are located inthe top right while objects with a low NMF SF-AGN ratiolie in the bottom left. This is consistent with the simple lin-ear mixing lines indicating AGN and star formation fractionseen in Armus et al. (2007) and Petric et al. (2010).Our third and fourth comparison is with diagnostic dia-grams using emission lines. We plot all spectra in the CAS-SIS database that have a known redshift and measurable − − − PAH(6.2 µ m) EQW − − − [ N e V ]/[ N e II]

10% 10%25% 25%50% 50%100% 100%

AGN SB − . − . − . − . . . . . . . l og ( S F N M F / A G N N M F ) Figure 12.

The [NeV]/[NeII] ratio vs the PAH 6.2 µ m equivalentwidth. The points are those objects in the CASSIS database thathave a redshift and an estimate for the three lines. The points arecolour coded by our NMF diagnostic. We also show the 100%,50%, 25%, and 10% AGN and starburst linear mixing contribu-tions taken from Armus et al. (2007). emission line. Line measurements are made with the PAH-ﬁt software (Smith et al. 2007). Figure 12 shows the ratioof Neon forbidden lines [NeV] and [NeII] against the PAH6.2 µ m equivalent width, colour coded by the NMF diagnos-tic. We indicate the fractional AGN and starburst contribu-tion to the MIR luminosity from the [NeV]/[NeII] (vertical)and 6.2 µ m PAH EQW (horizontal) assuming a simple lin-ear mixing model. In each case, the 100%, 50%, 25%, and10% levels are marked. The 100% level is set by the aver-age detected values for the [NeV]/[NeII] and PAH 6.2 µ mequivalent width among AGN and starbursts respectively,as discussed in Armus et al. (2007).We see that our diagnostic is consistent with star for-mation dominated objects being located in the bottom rightof the plot, while objects with higher AGN contribution arelocated in the top left.The third diagnostic diagram uses the [OIV] and [NeII]ratio vs PAH 6.2 µ m equivalent width. As in Figure 12,we colour code the points by NMF diagnostic and indicatethe fractional AGN and starburst contributions as discussedin Armus et al. (2007). Our plot can be seen in Figure 13.AGN dominated objects lie the top left, star formation dom-inated objects in the bottom right, which is consistent withthe interpretation of Armus et al. (2007). Our ﬁnal compari-son is with Spoon et al. (2007) diagram, classifying ULIRGsvia the strength of their 9 . µ m silicate feature and 6.2 µ mequivalent width. Figure 14 shows 89 ULIRGs in the Spoonet al. (2007) diagram, colour coded by our NMF diagnos-tic. Our NMF diagnostic suggests AGN dominated objectsare on the horizontal branch, while objects on the diagonalbranch appear to have signiﬁcant activity from star forma-tion and AGN. Objects dominated by star formation lie atthe extreme right of the two branches. Our diagnostic isconsistent with the interpretation of Spoon et al. (2007).We have shown that our diagnostic for determining theAGN/star formation ratio is consistent with MIR diagnos-tic diagrams already in use. Our diagnostic however has the © , 1–, 1–

10% 10%25% 25%50% 50%100% 100%

AGN SB − . − . − . − . . . . . . . l og ( S F N M F / A G N N M F ) Figure 13.

The [OIV]/[NeII] ratio vs the PAH 6.2 µ m equivalentwidth. The points are those objects in the CASSIS database thathave a redshift and an estimate for the three lines. The points arecolour coded by our NMF diagnostic. We also show the 100%,50%, 25%, and 10% AGN and starburst linear mixing contribu-tions taken from Armus et al. (2007) − . − . − . − . − . − . . . (PAH 6.2 µ m EW) − − − − − S ili c a t e S t r eng t h − . − . − . − . . . . . . . l og ( S F N M F / A G N N M F ) Figure 14.

The Spoon et al. (2007) diagram showing Silicatestrength versus the 6.2 µ m PAH equivalent width. The plot isseparated into the diﬀerent Spoon classes and objects are colourcoded by the NMF diagnostic. advantage that it uses a far greater wavelength range thancurrent diagnostics and does not rely on speciﬁc line mea-surements. By using 5 of the 7 components in NMF , ourdiagnostic is also ﬂexible enough to account for the diﬀer-ence in spectra amongst star formation or AGN dominatedobjects. We have carried out the ﬁrst empirical attempt at learningthe fundamental MIR spectral components of galaxies viathe multivariate analysis technique, NMF. We have chosenNMF as the most appropriate matrix factorisation techniquefor our problem as the non negative constraints required by the algorithm, more closely resembles the physical processof emission in the MIR than techniques used in previousstudies (Wang et al. 2011; Hurley et al. 2012). The NMFalgorithm has been applied to 729 galaxy spectra, taken fromthe CASSIS database (Lebouteiller et al. 2011) with spectralredshifts ranging from (0 . < z < . NMF - NMF . We ﬁnd that despite an increase in the al-lowed number of components, many of the components re-main similar. For example, similar counterparts to compo-nents in NMF can be found in NMF and above, the sixthcomponent in NMF can be found in NMF and above andso on. Finding similar components, despite an increase inﬂexibility, suggests these components are fundamental spec-tral components.We ﬁnd the components also have clear, physical inter-pretation. The ﬁrst component contains the forbidden ﬁnestructure lines associated with narrow line regions and AGNas well as a hot dust continuum also typical of AGN tori.The second common component shows silicate emission at10 and 18 µ m and is indicative of the warm dust associ-ated with both the inner wall of the AGN torus or narrowline region clouds. The third component is a star formationcomponent, containing all of the PAH and molecular hy-drogen emission lines, found near PDRs. As the number ofcomponents is increased, the colder dust slope is removed tothe sixth and seventh components. We interpret this as theseparation of unobscured star-forming component (or PDR)from an obscured star-forming component showing colderdust.Re-running the NMF algorithm on objects dominatedby star formation, we show that the PAH emission begins toseparate out into two components, which show similar fea-tures to the two diﬀerent PDR components found in Bern´eet al. (2007).We have shown that a simpler NMF set with seven com-ponents is capable of reproducing the general continuumshape for variety of extragalactic spectra seen in the MIR,though the components struggle with the variation in emis-sion lines. By examining the contributions each componentmakes to well known objects and previously classiﬁed sam-ples, we ﬁnd diﬀerent types of objects lie in diﬀerent regionsof ’NMF space’.Using Gaussian Mixtures modelling, we provide a clas-siﬁcation scheme that uses all seven components to separateobjects into ﬁve diﬀerent clusters: A Seyfert one cluster,Seyfert two cluster, starburst cluster, dusty and obscuredcluster and a type 1.5 Seyfert cluster. Our classiﬁcationoutperforms the Spoon diagram in separating out Seyfertone and two like objects. Unlike the SPoon classiﬁcation,ours use the whole MIR region, allowing objects withoutthe 9 . µ m silicate feature and 6.2 µ m equivalent width to © , 1– ?? earning the Fundamental Components of Galaxies with NMF be classiﬁed. Our GMM based classiﬁcation can also providean estimate of the probability of ﬁnding a particular galaxyin one of the ﬁve clusters.We also use ﬁve of the components to create a star for-mation/AGN diagnostic which performs well against cur-rent MIR diagnostic diagrams. Our NMF based diagnostichas the advantage of considering a greater wavelength range,and can therefore be used for objects where speciﬁc emis-sion features have not been observed, or for where spectraare too noisy.Our NMF components provide fundamental, physicalcomponents which are ideal for separating out diﬀerent typesof objects and investigating the power associated with AGNand star formation. They are linked to the actual physi-cal environments such as AGN and star formation unliketemplates based on speciﬁc objects (e.g. M82) or averagetemplates based on a sample of galaxies. We believe ourNMF set could be used to predict useful measures suchas star formation rate and AGN luminosity and will in-vestigate this in a future paper. We also believe our NMFset is ideal for more galaxy evolution based investigationssuch as decomposing the MIR luminosity function into con-tribution from AGN and and star formation. Our NMFcomponents and code for classiﬁcation are made availableat https://github.com/pdh21/NMF_software/ and on thearxiv. ACKNOWLEDGEMENTS

We thank the referee for the useful comments, which haveimproved the paper. We acknowledge support from theScience and Technology Facilities Council [grant numbersST/F006977/1, ST/I000976/1]. This work is based on ob-servations made with the Spitzer Space Telescope, which isoperated by the Jet Propulsion Laboratory, California Insti-tute of Technology under a contract with NASA.

REFERENCES

Akaike H., 1974, IEEE Trans. Autom. Control, 19, 716Alonso-Herrero A., et al., 2011, arXiv, astro-ph.CO, ac-cepted for publication in ApJAlonso-Herrero A., Pereira-Santaella M., Rieke G. H.,Rigopoulou D., 2012, The Astrophysical Journal, 744, 2Armus L., et al., 2007, The Astrophysical Journal, 656, 148Bern´e O., et al., 2007, Astronomy and Astrophysics, 469,575Blanton M. R., Roweis S., 2007, Astron.J., 133, 734Brandl B. R., et al., 2006, The Astrophysical Journal, 653,1129Bromley B. C., Press W. H., Lin H., Kirshner R. P., 1998,The Astrophysical Journal, 505, 25Calzetti D., et al., 2005, The Astrophysical Journal, 633,871Calzetti D., et al., 2007, The Astrophysical Journal, 666,870Chiar J. E., Tielens A. G. G. M., 2006, The AstrophysicalJournal, 637, 774Connolly A. J., Szalay A. S., Bershady M. A., Kinney A. L.,Calzetti D., 1995, Astronomical Journal v.110, 110, 1071 Dale D. A., et al., 2006, The Astrophysical Journal, 646,161Davoodi P., et al., 2006, The Astronomical Journal, 132,1818Engelbracht C. W., Gordon K. D., Rieke G. H., WernerM. W., Dale D. A., Latter W. B., 2005, The AstrophysicalJournal, 628, L29Farrah D., et al., 2007, The Astrophysical Journal, 667, 149Farrah D., et al., 2009, The Astrophysical Journal, 700, 395Farrah D., et al., 2008, The Astrophysical Journal, 677, 957Farrah D., et al., 2012, The Astrophysical Journal, 745, 178Feroz F., Hobson M. P., Bridges M., 2008, arXiv, astro-phGenzel R., et al., 1998, Astrophysical Journal v.498, 498,579Hern´an-Caballero A., Hatziminaoglou E., 2011, arXiv,astro-ph.COHouck J. R., et al., 2004, The Astrophysical Journal Sup-plement Series, 154, 18Hurley P. D., Oliver S., Farrah D., Wang L., Efstathiou A.,2012, arXiv, astro-ph.COKennicutt R. C., et al., 2009, The Astrophysical Journal,703, 1672Kessler M. F., et al., 1996, Astron. Astrophys., 315, L27Laurent O., Mirabel I. F., Charmandaris V., Gallais P.,Madden S. C., Sauvage M., Vigroux L., Cesarsky C., 2000,Astronomy and Astrophysics, 359, 887Lebouteiller V., Barry D. J., Spoon H. W. W., Bernard-Salas J., Sloan G. C., Houck J. R., Weedman D. W., 2011,The Astrophysical Journal Supplement, 196, 8Lee D., Seung S., 1999, Nature, 401, 788Lee D., Seung S., 2001, in In NIPS, MIT Press, pp. 556–562Lutz D., Spoon H. W. W., Rigopoulou D., Moorwood A.F. M., Genzel R., 1998, The Astrophysical Journal, 505,L103Madden S. C., Galliano F., Jones A. P., Sauvage M., 2006,Astronomy and Astrophysics, 446, 877Mason R. E., Levenson N. A., Shi Y., Packham C., GorjianV., Cleary K., Rhee J., Werner M., 2009, The Astrophys-ical Journal Letters, 693, L136Moore A., 1999, in Advances in Neural Information Pro-cessing Systems, Kearns M., Cohn D., eds., Morgan Kauf-man, 340 Pine Street, 6th Fl., San Francisco, CA 94104,pp. 543–549Mor R., Netzer H., Elitzur M., 2009, The AstrophysicalJournal, 705, 298Pan B., Lai J., Chen W.-S., 2011, Pattern Recognition,44, 2800 , ¡ce:title¿Semi-Supervised Learning for VisualContent Analysis and Understanding¡/ce:title¿Peeters E., 2011, arXiv, astro-ph.GAPeeters E., Spoon H. W. W., Tielens A. G. G. M., 2004,The Astrophysical Journal, 613, 986Petric A. O., et al., 2010, arXiv, astro-ph.GAPope A., et al., 2008, The Astrophysical Journal, 675, 1171Rigopoulou D., Spoon H. W. W., Genzel R., Lutz D., Moor-wood A. F. M., Tran Q. D., 1999, The Astronomical Jour-nal, 118, 2625Rosenberg M. J. F., Bern´e O., Boersma C., AllamandolaL. J., Tielens A. G. G. M., 2011, arXiv, astro-ph.GARoussel H., et al., 2007, arXiv, astro-phRoussel H., Sauvage M., Vigroux L., Bosma A., 2001, As-tronomy and Astrophysics, 372, 427Rovilos E., et al., 2012, Astronomy & Astrophysics, 546, © , 1–, 1–

APPENDIX A: NON-LINEAR MATRIXFACTORISATION TECHNIQUES

ICA, PCA and NMF are linear models and cannot eﬃcientlymodel non-linearities such as dust extinction. Over the lastdecade, non-linear matrix factorisation techniques have beendeveloped to overcome certain non-linear situations. All ofthese nonlinear based techniques use kernels to map datawith nonlinear structure into a kernel feature space, wherethe structure becomes linear. Techniques such as PCA orNMF can then be performed in the kernel feature space torecover the structure. These types of techniques are suited toproblems where the non-linearity is of parametric form, e.g.points distributed along a circle. Dust extinction is exponen-tial relationship unsuited to this type of technique (BinbinPan, private communication). -4.5•10 -4.0•10 -3.5•10 -3.0•10 B I C / A I C v a l ue AIC with extinctionBIC with extinctionAIC without extinctionBIC without extinction

Figure B2.

The AIC and BIC for the non linear simulations.Both the BIC and AIC for spectra without extinction indicate 5components as expected. The set with extinction requires around15-20.

APPENDIX B:

NMF AND EXTINCTIONSIMULATION

To explore whether the extinction can cause problems withour NMF analysis, we simulate extinction via equation 3described in section 3.Our simulation is divided into two parts. The ﬁrst partassumes galaxy spectra are a linear combination as describedin equation 2, while the second assumes equation 3 is valid.To simulate the spectra, we use NMF set

NMF and lin-early combine them with weights randomly sampled from adistribution based on those found in the real sample. We dothis 500 times to create 500 unique galaxy spectra.The second part of our simulation involves adding ex-tinction to the simulated spectra as described in equation 3in section 3. τ ( λ ) is deﬁned by the Galactic Centre extinc-tion law of Chiar & Tielens (2006).We then carry out the NMF algorithm on both the un-extincted and extincted spectra. We run the algorithm for NMF - NMF and use the simpliﬁed model selection mea-sures: the Akaike Information Criterion (AIC;Akaike (1974))and the Bayesian Information Criterion (BIC; Schwarz(1978)), deﬁned as follows: AIC ≡ − L max + 2 k + 2 k ( k + 1) N − k − BIC ≡ − L max + k ln N (B2) L max is the maximum likelihood solution, N is the num-ber of datapoints and k is the number of parameters. A min-imum value for the AIC and BIC correspond to the optimummodel. Figure B2 shows both the BIC and AIC for both setsof simulated spectra. As expected, the BIC and AIC indicatethe spectra without extinction can be adequately describedby the NMF set with 5 components. For spectra with extinc-tion, the BIC and AIC do not level of until NMF - NMF .This suggests that extinction could be a factor in driving ourlinear methods to more templates than might be required byunderlying physical conditions. © , 1– ?? earning the Fundamental Components of Galaxies with NMF Figure B1.

The 30 components of NMF set

Figure C1.

NMF ﬁts to the Blue Compact Dwarf: KUG 1013+381, Seyfert type 1 galaxies: Markarian 231 and PG1211+143, SeyfertType 2 galaxy: Markarian 273, and starburst galaxies: NGC3310 and NGC3256. Each spectrum is plotted as a black solid line and theNMF ﬁt as black dashed line. The contribution from each component is also shown, with the same colour coding as in Figure 6. Theresiduals (data-ﬁt) are plotted below each ﬁt. APPENDIX C:

Four additional example

NMF ﬁts. LINER: 3C270, submillimeter galaxy: SMG GN26, quasar: PG0804+761, andULIRG:IRAS 10378+1108 . Each spectrum is plotted as a black solid line and the NMF ﬁt as black dashed line. The contributionfrom each component is also shown, with the same colour coding as in Figure 6 © , 1–, 1–

Figure C3.

NMF ﬁts to the Average templates from Hern´an-Caballero & Hatziminaoglou (2011) (Information on sample can be foundin Table C2). Each spectrum is plotted as a black solid line and the NMF ﬁt as black dashed line. The contribution from each componentis also shown, with the same colour coding as in Figure 6. © , 1– ?? earning the Fundamental Components of Galaxies with NMF Table C1. name Nsources z min (cid:104) z (cid:105) z max λ min [ µ m] λ max [ µ m] commentsSy1 11 0.002 0.041 0.205 5.2 24.6 Seyfert 1 with ν L ν (7 µ m) < erg s − Sy1x 72 0.003 0.091 0.371 5.0 24.6 intermediate Seyfert types (1.2, 1.5, 1.8, 1.9)Sy2 53 0.003 0.045 1.140 5.2 24.6 Seyfert 2 with ν L ν (7 µ m) < erg s − LINER 16 0.001 0.034 0.322 5.2 24.6 LINER with ν L ν (7 µ m) < erg s − QSO 125 0.020 1.092 3.355 2.5 24.6 QSO1 and Seyfert 1 with ν L ν (7 µ m) > erg s − QSO2 65 0.031 1.062 3.700 3.6 24.6 QSO2 and Seyfert 2 with ν L ν (7 µ m) > erg s − Sbrst 16 0.001 0.091 1.316 5.2 24.6 Starburst or HII with ν L ν (7 µ m) < erg s − ULIRG 184 0.018 0.730 2.704 4.5 24.6 ULIRG (low and high redshift sources)SMG 51 0.557 1.869 3.350 4.8 12.0 Submillimiter GalaxiesMIR AGN1 119 0.002 0.455 2.190 4.0 24.6 MIR selected AGN with silicate emissionMIR AGN2 160 0.002 0.549 2.470 4.5 24.6 MIR selected AGN with silicate absorptionMIR SB 257 0.001 0.413 2.000 4.6 24.6 MIR selected starbursts