Unsupervised Functional Data Analysis via Nonlinear Dimension Reduction
UU NSUPERVISED F UNCTIONAL D ATA A NALYSISVIA N ONLINEAR D IMENSION R EDUCTION
PREPRINT
Moritz Herrmann ∗ [email protected] Department of StatisticsLudwig-Maximilians-UniversityMunich, Germany
Fabian Scheipl [email protected]
Department of StatisticsLudwig-Maximilians-UniversityMunich, Germany A BSTRACT
In recent years, manifold methods have moved into focus as toolsfor dimension reduction. Assuming that the high-dimensional dataactually lie on or close to a low-dimensional nonlinear manifold,these methods have shown convincing results in several settings. Thismanifold assumption is often reasonable for functional data, i.e., datarepresenting continuously observed functions, as well. However, theperformance of manifold methods recently proposed for tabular orimage data has not been systematically assessed in the case of func-tional data yet. Moreover, it is unclear how to evaluate the qualityof learned embeddings that do not yield invertible mappings, sincethe reconstruction error cannot be used as a performance measure forsuch representations. In this work, we describe and investigate thespecific challenges for nonlinear dimension reduction posed by thefunctional data setting. The contributions of the paper are three-fold:First of all, we define a theoretical framework which allows to sys-tematically assess specific challenges that arise in the functional datacontext, transfer several nonlinear dimension reduction methods fortabular and image data to functional data, and show that manifoldmethods can be used successfully in this setting. Secondly, we subjectperformance assessment and tuning strategies to a thorough and sys- ∗ Corresponding author a r X i v : . [ s t a t . M L ] D ec nsupervised Functional Data Analysis PREPRINT tematic evaluation based on several different functional data settingsand point out some previously undescribed weaknesses and pitfallswhich can jeopardize reliable judgment of embedding quality. Thirdly,we propose a nuanced approach to make trustworthy decisions for oragainst competing nonconforming embeddings more objectively.
Keywords
Dimension reduction · Functional data analysis · Manifold methods · Unsupervised learning.
The ever growing amount of easily available high-dimensional data has led to an in-creasing interest in methods for dimension reduction in several contexts, for exampleimage processing [13] and single cell data [2, 16]. Next to standard dimension reductionmethods such as Principal Component Analysis (PCA) and Multidimensional Scaling(MDS), manifold methods have moved into focus in recent years. If the assumptionthat high-dimensional data actually lie on or close to a lower-dimensional Riemannianmanifold holds, i.e., if the data have low intrinsic dimension, nonlinear dimensionreduction methods are often capable of detecting this intrinsic low-dimensional structureeven if standard, in particular linear, methods fail to do so. In this paper, we describeand assess a general approach for extending established and state of the art manifoldmethods ISOMAP [37], DIFFMAP [9], t-SNE [28], and UMAP [29] to functional dataand use MDS as a default reference method for benchmarking.Functional data analysis (FDA) [34, 41, e.g.], which is an active field of research instatistics with many close connections to time series analysis, focuses on data in whichthe units of observation are realizations of stochastic processes over compact domains.This kind of data is another data type for which the manifold assumption is often rea-sonable: On the one hand, such data is infinite dimensional in theory and typically veryhigh-dimensional in practice – functional observations are usually recorded on fineand dense grids: For example, spectroscopic measurements are typically evaluated onthousands of electromagnetic wavelengths or electrocardiograms, measured at 100 Hzfor 10 minutes, would yield 60,000 grid points each. On the other hand, such signalstypically contain a lot of structure, and it is often reasonable to assume that only fewmodes of variation suffice to describe most of the information contained in the data, i.e.,such functional data often have low intrinsic dimension, at least approximately.An important complication is that FDA often faces the challenge of two kinds of vari-ation, both of which can be of major interest: amplitude (i.e., “vertical”) variationaffecting the slope, level, and size of local extrema of a function and phase (i.e., “hori-zontal”) variation affecting the location of extrema and inflection points. Phase variation,which can be conceptualized as elastic deformations of the domain of the functionalobservations, often results in complex nonlinear intrinsic structure [7]. As our resultsshow, this is true even for fairly simple phase variation structures. Despite some priorwork [7, 10] showing that manifold methods for functional data – specifically, func-tional versions of ISOMAP [37] – can successfully deal with structured phase variationand yield efficient and compact low-dimensional representations and despite recentsubstantial progress in the development and application of manifold methods to tabular,image and video data [16, 42, e.g.], manifold learning for functional data remains anunderdeveloped topic. However, low-dimensional representations of functional data arehighly relevant for real world problems. Finding reliable low-dimensional – especially2- or 3-dimensional – representations of data is beneficial for visualization, description2nsupervised Functional Data Analysis
PREPRINT and exploration purposes in general. In FDA settings, this is especially crucial as thevisualization of large data sets of functional observations is particularly challengingand quickly overwhelms analysts with ostensible complexity even if the underlyingstructures are actually fairly low-dimensional and simple, c.f Figure 2. Moreover,finding informative low-dimensional representations of functional data is an essentialpreprocessing step for functional data, since these representations can be used as featurevectors in supervised machine learning algorithms which require tabular, not functionaldata inputs [32].In this work, we thoroughly assess if manifold methods can be used to embed functionaldata, perform a careful evaluation of hyper-parameter tuning approaches for functionalmanifold methods and investigate the suitability of the derived embeddings in varioussettings. Specifically, we address the following questions:(1) Are the manifold methods under investigation able to detect low-dimensional mani-fold structure of functional data? Special attention is given to assess the effects of phasevariation.(2) To what extent can automatic tuning strategies replace laborious and subjectivevisual inspection in order to obtain reliable embeddings in unsupervised FDA settings?The remainder of the paper is structured as follows: In section 2 we specify notation andthe theoretical framework, give a description of the embedding methods used and anoverview over performance assessment and tuning approaches. Moreover, we motivateour study design. In section 3 we describe the design of the synthetic data simulationsand assess the tuning and embedding approaches in these settings, in which the “groundtruth” is available for verification. The concepts and insights developed on syntheticdata are then brought to bear on three real data sets in section 4. The findings of thestudy are finally discussed in section 5.
Nonlinear dimension reduction (NDR) [3, 23, e.g.] is based on the assumption thathigh-dimensional data observed in a D -dimensional space H actually lie on or close toa d -dimensional manifold M ⊂ H , with d < D [6]. One is then interested in findingan embedding e : H → Y from the high-dimensional space to a low-dimensionalembedding space Y such that Y is as similar to M as possible.In most NDR applications, one simply considers H = R D . In a functional data setting,the situation is more involved: We define a d-dimensional parameter space Θ ⊂ R d while F = L ( T ) , the space of square integrable functions over the domain T , takeson the role of H , and φ : Θ → F is a mapping from the parameter space to the functionspace . We then observe functions x i ( t ) ∈ F with x i ( t ) = φ ( θ i ) , which can have acomplex but intrinsic low dimensional structure, depending on both the structure anddimensionality of Θ and the complexity of φ .Transferring this to the NDR terminology, M = Θ , i.e. the low-dimensional manifoldis the parameter space. However, the observed data are functions in the subspace M F ⊂ F , i.e., using the terms of [7], a functional manifold. Thus, using manifoldmethods, an embedding e : M F → Y can be constructed. Specifically, that means wehave the mappings Θ φ → M F e → Y , but only e : F → Y can be learned from the data3nsupervised Functional Data Analysis
PREPRINT
Figure 1: Framework for nonlinear dimension reduction of functional data.(s. Figure 1). The question we try to answer is this: how well can the underlying globalstructure of Θ be recovered in an embedding learned from data on M F ?In general, however, it is not straightforward to define what “recovering well" is supposedto mean in a specific NDR setting [12], a rarely discussed but crucial aspect. In particular,the fact that manifolds which are locally homeomorphic to R d by definition need not behomeomorphic to R d globally needs to be taken into account. E.g., a 2-sphere, althoughlocally homeomorphic to the plane R , can not be embedded into R globally by a singleembedding e without distortions. In differential geometry terms, only specific manifolds M like the famous “Swiss roll” can be represented by a single chart. However, sincelearning an embedding function is roughly equivalent to estimating a chart of the datamanifold, manifold learning faces an ill-posed problem if the atlas of a data manifoldrequires more than one chart.Since it is not known in practice whether a single chart is sufficient or not, assessmentof the embeddings achieved by manifold methods must always be considered underboth local and global perspectives: (1) successful on a local level if the embedding e : H → R d yields an embedding space Y in which local structures are similar to localstructures in M ⊂ H , i.e., if e preserves neighborhoods of (small) membership size g ,and (2) globally successful if Y is as close to isomorphic to M as possible, e.g., if anunderlying “Swiss roll” manifold is “unrolled” into a plane. Note that the latter is notpossible for every manifold. Thus, we consider a learned embedding to be successfulif the resulting configurations of data units in Y are as similar to the correspondingconfigurations in Θ as possible in the following sense: In the conducted simulationstudy, we use simple parameter manifolds Θ which are mostly homeomorphic to R d with d ∈ { , } , and – in one case – homeomorphic to the circle. This allows us toevaluate embedding methods and tuning approaches for the functional data generatedfrom Θ from a global perspective, both by visual inspection and quantitatively. Local characteristics are additionally investigated for the real data sets.As phase variation typically transforms the domain non-linearly, phase-varying func-tional data is very likely to live on a non-simple functional manifold M F that is nolonger globally isomorphic to R d even if the generating parameter manifold Θ is asimple linear subspace, an issue leading to additional complexity in the FDA setting.By restricting our simulation study in part to fairly simple, linear Θ , we are able toassess these non-obvious and previously undescribed effects of domain warping on theembeddings. 4nsupervised Functional Data Analysis PREPRINT
Since the methods we employ do not yield invertible embeddings e , we cannot eval-uate them based on their reconstruction error E [ L ( x, e − ( y ))] , where L is some lossfunction measuring the divergence between x and its reconstruction e − ( y ) , whichwould objectively quantify the quality of an embedding in terms of the fidelity of itslow-dimensional compression to the original data. Assessing embeddings qualitativelyby visual inspection, as is widely done [6, 1, 29, e.g.], cannot be automated, and so itdoes not scale to large-scale comparison and benchmark studies nor to the importanttask of tuning a method’s hyperparameters.Instead, several surrogate measures have been developed, often based on comparing theranks of pairwise distances between the high-dimensional data space and the learnedembedding space [39, 8, 24, 25, e.g.]. We employ measures based on the normalizedlocal continuity meta-criterion (LCMC) [22], since they are parameter free, yield asingle scalar value – a property particularly desirable if the measure is supposed to beused for tuning – and allow to assess local and global performance. The LCMC is basedon the measure Q RX ( g ) = 1 g n n (cid:88) i =1 |N H g ( i ) ∩ N Y g ( i ) | (cid:124) (cid:123)(cid:122) (cid:125) N g , which quantifies the amount of overlap between the memberships of neighborhoodsof a certain size in the two spaces [24, 19]. The g -neighborhood N g ( i ) is defined asthe set of those g objects which are closest to i in the respective space according to asuitable distance measure, and N g measures the mean overlap obtained by averaging thecardinalities of the intersections between all such neighborhoods in the high dimensionalspace H and the low dimensional embedding space Y . The factor g is a normalizationfactor. The normalized LCMC is then defined as R RX ( g ) = ( n − Q RX ( g ) − gn − − g , which also accounts for random overlap [8, 22]. A value of 0 is expected for a randomembedding, i.e., the agreement between g -neighborhoods in Y and in H is the sameas that of a random configuration of objects in Y . A value of 1 indicates a perfectembedding with complete identity of all g -neighborhoods in the two spaces [22, 24].The choice of neighborhood size g , however, is crucial and has a strong influence onwhether an embedding is judged to be successful or not. For large g , these metrics quan-tify the preservation of global structure in the embedding, for small g , the preservationof local structure.To circumvent the problems that come with the choice of g , one can compute parameterfree measures based on R NX ( g ) [19]. Regarding R NX ( g ) as a function of g andaveraging the function values on either side of its maximum at g max , leads to both alocal and a global performance measure:Q local = 1 g max g max (cid:88) g =1 Q RX ( g ) and Q global = 1 n − g max n − (cid:88) g = g max Q RX ( g ) . PREPRINT
To quantify the overall performance of an embedding, the area under the R NX ( g ) -curve AU C R NX = (cid:80) n − g =1 R NX ( g ) (cid:80) n − g =1 1 g can be computed [19]. Given such a scalar measure of performance, embedding methodscan then be tuned by maximizing this measure over the different hyperparameter settings.However, while it is known that the choice of the distance metric has a strong influenceon embedding methods [40], the influence of the distance metric used to computethe g -neighborhoods in the surrogate performance measures remains unclear. Sincethese measures are based on g -neighborhoods, the proximity metric used to define theneighborhoods in the respective spaces is crucial, especially if the goal of the analysisis to “unfold” the global structure of the manifold. In order to recover the globalmanifold structure, neighborhoods in the high-dimensional space should be definedusing geodesic distances rather than Euclidean distances, since only the former representlong-range distances on the manifold correctly, while the latter are merely distancesin the ambient space. This distinction is likely to be highly relevant especially if theobserved high-dimensional data manifold has a complex structure, such as M F in oursituation, and when the measures are supposed to be used for automatic parameter tuningfor optimal recovery of global structure. In the following we use AU C mR NX and Q m local ,where m ∈ { dir, geo } indicates the distance metric used to calculate the neighborhoodsfor performance assessment, i.e. in this study g -neighborhoods in R NX ( g ) are computedeither using L distances (i.e. Euclidean distances in R D ) or geodesic distances. Inthe following we use the term direct distance instead of L distance to emphasis theconceptual difference of distance measures which merely quantify proximity in theambient space (hence direct distance) and geodesic distance measures quantifyingproximity on a (nonlinear) manifold. Note, this is a general conceptual difference. Moststandard distance metrics can be regarded as direct distance measures in the particularspace and geodesic distances as computed here can be obtained based on several ofthese direct distance metrics. We will show that direct distances such as the L distancecan yield very misleading results when used in the surrogate performance measures andthat tuning approaches can thus lead to far form optimal configurations. In this study, we compare nonlinear dimension reduction methods isometric featuremapping (ISOMAP) [37], diffusion map (DIFFMAP) [9], t-distributed stochastic neigh-borhood embedding (t-SNE) [28], and uniform manifold approximation and unfolding (UMAP) [29] for functional data. All these methods have locality parameters thatcontrol whether (rather) local structures or (rather) global structures are considered,that is how much “context" of the respective data points is taken into account whileconstructing the embedding. These parameters influence the result strongly and need tobe tuned. We apply MDS as a simple tuning-free benchmark reference method.ISOMAP is based on classical MDS. In contrast to MDS it is capable of unfoldingthe intrinsic structure of a data set. The algorithm consist of three steps. First, anearest-neighbor-graph is constructed based on a suitable direct distance metric, usuallythe L metric. This requires defining a neighborhood size either by a distance threshold (cid:15) or by the number of neighbors k to be included. This parameter is the main tuningparameter of the algorithm. In the next step, shortest-path or geodesic distances among6nsupervised Functional Data Analysis PREPRINT all points are computed based on the nearest-neighbor graph. These distances are thensupplied to classical MDS, which embeds the observations accordingly. ISOMAP issupposed to be particularly suited to detect global structures [37].DIFFMAP is another spectral embedding method projecting on the eigenvectors of adiffusion operator on the data manifold. Proximity of data points is defined by a kernelfunction whose width acts as a tunable locality parameter [9, 27].t-SNE, which has been state of the art for several years [29], builds upon stochasticneighborhood embedding (SNE). In contrast to the aforementioned methods, (t-)SNEtransforms proximities between data points into conditional probabilities of them beingneighbors in the respective space and then minimizes the KL divergence of the implieddistribution in the original space from that in the embedding space. The perplexity ofthe implied distribution in the original space acts as a tunable locality parameter.UMAP [29] is a state-of-the-art manifold learning method based on three assumptions– uniformly distributed data on a locally connected manifold equipped with a locallyconstant metric. It computes a fuzzy topological representation of the manifold basedon a nearest-neighbor-graph. The number of nearest neighbors serves as a tunablelocality parameter.In addition to the investigation of how successfully the intrinsic manifold structure of afunctional data set can be detected and unfolded in general, we also want to investigatehow reliably automatic tuning approaches identify suitable hyper-parameter settings.We consider the parameters steering the degree of locality as the main tuning parameterof these embedding methods. Using the terminology of the respective R packages, theseare the neighborhood sizes k for ISOMAP, n_neighbors for UMAP, the perplexity for t-SNE and eps.val for DIFFMAP. Hereinafter we refer to these parameters as locality parameters . Moreover, the methods are not supplied with the raw data matrices,but with distance matrices instead. Note, this means that an initialization via PCA is notperformed for t-SNE using Rtsne . To investigate these aspects, we compute both thedirect and geodesic distance matrix for each simulated data set in the function space aswell as in the parameter space. For a given parameter configuration, the direct distancematrix obtained from the function space is then input to the respective embeddingmethod. Finally, direct distances of the learned embeddings are computed. As theperformance measures are based on the comparison of g -neighborhoods in the highdimensional and the embedding spaces, we compute the performance measures withrespect to both the parameter space as well as the function space based on direct and thegeodesic distance neighborhoods in the simulation settings. Recall, direct distancesrepresent distances in the ambient space rather than distances on the manifold and theresulting neighborhoods are thus unlikely to be well suited for performance assessmentand tuning if the intrinsic structure is nonlinear, especially for larger neighborhoodsizes.Parameters are tuned for optimal performance via an extensive grid search, c.f. Table 1.For DIFFMAP we compute a data set-specific starting value (cid:15) s via epsilonCompute ,which is the default value of the method, and use this to define the search grid ofthe locality parameter. In the synthetic data settings of Section 3, we can perform a“ground truth”-based meta assessment, in which we evaluate the effect of direct andgeodesic distances. Moreover, we can compare the results achieved by tuning basedon performance measures computed in the function space with those achieved by apractically infeasible “oracle” tuning method that uses corresponding performancemeasures computed in the unobservable true parameter space instead.7nsupervised Functional Data Analysis PREPRINT
Table 1: Parameters of the manifold methods which are subjected to tuning. The secondcolumn shows the total amount of different parameter configurations in the tuningparameter grid, the third column the locality parameter, the fifth column the embeddingdimension parameter, and the last column further parameters tuned. The “grid”-columnsdisplay the search grids of the parameters in the preceding column. The embeddingdimension grid for t-SNE differs from the other grids because the implementation doesnot allow the embedding dimension to be greater than three.
Method k [2, 5] -ISOMAP 1300 k [3, 975] ndim [2, 5] -step size: 3DIFFMAP 6000 eps.val [ . (cid:15) s , . (cid:15) s ] neigen [2, 5] t length: 250UMAP 18720 n_neighbors [5, 975] n_components [2, 5] min_dist step size: 5 n_epochsinit t-SNE 21184 perplexity [3, 333] dims [2, 3] theta step size: 1 max_iteretaexaggeration Since we are interested in whether manifold methods and tuning approaches can beused for functional data sets in general, a thorough evaluation design is inevitable. Forsupervised learning algorithms a wide and comprehensive body of literature exists onthe conduction of neutral and objective comparison and benchmark studies [38, 5, 11,e.g.].How to reliably evaluate algorithms and meta learning approaches in unsupervisedsettings, however, is not as clear. Due to the lack of an outcome variable, clearly definedobjectives to optimize against are usually not available. This makes the comparisonof unsupervised learning algorithms in general, and the assessment of meta learningapproaches such as tuning in particular, prone to overoptimistic findings. Generalframeworks for systematic benchmarks in this context are still in their infancy [38].In particular, nonlinear dimension reduction and manifold learning are often confrontedwith the lack of a clearly defined objective in terms of the achieved reconstruction errorif the methods do not yield an invertible mapping. Since there is no standard benchmarkprocedure for unsupervised learning generally agreed upon, we devised the followingprocedure in order to avoid overoptimistic conclusions in our study as much as possible:Based on the problem specification and the theoretical framework defined in Section 2.1,we first conduct a simulation study to assess embedding methods and the consideredtuning approaches in settings where the ground truth is known. This allows us toinvestigate possible strengths, weaknesses and pitfalls in settings that allow for objectiveevaluations based on a known “ground truth”. We then apply the approach to realdata sets where qualified assumptions about the intrinsic structure of the data can bemade due to substantial considerations and analysis of previous studies. Althoughknowledge of the intrinsic structure is less certain than in the ground truth simulations,it is still possible to “objectively" evaluate the embeddings, at least conditional oncertain substantially justified assumptions, in these settings. To some extent, this letsus examine whether the insights obtained in simulated data settings also hold for real8nsupervised Functional Data Analysis
PREPRINT data. Finally, we compute embeddings for a real data set for which information aboutits intrinsic structure can not be justifiably inferred from prior substantial considerations– i.e., a fully unsupervised problem. We will show that such a setting can pose severeproblems due to nonconforming embeddings that are hard to tackle, but that “principled”choices between nonconforming embeddings may nevertheless be possible based onthe insights gained from the simulation study and from data applications in which someprior knowledge is available.
Based on the framework described in Section 2.1 we can systematically assess the utilityof the described manifold methods for functional data settings by means of a simulationstudy. This allows to compare the embeddings to a ground truth which is essentialin a unsupervised learning problem to come to reliable conclusions. In particular, wecan assess the influence of some factors likely to lead to strongly nonlinear intrinsicstructure of functional data manifolds, i.e., nonlinear domain warping (phase variation)and an underlying nonlinear parameter space Θ , by systematically controlling thesesources of variation. Loosely speaking, the simulation design is based on two peaked functions derived fromGaussian pdfs over domain [0 , . Variation is achieved by randomly changing thelocations, widths and heights of the peaks, in total leading to eleven considered settings,six based on a linear parameter space, including three with nonlinear phase variation,and five based on a nonlinear parameter space and amplitude variation only. We can thusassess the effect of nonlinear phase variation and a nonlinear parameter space separatelyusing the first six settings and the last five settings, respectively.Specifically, we consider the following functional manifold M F = (cid:8) x ∈ L ([0 , x ( t ) = φ ( θ ) (cid:9) , with θ = ( a , p ) and φ ( θ ) = b ( w ( t ; p ); a ) . Amplitude variation in M F is parameterizedas b ( t ; a ) = a √ . π { a · n ( t, .
25) + a · n ( t, . } + a , with n ( t, µ ) = exp (cid:16) − ( t − µ ) . (cid:17) . Depending on the setting, phase variation is parameter-ized as the identity warping w ( t ; p ) = t , a linear warping w ( t ; p ) = (cid:26) p t for t ∈ [0 , . − p )( t −
1) + 1 for t ∈ (0 . , , a power warping w ( t ; p ) = t p or as w ( t, p ) = B ( t ; p , p ) , where B ( · ; a, b ) is the cdfof a Beta ( a, b ) distribution.The considered settings are obtained by selecting up to three of the parameters a , a , a , a , p , p , p , p , i.e., the considered settings have at most 3 degrees offreedom (df). Inactive parameters are set to constant values, e.g., a = 1 and a = 0 . If9nsupervised Functional Data Analysis PREPRINT
Table 2: Overview of simulation settings.setting df parameter space variation parameter warpinga1-l 1 linear amplitude a identityp1-l 1 linear phase p linearc1-l 1 linear coupled a = p powera2-l 2 linear amplitude a , a identityp2-l 2 linear phase p , p beta cdfi2-l 2 linear independent a , p powera2-sr 2 1D Swiss roll amplitude a , a identitya3-hx 3 1D helix amplitude a , a , a identitya3-sr 3 2D Swiss roll amplitude a , a , a identitya3-sc 3 2D S-curve amplitude a , a , a identitya3-tp 3 2D tp surface amplitude a , a , a identityFigure 2: Example functions with 1 df coupled joint amplitude and phase variation(c1-l), 1 df phase variation (p1-l), and 2 and 3 df amplitude variation (a2-sr and a3-hx).c1-l and p1-l are based on linear, a2-sr and a3-hx are based on nonlinear parametermanifolds.no warping parameter is selected, the identity warping is applied. Varying both ampli-tude and warping parameters in a setting induces joint amplitude and phase variation.Dependencies between amplitude and phase parameters induce dependencies betweenamplitude and phase variation.The active parameters are either drawn uniformly from a linear parameter space, i.e., themanifold Θ is a linear subspace, or from a nonlinear parameter space, i.e, a (nonlinear)manifold. Note that the power function and the beta cdf warping are nonlinear transfor-mations of the domain, thus we obtain a non-linear functional data manifold even if theparameter space Θ is linear. In the linear case we let a i , p j ∼ U [0 . , , i = { , ..., } and j = { , , } , and p ∼ U [0 . , . . In the nonlinear case, parameter valuesare drawn uniformly from one of five different manifolds: the Swiss roll (1D and 2D),the 1D helix, the 2D S-curve, and the 2D two-peaked (tp) surface as provided by theR package dimRed [19]. For each setting, 1000 functional observations are generatedbased on 200 grid points. A summary of the simulation settings is provided in Table 2and Figure 2 displays samples of functions with phase variation, amplitude variation,and joint, coupled amplitude and phase variation.10nsupervised Functional Data Analysis PREPRINT
The results show that functional data can – in principle, i.e. given the ground truth tocompare against – be embedded with methods developed primarily for images or tabulardata.To start with we also sketch the effect of nonlinear warping in the next subsection.For that we consider settings a1-l, p1-l, c1-l, a2-l, p2-l, i2-l with simple parameterspaces R d first. Based on embeddings of the reference method MDS we analyze somespecific pitfalls which result from the drawbacks of the performance measures describedin Section 2.2. We finally turn to the settings with more complex parameter spaces,a2-sr, a3-hx, a3-sr, a3-sc, a3-tp, additionally evaluating tuning approaches based on thesurrogate performance measures. We show that based on the correct tuning approach, itis possible to (automatically) obtain high quality embeddings for these settings as well. In this section we highlight two essential aspects. First of all, the findings indicate thatit is possible to successfully embed functional data using manifold methods, at leastin these simple settings. In addition, we provide evidence that things can get rathercomplicated quickly if warping comes into play even if the underlying parameter spaceis a simple linear one. That said, the settings we consider here, a1-l, p1-l, c1-l, a2-l, p2-l,i2-l, include amplitude as well as phase variation and also both coupled and independentjoint phase and amplitude variation.As can be seen in Figure 3, ISOMAP is particularly successful. Clearly, perfect linearembeddings are achieved in settings with one degree of freedom and two degrees offreedom alike (a1-l, p1-l, c1-l, a2-l). Note that phase variation is induced by a nonlinearpolynomial warping function in the setting with coupled phase and amplitude variationc1-l. Nevertheless, the functional manifold can be perfectly unfolded into its underlyinglinear structure. This is not the case for setting p2-l, where phase variation is inducedby the Beta cdf. Here, the resulting structure of the functional manifold becomes morechallenging to unfold into R .This shows why embedding functional data can be especially complex and why it isimportant to use simple, low-dimensional parameter spaces for this study: the functionalmanifolds we are dealing with become nonlinear even though they are based on a decep-tively simple parameter space. Moreover, consider setting i2-l, which has independentamplitude and phase variation (based on power warping). Even though the data areembedded nearly linearly, the distribution of the parameter values of the amplitudevariation, indicated by colour code, no longer follows a simple linear direction in theembedding space. This also indicates the more complex structure of the functionalmanifold induced by the power warping.For the embeddings of the other methods similar findings can be reported, howeveroverall they are not as successful as the ISOMAP embeddings. In the 1-dimensionalsettings most embeddings are not perfectly linear. Moreover, unlike t-SNE, UMAP andDIFFMAP fail to recover setting i2-l respectively i2-l, p2-l, and a2-l. To sum up,manifold methods apparently seem to be able to recover the manifold structure offunctional data, but how successfully they do so strongly depends of the complexityinduced by the transformation of the parameter space. Structured phase variation canquickly lead to very complex functional manifolds which may not be embeddablefaithfully in the sense that the functional data manifold is no longer isomorphic to a11nsupervised Functional Data Analysis PREPRINT
Figure 3: Embeddings for settings a1-l, p1-l, c1-l, a2-l, p2-l, i2-l. Color scale encodesvalue of the first parameter in Θ . Since units for the embeddings are arbitrary, we omitaxis labels here and in the following figures to save space.low-dimensional linear subspace even if it is based on a very simple linear parameterspace. As outlined in section 2.2, the proposed surrogate performance measures suffer fromsome drawbacks. Here we show that one of the most worrisome resulting pitfalls is thatthese measures can frequently indicate high performance values even if the embeddingis not of high quality in terms of ground-truth performance in the underlying parameterspace.To demonstrate the issue we concentrate on MDS embeddings of the nonlinear settingsa2-sr, a3-hx, a3-sr, a3-sc, a3-tp. As can be seen in Figure 4, simple MDS is not ableto unfold the functional manifolds into embeddings on linear subspaces or the circle.However, assessing the embedding, e.g. of setting a3-hx, using
AU C dirR NX based on direct L -distance neighborhoods in the function space F , i.e. the standard way of calculatingdistances, would indicate a perfect embedding quality of 1. However, assessing theembedding based on direct-distance-based neighborhoods in the parameter space Θ yields a much lower AU C dirR NX of only 0.78. Computing AU C geoR NX , i.e. computingneighborhoods using geodesic distances, in the parameter space – recall that this isassumed to be the appropriate way to capture long-range distances on the manifold –leads to a further reduction, with an AU C geoR NX of only 0.553. This corresponds moreclosely to the visual impression, since MDS is not able to unfold the intrinsic structurecorrectly.So we see that naive application of standard performance measures can indicate highquality embeddings even if the manifold is not accurately recovered at all – at least if recovering is defined as also unfolding non-linear manifolds. The example also shows12nsupervised Functional Data Analysis PREPRINT
Figure 4: MDS embeddings for settings with nonlinear parameter space. Color scaleencodes value of the first parameter in Θ that the assessment of the quality of an embedding provided by these performancemeasures is highly sensitive to the choice of the distance metric used to determine theneighborhood structure in the space against which the embedding space neighborhoodsare compared.These two pitfalls make the assessment of real data embeddings using surrogate per-formance measures particularly challenging since, in reality, the intrinsic structure isobviously unknown. In particular, automatic tuning approaches based on these measureshave to be chosen and evaluated very carefully. To assess the overall approach of tuning and embedding functional data, we concentrateon the more challenging nonlinear settings a2-sr, a3-hx, a3-sr, a3-sc, a3-tp. Sincewe want to assess the ability to detect and unfold intrinsic structure induced by anonlinear parameter space here, identity warping is applied in all settings. To thoroughlyevaluate effects of the distance metric, we tuned each method on each setting based on
AU C dirR NX and AU C geoR NX , both based on the function space as well as the ground truthparameter space. That is, in total each method has been tuned four times for each setting.Computing performance based on the parameter and the function space allows us tocompare what is theoretically possible – based on the ground truth parameter space – onthe one hand, and what is practically feasible – based on the observable function space –on the other hand. Reliable tuning and evaluation methods based on the functional datashould provide similar answers and results as those achieved by tuning and evaluatingon the true underlying parameter space.Figures 5 and 6 display the resulting embeddings for the considered settings. The em-beddings have been obtained via tuning based on the parameter space, i.e., maximizingagreement between parameter space neighborhoods and embedding space neighbor-hoods in Figure 5 and obtained via tuning on the function space, i.e., maximizingagreement between function space neighborhoods and embedding space neighborhoods,in Figure 6. Successful embeddings should unfold these data either to a circle (a3-hx,2nd column) or to linear subspaces. However, regarding the parameter space-optimizedembeddings in Fig. 5, it becomes obvious that – even if the true parameter space is used– tuning can lead to embeddings that do not withstand visual inspection in that sense(e.g., see t-SNE and ISOMAP embeddings of a2-sr and a3-hx based on AU C dirR NX ; Fig.5 A, second and third row). For setting a2-sr, AU C dirR NX indicates perfect embeddingfor ISOMAP and good embedding for t-SNE. The corresponding embeddings basedon AU C geoR NX (Fig. 5 B, second and third row), however, withstand visual inspectionfar better. This already indicates that tuning based on geodesic distances can lead tobetter results than simply relying on direct distances. Turning to the function space13nsupervised Functional Data Analysis PREPRINT optimized embeddings (Fig. 6), we see that things can get even more involved in reality.Consider, for example, the ISOMAP embeddings again. The embeddings based on
AU C dirR NX (Fig. 6 A) for a3-sc as well as for settings a2-sr, a3-hx and a3-sr are notsatisfactory, even though high performances are indicated by the measure.These discrepancies between measured and actual performance are due to the factthat, in the case of direct distances, the performance measure is based on a suboptimaldistance metric in the high-dimensional spaces Θ and F . For strongly nonlinear settings,direct distances seem to be insufficient for tuning methods so that they correctly reflectthe intrinsic structure. Analogously, this applies to the function space as well, sincethe L metric in the function space is structurally very similar to Euclidean distancein a Euclidean space. So L -based neighborhoods are likely to be different fromneighborhoods based on geodesic distances in the function space whenever the functionalmanifold is nonlinear. Due to the more complex structure of the function space the effectseems to be intensified (for example, see ISOMAP embeddings for a3-sc: based on theparameter space, direct distances were sufficient to recover the manifold, whereas inthe function space the intrinsic structure was only recovered if tuned based on geodesicdistances). Next to the effects of nonlinear domain warping this is another example forthe specific challenges of nonlinear dimension reduction in FDA settings.Turning to the remaining methods, the picture is a little more difficult to make senseof, because the embeddings do not yield such clear differences as ISOMAP and t-SNE.In general, DIFFMAP and UMAP show arguably better embedding results based on AU C geoR NX (e.g. a2-sr, a3-hx, Fig. 5), but, on the other hand, they benefit less from usinggeodesic distances for tuning (e.g. a3-sc, a3-hx, a2-sr, Fig. 6). DIFFMAP embeddings,in particular, differ the least among AU C dirR NX and AU C geoR NX . Moreover, in somecases the underlying manifold is hardly recognizable or not successfully unfolded, inparticular this holds for the DIFFMAP embeddings in settings a3-sr, a3-sc and a3-tp.To quantify the differences between using geodesic rather than direct distances tooptimize and compute performance measures, Figure 7 shows the different optimal AU C mR NX -values for all nonlinear settings obtained on the function space and the pa-rameter space. The best values achieved based on the function space differ stronglyfrom the ones based on the parameter space in several cases. In general, optimizationvia AU C geoR NX leads to smaller differences between tuning on function and parameterspaces distances. This is also reflected in Table 3, which shows the absolute differ-ences ∆ m ¯ a := | ¯ a mps − ¯ a mfs | , with ¯ a mps and ¯ a mfs the mean optimal AU C mR NX based onparameter respectively function space. The mean values ¯ a mps and ¯ a mfs are computedover the ISOMAP, DIFFMAP, t-SNE, and UMAP embeddings and the settings a2-sr,a3-hx, a3-sc, a3-tp. Since setting a3-sr could not be embedded successfully with anyof the methods even if optimized over the parameter space, it is excluded. Clearly,optimal AU C geoR NX (values based on the geodesic distances) differ less between functionspace and parameter space than AU C dirR NX (values based on the direct distances) forISOMAP and t-SNE, while there is not much of a difference for UMAP and DIFFMAP.This is in line with the visual impression that ISOMAP and t-SNE yield clearly betterembeddings based on tuning via AU C geoR NX for these settings.This also indicates that it is frequently more appropriate to use geodesic distances –especially in function spaces – for performance assessment and tuning.To sum up, we have seen that it is possible to obtain successful embeddings for functionaldata and that automatic parameter tuning can be applied in these simulation settings.14nsupervised Functional Data Analysis PREPRINT
A: Embeddings optimized via
AU C dirR NX in Θ B: Embeddings optimized via
AU C geoR NX in Θ Figure 5: Parameter-space optimal embeddings of nonlinear settings a2-sr, a3-hx, a3-sr,a3-sc, a3-tp. A: first four rows based on parameter space
AU C dirR NX -optimization. B:lower four rows based on AU C geoR NX . Color scale encodes value of the first parameter in Θ . 15nsupervised Functional Data Analysis PREPRINT
A: Embeddings optimized via
AU C dirR NX in F B: Embeddings optimized via
AU C geoR NX in F Figure 6: Functions-space optimal embeddings of nonlinear settings a2-sr, a3-hx, a3-sr,a3-sc, a3-tp. A: first four rows based on function space
AU C dirR NX -optimization. B:lower four rows based on AU C geoR NX . Color scale encodes value of the first parameter in Θ . 16nsupervised Functional Data Analysis PREPRINT
Figure 7: Function (fs) and parameter (ps) space optimal
AU C mR NX values based ongeodesic and direct distances for settings a2-sr, a3-hx, a3-sr, a3-sc, a3-tp.Table 3: Comparing performance assessment based on geodesic and direct distancesusing absolute difference of mean AU C mR NX in parameter space and function space.Setting a3-sr is excluded because no visually appropriate embeddings could be obtained.Method ∆ dir ¯ a ∆ geo ¯ a ISOMAP 0.246 0.081t-SNE 0.146 0.060UMAP 0.064 0.052DIFFMAP 0.085 0.043Moreover, using distances in function space seems to be a reliable alternative to usingdistances in ground truth parameter space. In some of the complex settings a2-sr, a3-hx,a3-sr, a3-sc, a3-tp, however, tuning is successful only if based on geodesic distancesrather than direct distances to define the neighborhoods in the high-dimensional space.In general, these are promising results indicating that (automatically) obtaining highquality embeddings for real functional data is feasible. Yet, the approaches shouldnot be applied lightly to real data. As we have seen, some of the methods may yieldsuboptimal or nonconforming embeddings even if tuned properly. Moreover, not everysetting seems to be amenable to a successful embedding (e.g. see setting a3-sr) – factorswhich can lead to multiple nonconforming or misleading embeddings and overoptimisticor invalid conclusions if not assessed carefully.
We now turn to real data examples in this section in order to verify the practical utilityof the insights from our simulation study. As motivated in Section 2.4, we first apply ourapproach to two settings where the intrinsic structure of the data can be inferred fromdomain knowledge to a certain extent. Subsequently, we investigate a fully unsupervisedreal data example with completely unknown structure. In addition to
AU C mR NX , we also17nsupervised Functional Data Analysis PREPRINT evaluate embedding performance via Q m local here in order to investigate the distinctionsbetween local and global performance measures. To begin with, we give a shortdescription of all three date sets. We apply the embedding methods on two functional and one image data set. For twodata sets, the COIL data and the earthquake data, the intrinsic structure can be inferredfrom prior knowledge in advance, at least to a certain extent. The intrinsic structureof the third data set, a spectrography data set, is not known. Figure 8 shows exampleobservations of the three real data sets. More details are given in the sections devoted tothe specific data set.
COIL20 [30] is an image data set consisting of 128 x 128 pixel images of 20 objects.We use this data set albeit it is not functional, because it is a real data set for which theintrinsic structure can be inferred from substantial considerations and is nonlinear. Foreach object, 72 pictures were obtained by rotating the object around itself and taking apicture every 5 degrees of rotation. The end position equals the starting position andeach picture reflects the object at a different angle. Thus, for each object the COIL20data set contains 72 observations with 16384 features containing single pixel intensities.For this study we use a subset of the COIL20 data containing only the pictures of thefirst object as the high-dimensional data to be embedded. This means, the data set weuse contains 72 pictures of the same object depicted in Figure 8 at different anglesranging from 0 to 360 degrees. Considering this setup, the 5-degree-picture should– for example – have approximately the same distance to the 0-degree-picture as the355-degree-picture. The intrinsic structure of the data set is thus expected to be circularand one dimensional, i.e., a setting supposed to be comparable to setting a3-hx.Aside from these insights – which can be inferred from the original description of thedata generating design and applies to all of the 20 COIL objects – for the specific objectregarded here further considerations can be made, if one closely examines Figure 8. Dueto the axial symmetry of the object, it appears to be more similar to itself at positions 0and 180 degrees than at 90 and 270 degrees. This can be an indication of further existingstructure which might be present in this specific example, but whether and how this isreflected in the embeddings is difficult to assess.
The second real data set contains functional data of a seismological in silico experimentwith both phase and amplitude variation described in [14]. It contains 1558 observationsobserved on 61 grid points. Each observation represents 60 seconds of absolute groundmovement velocities at a virtual seismometer location for a simulated earthquake. Theoriginal investigation based on multivariate functional PCA of phase and amplitudevariation revealed a two dimensional linear structure of the data [14] reflecting the spatialdistances of the virtual seismometers to each other and to the simulated earthquake’shypocenter. That is, from the analyses of the previous study we can infer that this is adata set with phase and amplitude variation and – following our framework – supposedly18nsupervised Functional Data Analysis
PREPRINT
Figure 8: Upper row: Ten example observations of each the earthquake data (Eq) andthe spectrography data (Sp). Lower row: Four images of the COIL object.a relatively simple (linear) underlying parameter space, so a setting assumed to becomparable to settings i2-l or a2-l, for example.
Finally, we consider another functional data set with 1004 functional observationsobserved on a grid of length 1751. The data was originally generated to investigate howforged spirits can be detected noninvasively via vibrational spectroscopy of the ethanollevel, i.e. each observation is a spectrograph based on 1751 different wavelengths (see[21] for more details on the data set). The data is usually used as a classification problemand can be obtained separated into a training set with 504 observations and a test setwith 500 observations. Since we are in an unsupervised setting, we merged the trainingand test set and use the joined data set as the high-dimensional data to be embedded.Here, we cannot make any justifiable assumptions about the intrinsic structure and it isunclear what a successful embedding should look like. Figure 9 displays the embeddings for the earthquake data and Figure 10 for the COILdata. The first two columns for each data set are obtained by tuning via the localperformance measure Q m local , while the latter two columns are obtained by tuning via theglobal performance measure AU C mR NX . PREPRINT
Earthquake dataFigure 9: Embeddings of the earthquake data. First two columns obtained by opti-mization via Q m local , latter two columns via AU C mR NX . Color scale encodes distance tohypocenter.Most importantly, the results show that what has been observed for the simulateddata also holds for real data: the low dimensional intrinsic structure – both linear andnonlinear – can be successfully embedded. Several embeddings of the COIL data showa 1-dimensional, circular structure, while a 2-dimensional linear structure results for theearthquake data. Moreover, closer examination reveals further interesting insights.First of all, considering the earthquake data it appears that there is not much of a differ-ence between embeddings based on AU C geoR NX and AU C dirR NX in a setting with nearlylinear intrinsic structure. The t-SNE and UMAP embeddings based on AU C dirR NX visu-ally appear to be inferior to the ones based on AU C geoR NX . Based on the performancevalues all embeddings based on AU C geoR NX perform somewhat worse. But in general theperformance differences are negligible. This is a desirable result in line with theoret-ical considerations: On approximately linear subspaces, geodesic distances ought toresemble direct distances.Next to these findings, the COIL data opens up a rich pool of interesting insights thatextend the understanding of embedding methods beyond that obtained in the simulationstudy. First of all, as outlined in Section 4.1.1, a 1-dimensional circular structure can20nsupervised Functional Data Analysis PREPRINT be assumed from the data generating procedure and this structure is detected by fourof the embeddings. However, due to the symmetries of the rotating object, additionalstructure could be assumed and this additional structure seems to be recovered in severalembeddings as well – in the case of some of the methods only if the global performancemeasure
AU C mR NX is used for tuning, however. Consider the t-SNE embeddings wherethe effect is most prominent. In the two rightmost columns tuned for AU C mR NX , thestructure is ellipsoid, while it is circular in the two leftmost columns tuned for Q m local .These nonconforming embeddings can be explained with the axial symmetry of theobject. Due to the symmetry, the object is more equal to itself at position 0 and 180degrees than at 90 and 270 degrees, which is a global characteristic (locally, i.e. withina small range of rotation angles, the objects looks similar to itself everywhere). Thelocal performance measure Q m local is not able to reflect this global characteristic of thedata sufficiently and the aspect is lost in UMAP, t-SNE, and DIFFMAP embeddings iftuned based on Q m local . These examples demonstrate that global structural properties caneasily be “lost in translation” if an embedding is not tuned properly. It also has to beemphasized that – in contrast to t-SNE – those UMAP embeddings which sufficientlypreserve global structure do not simultaneously preserve local structure in this setting.The situation is a little different for ISOMAP. As can bee seen in Figure 10, the globalstructure is recovered in the embeddings, both based on Q m local and AU C mR NX . However,an additional dimension is needed to reflect this in the embedding, since all embeddingsresult in a twisted ring with an upward bend at two positions opposed to each other.That is, ISOMAP is not able to fully recover the structure in this example in the lowestpossible number of embedding dimensions (similiar to DIFFMAP). Considering theMDS embeddings of the COIL data, we see that they are almost completely equal to theISOMAP embeddings (only the performance indicated by the values of Q m local is slightlyworse for MDS). A possible explanation is that due to the low amount of observations(only 72 for COIL), the geodesic distances do not differ sufficiently from the directdistances and ISOMAP basically reflects MDS embeddings. This is supported by acouple of insights which can be gained by contrasting these results with the results ofthe simulations study.First of all, regarding the MDS embeddings of the COIL data and the a3-hx data, we seethat in both situations the intrinsic structure (a twisted ring and the helix, respectively)is recovered in principle. Yet, while the intrinsic structure of a3-hx gets unfolded byISOMAP, the same is not true for COIL. The most fundamental difference betweenthese two examples is the number of observations (1000 for a3-hx, 72 for COIL),which points towards the conclusion that, based on the low number of observations,a sufficient shortest path graph cannot be constructed for the COIL data such thatMDS would benefit from it compared to simply using direct distances. Moreover,additional experiments in which the number of observations was increased from 1000to 5000 in setting a3-sr showed that increasing the number of observations can lead toembeddings which successfully unroll the manifold – the failure of the methods to doso in our original experiment in setting a3-sr is likely to be due to too few observations.Finally, the conclusion is also supported by the fact that there are in general almost nodifferences between the COIL embeddings tuned via AU C dirR NX and AU C geoR NX – neithervisually nor quantitatively. Recall, however, that in the simulated settings there havebeen large differences and – despite the fact that t-SNE and UMAP are able to recoverthe global structure here – it was frequently crucial to use geodesic distances so that theintrinsic structure could be unfolded. 21nsupervised Functional Data Analysis PREPRINT
COIL dataFigure 10: Embeddings of the COIL data. First two columns obtained by optimizationvia Q m local , latter two columns via AU C mR NX . Color scale encodes rotation angle.22nsupervised Functional Data Analysis PREPRINT
To sum up, the investigation of these two examples confirms that low dimensional intrin-sic structure of real (functional) data can be automatically and successfully embedded.However, some pitfalls were clearly identified. For one, certain structural properties ofthe functional manifold can easily get lost in the embedding if it is not tuned with thecorrect strategy, which makes the assessment of fully unsupervised settings specificallychallenging. Secondly, we saw that using geodesic instead of direct distances is onlybeneficial if the number of observations is sufficiently large in a nonlinear setting. Re-gardless, their use does not seem to be harmful even in settings with few observationsand/or settings with linear structure of the functional manifold.
So far, we have seen that in several settings – simulated as well as real – where theintrinsic structure is known at least to certain extent, functional data can be embeddedsuccessfully and that automatic tuning can be used to obtain suitably faithful embeddings.On the other hand, we identified some specific pitfalls. In this subsection, we illustratethe resulting challenges of nonconforming embeddings with a fully unsupervised realdata example and point out an approach possibly allowing to gain some further insightsin such fully unsupervised settings.The embeddings of the spectrography data are depicted in Figure 11. The majorproblem is that there are – overall – two nonconforming structures which are detected.On the one hand, a closed, circular 3-dimensional structure – a “donut" –, detectedby MDS, all ISOMAP embeddings except the one tuned via
AU C geoR NX , and arguablyalso the t-SNE embeddings based on Q m local . On the other hand, there is a curved, open3-dimensional structure detected by ISOMAP based on AU C geoR NX , t-SNE based on AU C geoR NX and to lesser extend AU C dirR NX , and UMAP. In this example, it is not possibleto decide which of the embeddings better describes the true structure by visual inspectionor reference to prior knowledge, nor is it expedient to simply maximize performancemeasures. E.g., AU C mR NX is similarly high for ISOMAP both based on geodesic as wellas direct distances, yet they lead to nonconforming embeddings. The drawbacks andpitfalls of the performance measures described in the simulation study are fully apparenthere. Although performance measures are available, deciding between nonconformingembeddings is far from straight-forward.However, the results gathered so far allow us to introduce additional decision criteria:First of all, we saw that, in the COIL and the a3-hx examples, closed structures weredetected and recovered by all methods irrespective of the performance measures usedfor tuning. That is, in a setting with a not “fully” unfoldable structure which is closed insome way or the other, this structure was recovered in all cases considered. Specifically,that included embeddings obtained with AU C geoR NX . On the other hand, if a nonlinearstructure is ‘fully’ unfoldable, we saw that, in several cases, a fully unfolded embeddingwas achieved only if the embeddings were based on AU C geoR NX (e.g. see a3-sc) – providedthere was enough data. Considering the spectrography data, we see that none of theembeddings based on AU C geoR NX – except MDS of course – show the closed circularstructure. It may thus not be too far-fetched to infer that, if the intrinsic structure of thisdata set were closed and circular in reality, this would also be reflected by at least someof the embeddings based on AU C geoR NX (as was observed for example for the COIL andthe a3-hx data) and that the “donut" does not actually resemble the true intrinsic structuresufficiently. In fact, one possible explanation could be that direct distances might falsely23nsupervised Functional Data Analysis PREPRINT
Spectrography dataFigure 11: Embeddings of the spectrography data. First two columns obtained byoptimization via Q m local , latter two via AU C mR NX . To improve visual differentiation,points are colored according to their observation number.indicate that two objects at opposite ends of an open, but curved manifold are closetogether by taking a shortcut “through" the ambient space, and thus connect these partsof the functional manifold in the embedding space. Similarly, based on local measures– that is, from a local perspective – such points might also appear close and globalcharacteristics cannot be reflected sufficiently as was the case in the COIL example.This might result in unconnected manifold regions being pulled together, which mightbe appropriate locally, but yields a wrong impression from a global perspective. Thefact that UMAP, a method which is mainly concerned with an accurate representationof the local structure [29], does not show a closed structure even if based on Q m local ,contributes to this conclusion. Note, however, that other explanations might be possible.For example, MDS embeddings reflected the true intrinsic structure in the simulatedsettings, which may also be the case here and the other, “unclosed" structure mightresult from undiscovered effects of the approaches. Nevertheless, this example pointsinto a direction to gain insights in settings where so far no judgment is possible.24nsupervised Functional Data Analysis PREPRINT
Our results indicate that nonlinear dimension reduction methods can detect and “unfold"the manifold structure of functional data. For our settings, ISOMAP and t-SNE wereseen to be particularly successful methods. Based on the same tuning regime, the othermethods under comparison – DIFFMAP, UMAP, and MDS – frequently did not obtainsimilarly useful embeddings. However, we focused on recovery of the global datamanifold structure in the embedding space. This is likely to affect UMAP adverselysince it is optimized for high-fidelity reconstruction of local structures. The conclusionshould thus not be that t-SNE or ISOMAP are superior for embedding functional datain general. Similar remarks apply for DIFFMAP, which is also a local method andperformed less well than UMAP in our experiments. Especially in higher-dimensionalreal data settings, it frequently led to degenerate embeddings.Furthermore, tuning strategies based on surrogate performance measures such as
AU C mR NX should be applied with caution, since they may lead to very misleadingembeddings with far from optimal configurations. In fact, we found evidence thatembedding performance strongly depends on the distance metric m supplied to theperformance measure used for tuning. Our results suggest that the use of geodesicdistances – in particular in function space – is more likely to yield suitable embeddingsif faithful representation of global structure is of importance: tuning embeddings basedon the functional geodesic distances rather than direct distances yielded embeddingsthat were frequently much more similar to the ground truth structure in the simulationstudy and the expected structure in the real data examples.Taking all insights obtained in this study into account, we propose to use the followingnuanced approach to achieve more reliable embeddings of functional data: (1) Em-beddings should be computed automatically by the described tuning approach. (2)At least two embedding methods should be included to account for different methodperformances. In addition, a tuning-free reference method should be included, for whichwe suggest MDS, since it can recover intrinsic structures in simple cases although itdoes not “unfold” them. (3) Embeddings should be computed based on optimizing alocal as well as a global performance measure. (4) Geodesic distances should be used.Tuning based on geodesic distances worked better for some methods, specifically incomplex nonlinear settings and did not have adverse effects on performance in anyother settings. In addition, discrepancies between geodesic-based embeddings anddirect-distance-based respectively local-performance-based embeddings can provideclues on the likely complexity of the intrinsic structure (closed vs. nonclosed, linear vs.nonlinear).Moreover, some general questions are raised as well, as the last aspect strongly affectshow to appropriately tackle specific unsupervised problems with manifold methods. Forproblems such as clustering or outlier detection, where preserving local structure canbe sufficient, relying on non-geodesic distances can be appropriate. Yet, according toour findings, this can not simply be transferred to other tasks where global structureis more important, for example using manifold methods as a preprocessing or featureengineering step. In this setting, unfolding the manifold, i.e., detecting and simplifyingthe global structure, is important because reliable low dimensional representationsnot only improve visualization and exploration of functional data, the embeddingcoordinates can also be exploited as features, which preserve the essential informationcontained in the functional data, for supervised learning (i.e., modeling and inference)25nsupervised Functional Data Analysis PREPRINT tasks. Using direct distances to assess embeddings quantitatively or to optimize learnedembeddings in an automated fashion is (more) likely to lead to misleading results in thiscontext.In summary, our results show the potential of extending manifold methods for NDR tofunction-valued data, but also reveal challenges that are likely to come up in applications.In order to achieve reliable low-dimensional representations of functional data forvisualization and exploration or to serve as feature inputs in supervised learning tasks,these issues will require additional attention by the research community. In futurework, we will investigate the effect of noise-corrupted observations on the estimationof geodesic distances for functional data, since errors that shift observed functions offthe functional manifold are likely to affect the recovery of geodesic distances adversely.In addition, the effects of grid resolution and data set size as well as the definitionof alternative distance metrics which specifically account for certain characteristicsof functions – for example, separate amplitude and phase distances [36] – are furtherimportant aspects.More generally, this study should be considered in the light of a growing debate onreplicability in methodological research. As has been outlined by many [17, 26, 4, 15,e.g.], methodological research claiming to show superior performance of its proposedmethods and approaches in one way or another can frequently not be confirmed inindependent replications. Our aim in this work was to provide a fairly neutral evaluationby design, pointing out specific pitfalls and drawbacks of several widely used manifoldlearning algorithms and possible meta learning methods on a wide range of functionaldata settings. However, other evaluation frameworks are certainly possible and mayyield additional insights and qualifications of our conclusions. As long as there is nogeneral benchmarking and evaluation regime generally agreed upon, results of all suchstudies will depend on the choice of this framework to some extend. In that regard, ourstudy is also intended to serve as a starting point and we hope that it may contribute toinitiate a discussion – similar to the ones in supervised learning and cluster analysis –on how to conduct neutral evaluations and foster replicable results in the important fieldof NDR and manifold learning.
Technical details
All code necessary to reproduce the experiment can be found on GitHub https://github.com/HerrMo/fda-ndr . To conduct the experiments we used R 3.6.3 [33] ona system with Linux Mint Cinnamon 19.2 and R packages vegan [31] for ISOMAP, diffusionMap [35] for DIFFMAP,
Rtsne [20] for t-SNE, and umap [18] for UMAP.For the performance measures we used code of the functions auc_rnx and q_local from the dimRed package [19]. To compute MDS we used cmdscale of package stats and isomapdist of package vegan to compute geodesic distances.
Acknowledgments
This work has been funded by the German Federal Ministry of Education and Research(BMBF) under Grant No. 01IS18036A. The authors of this work take full responsibilitiesfor its content. 26nsupervised Functional Data Analysis
PREPRINT
Declarations of interest
None.
References [1] Alaız, C.M.: Diffusion Maps Parameters Selection Based on NeighbourhoodPreservation. Computational Intelligence (2015)[2] Becht, E., McInnes, L., Healy, J., Dutertre, C.A., Kwok, I.W.H., Ng, L.G., Ginhoux,F., Newell, E.W.: Dimensionality reduction for visualizing single-cell data usingUMAP. Nature Biotechnology (1), 38–44 (2019)[3] Bengio, Y., Courville, A., Vincent, P.: Representation learning: A review and newperspectives. IEEE transactions on pattern analysis and machine intelligence (8),1798–1828 (2013)[4] Boulesteix, A.L., Hoffmann, S., Charlton, A., Seibold, H.: A replication crisis inmethodological research? Significance (5), 18–21 (2020)[5] Boulesteix, A.L., Lauer, S., Eugster, M.J.: A plea for neutral comparison studiesin computational sciences. PloS one (4), e61562 (2013)[6] Cayton, L.: Algorithms for manifold learning. Tech. Rep. 1-17, Univ. of Californiaat San Diego Tech. Rep (2005)[7] Chen, D., Müller, H.G.: Nonlinear manifold representations for functional data.The Annals of Statistics (1), 1–29 (2012)[8] Chen, L., Buja, A.: Local Multidimensional Scaling for Nonlinear DimensionReduction, Graph Drawing, and Proximity Analysis. Journal of the AmericanStatistical Association (485), 209–219 (2009)[9] Coifman, R.R., Lafon, S.: Diffusion maps. Applied and Computational HarmonicAnalysis (1), 5–30 (2006)[10] Dimeglio, C., Gallón, S., Loubes, J.M., Maza, E.: A robust algorithm for templatecurve estimation based on manifold embedding. Computational Statistics & DataAnalysis , 373 – 386 (2014)[11] Eugster, M.J.: Benchmark Experiments: A Tool for Analyzing StatisticalLearning Algorithms. Ph.D. thesis, LMU München (2011), https://edoc.ub.uni-muenchen.de/12990/1/EugsterManuelJA.pdf [12] Goldberg, Y., Zakai, A., Kushnir, D., Ritov, Y.: Manifold learning: The price ofnormalization. Journal of Machine Learning Research (Aug), 1909–1939 (2008)[13] Gong, S., Boddeti, V.N., Jain, A.K.: On the Intrinsic Dimensionality of ImageRepresentations. In: Proceedings of the IEEE Conference on Computer Vision andPattern Recognition. pp. 3987–3996 (2019)[14] Happ, C., Scheipl, F., Gabriel, A.A., Greven, S.: A general framework for multi-variate functional principal component analysis of amplitude and phase variation.Stat (1), e220 (2019)[15] Hutson, M.: Artificial intelligence faces reproducibility crisis. Science (6377),725–726 (2018) 27nsupervised Functional Data Analysis PREPRINT [16] Kobak, D., Berens, P.: The art of using t-SNE for single-cell transcriptomics.Nature Communications (1), 5416 (2019)[17] Kobak, D., Linderman, G.C.: UMAP does not preserve global structure anybetter than t-SNE when using the same initialization. preprint, BioRxiv (2019), [18] Konopka, T.: umap: Uniform Manifold Approximation and Projection (2020), https://CRAN.R-project.org/package=umap , R package version 0.2.4.1[19] Kraemer, G., Reichstein, M., Mahecha, D., M.: dimRed and coRanking - UnifyingDimensionality Reduction in R. The R Journal (1), 342 (2018)[20] Krijthe, J.H.: Rtsne: T-Distributed Stochastic Neighbor Embedding using Barnes-Hut Implementation (2015), https://github.com/jkrijthe/Rtsne , R pack-age version 0.15[21] Large, J., Kemsley, E.K., Wellner, N., Goodall, I., Bagnall, A.: Detecting forgedalcohol non-invasively through vibrational spectroscopy and machine learning. In:Pacific-Asia Conference on Knowledge Discovery and Data Mining. pp. 298–309.Springer (2018)[22] Lee, J.A., Renard, E., Bernard, G., Dupont, P., Verleysen, M.: Type 1 and 2mixtures of Kullback–Leibler divergences as cost functions in dimensionalityreduction based on similarity preservation. Neurocomputing , 92–108 (2013)[23] Lee, J.A., Verleysen, M.: Nonlinear Dimensionality Reduction. Springer Science& Business Media (2007)[24] Lee, J.A., Verleysen, M.: Quality assessment of dimensionality reduction: Rank-based criteria. Neurocomputing (7-9), 1431–1443 (2009)[25] Liang, J., Chenouri, S., Small, C.G.: A new method for performance analysis innonlinear dimensionality reduction. Statistical Analysis and Data Mining: TheASA Data Science Journal (1), 98–108 (2020)[26] Lucic, M., Kurach, K., Michalski, M., Gelly, S., Bousquet, O.: Are GANs createdequal? A large-scale study. In: Advances in neural information processing systems.pp. 700–709 (2018)[27] Ma, Y., Fu, Y.: Manifold learning theory and applications. CRC press (2011)[28] Maaten, L.v.d., Hinton, G.: Visualizing Data using t-SNE. Journal of MachineLearning Research , 2579–2605 (2008)[29] McInnes, L., Healy, J., Melville, J.: UMAP: Uniform Manifold Approximationand Projection for Dimension Reduction. arXiv:1802.03426 (2018), https://arxiv.org/abs/1802.03426 [30] Nane, S., Nayar, S., Murase, H.: Columbia object image library: Coil-20. Dept.Comp. Sci., Columbia University, New York, Tech. Rep (1996)[31] Oksanen, J., Blanchet, F.G., Friendly, M., Kindt, R., Legendre, P., McGlinn,D., Minchin, P.R., O’Hara, R.B., Simpson, G.L., Solymos, P., Stevens, M.H.H.,Szoecs, E., Wagner, H.: vegan: Community Ecology Package (2019), https://CRAN.R-project.org/package=vegan , R package version 2.5-6[32] Pfisterer, F., Beggel, L., Sun, X., Scheipl, F., Bischl, B.: Benchmarking time seriesclassification – functional data vs machine learning approaches. arXiv:1911.07511(2019), https://arxiv.org/abs/1911.07511 PREPRINT [33] R Core Team: R: A Language and Environment for Statistical Computing. RFoundation for Statistical Computing, Vienna, Austria (2020), [34] Ramsay, J.O., Silverman, B.W.: Functional data analysis. Springer series in statis-tics, Springer, New York, 2nd ed edn. (2005)[35] Richards, J., Cannoodt, R.: diffusionMap: Diffusion Map (2019), https://CRAN.R-project.org/package=diffusionMap , R package version 1.2.0[36] Srivastava, A., Wu, W., Kurtek, S., Klassen, E., Marron, J.S.: Registration ofFunctional Data Using Fisher-Rao Metric. arXiv:1103.3817 (2011), http://arxiv.org/abs/1103.3817 [37] Tenenbaum, J.B., Silva, V.d., Langford, J.C.: A Global Geometric Framework forNonlinear Dimensionality Reduction. Science (5500), 2319–2323 (2000)[38] Van Mechelen, I., Boulesteix, A.L., Dangl, R., Dean, N., Guyon, I., Hennig,C., Leisch, F., Steinley, D.: Benchmarking in cluster analysis: A white paper.arXiv:1809.10496 (2018), https://arxiv.org/abs/1809.10496 [39] Venna, J., Kaski, S.: Neighborhood Preservation in Nonlinear Projection Methods:An Experimental Study. In: Dorffner, G., Bischof, H., Hornik, K. (eds.) ArtificialNeural Networks — ICANN 2001. pp. 485–491. Lecture Notes in ComputerScience, Springer, Berlin, Heidelberg (2001)[40] Venna, J., Peltonen, J., Nybo, K., Aidos, H., Kaski, S.: Information retrievalperspective to nonlinear dimensionality reduction for data visualization. Journal ofMachine Learning Research (2) (2010)[41] Wang, J.L., Chiou, J.M., Müller, H.G.: Functional data analysis. Annual Reviewof Statistics and Its Application (1), 257–295 (2016)[42] Wang, W., Yan, Y., Nie, F., Yan, S., Sebe, N.: Flexible manifold learning withoptimal graph for image and video representation. IEEE Transactions on ImageProcessing27