FFDApy : a Python package for functional data
Steven Golovkine ∗ January 28, 2021
Abstract
We introduce the
Python package,
FDApy , as an implementation of functionaldata. This package provide modules for the analysis of such data. It includesclasses for different dimensional data as well as irregularly sampled functional data.A simulation toolbox is also provided. It might be used to simulate different clustersof functional data. Some methodologies to handle these data are implemented, suchas dimension reduction and clustering. New methods can be easily added. Thepackage is publicly available on the Python Package Index and Github.
With a large number of applications ranging from sports to automotive industry andhealthcare, more and more phenomena produce observation entities in the form of a se-quence of possibly vector-valued measurements recorded intermittently at several discretepoints in time. Functional data analysis (FDA) considers such data as being values of therealizations of a stochastic process, recorded with some error, at discrete random times.The purpose of FDA is to study such trajectories, also called curves or functions. Theconcept of functional data can be linked to the study of time series data, as dense, usuallyregular samples of potentially non-smooth functions, or longitudinal data, as sparse andirregular samples of smooth function, or even image data, which can be represented asfunctions on two-dimensional domains.In order to apply FDA to a real dataset, there is a need for appropriate softwares withup-to-date methodological implementation and easy addition of new theoretical develop-ments. Currently, the most widely known software for FDA is the R package fda [17],based on work cited in [15, 16]. Usually, R packages for FDA are specific to one method.For example, one may cite FDboost [3] and refund [7] for regression and classification, funFEM [1], funHDDC [18] and funLBM [2] for clustering or fdasrvf [20] and fdakma [14]for functional data registration, etc . Most of these packages are built upon fda . How-ever, in most packages, the functional data are restricted to univariate ones that are well ∗ Groupe Renault & CREST - UMR 9194, Rennes, France, [email protected] a r X i v : . [ c s . M S ] J a n escribed by their coefficients in a given basis of functions. The funData package [11]has been recently released. It aims to provide a unified framework to handle univariateand multivariate functional data defined on different dimensional domains. Sparse func-tional data are also considered. The MFPCA [12] package is currently the only one built ontop of the funData package. It implements multivariate functional principal componentsanalysis (MFPCA) for data defined on different dimensional domains [10].Concerning the
Python community, there are only few packages that are related toFDA. One may cite sktime [13] and tslearn [19] that provide tools for the analysis oftime series as a scikit-learn compatible API. Thus, they implement specific time seriesmethods such as DTW-based ones or shapelets learning. The only one that develops spe-cific methods for FDA is scikit-fda [4]. In particular, it implements diverse registrationtechniques as well as statistical data depths for functional data. However, most of themethods are for one-dimensional data and they only accept multivariate functional datadefined on the same unidimensional domain.The
FDApy package implements methods to handle functional data in
Python basedon an object-oriented approach, in the spirit of funData . In particular, it provides classesto manipulate dense, irregularly and multivariate functional data defined on one or higherdimensional domains. A large simulation toolbox, based on basis decomposition, is pro-vided. It allows parameters for different clusters simulation to be configured within thedata. An implementation of MFPCA for data defined on different domains, as describedin [10], is implemented. Moreover, the fCUBT algorithm [8], used to create partition inthe data, is also available. All methods are implemented using the defined classes. Thepackage is publicly available on Github and the Python Package Index .In the general case, the data consist of independent trajectories of a vector-valuedstochastic process X = ( X (1) , . . . , X ( P ) ) (cid:62) , P ≥
1. For each 1 ≤ p ≤ P , let T p ⊂ R d p with d p ≥
1, as for instance, T p = [0 , d p . The realizations of each coordinate X ( p ) : T p → R areassumed to belong to L ( T p ), the Hilbert space of squared-integrable, real-valued functionsdefined on T p . Thus, X is a stochastic process indexed by t = ( t , . . . , t P ) belonging to the P − fold Cartesian product T := T × · · · × T P and taking values in the P − fold Cartesianproduct space H := L ( T ) × · · · × L ( T P ). In practice, realizations of functional dataare only obtained on a finite grid and possibly with noise. Let us consider N curves X , . . . , X n , . . . , X N generated as a random sample of the P -dimensional stochastic process X with continuous trajectories. For each 1 ≤ n ≤ N , and given a vector of positiveintegers M n = ( M (1) n , . . . , M ( P ) n ) ∈ R P , let T n, m = ( T (1) n,m , . . . , T ( P ) n,m P ) , ≤ m p ≤ M ( p ) n , ≤ p ≤ P , be the random observation times for the curve X n . These times are obtained asindependent copies of a variable T taking values in T . The vectors M , . . . , M N representan independent sample of an integer-valued random vector M with expectation µ M . Weassume that the realizations of X , M and T are mutually independent. The observationsassociated with a curve, or trajectory, X n consist of the pairs ( Y n, m , T n, m ) ∈ R P × T https://github.com/StevenGolovkine/FDApy https://pypi.org/project/FDApy/ m = ( m , . . . , m P ) , ≤ m p ≤ M ( p ) n , ≤ p ≤ P , and Y n, m is defined as Y n, m = X n ( T n, m ) + ε n, m , , ≤ n ≤ N, and ε n, m are independent copies of a centered random vector ε ∈ R P with finite variance.We use the notation X n ( t ) for the value at t of the realization X n of X . Univariatefunctional data refers to the case where P = 1.The remainder of the paper is organized as follows. In Section 2, we introduce theclasses for an object-oriented implementation of functional data. Section 3 describes thedata we used as examples. In Section 4, we presents the creation and manipulationof functional data objects. Sections 5, 6 and 7 then demonstrate some methods thatthe package implements: the estimation of components, multivariate functional principalcomponents analysis and the fCUBT algorithm used to find a partition of the sampleddata. The representation of functional data is done using two classes, that both extend anabstract class
FunctionalData :1. Class
DenseFunctionalData represents dense functional data of arbitrary dimen-sion (one for curves, two for images, etc. ) on a common set of observation points t , . . . , t M for all observations. It may have missing values within the data.2. Class IrregularFunctionalData represents irregularly sampled data of arbitrarydimension on different sets of observation points. The number and the location ofthe sampling points vary between observations. It must not have missing valueswithin the data.Finally, the implementation of the class
MultivariateFunctionalData is differentbecause it does not extend the class
FunctionalData but the
UserList one. Thus, aninstance of
MultivariateFunctionalData is defined as a list of P elements from the DenseFunctionalData and/or
IrregularFunctionalData classes that may be definedon different dimensional domains ( e.g. curves and images). A diagram of the classes isgiven in Figure 1.
Remark 1
In practice, the difference between dense and irregularly sampled functionaldata can be tricky. By design, dense functional data are assumed to be sampled on the com-plete grid T = { t , . . . , t M } and measurement errors may exist. Taking data from sensorsas an example, observations are recorded at a given sampling rate and are timestampedbut some anomalies may happen during the recording process. While for an irregularlysampled functional data, we assume that the curves are observed at different samplingpoints with potentially different numbers of points. This is usually the case in medicalstudies such as growth curves analysis because one cannot expect that the individuals aremeasured at the exact same time. unctionalData DenseFunctionalData IrregularFunctionalData
UserList
MultivariateFunctionalData
Figure 1: Representation of the main classesThe
DenseFunctionalData and
IrregularFunctionalData classes represents thedata in a similar way: the instance variable argvals contains the sampling points andthe instance variable values represents the data. In the case of dense functional data,the argvals is a dictionary whese each entry contains a numpy array that representsthe common sampling points for a given dimension, while values is a numpy array con-taining the observations. In the case of one-dimensional data sampled on a grid with M points, argvals contains only one entry as an array of shape ( M, ) and values is an arrayof dimension ( N, M ) where each row is an observation. For two-dimensional observationswith M (1) × M (2) sampling points, argvals contains two entries, the first being an arrayof shape ( M (1) , ) and the second an array of shape ( M (2) , ) and values is an array ofdimension ( N, M (1) , M (2) ) where the first coordinate gives the observation. The higherdimensional data are represented by adding an entry in the argvals dictionary and adimension in the values array. For irregularly sampled functional data, both argvals and values are dictionaries. The entries of argvals are dictionaries where each entryconsists of the sampling points for a particular observation. In a similar way, each entry ofthe values dictionary represents an observation. For one-dimensional irregularly sampledfunctional data, argvals contains one entry which is a dictionary of size N containingthe sampling points as array of shape ( M n , ) , ≤ n ≤ N and values is a dictionarywith N entries containing the observations as arrays of shape ( M n , ) , ≤ n ≤ N . Forhigher dimensions, each entry of the argvals dictionary represents a dimension of theprocess and contains another dictionary with N entries for the sampling points. Like-wise, the values dictionary has N entries and every one of them is an array of shape( M (1) n , M (2) n , . . . ) , ≤ n ≤ N .Finally, the MultivariateFunctionalData class inherits from the
UserList class, andthus gathers P instances of DenseFunctionalData and/or
IrregularFunctionalData asa list. As a result, this class has access to all the methods applicable to lists such as append , extend , pop , etc. Given a specific dataset, instances of the different classes are called
DenseFunctionalData , IrregularFunctionalData or MultivariateFunctionalData ob-jects. In the following, the generic term, functional data object , will refer to instances ofall the three classes. 4
Data used in the examples
We will consider two datasets in the code examples. The first one will be the
Cana-dian weather data, which is presented in the textbook by Ramsay and Silverman [15]and available in their R package fda [17]. The second dataset is the CD4 cell count dataset, used in [6], and available in the R package refund [7]. As both examples areone-dimensional data, higher dimensional datasets, in particular images ones, will besimulated using the simulation toolbox provided in the package.The Canadian weather dataset contains daily recording of the temperature (in de-gree Celsius) and the precipitation (in millimeters) for N = 35 Canadian cities spreadacross the country and averaged over the years 1960 to 1994. The daily temperaturedata will be used as an example of DenseFunctionalData defined on a one-dimensionaldomain. We will add the daily precipitation records to the temperature ones in or-der to create a
MultivariateFunctionalData object with elements defined on differentone-dimensional domains ( T = [1 , T = [1 , CD4 cell count datasetcollects the number of CD4 cells per milliliter of blood of N = 366 participants. CD4cells are a particular type of white blood cell and are key components of the immunesystem. HIV attacks the CD4 cells in the patient’s blood. Thus, the count of CD4 cellscan be viewed as a measure of the disease progression. For this dataset, the number ofCD4 cells are measured roughly twice a year and centered at the time of seroconversion,which is the time that HIV becomes detectable. For every individual, the number ofmeasurements varies between 1 to 11 over a period of 18 months before and 42 monthsafter seroconversion. The sampling points are different between observations. We will usethis dataset as an example of IrregularFunctionalData . With the help of the two example datasets, this section will present how to create andmanipulate a functional data object. In particular, we review the different instance vari-ables used to extract information from the data. We also present methods to modifyand plot functional data objects. General methods, such as the computation of the meanor covariance, for
MultivariateFunctionalData objects usually call the correspondingmethods for each individual and concatenate the results appropriately. For all the codeexamples, we assume that the correct functions from the
FDApy package are loaded aswell as the packages numpy and pandas using the following code snippet:1 import numpy a s np2 import pandas a s pd 5 .1 Creation of objects
Assuming the Canadian temperature data is stored in a temperature.csv file and theCanadian precipitation data in a precipitation.csv file, the following code loads the datainto pandas dataframes and creates
DenseFunctionalData instances from them. Weexplicitly named the dimension of the observation.1 t e m p e r a t u r e = pd . r e a d c s v ( ’ t e m p e r a t u r e . c s v ’ , i n d e x c o l =0)23 a r g v a l s = pd . f a c t o r i z e ( t e m p e r a t u r e . columns ) [ 0 ]4 v a l u e s = np . a r r a y ( t e m p e r a t u r e )5 dailyTemp = D e n s e F u n c t i o n a l D a t a ( { ’ i n p u t d i m 0 ’ : a r g v a l s } ,6 v a l u e s )78 p r e c i p i t a t i o n = pd . r e a d c s v ( ’ p r e c i p i t a t i o n . c s v ’ , i n d e x c o l =0)910 a r g v a l s = pd . f a c t o r i z e ( p r e c i p i t a t i o n . columns ) [ 0 ]11 v a l u e s = np . a r r a y ( p r e c i p i t a t i o n )12 d a i l y P r e c = D e n s e F u n c t i o n a l D a t a ( { ’ i n p u t d i m 0 ’ : a r g v a l s } ,13 v a l u e s )Given multiple functional data objects, the creation of MultivariateFunctionalData instances is done by passing a list of objects to the constructor method.1 canadWeather = M u l t i v a r i a t e F u n c t i o n a l D a t a ( [ dailyTemp ,2 d a i l y P r e c ] )The construction of an
IrregularFunctionalData instance is similar, except thatthe dictionaries for argvals and values must contain an entry for each observation ofthe data. We consider that the CD4 cell count data are stored in a cd4.csv file con-taining a matrix representing the CD4 counts for each patient on the common gridof all sampling points and the missing values are coded as NA . Thus, the followingcode extracts only the non-missing values for each patient and construct an instanceof IrregularFunctionalData .1 cd4 = pd . r e a d c s v ( ’ cd4 . c s v ’ , i n d e x c o l =0)23 a l l a r g v a l s = cd4 . columns . a s t y p e ( np . i n t 6 4 )4 a r g v a l s = { i d x : np . a r r a y ( a l l a r g v a l s [ ˜ np . i s n a n ( row ) ] )5 f o r i d x , row in enumerate ( cd4 . v a l u e s ) } { i d x : row [ ˜ np . i s n a n ( row ) ]7 f o r i d x , row in enumerate ( cd4 . v a l u e s ) } { ’ i n p u t d i m 0 ’ : a r g v a l s } ,6 v a l u e s ) Remark 2
Two loaders are included within the package: read csv and read ts . Thesemethods can be used to load already well formatted data from csv or ts files. In particular,the ts files are, in particular, used in the UEA & UCR Time Series Classification Repos-itory . These functions are wrapper functions of the above code snippets. Nonetheless,multivariate functional data cannot be imported in this way. Basic information about the functional data object is printed on the standard outputwhen the object is called in the command line. For example, for the temperature dataset,the output will be:1 dailyTemp
Univariate functional data object with 35 observations on a 1-dimensionalsupport.
The outputs are similar for instances of the other types of functional data objects.For dense and irregular functional objects, a subset of the data can be extracted usingthe convenient way to substract objects provided in
Python . For example, in order toget the observations from 5 to 12 from the temperature data, we may write:1 dailyTemp [ 5 : 1 3 ]
Univariate functional data object with 8 observations on a 1-dimensionalsupport.
Note that this will not work with
MultivariateFunctionalData instances. Thissubsetting method will instead return the univariate functional data in the list. However,an iterator through the observations of the multivariate functional data is provided as the get obs method.In regards to Remark 1, we implement functions to convert
DenseFunctionalData instances into
IrregularFunctionalData instances and to do the reverse operation. Themissing values are coded with np.nan . The code is thus written:1 dailyTemp . a s i r r e g u l a r ( ) .2 Access to the instance variables The functional data classes come with multiple instance variables. In
Python , they canusually be accessed using instance name.variable name . We will present some of themin the following. Note that, some variables cannot be accessed directly for multivariatefunctional data and have to be retrieved by looping through its univariate elements.Of course, the argvals and values are accessible using dailyTemp.argvals and dailyTemp.values and show what the user gave to the object constructor. Furthermore,we provide a variable argvals stand with the same shape as argvals but with normalizedsampling points. The instance variables are the following: • n obs – number of observations in the object. • n points – number of sampling points in the object for each dimension as a dictio-nary. For the multivariate functional data object, it should be a list of P entries. Inthe case of IrregularFunctionalData , the returned number is the mean numberof sampling points per observation. • n dim – input dimension of the functional data (one for curves, two for images, etc. ).For MultivariateFunctionalData objects, is expressed as a list. • range obs – minimum and maximum values of the observations as a tuple. • range points – minimum and maximum values of the sampling points as a tuple.The calculation is based on the variable argvals . Basic plotting methods for functional data objects are provided in the package. They arebuilt upon the matplotlib package. We assume the package is loaded with1 import m a t p l o t l i b . p y p l o t a s p l tThe plot method returns an instance of
Axes from the matplotlib library. Thus, allthe plotting options relative to ticks, frames and so on, are modifiable using this instanceof
Axes . Customization of the graph parameters, such as colors, linetypes or linewidths forexample, can be made by passing the arguments as inputs to the function. The followingsnippet is used to plot all the temperature curves for all the Canadian weather stationdata (represented as a
DenseFunctionalData object),1 = p l o t ( dailyTemp )2 p l t . x l a b e l ( ’ Days ’ )3 p l t . y l a b e l ( ’ Temperature ’ )4 p l t . t i t l e ( ’ D a i l y Temperature Data ’ )5 p l t . show ( ) 8
50 100 150 200 250 300 350Days − − − T e m p e r a t u r e Daily Temperature Data −
10 0 10 20 30 40Month since seroconversion5001000150020002500 C D ce ll c o un t s ( l og - s c a l e ) CD4 counts for individual 5-14
Figure 2: Results of the plot method for functional data object.
Simulation
Brownian KarhunenLoeve
Figure 3: Links between classes in the simulation toolbox.while a plot of the CD4 cell counts for 10 patients on the log-scale (represented as an
IrregularFunctionalData object) is given by1 = p l o t ( c d 4 c o u n t s [ 5 : 1 5 ] )2 p l t . x l a b e l ( ’ Month s i n c e s e r o c o n v e r s i o n ’ )3 p l t . y l a b e l ( ’CD4 c e l l c o u n t s ( l o g − s c a l e ) ’ )4 p l t . t i t l e ( ’CD4 c o u n t s f o r i n d i v i d u a l 5 −
14 ’ )5 p l t . show ( )The plots are shown in Figure 2.
Simulation functions are implemented in order to test new methodological developments.The data can be simulated using a truncated version of the Karhunen-Lo`eve representation(class
KarhunenLoeve ) as well as diverse Brownian motions (class
Brownian ) that inheritsfrom the
Simulation class (see Figure 3). An element of the
Simulation class have twoprincipal instance variables: basis that contains the used basis and data that containsthe simulated observation (after running the fonction new() ).For Brownian motions, three types are implemented: standard , fractional and geometric . For example, we can simulate N = 10 realizations of a fractional Brownian9 . . . . . . − . − . − . − . . . . . . (a) . . . . . . − . − . . . . (b) Figure 4: Example of simulated data. (a) Brownian motion. (b) Karhunen-Lo`eve expan-sionmotion on the one-dimensional observation grid { , . , . . . , } with a Hurst parameterequal to 0 . X has a Karhunen-Lo`eve decomposition. Each of its realizations can berepresented using this decomposition, trucated at J coefficients: X n ( t ) = µ ( t ) + J (cid:88) j =1 ξ j,n φ j ( t ) , t ∈ T , n = 1 , . . . , N, with a common mean function µ and an orthonormal basis of functions { φ j } j =1 ,...,J . Thecoefficient ξ j,n are realizations of random Gaussian variables ξ j such that E ( ξ j ) = 0 andVar( ξ j ) = λ j with eigenvalues λ j ≥ N = 10 curves on T = [0 , T and eigenvalues with exponential decrease:1 k l = KarhunenLoeve ( name= ’ b s p l i n e s ’ , n f u n c t i o n s =5)2 k l . new ( n o b s =10 , a r g v a l s=np . l i n s p a c e ( 0 , 1 , 1 0 1 ) ,3 c l u s t e r s t d= ’ e x p o n e n t i a l ’ )10igure 4 presents a plot of the simulated Brownian motions and those from theKarhunen-Lo`eve decomposition. The simulation of two dimensional data is based onthe tensor product of basis functions. Simulation for higher dimensional data is notimplemented.We also added methods to generate noisy observations as well as sparse data. Notethat, these functions are only implemented on instances of Simulation . The add noise function adds pointwise noise to the observations. Both homoscedastic and heteroscedas-tic noise are implemented. If a single scalar is given as a parameter to the function,homoscedastic noise will be simulated. For the heteroscedastic case, lambda functionsand vectors of size n points can be supplied by the user. The noisy data are storedin the instance variable noisy data . For example, to add random noise with variance σ = 0 .
05, we run1 k l . a d d n o i s e ( v a r n o i s e = 0 . 0 5 )and, for heteroscedastic noise with variance defined by x → (cid:112) | x | ,1 k l . a d d n o i s e ( v a r n o i s e= lambda x : np . s q r t ( 1 + np . abs ( x ) ) )The function sparsify randomly removes sampling points from the observation. Pre-cisely, we randomly generate the number of sampling points to retain for each observationand then randomly select the sampling points to remove from each observation. Thesparse data are stored in the instance variable sparse data . For example, to randomlyremove 50% of the sampling points (more or less 5%) on the Brownian simulated data,we run1 brownian . s p a r s i f y ( p e r c e n t a g e = 0 . 5 , e p s i l o n = 0 . 0 5 )Figure 5 presents a plot of the noisy and sparse verions of the Karhunen-Lo`eve simu-lated data. Let K be a positive integer, and let Z be a discrete random variable taking values in therange { , . . . , K } such that P ( Z = k ) = p k with p k > K (cid:88) k =1 p k = 1 . The variable Z represents the cluster membership of the realizations of the process. Weconsider that the stochastic process follows a functional mixture model with K compo-11 . . . . . . − . − . − . − . . . . . . (a) . . . . . . − . − . . . . (b) Figure 5: Results for the add noise and sparsify functions on the Karhunen-Lo`evesimulated data. (a) Noisy data. (b) Sparse data.nents, that is, it allows for the following decomposition: X ( t ) = K (cid:88) k =1 µ k ( t ) { Z = k } + (cid:88) j ≥ ξ j φ j ( t ) , t ∈ T , where • µ , . . . , µ K are the mean curves per cluster. • { φ j } j ≥ is an orthonormal basis of functions. • ξ j , j ≥ Z . For each 1 ≤ k ≤ K , ξ j | Z = k ∼ N (0 , σ kj ).For example, we can generate N = 10 realizations of two clusters using 3 eigenfunctionswith given coefficients with1 N = 102 n f e a t u r e s = 33 n c l u s t e r s = 24 c e n t e r s = np . a r r a y ( [ [ 2 , −
1] , [ − . . . . . . − − − Figure 6: Simulation of data with two clusters. Each color represents a cluster.
Considering the model defined in (1), we assume that P = 1 and for the sake of readability,we omit the superscript. The objective is to estimate the function X n ( · ) using the availablesample points. Thus, we consider local polynomial smoothers [5]. This type of estimatorscrucially depends on a tuning parameter, the bandwidth.Let d ≥ t ∈ T be the evaluation points for the estimationof X n . For any u ∈ R , we consider the vector U ( u ) = (1 , u, . . . , u d / d !) and note that U h ( · ) = U ( · /h ). Let K : R → R be a positive kernel and define K h ( · ) = h − K ( · /h ).Moreover, we define: ϑ M n ,h := arg min ϑ ∈ R d +1 M n (cid:88) m =1 (cid:8) Y n,m − ϑ (cid:62) U h ( T n,m − t ) (cid:9) K h ( T n,m − t ) , where h is the bandwidth. The vector ϑ M n ,h satisfies the normal equations Aϑ M n ,h = a with A = A M n ,h = 1 M n M n (cid:88) m =1 U h ( T n,m − t ) U (cid:62) h ( T n,m − t ) K h ( T n,m − t ) a = a M n ,h = 1 M n M n (cid:88) m =1 Y n,m U h (cid:0) T ( n ) m − t (cid:1) K h (cid:0) T ( n ) m − t (cid:1) . Let λ be the smallest eigenvalue of the matrix A and note that, whenever λ >
0, we have ϑ M n ,h = A − a . With at hand an estimation of the bandwidth (cid:98) h , the local polynomialestimator of (cid:98) X ( n ) ( t ) of order d is given by: (cid:98) X ( n ) ( t ) = U (cid:62) (0) (cid:98) ϑ, where (cid:98) ϑ = ϑ M n , (cid:98) h . If d = 0, we are in the particular case of the Nadaraya-Watson estimator. TheGaussian, Epanechnikov, tri-cube and bi-square kernels are implemented and others can13
50 100 150 200 250 300 350Days − − − T e m p e r a t u r e Daily Temperature Data (a) . . . . . . − − − T e m p e r a t u r e Daily Temperature Data (b)
Figure 7: (a) Curve and (b) smoothed estimation for the Canadian Temperature data.be added in a modular way. We propose an estimate of the bandwidth h that is based onthe regularity of the underlying function [9].For example, if we want to smooth the daily temperature curves using a local polyno-mial smoother with an estimate of bandwidth at t = 0 . In this section, we develop estimators for the mean and the covariance functions of acomponent X ( p ) , ≤ p ≤ P from the process X . These estimators might be used tocompute estimators of eigenvalues and eigenfunctions of X ( p ) for the Karhunen-Lo`eveexpansion.Let (cid:98) X ( p ) n be a suitable nonparametric estimator of the curve X ( p ) n applied with the M ( p ) n pairs ( Y ( p ) n,m p , T ( p ) n,m p ) , n = 1 , . . . , N , as for instance a local polynomial estimator suchas that presented in the previous subsection. With at hand the (cid:98) X n ’s tuned for the meanfunction estimation, we define (cid:98) µ ( p ) N ( t p ) = 1 N N (cid:88) n =1 (cid:98) X ( p ) n ( t p ) , t p ∈ T p . For example, the code snippet for the estimation of the mean curve of the dailytemperature curves using local linear smoother with bandwidth equal to 0 .
05 is14 . . . . . . − − − T e m p e r a t u r e (a) . . . . . . . . . . . . N o r m a li ze d t i m e (b) Figure 8: (a) Mean and (b) covariance estimation for the Canadian Temperature data.1 mean temp = dailyTemp . mean ( smooth= ’ L o c a l L i n e a r ’ ,2 bandwidth = 0 . 0 5 )For the covariance function, following [21], we distinguish the diagonal from the non-diagonal points. With at hand the (cid:98) X ( p ) n ’s tuned for the covariance function estimation, (cid:98) C p,p ( s p , t p ) = 1 N N (cid:88) n =1 (cid:98) X ( p ) n ( s p ) (cid:98) X ( p ) n ( t p ) − (cid:98) µ ( p ) N ( s p ) (cid:98) µ ( p ) N ( t p ) , s p , t p ∈ T p , s p (cid:54) = t p . The diagonal of the covariance is then estimated using two-dimensional kernel smoothingwith (cid:98) C p,p ( s p , t p ) , s p (cid:54) = t p as input data. See [21] for the details.1 cov temp = dailyTemp . c o v a r i a n c e ( smooth= ’GAM’ ) The
FDApy package implements MFPCA for data defined on potentially different domains,developped by Happ and Greven [10]. The implementation of the method is build uponthe functional data classes defined in the package. After giving a short review of themethodology in Section 6.1, we explain how to effectively use it in Section 6.2. Fortheoretical details, please refer to [10].
Following Happ and Greven [10], the multivariate components for X are computed byplugging in the univariate components computed from each component X ( p ) . These esti-mations are done as the follows. 15. Perform a univariate fPCA on each of the components of X separately. For acomponent X ( p ) , the eigenfunctions and eigenvectors are computed as a matrixanalysis of the estimated covariance (cid:98) C p,p . This results in a set of eigenfunctions (cid:16)(cid:98) ρ ( p )1 , . . . , (cid:98) ρ ( p ) J ( p ) (cid:17) associated with a set of eigenvalues (cid:16)(cid:98) λ ( p )1 , . . . , (cid:98) λ ( p ) J ( p ) (cid:17) for a giventruncation integer J ( p ) . The univariate scores for a realization X ( p ) n of X ( p ) are thengiven by (cid:98) c ( p ) j,n = (cid:104) (cid:98) X ( p ) n , (cid:98) ρ ( p ) j (cid:105) , ≤ j ≤ J ( p ) .2. Define the matrix Z ∈ R N × J + , J + = (cid:80) Pp =1 J ( p ) , where each row stacks the scoresfor each components for a unique observation (cid:16)(cid:98) c (1)1 ,n , . . . , (cid:98) c (1) J (1) ,n , . . . , (cid:98) c ( P )1 ,n , . . . , (cid:98) c ( P ) J ( p ) ,n (cid:17) .Define Z ∈ R J + × J + such that Z = ( N − − Z (cid:62) Z .3. An eigenanalysis of the matrix Z is performed and leads to the eigenvectors (cid:98) v j andeigenvalues (cid:98) λ j .4. Finally, the multivariate eigenfunctions are estimated with (cid:98) ϕ ( p ) j ( t p ) = (cid:88) J ( p ) j (cid:48) =1 [ (cid:98) v j ] ( p ) j (cid:48) (cid:98) ρ ( p ) j (cid:48) ( t p ) , t p ∈ T p , ≤ j ≤ J + , ≤ p ≤ P. and the multivariate scores with (cid:98) c j,n = Z n, · (cid:98) v j , ≤ n ≤ N , ≤ j ≤ J + . The multivariate Karhunen-Lo`eve expansion of the process X is thus (cid:98) X n ( t ) = (cid:98) µ N ( t ) + J (cid:88) j =1 (cid:98) c j,n (cid:98) ϕ j ( t ) , t ∈ T . where (cid:98) µ N ( · ) = (cid:16)(cid:98) µ (1) N ( · ) , . . . , (cid:98) µ ( P ) N ( · ) (cid:17) is the vector of the estimated mean functions. The implementation of the MFPCA is based on the
MFPCA class. Hence, we constructan object of class
MFPCA specifying the number of eigencomponents that we want. Thecomputation of the eigenelements is performed using the fit method, and the scores arethen calculated using the transform method. Given scores, the inverse transformationto the functional space is done using the inverse transform method. The triptych fit , transform and fit transform is based on the implementation choice of sklearn . In this example, we perform a MFPCA for the bivariate Canadian Weather data. Weexpand each univariate element using an univariate FPCA with a number of components16 − − − T e m p e r a t u r e ( ° C ) P r ec i p i t a t i o n ( mm ) . . . . . . − . − . − . . . . . . . . . . . − . − . − . . . . . Figure 9: Results of the MFPCA for the Canadian Weather data. The first row representsthe mean functions of the temperature (left) and precipitation (right) data. The secondrow corresponds to the bivariate eigenfunctions found for 99% of explained variance.that explain 99% of the variance within the data. The number of components are specifiedin a list in the
MFPCA constructor. The method parameter in the fit method indicateshow the univariate scores are computed. Here, we use numerical integration to derivethem.1 f p c a = MFPCA( n c om p o ne n t s = [ 0 . 9 9 , 0 . 9 9 ] )2 f p c a . f i t ( canadWeather , method= ’ NumInt ’ )The scores are computed using the transform function:1 s c o r e s = f p c a . t r a n s f o r m ( d a t a=canadWeather )The eigenvalues are stored as instance variables. We remark the rapid decrease of theeigenvalues. Hence, we only need a few eigencomponents to explain most of the variancewithin the data. 17 f p c a . e i g e n v a l u e s array([4.36e+01, 4.62e+00, 1.20e+00, 5.76e-01, 1.14e-01, 2.61e-02])
MFPCA is based on the univariate basis expansion of each of the components of the process.Currently, only two basis expansions are implemented. New bases can easily be added tothe package. All univariate basis expansions should implemented the methods: fit , usedto compute the elements of the basis, transform , to compute the scores of the observationswithin the basis, and fit transform , to return the observations in the functional spacegiven their scores. The implemented bases are: • UFPCA – Univariate Functional Principal Components Analysis for data on one di-mensional domains. This basis was used in the Canadian Weather example. Mul-tiple smoothing methods are implemented for the estimation of the mean and thecovariance (see Section 5), such as local polynomial estimation or GAM with penal-ized B-splines. The scores are computed using numerical integration. Consideringsparse functional data, one may also used the PACE algorithm [21]. The main argu-ment to build an instance of the class
UFPCA is n comp which can be the proportionof variance explained by the principal components, if n comp <
1, or the number ofprincipal components to computed, if n comp ≥ • FCP-TPA – Functional Candecomp/Parafac Tensor Power Algorithm for data ontwo dimensional domains. This algorithm is used to find a basis decomposition ofimage data. Consider N realizations of a stochastic process X defined on S x × S y ,the data can be represented as a tensor X in R N × S x × S y . A Candecomp/Parafacrepresentation of the data is assumed: X = J (cid:88) j =1 λ j u j ⊗ v j ⊗ w j , where λ j is scalar, u j ∈ R N , v j ∈ R S x and w j ∈ R S y are vectors and ⊗ denotes theouter product. In addition, the outer product v j ⊗ w j can be interpreted as the j th eigenimage evaluated on the same grid points as the original data. Moreover,the vector λ j v j is the score vector gathering the observations projected onto theeigenimage v j ⊗ w j . Our implementation is adapted from the function FCP TPA ofthe R package MFPCA [12]. The main argument, to build an instance of the class
FCPTPA , is n comp which is the number of principal components to computed.18 fCUBT
The
FDApy package implements fCUBT for the clustering of functional data objects definedon potentially different domains, developed by [8]. The implementation of the method isbuild upon the functional data classes defined in the package. After giving a short reviewof the methodology in Section 7.1, we explain how to effectively use it in Section 7.2. Fora detailed description, please refer to [8].
Let S be a sample of realizations of the process X . We consider the problem of learning apartition U such that every element U of U gathers similar elements of S . The partition U is built as a tree T defined using a top-down procedure by recursive splitting. Eachnode of the tree T is denoted by S d , j . At each stage, a node ( d , j ) is possibly split into two subnodes in a four step procedure:1. A MFPCA, with n comp components, is conducted on the elements of S d , j . It resultsin a set of eigenvalues Λ d , j associated with a set of eigenfunctions Φ d , j .2. The matrix of scores C d , j is then defined with the columns built with the projectionsof the elements of S d , j onto the elements of Φ d , j .3. For each K = 1 , . . . , K max , we fit a GMM to the columns of the matrix C d , j . Theresulting models are denoted as {M , . . . , M K max } . Considering the BIC, we deter-mine (cid:98) K d , j = arg max K =1 ,...,K max BIC( M K )4. If (cid:98) K d , j >
1, we split S d , j using the model M , which is a mixture of two Gaussianvectors. Otherwise, the node is considered to be a terminal node and the construc-tion of the tree is stopped for this node.The recursive procedure continues downwards until one of the following stopping rulesare satisfied: there are less than minsize observations in the node or the estimation (cid:98) K d , j of the number of clusters in the mode is equal to 1. When the algorithm ends, a labelis assigned to each leaf (terminal node). The resulting tree is referred to as the maximalbinary tree. In this step, the idea is to join terminal nodes which do not necessarily share the samedirect ancestor. 19. Build the graph G = ( V, E ) where V = { S d , j , ≤ j < d , ≤ d < D | S d , j is a terminal node } , and E = (cid:110) ( S d , j , S d (cid:48) , j (cid:48) ) | S d , j , S d (cid:48) , j (cid:48) ∈ V, S d , j (cid:54) = S d (cid:48) , j (cid:48) and (cid:98) K ( d , j ) ∪ ( d (cid:48) , j (cid:48) ) = 1 (cid:111) .
2. Let ( S d , j , S d (cid:48) , j (cid:48) ) be the edge with the maximum BIC value. Remove this edge thenand replace the asssociated vertex by S d , j ∪ S d (cid:48) , j (cid:48) .3. Continue the procedure by applying the step with { V \ { S d , j , S d (cid:48) , j (cid:48) }} ∪ { S d , j ∪ S d (cid:48) , j (cid:48) } .The procedure continues until the set V is reduced to a unique element or the set E is the empty set. The implementation of the fCUBT is based on the fCUBT class. Hence, we construct anobject of fCUBT class specifyng the root node of the tree which contains a sample ofdata. The growth of the tree is performed using the grow function with the number ofeigencomponents to keep at each node as parameters. Once the tree has grown, the joiningstep is made using the join function. The prediction of the class of a new observation ispossible through the predict function (or predict proba for the probabilities to belongto each class).
In this example, we perform a clustering of the univariate Canadian Temperature dataextracted from the bivariate Canadian Weather data. We build the root node containingall the observations within the dataset. The
FCUBT constructor is called.1 r o o t n o d e = Node ( dailyTemp , i s r o o t=True )2 f c u b t = FCUBT( r o o t n o d e=r o o t n o d e )To grow the tree, we choose to consider a number of components that explain 95%of the variance of the remaining observations at each node of the tree. Moreover, theconstruction of the branch is stopped if there are less than 5 observations in a node.Figure 10 presents the results of clustering. The plot function from the
FCUBT classallows us to show the maximum tree once the data has been fitted (currently, only forunivariate data objects). This representation is particularly useful for the understandingof the clustering results. One might also cut the tree at a given height. For example,considering Figure 11,1 f c u b t . grow ( n c o mp o n en t s = 0 . 9 5 , m i n s i z e =5)20 . . . . . . − − − T e m p e r a t u r e Daily Temperature Data
Figure 10: Plot of the Canadian Temperature dataset. Each color represents a differentcluster. − − − (0, 0) − − (1, 0) − − − (1, 1) − (2, 0) − − (2, 1) − − − (2, 2) − − − (2, 3) − (3, 0) − − (3, 1) Figure 11: Grown tree T illustration for the Canadian Temperature dataset.21he joining step is performed using the join function. We choose to consider 95% ofthe explained variance of the observations to join two nodes.1 f c u b t . j o i n ( n c o mp o n en t s = 0 . 9 5 ) The package is publicly available on Github and the Python Package Index . A documen-tation, including examples, is available with the package. Some tests are also implementedusing unittest which is the unit testing framework provided with Python . Acknowledgment
The author wishes to thank Groupe Renault and the ANRT (French National Associa-tion for Research and Technology) for their financial support via the CIFRE conventionno. 2017/1116.
References [1] C. Bouveyron. funFEM: Clustering in the Discriminative Functional Subspace , 2015.R package version 1.1.[2] C. Bouveyron, J. Jacques, and A. Schmutz. funLBM: Model-Based Co-Clustering ofFunctional Data , 2020. R package version 2.1.[3] S. Brockhaus, D. R¨ugamer, and S. Greven. Boosting functional regression modelswith FDboost.
Journal of Statistical Software , 94(10):1–50, 2020.[4] C. R. Carreno, hzzhyj, Pablo, D. G. Fernandez, P. Marcos, pedrorponga, M. C.Berrocal, amandaher, D. Tucker, and S. Johnsen. Gaa-uam/scikit-fda: Version 0.5,Dec. 2020.[5] J. Fan and I. Gijbels.
Local polynomial modelling and its applications . Number 66in Monographs on statistics and applied probability , ISSN 0960-6696 ; ZDB-ID:22968-4. Chapman & Hall, London, 1996.[6] J. Goldsmith, S. Greven, and C. Crainiceanu. Corrected Confidence Bands for Func-tional Data Using Principal Components.
Biometrics , 69(1):41–51, 2013. https://github.com/StevenGolovkine/FDApy https://pypi.org/project/FDApy/ refund: Regressionwith Functional Data , 2020. R package version 0.1-22.[8] S. Golovkine, N. Klutchnikoff, and V. Patilea. Clustering multivariate functionaldata using unsupervised binary trees. arXiv:2012.05973 [cs, stat] , Dec. 2020.[9] S. Golovkine, N. Klutchnikoff, and V. Patilea. Learning the smoothness of noisycurves with application to online curve estimation. arXiv:2009.03652 [math, stat] ,Sept. 2020.[10] C. Happ and S. Greven. Multivariate Functional Principal Component Analysisfor Data Observed on Different (Dimensional) Domains. Journal of the AmericanStatistical Association , 113(522):649–659, Apr. 2018.[11] C. Happ-Kurz. Object-Oriented Software for Functional Data.
Journal of StatisticalSoftware , 93(1):1–38, Apr. 2020.[12] C. Happ-Kurz.
MFPCA: Multivariate Functional Principal Component Analysis forData Observed on Different Dimensional Domains , 2020. R package version 1.3-6.[13] M. L¨oning, A. Bagnall, S. Ganesh, V. Kazakov, J. Lines, and F. J. Kir´aly. sktime:A Unified Interface for Machine Learning with Time Series. arXiv:1909.07872 [cs,stat] , Sept. 2019.[14] A. Parodi, M. Patriarca, L. Sangalli, P. Secchi, S. Vantini, and V. Vitelli. fdakma:Functional Data Analysis: K-Mean Alignment , 2015. R package version 1.2.1.[15] J. Ramsay and B. W. Silverman.
Functional Data Analysis . Springer Series inStatistics. Springer-Verlag, New York, 2 edition, 2005.[16] J. O. Ramsay, G. Hooker, and S. Graves.
Functional Data Analysis with R andMATLAB . Use R! Springer-Verlag, New York, 2009.[17] J. O. Ramsay, S. Graves, and G. Hooker. fda: Functional Data Analysis , 2020. Rpackage version 5.1.5.1.[18] A. Schmutz, J. Jacques, and C. Bouveyron. funHDDC: Univariate and MultivariateModel-Based Clustering in Group-Specific Functional Subspaces , 2019. R packageversion 2.3.0.[19] R. Tavenard, J. Faouzi, G. Vandewiele, F. Divo, G. Androz, C. Holtz, M. Payne,R. Yurchak, M. Rußwurm, K. Kolar, and E. Woods. Tslearn, A Machine LearningToolkit for Time Series Data.
Journal of Machine Learning Research , 21(118):1–6,2020.[20] J. D. Tucker. fdasrvf: Elastic Functional Data Analysis , 2020. R package version1.9.4. 2321] F. Yao, H.-G. M¨uller, and J.-L. Wang. Functional Data Analysis for Sparse Longitu-dinal Data.