[PDF] FDApy: a Python package for functional data

Abstract

We introduce the Python package, FDApy, as an implementation of functional data. This package provide modules for the analysis of such data. It includes classes for different dimensional data as well as irregularly sampled functional data. A simulation toolbox is also provided. It might be used to simulate different clusters of functional data. Some methodologies to handle these data are implemented, such as dimension reduction and clustering. New methods can be easily added. The package is publicly available on the Python Package Index and Github.

Full PDF

FFDApy : a Python package for functional data

Steven Golovkine ∗ January 28, 2021

Abstract

We introduce the

Python package,

FDApy , as an implementation of functionaldata. This package provide modules for the analysis of such data. It includesclasses for diﬀerent dimensional data as well as irregularly sampled functional data.A simulation toolbox is also provided. It might be used to simulate diﬀerent clustersof functional data. Some methodologies to handle these data are implemented, suchas dimension reduction and clustering. New methods can be easily added. Thepackage is publicly available on the Python Package Index and Github.

With a large number of applications ranging from sports to automotive industry andhealthcare, more and more phenomena produce observation entities in the form of a se-quence of possibly vector-valued measurements recorded intermittently at several discretepoints in time. Functional data analysis (FDA) considers such data as being values of therealizations of a stochastic process, recorded with some error, at discrete random times.The purpose of FDA is to study such trajectories, also called curves or functions. Theconcept of functional data can be linked to the study of time series data, as dense, usuallyregular samples of potentially non-smooth functions, or longitudinal data, as sparse andirregular samples of smooth function, or even image data, which can be represented asfunctions on two-dimensional domains.In order to apply FDA to a real dataset, there is a need for appropriate softwares withup-to-date methodological implementation and easy addition of new theoretical develop-ments. Currently, the most widely known software for FDA is the R package fda [17],based on work cited in [15, 16]. Usually, R packages for FDA are speciﬁc to one method.For example, one may cite FDboost [3] and refund [7] for regression and classiﬁcation, funFEM [1], funHDDC [18] and funLBM [2] for clustering or fdasrvf [20] and fdakma [14]for functional data registration, etc . Most of these packages are built upon fda . How-ever, in most packages, the functional data are restricted to univariate ones that are well ∗ Groupe Renault & CREST - UMR 9194, Rennes, France, [email protected] a r X i v : . [ c s . M S ] J a n escribed by their coeﬃcients in a given basis of functions. The funData package [11]has been recently released. It aims to provide a uniﬁed framework to handle univariateand multivariate functional data deﬁned on diﬀerent dimensional domains. Sparse func-tional data are also considered. The MFPCA [12] package is currently the only one built ontop of the funData package. It implements multivariate functional principal componentsanalysis (MFPCA) for data deﬁned on diﬀerent dimensional domains [10].Concerning the

Python community, there are only few packages that are related toFDA. One may cite sktime [13] and tslearn [19] that provide tools for the analysis oftime series as a scikit-learn compatible API. Thus, they implement speciﬁc time seriesmethods such as DTW-based ones or shapelets learning. The only one that develops spe-ciﬁc methods for FDA is scikit-fda [4]. In particular, it implements diverse registrationtechniques as well as statistical data depths for functional data. However, most of themethods are for one-dimensional data and they only accept multivariate functional datadeﬁned on the same unidimensional domain.The

FDApy package implements methods to handle functional data in

Python basedon an object-oriented approach, in the spirit of funData . In particular, it provides classesto manipulate dense, irregularly and multivariate functional data deﬁned on one or higherdimensional domains. A large simulation toolbox, based on basis decomposition, is pro-vided. It allows parameters for diﬀerent clusters simulation to be conﬁgured within thedata. An implementation of MFPCA for data deﬁned on diﬀerent domains, as describedin [10], is implemented. Moreover, the fCUBT algorithm [8], used to create partition inthe data, is also available. All methods are implemented using the deﬁned classes. Thepackage is publicly available on Github and the Python Package Index .In the general case, the data consist of independent trajectories of a vector-valuedstochastic process X = ( X (1) , . . . , X ( P ) ) (cid:62) , P ≥

1. For each 1 ≤ p ≤ P , let T p ⊂ R d p with d p ≥

1, as for instance, T p = [0 , d p . The realizations of each coordinate X ( p ) : T p → R areassumed to belong to L ( T p ), the Hilbert space of squared-integrable, real-valued functionsdeﬁned on T p . Thus, X is a stochastic process indexed by t = ( t , . . . , t P ) belonging to the P − fold Cartesian product T := T × · · · × T P and taking values in the P − fold Cartesianproduct space H := L ( T ) × · · · × L ( T P ). In practice, realizations of functional dataare only obtained on a ﬁnite grid and possibly with noise. Let us consider N curves X , . . . , X n , . . . , X N generated as a random sample of the P -dimensional stochastic process X with continuous trajectories. For each 1 ≤ n ≤ N , and given a vector of positiveintegers M n = ( M (1) n , . . . , M ( P ) n ) ∈ R P , let T n, m = ( T (1) n,m , . . . , T ( P ) n,m P ) , ≤ m p ≤ M ( p ) n , ≤ p ≤ P , be the random observation times for the curve X n . These times are obtained asindependent copies of a variable T taking values in T . The vectors M , . . . , M N representan independent sample of an integer-valued random vector M with expectation µ M . Weassume that the realizations of X , M and T are mutually independent. The observationsassociated with a curve, or trajectory, X n consist of the pairs ( Y n, m , T n, m ) ∈ R P × T https://github.com/StevenGolovkine/FDApy https://pypi.org/project/FDApy/ m = ( m , . . . , m P ) , ≤ m p ≤ M ( p ) n , ≤ p ≤ P , and Y n, m is deﬁned as Y n, m = X n ( T n, m ) + ε n, m , , ≤ n ≤ N, and ε n, m are independent copies of a centered random vector ε ∈ R P with ﬁnite variance.We use the notation X n ( t ) for the value at t of the realization X n of X . Univariatefunctional data refers to the case where P = 1.The remainder of the paper is organized as follows. In Section 2, we introduce theclasses for an object-oriented implementation of functional data. Section 3 describes thedata we used as examples. In Section 4, we presents the creation and manipulationof functional data objects. Sections 5, 6 and 7 then demonstrate some methods thatthe package implements: the estimation of components, multivariate functional principalcomponents analysis and the fCUBT algorithm used to ﬁnd a partition of the sampleddata. The representation of functional data is done using two classes, that both extend anabstract class

FunctionalData :1. Class

DenseFunctionalData represents dense functional data of arbitrary dimen-sion (one for curves, two for images, etc. ) on a common set of observation points t , . . . , t M for all observations. It may have missing values within the data.2. Class IrregularFunctionalData represents irregularly sampled data of arbitrarydimension on diﬀerent sets of observation points. The number and the location ofthe sampling points vary between observations. It must not have missing valueswithin the data.Finally, the implementation of the class

MultivariateFunctionalData is diﬀerentbecause it does not extend the class

FunctionalData but the

UserList one. Thus, aninstance of

MultivariateFunctionalData is deﬁned as a list of P elements from the DenseFunctionalData and/or

IrregularFunctionalData classes that may be deﬁnedon diﬀerent dimensional domains ( e.g. curves and images). A diagram of the classes isgiven in Figure 1.

Remark 1

In practice, the diﬀerence between dense and irregularly sampled functionaldata can be tricky. By design, dense functional data are assumed to be sampled on the com-plete grid T = { t , . . . , t M } and measurement errors may exist. Taking data from sensorsas an example, observations are recorded at a given sampling rate and are timestampedbut some anomalies may happen during the recording process. While for an irregularlysampled functional data, we assume that the curves are observed at diﬀerent samplingpoints with potentially diﬀerent numbers of points. This is usually the case in medicalstudies such as growth curves analysis because one cannot expect that the individuals aremeasured at the exact same time. unctionalData DenseFunctionalData IrregularFunctionalData

UserList

MultivariateFunctionalData

Figure 1: Representation of the main classesThe

DenseFunctionalData and

IrregularFunctionalData classes represents thedata in a similar way: the instance variable argvals contains the sampling points andthe instance variable values represents the data. In the case of dense functional data,the argvals is a dictionary whese each entry contains a numpy array that representsthe common sampling points for a given dimension, while values is a numpy array con-taining the observations. In the case of one-dimensional data sampled on a grid with M points, argvals contains only one entry as an array of shape ( M, ) and values is an arrayof dimension ( N, M ) where each row is an observation. For two-dimensional observationswith M (1) × M (2) sampling points, argvals contains two entries, the ﬁrst being an arrayof shape ( M (1) , ) and the second an array of shape ( M (2) , ) and values is an array ofdimension ( N, M (1) , M (2) ) where the ﬁrst coordinate gives the observation. The higherdimensional data are represented by adding an entry in the argvals dictionary and adimension in the values array. For irregularly sampled functional data, both argvals and values are dictionaries. The entries of argvals are dictionaries where each entryconsists of the sampling points for a particular observation. In a similar way, each entry ofthe values dictionary represents an observation. For one-dimensional irregularly sampledfunctional data, argvals contains one entry which is a dictionary of size N containingthe sampling points as array of shape ( M n , ) , ≤ n ≤ N and values is a dictionarywith N entries containing the observations as arrays of shape ( M n , ) , ≤ n ≤ N . Forhigher dimensions, each entry of the argvals dictionary represents a dimension of theprocess and contains another dictionary with N entries for the sampling points. Like-wise, the values dictionary has N entries and every one of them is an array of shape( M (1) n , M (2) n , . . . ) , ≤ n ≤ N .Finally, the MultivariateFunctionalData class inherits from the

UserList class, andthus gathers P instances of DenseFunctionalData and/or

IrregularFunctionalData asa list. As a result, this class has access to all the methods applicable to lists such as append , extend , pop , etc. Given a speciﬁc dataset, instances of the diﬀerent classes are called

DenseFunctionalData , IrregularFunctionalData or MultivariateFunctionalData ob-jects. In the following, the generic term, functional data object , will refer to instances ofall the three classes. 4

Data used in the examples

We will consider two datasets in the code examples. The ﬁrst one will be the

Cana-dian weather data, which is presented in the textbook by Ramsay and Silverman [15]and available in their R package fda [17]. The second dataset is the CD4 cell count dataset, used in [6], and available in the R package refund [7]. As both examples areone-dimensional data, higher dimensional datasets, in particular images ones, will besimulated using the simulation toolbox provided in the package.The Canadian weather dataset contains daily recording of the temperature (in de-gree Celsius) and the precipitation (in millimeters) for N = 35 Canadian cities spreadacross the country and averaged over the years 1960 to 1994. The daily temperaturedata will be used as an example of DenseFunctionalData deﬁned on a one-dimensionaldomain. We will add the daily precipitation records to the temperature ones in or-der to create a

MultivariateFunctionalData object with elements deﬁned on diﬀerentone-dimensional domains ( T = [1 , T = [1 , CD4 cell count datasetcollects the number of CD4 cells per milliliter of blood of N = 366 participants. CD4cells are a particular type of white blood cell and are key components of the immunesystem. HIV attacks the CD4 cells in the patient’s blood. Thus, the count of CD4 cellscan be viewed as a measure of the disease progression. For this dataset, the number ofCD4 cells are measured roughly twice a year and centered at the time of seroconversion,which is the time that HIV becomes detectable. For every individual, the number ofmeasurements varies between 1 to 11 over a period of 18 months before and 42 monthsafter seroconversion. The sampling points are diﬀerent between observations. We will usethis dataset as an example of IrregularFunctionalData . With the help of the two example datasets, this section will present how to create andmanipulate a functional data object. In particular, we review the diﬀerent instance vari-ables used to extract information from the data. We also present methods to modifyand plot functional data objects. General methods, such as the computation of the meanor covariance, for

MultivariateFunctionalData objects usually call the correspondingmethods for each individual and concatenate the results appropriately. For all the codeexamples, we assume that the correct functions from the

FDApy package are loaded aswell as the packages numpy and pandas using the following code snippet:1 import numpy a s np2 import pandas a s pd 5 .1 Creation of objects

Assuming the Canadian temperature data is stored in a temperature.csv ﬁle and theCanadian precipitation data in a precipitation.csv ﬁle, the following code loads the datainto pandas dataframes and creates

DenseFunctionalData instances from them. Weexplicitly named the dimension of the observation.1 t e m p e r a t u r e = pd . r e a d c s v ( ’ t e m p e r a t u r e . c s v ’ , i n d e x c o l =0)23 a r g v a l s = pd . f a c t o r i z e ( t e m p e r a t u r e . columns ) [ 0 ]4 v a l u e s = np . a r r a y ( t e m p e r a t u r e )5 dailyTemp = D e n s e F u n c t i o n a l D a t a ( { ’ i n p u t d i m 0 ’ : a r g v a l s } ,6 v a l u e s )78 p r e c i p i t a t i o n = pd . r e a d c s v ( ’ p r e c i p i t a t i o n . c s v ’ , i n d e x c o l =0)910 a r g v a l s = pd . f a c t o r i z e ( p r e c i p i t a t i o n . columns ) [ 0 ]11 v a l u e s = np . a r r a y ( p r e c i p i t a t i o n )12 d a i l y P r e c = D e n s e F u n c t i o n a l D a t a ( { ’ i n p u t d i m 0 ’ : a r g v a l s } ,13 v a l u e s )Given multiple functional data objects, the creation of MultivariateFunctionalData instances is done by passing a list of objects to the constructor method.1 canadWeather = M u l t i v a r i a t e F u n c t i o n a l D a t a ( [ dailyTemp ,2 d a i l y P r e c ] )The construction of an

IrregularFunctionalData instance is similar, except thatthe dictionaries for argvals and values must contain an entry for each observation ofthe data. We consider that the CD4 cell count data are stored in a cd4.csv ﬁle con-taining a matrix representing the CD4 counts for each patient on the common gridof all sampling points and the missing values are coded as NA . Thus, the followingcode extracts only the non-missing values for each patient and construct an instanceof IrregularFunctionalData .1 cd4 = pd . r e a d c s v ( ’ cd4 . c s v ’ , i n d e x c o l =0)23 a l l a r g v a l s = cd4 . columns . a s t y p e ( np . i n t 6 4 )4 a r g v a l s = { i d x : np . a r r a y ( a l l a r g v a l s [ ˜ np . i s n a n ( row ) ] )5 f o r i d x , row in enumerate ( cd4 . v a l u e s ) } { i d x : row [ ˜ np . i s n a n ( row ) ]7 f o r i d x , row in enumerate ( cd4 . v a l u e s ) } { ’ i n p u t d i m 0 ’ : a r g v a l s } ,6 v a l u e s ) Remark 2

Two loaders are included within the package: read csv and read ts . Thesemethods can be used to load already well formatted data from csv or ts ﬁles. In particular,the ts ﬁles are, in particular, used in the UEA & UCR Time Series Classiﬁcation Repos-itory . These functions are wrapper functions of the above code snippets. Nonetheless,multivariate functional data cannot be imported in this way. Basic information about the functional data object is printed on the standard outputwhen the object is called in the command line. For example, for the temperature dataset,the output will be:1 dailyTemp

Univariate functional data object with 35 observations on a 1-dimensionalsupport.

The outputs are similar for instances of the other types of functional data objects.For dense and irregular functional objects, a subset of the data can be extracted usingthe convenient way to substract objects provided in

Python . For example, in order toget the observations from 5 to 12 from the temperature data, we may write:1 dailyTemp [ 5 : 1 3 ]

Univariate functional data object with 8 observations on a 1-dimensionalsupport.

Note that this will not work with

MultivariateFunctionalData instances. Thissubsetting method will instead return the univariate functional data in the list. However,an iterator through the observations of the multivariate functional data is provided as the get obs method.In regards to Remark 1, we implement functions to convert

DenseFunctionalData instances into

IrregularFunctionalData instances and to do the reverse operation. Themissing values are coded with np.nan . The code is thus written:1 dailyTemp . a s i r r e g u l a r ( ) .2 Access to the instance variables The functional data classes come with multiple instance variables. In

Python , they canusually be accessed using instance name.variable name . We will present some of themin the following. Note that, some variables cannot be accessed directly for multivariatefunctional data and have to be retrieved by looping through its univariate elements.Of course, the argvals and values are accessible using dailyTemp.argvals and dailyTemp.values and show what the user gave to the object constructor. Furthermore,we provide a variable argvals stand with the same shape as argvals but with normalizedsampling points. The instance variables are the following: • n obs – number of observations in the object. • n points – number of sampling points in the object for each dimension as a dictio-nary. For the multivariate functional data object, it should be a list of P entries. Inthe case of IrregularFunctionalData , the returned number is the mean numberof sampling points per observation. • n dim – input dimension of the functional data (one for curves, two for images, etc. ).For MultivariateFunctionalData objects, is expressed as a list. • range obs – minimum and maximum values of the observations as a tuple. • range points – minimum and maximum values of the sampling points as a tuple.The calculation is based on the variable argvals . Basic plotting methods for functional data objects are provided in the package. They arebuilt upon the matplotlib package. We assume the package is loaded with1 import m a t p l o t l i b . p y p l o t a s p l tThe plot method returns an instance of

Axes from the matplotlib library. Thus, allthe plotting options relative to ticks, frames and so on, are modiﬁable using this instanceof

Axes . Customization of the graph parameters, such as colors, linetypes or linewidths forexample, can be made by passing the arguments as inputs to the function. The followingsnippet is used to plot all the temperature curves for all the Canadian weather stationdata (represented as a

DenseFunctionalData object),1 = p l o t ( dailyTemp )2 p l t . x l a b e l ( ’ Days ’ )3 p l t . y l a b e l ( ’ Temperature ’ )4 p l t . t i t l e ( ’ D a i l y Temperature Data ’ )5 p l t . show ( ) 8

50 100 150 200 250 300 350Days − − − T e m p e r a t u r e Daily Temperature Data −

10 0 10 20 30 40Month since seroconversion5001000150020002500 C D ce ll c o un t s ( l og - s c a l e ) CD4 counts for individual 5-14

Figure 2: Results of the plot method for functional data object.

Simulation

Brownian KarhunenLoeve

Figure 3: Links between classes in the simulation toolbox.while a plot of the CD4 cell counts for 10 patients on the log-scale (represented as an

IrregularFunctionalData object) is given by1 = p l o t ( c d 4 c o u n t s [ 5 : 1 5 ] )2 p l t . x l a b e l ( ’ Month s i n c e s e r o c o n v e r s i o n ’ )3 p l t . y l a b e l ( ’CD4 c e l l c o u n t s ( l o g − s c a l e ) ’ )4 p l t . t i t l e ( ’CD4 c o u n t s f o r i n d i v i d u a l 5 −

14 ’ )5 p l t . show ( )The plots are shown in Figure 2.

Simulation functions are implemented in order to test new methodological developments.The data can be simulated using a truncated version of the Karhunen-Lo`eve representation(class

KarhunenLoeve ) as well as diverse Brownian motions (class

Brownian ) that inheritsfrom the

Simulation class (see Figure 3). An element of the

Simulation class have twoprincipal instance variables: basis that contains the used basis and data that containsthe simulated observation (after running the fonction new() ).For Brownian motions, three types are implemented: standard , fractional and geometric . For example, we can simulate N = 10 realizations of a fractional Brownian9 . . . . . . − . − . − . − . . . . . . (a) . . . . . . − . − . . . . (b) Figure 4: Example of simulated data. (a) Brownian motion. (b) Karhunen-Lo`eve expan-sionmotion on the one-dimensional observation grid { , . , . . . , } with a Hurst parameterequal to 0 . X has a Karhunen-Lo`eve decomposition. Each of its realizations can berepresented using this decomposition, trucated at J coeﬃcients: X n ( t ) = µ ( t ) + J (cid:88) j =1 ξ j,n φ j ( t ) , t ∈ T , n = 1 , . . . , N, with a common mean function µ and an orthonormal basis of functions { φ j } j =1 ,...,J . Thecoeﬃcient ξ j,n are realizations of random Gaussian variables ξ j such that E ( ξ j ) = 0 andVar( ξ j ) = λ j with eigenvalues λ j ≥ N = 10 curves on T = [0 , T and eigenvalues with exponential decrease:1 k l = KarhunenLoeve ( name= ’ b s p l i n e s ’ , n f u n c t i o n s =5)2 k l . new ( n o b s =10 , a r g v a l s=np . l i n s p a c e ( 0 , 1 , 1 0 1 ) ,3 c l u s t e r s t d= ’ e x p o n e n t i a l ’ )10igure 4 presents a plot of the simulated Brownian motions and those from theKarhunen-Lo`eve decomposition. The simulation of two dimensional data is based onthe tensor product of basis functions. Simulation for higher dimensional data is notimplemented.We also added methods to generate noisy observations as well as sparse data. Notethat, these functions are only implemented on instances of Simulation . The add noise function adds pointwise noise to the observations. Both homoscedastic and heteroscedas-tic noise are implemented. If a single scalar is given as a parameter to the function,homoscedastic noise will be simulated. For the heteroscedastic case, lambda functionsand vectors of size n points can be supplied by the user. The noisy data are storedin the instance variable noisy data . For example, to add random noise with variance σ = 0 .

05, we run1 k l . a d d n o i s e ( v a r n o i s e = 0 . 0 5 )and, for heteroscedastic noise with variance deﬁned by x → (cid:112) | x | ,1 k l . a d d n o i s e ( v a r n o i s e= lambda x : np . s q r t ( 1 + np . abs ( x ) ) )The function sparsify randomly removes sampling points from the observation. Pre-cisely, we randomly generate the number of sampling points to retain for each observationand then randomly select the sampling points to remove from each observation. Thesparse data are stored in the instance variable sparse data . For example, to randomlyremove 50% of the sampling points (more or less 5%) on the Brownian simulated data,we run1 brownian . s p a r s i f y ( p e r c e n t a g e = 0 . 5 , e p s i l o n = 0 . 0 5 )Figure 5 presents a plot of the noisy and sparse verions of the Karhunen-Lo`eve simu-lated data. Let K be a positive integer, and let Z be a discrete random variable taking values in therange { , . . . , K } such that P ( Z = k ) = p k with p k > K (cid:88) k =1 p k = 1 . The variable Z represents the cluster membership of the realizations of the process. Weconsider that the stochastic process follows a functional mixture model with K compo-11 . . . . . . − . − . − . − . . . . . . (a) . . . . . . − . − . . . . (b) Figure 5: Results for the add noise and sparsify functions on the Karhunen-Lo`evesimulated data. (a) Noisy data. (b) Sparse data.nents, that is, it allows for the following decomposition: X ( t ) = K (cid:88) k =1 µ k ( t ) { Z = k } + (cid:88) j ≥ ξ j φ j ( t ) , t ∈ T , where • µ , . . . , µ K are the mean curves per cluster. • { φ j } j ≥ is an orthonormal basis of functions. • ξ j , j ≥ Z . For each 1 ≤ k ≤ K , ξ j | Z = k ∼ N (0 , σ kj ).For example, we can generate N = 10 realizations of two clusters using 3 eigenfunctionswith given coeﬃcients with1 N = 102 n f e a t u r e s = 33 n c l u s t e r s = 24 c e n t e r s = np . a r r a y ( [ [ 2 , −

1] , [ − . . . . . . − − − Figure 6: Simulation of data with two clusters. Each color represents a cluster.

Considering the model deﬁned in (1), we assume that P = 1 and for the sake of readability,we omit the superscript. The objective is to estimate the function X n ( · ) using the availablesample points. Thus, we consider local polynomial smoothers [5]. This type of estimatorscrucially depends on a tuning parameter, the bandwidth.Let d ≥ t ∈ T be the evaluation points for the estimationof X n . For any u ∈ R , we consider the vector U ( u ) = (1 , u, . . . , u d / d !) and note that U h ( · ) = U ( · /h ). Let K : R → R be a positive kernel and deﬁne K h ( · ) = h − K ( · /h ).Moreover, we deﬁne: ϑ M n ,h := arg min ϑ ∈ R d +1 M n (cid:88) m =1 (cid:8) Y n,m − ϑ (cid:62) U h ( T n,m − t ) (cid:9) K h ( T n,m − t ) , where h is the bandwidth. The vector ϑ M n ,h satisﬁes the normal equations Aϑ M n ,h = a with A = A M n ,h = 1 M n M n (cid:88) m =1 U h ( T n,m − t ) U (cid:62) h ( T n,m − t ) K h ( T n,m − t ) a = a M n ,h = 1 M n M n (cid:88) m =1 Y n,m U h (cid:0) T ( n ) m − t (cid:1) K h (cid:0) T ( n ) m − t (cid:1) . Let λ be the smallest eigenvalue of the matrix A and note that, whenever λ >

0, we have ϑ M n ,h = A − a . With at hand an estimation of the bandwidth (cid:98) h , the local polynomialestimator of (cid:98) X ( n ) ( t ) of order d is given by: (cid:98) X ( n ) ( t ) = U (cid:62) (0) (cid:98) ϑ, where (cid:98) ϑ = ϑ M n , (cid:98) h . If d = 0, we are in the particular case of the Nadaraya-Watson estimator. TheGaussian, Epanechnikov, tri-cube and bi-square kernels are implemented and others can13

50 100 150 200 250 300 350Days − − − T e m p e r a t u r e Daily Temperature Data (a) . . . . . . − − − T e m p e r a t u r e Daily Temperature Data (b)

Figure 7: (a) Curve and (b) smoothed estimation for the Canadian Temperature data.be added in a modular way. We propose an estimate of the bandwidth h that is based onthe regularity of the underlying function [9].For example, if we want to smooth the daily temperature curves using a local polyno-mial smoother with an estimate of bandwidth at t = 0 . In this section, we develop estimators for the mean and the covariance functions of acomponent X ( p ) , ≤ p ≤ P from the process X . These estimators might be used tocompute estimators of eigenvalues and eigenfunctions of X ( p ) for the Karhunen-Lo`eveexpansion.Let (cid:98) X ( p ) n be a suitable nonparametric estimator of the curve X ( p ) n applied with the M ( p ) n pairs ( Y ( p ) n,m p , T ( p ) n,m p ) , n = 1 , . . . , N , as for instance a local polynomial estimator suchas that presented in the previous subsection. With at hand the (cid:98) X n ’s tuned for the meanfunction estimation, we deﬁne (cid:98) µ ( p ) N ( t p ) = 1 N N (cid:88) n =1 (cid:98) X ( p ) n ( t p ) , t p ∈ T p . For example, the code snippet for the estimation of the mean curve of the dailytemperature curves using local linear smoother with bandwidth equal to 0 .

05 is14 . . . . . . − − − T e m p e r a t u r e (a) . . . . . . . . . . . . N o r m a li ze d t i m e (b) Figure 8: (a) Mean and (b) covariance estimation for the Canadian Temperature data.1 mean temp = dailyTemp . mean ( smooth= ’ L o c a l L i n e a r ’ ,2 bandwidth = 0 . 0 5 )For the covariance function, following [21], we distinguish the diagonal from the non-diagonal points. With at hand the (cid:98) X ( p ) n ’s tuned for the covariance function estimation, (cid:98) C p,p ( s p , t p ) = 1 N N (cid:88) n =1 (cid:98) X ( p ) n ( s p ) (cid:98) X ( p ) n ( t p ) − (cid:98) µ ( p ) N ( s p ) (cid:98) µ ( p ) N ( t p ) , s p , t p ∈ T p , s p (cid:54) = t p . The diagonal of the covariance is then estimated using two-dimensional kernel smoothingwith (cid:98) C p,p ( s p , t p ) , s p (cid:54) = t p as input data. See [21] for the details.1 cov temp = dailyTemp . c o v a r i a n c e ( smooth= ’GAM’ ) The

FDApy package implements MFPCA for data deﬁned on potentially diﬀerent domains,developped by Happ and Greven [10]. The implementation of the method is build uponthe functional data classes deﬁned in the package. After giving a short review of themethodology in Section 6.1, we explain how to eﬀectively use it in Section 6.2. Fortheoretical details, please refer to [10].

Following Happ and Greven [10], the multivariate components for X are computed byplugging in the univariate components computed from each component X ( p ) . These esti-mations are done as the follows. 15. Perform a univariate fPCA on each of the components of X separately. For acomponent X ( p ) , the eigenfunctions and eigenvectors are computed as a matrixanalysis of the estimated covariance (cid:98) C p,p . This results in a set of eigenfunctions (cid:16)(cid:98) ρ ( p )1 , . . . , (cid:98) ρ ( p ) J ( p ) (cid:17) associated with a set of eigenvalues (cid:16)(cid:98) λ ( p )1 , . . . , (cid:98) λ ( p ) J ( p ) (cid:17) for a giventruncation integer J ( p ) . The univariate scores for a realization X ( p ) n of X ( p ) are thengiven by (cid:98) c ( p ) j,n = (cid:104) (cid:98) X ( p ) n , (cid:98) ρ ( p ) j (cid:105) , ≤ j ≤ J ( p ) .2. Deﬁne the matrix Z ∈ R N × J + , J + = (cid:80) Pp =1 J ( p ) , where each row stacks the scoresfor each components for a unique observation (cid:16)(cid:98) c (1)1 ,n , . . . , (cid:98) c (1) J (1) ,n , . . . , (cid:98) c ( P )1 ,n , . . . , (cid:98) c ( P ) J ( p ) ,n (cid:17) .Deﬁne Z ∈ R J + × J + such that Z = ( N − − Z (cid:62) Z .3. An eigenanalysis of the matrix Z is performed and leads to the eigenvectors (cid:98) v j andeigenvalues (cid:98) λ j .4. Finally, the multivariate eigenfunctions are estimated with (cid:98) ϕ ( p ) j ( t p ) = (cid:88) J ( p ) j (cid:48) =1 [ (cid:98) v j ] ( p ) j (cid:48) (cid:98) ρ ( p ) j (cid:48) ( t p ) , t p ∈ T p , ≤ j ≤ J + , ≤ p ≤ P. and the multivariate scores with (cid:98) c j,n = Z n, · (cid:98) v j , ≤ n ≤ N , ≤ j ≤ J + . The multivariate Karhunen-Lo`eve expansion of the process X is thus (cid:98) X n ( t ) = (cid:98) µ N ( t ) + J (cid:88) j =1 (cid:98) c j,n (cid:98) ϕ j ( t ) , t ∈ T . where (cid:98) µ N ( · ) = (cid:16)(cid:98) µ (1) N ( · ) , . . . , (cid:98) µ ( P ) N ( · ) (cid:17) is the vector of the estimated mean functions. The implementation of the MFPCA is based on the

MFPCA class. Hence, we constructan object of class

MFPCA specifying the number of eigencomponents that we want. Thecomputation of the eigenelements is performed using the fit method, and the scores arethen calculated using the transform method. Given scores, the inverse transformationto the functional space is done using the inverse transform method. The triptych fit , transform and fit transform is based on the implementation choice of sklearn . In this example, we perform a MFPCA for the bivariate Canadian Weather data. Weexpand each univariate element using an univariate FPCA with a number of components16 − − − T e m p e r a t u r e ( ° C ) P r ec i p i t a t i o n ( mm ) . . . . . . − . − . − . . . . . . . . . . . − . − . − . . . . . Figure 9: Results of the MFPCA for the Canadian Weather data. The ﬁrst row representsthe mean functions of the temperature (left) and precipitation (right) data. The secondrow corresponds to the bivariate eigenfunctions found for 99% of explained variance.that explain 99% of the variance within the data. The number of components are speciﬁedin a list in the

MFPCA constructor. The method parameter in the fit method indicateshow the univariate scores are computed. Here, we use numerical integration to derivethem.1 f p c a = MFPCA( n c om p o ne n t s = [ 0 . 9 9 , 0 . 9 9 ] )2 f p c a . f i t ( canadWeather , method= ’ NumInt ’ )The scores are computed using the transform function:1 s c o r e s = f p c a . t r a n s f o r m ( d a t a=canadWeather )The eigenvalues are stored as instance variables. We remark the rapid decrease of theeigenvalues. Hence, we only need a few eigencomponents to explain most of the variancewithin the data. 17 f p c a . e i g e n v a l u e s array([4.36e+01, 4.62e+00, 1.20e+00, 5.76e-01, 1.14e-01, 2.61e-02])

MFPCA is based on the univariate basis expansion of each of the components of the process.Currently, only two basis expansions are implemented. New bases can easily be added tothe package. All univariate basis expansions should implemented the methods: fit , usedto compute the elements of the basis, transform , to compute the scores of the observationswithin the basis, and fit transform , to return the observations in the functional spacegiven their scores. The implemented bases are: • UFPCA – Univariate Functional Principal Components Analysis for data on one di-mensional domains. This basis was used in the Canadian Weather example. Mul-tiple smoothing methods are implemented for the estimation of the mean and thecovariance (see Section 5), such as local polynomial estimation or GAM with penal-ized B-splines. The scores are computed using numerical integration. Consideringsparse functional data, one may also used the PACE algorithm [21]. The main argu-ment to build an instance of the class

UFPCA is n comp which can be the proportionof variance explained by the principal components, if n comp <

1, or the number ofprincipal components to computed, if n comp ≥ • FCP-TPA – Functional Candecomp/Parafac Tensor Power Algorithm for data ontwo dimensional domains. This algorithm is used to ﬁnd a basis decomposition ofimage data. Consider N realizations of a stochastic process X deﬁned on S x × S y ,the data can be represented as a tensor X in R N × S x × S y . A Candecomp/Parafacrepresentation of the data is assumed: X = J (cid:88) j =1 λ j u j ⊗ v j ⊗ w j , where λ j is scalar, u j ∈ R N , v j ∈ R S x and w j ∈ R S y are vectors and ⊗ denotes theouter product. In addition, the outer product v j ⊗ w j can be interpreted as the j th eigenimage evaluated on the same grid points as the original data. Moreover,the vector λ j v j is the score vector gathering the observations projected onto theeigenimage v j ⊗ w j . Our implementation is adapted from the function FCP TPA ofthe R package MFPCA [12]. The main argument, to build an instance of the class

FCPTPA , is n comp which is the number of principal components to computed.18 fCUBT

The

FDApy package implements fCUBT for the clustering of functional data objects deﬁnedon potentially diﬀerent domains, developed by [8]. The implementation of the method isbuild upon the functional data classes deﬁned in the package. After giving a short reviewof the methodology in Section 7.1, we explain how to eﬀectively use it in Section 7.2. Fora detailed description, please refer to [8].

Let S be a sample of realizations of the process X . We consider the problem of learning apartition U such that every element U of U gathers similar elements of S . The partition U is built as a tree T deﬁned using a top-down procedure by recursive splitting. Eachnode of the tree T is denoted by S d , j . At each stage, a node ( d , j ) is possibly split into two subnodes in a four step procedure:1. A MFPCA, with n comp components, is conducted on the elements of S d , j . It resultsin a set of eigenvalues Λ d , j associated with a set of eigenfunctions Φ d , j .2. The matrix of scores C d , j is then deﬁned with the columns built with the projectionsof the elements of S d , j onto the elements of Φ d , j .3. For each K = 1 , . . . , K max , we ﬁt a GMM to the columns of the matrix C d , j . Theresulting models are denoted as {M , . . . , M K max } . Considering the BIC, we deter-mine (cid:98) K d , j = arg max K =1 ,...,K max BIC( M K )4. If (cid:98) K d , j >

1, we split S d , j using the model M , which is a mixture of two Gaussianvectors. Otherwise, the node is considered to be a terminal node and the construc-tion of the tree is stopped for this node.The recursive procedure continues downwards until one of the following stopping rulesare satisﬁed: there are less than minsize observations in the node or the estimation (cid:98) K d , j of the number of clusters in the mode is equal to 1. When the algorithm ends, a labelis assigned to each leaf (terminal node). The resulting tree is referred to as the maximalbinary tree. In this step, the idea is to join terminal nodes which do not necessarily share the samedirect ancestor. 19. Build the graph G = ( V, E ) where V = { S d , j , ≤ j < d , ≤ d < D | S d , j is a terminal node } , and E = (cid:110) ( S d , j , S d (cid:48) , j (cid:48) ) | S d , j , S d (cid:48) , j (cid:48) ∈ V, S d , j (cid:54) = S d (cid:48) , j (cid:48) and (cid:98) K ( d , j ) ∪ ( d (cid:48) , j (cid:48) ) = 1 (cid:111) .

2. Let ( S d , j , S d (cid:48) , j (cid:48) ) be the edge with the maximum BIC value. Remove this edge thenand replace the asssociated vertex by S d , j ∪ S d (cid:48) , j (cid:48) .3. Continue the procedure by applying the step with { V \ { S d , j , S d (cid:48) , j (cid:48) }} ∪ { S d , j ∪ S d (cid:48) , j (cid:48) } .The procedure continues until the set V is reduced to a unique element or the set E is the empty set. The implementation of the fCUBT is based on the fCUBT class. Hence, we construct anobject of fCUBT class specifyng the root node of the tree which contains a sample ofdata. The growth of the tree is performed using the grow function with the number ofeigencomponents to keep at each node as parameters. Once the tree has grown, the joiningstep is made using the join function. The prediction of the class of a new observation ispossible through the predict function (or predict proba for the probabilities to belongto each class).

In this example, we perform a clustering of the univariate Canadian Temperature dataextracted from the bivariate Canadian Weather data. We build the root node containingall the observations within the dataset. The

FCUBT constructor is called.1 r o o t n o d e = Node ( dailyTemp , i s r o o t=True )2 f c u b t = FCUBT( r o o t n o d e=r o o t n o d e )To grow the tree, we choose to consider a number of components that explain 95%of the variance of the remaining observations at each node of the tree. Moreover, theconstruction of the branch is stopped if there are less than 5 observations in a node.Figure 10 presents the results of clustering. The plot function from the

FCUBT classallows us to show the maximum tree once the data has been ﬁtted (currently, only forunivariate data objects). This representation is particularly useful for the understandingof the clustering results. One might also cut the tree at a given height. For example,considering Figure 11,1 f c u b t . grow ( n c o mp o n en t s = 0 . 9 5 , m i n s i z e =5)20 . . . . . . − − − T e m p e r a t u r e Daily Temperature Data

Figure 10: Plot of the Canadian Temperature dataset. Each color represents a diﬀerentcluster. − − − (0, 0) − − (1, 0) − − − (1, 1) − (2, 0) − − (2, 1) − − − (2, 2) − − − (2, 3) − (3, 0) − − (3, 1) Figure 11: Grown tree T illustration for the Canadian Temperature dataset.21he joining step is performed using the join function. We choose to consider 95% ofthe explained variance of the observations to join two nodes.1 f c u b t . j o i n ( n c o mp o n en t s = 0 . 9 5 ) The package is publicly available on Github and the Python Package Index . A documen-tation, including examples, is available with the package. Some tests are also implementedusing unittest which is the unit testing framework provided with Python . Acknowledgment

The author wishes to thank Groupe Renault and the ANRT (French National Associa-tion for Research and Technology) for their ﬁnancial support via the CIFRE conventionno. 2017/1116.

References [1] C. Bouveyron. funFEM: Clustering in the Discriminative Functional Subspace , 2015.R package version 1.1.[2] C. Bouveyron, J. Jacques, and A. Schmutz. funLBM: Model-Based Co-Clustering ofFunctional Data , 2020. R package version 2.1.[3] S. Brockhaus, D. R¨ugamer, and S. Greven. Boosting functional regression modelswith FDboost.

Journal of Statistical Software , 94(10):1–50, 2020.[4] C. R. Carreno, hzzhyj, Pablo, D. G. Fernandez, P. Marcos, pedrorponga, M. C.Berrocal, amandaher, D. Tucker, and S. Johnsen. Gaa-uam/scikit-fda: Version 0.5,Dec. 2020.[5] J. Fan and I. Gijbels.

Local polynomial modelling and its applications . Number 66in Monographs on statistics and applied probability , ISSN 0960-6696 ; ZDB-ID:22968-4. Chapman & Hall, London, 1996.[6] J. Goldsmith, S. Greven, and C. Crainiceanu. Corrected Conﬁdence Bands for Func-tional Data Using Principal Components.

Biometrics , 69(1):41–51, 2013. https://github.com/StevenGolovkine/FDApy https://pypi.org/project/FDApy/ refund: Regressionwith Functional Data , 2020. R package version 0.1-22.[8] S. Golovkine, N. Klutchnikoﬀ, and V. Patilea. Clustering multivariate functionaldata using unsupervised binary trees. arXiv:2012.05973 [cs, stat] , Dec. 2020.[9] S. Golovkine, N. Klutchnikoﬀ, and V. Patilea. Learning the smoothness of noisycurves with application to online curve estimation. arXiv:2009.03652 [math, stat] ,Sept. 2020.[10] C. Happ and S. Greven. Multivariate Functional Principal Component Analysisfor Data Observed on Diﬀerent (Dimensional) Domains. Journal of the AmericanStatistical Association , 113(522):649–659, Apr. 2018.[11] C. Happ-Kurz. Object-Oriented Software for Functional Data.

Journal of StatisticalSoftware , 93(1):1–38, Apr. 2020.[12] C. Happ-Kurz.

MFPCA: Multivariate Functional Principal Component Analysis forData Observed on Diﬀerent Dimensional Domains , 2020. R package version 1.3-6.[13] M. L¨oning, A. Bagnall, S. Ganesh, V. Kazakov, J. Lines, and F. J. Kir´aly. sktime:A Uniﬁed Interface for Machine Learning with Time Series. arXiv:1909.07872 [cs,stat] , Sept. 2019.[14] A. Parodi, M. Patriarca, L. Sangalli, P. Secchi, S. Vantini, and V. Vitelli. fdakma:Functional Data Analysis: K-Mean Alignment , 2015. R package version 1.2.1.[15] J. Ramsay and B. W. Silverman.

Functional Data Analysis . Springer Series inStatistics. Springer-Verlag, New York, 2 edition, 2005.[16] J. O. Ramsay, G. Hooker, and S. Graves.

Functional Data Analysis with R andMATLAB . Use R! Springer-Verlag, New York, 2009.[17] J. O. Ramsay, S. Graves, and G. Hooker. fda: Functional Data Analysis , 2020. Rpackage version 5.1.5.1.[18] A. Schmutz, J. Jacques, and C. Bouveyron. funHDDC: Univariate and MultivariateModel-Based Clustering in Group-Speciﬁc Functional Subspaces , 2019. R packageversion 2.3.0.[19] R. Tavenard, J. Faouzi, G. Vandewiele, F. Divo, G. Androz, C. Holtz, M. Payne,R. Yurchak, M. Rußwurm, K. Kolar, and E. Woods. Tslearn, A Machine LearningToolkit for Time Series Data.

Journal of Machine Learning Research , 21(118):1–6,2020.[20] J. D. Tucker. fdasrvf: Elastic Functional Data Analysis , 2020. R package version1.9.4. 2321] F. Yao, H.-G. M¨uller, and J.-L. Wang. Functional Data Analysis for Sparse Longitu-dinal Data.