A Basis Approach to Surface Clustering
NNoname manuscript No. (will be inserted by the editor)
A Basis Approach to Surface Clustering
Adriano Z. Zambom · Qing Wang · Ronaldo Dias
Received: date / Accepted: date
Abstract
This paper presents a novel method for clustering surfaces. Theproposal involves first using basis functions in a tensor product to smooth thedata and thus reduce the dimension to a finite number of coefficients, and thenusing these estimated coefficients to cluster the surfaces via the k -means al-gorithm. An extension of the algorithm to clustering tensors is also discussed.We show that the proposed algorithm exhibits the property of strong consis-tency, with or without measurement errors, in correctly clustering the data asthe sample size increases. Simulation studies suggest that the proposed methodoutperforms the benchmark k -means algorithm which uses the original vector-ized data. In addition, an EGG real data example is considered to illustratethe practical application of the proposal. Keywords
B-spline · k -means · surface clustering Clustering objects in an infinite dimensional space is a challenging taskgiven the complex nature of the data. Although most data on a continuousdomain are observed at a finite set or grid, the computational cost may be toohigh or the direct application of a clustering procedure to the raw data mayfail to capture the intrinsic stochasticity of the observations. Examples of such
Adriano Z. ZambomDepartment of Mathematics, California State University Northridge, USATel.: (818)677-2701E-mail: [email protected] WangDepartment of Mathematics, Wellesley College, USARonaldo DiasDepartment of Statistics, State University of Campinas, Brazil a r X i v : . [ s t a t . M E ] F e b Adriano Z. Zambom et al. data structures with infinite dimensions include curves, surfaces, and tensors,which are in reality usually observed with errors. The main goal of this paperis to develop a novel clustering procedure for data sets whose elements aresurfaces such as bivariate densities. The idea is to first find an approximationof each surface by estimating the matrix (or tensor) of coefficients of a modelin a finite dimensional space, thereby lowering the complexity of the data, anduse these coefficients as the new data for a certain clustering method.Since the seminal paper of Ward Jr. (1963) introducing hierarchical cluster-ing and the paper of Hartigan (1975) and Hartigan and Wong (1979) discussing k -means clustering, developments and adaptations of these classical algorithmshave been seen in a wide range of applications, such as in bioinformatics (Abu-Jamous et al. (2015), Duran et al. (2019)), clinical psychiatry (Jim´enez-Murciaet al. (2019)), environmental policy (Hu et al. (2019)), market segmentation(Wedel and Kamakura (1999)), medicine (Udler et al. (2018)), text mining(Abualigah et al. (2018)), supply management (Blackhurst et al. (2018)), andmany other areas. The overall goal of these algorithms is to find partitionsof the data based on distance metrics between elements. For instance, in (ag-glomerative) hierarchical clustering one produces a sequence of n − n singleton clusters, and then merging the closestclusters together step by step until a single cluster of n units is formed. Theiterative k -means clustering algorithm starts with a set of k initial cluster cen-ters given as input based on a starting-point algorithm or previous knowledge.Each element of the data is then assigned a cluster membership in a way thatthe within-cluster sum of squares is minimized.Clustering methods for curves, i.e. functional data clustering, have beenexplored by several researchers in the past few years. Ieva et al. (2013), forexample, performs clustering for multivariate functional data with a modi-fication of the k -means clustering procedure. In its motivating example, thegoal is to find clusters of similar reconstructed and registered ECGs basedon their functional forms and first derivatives. For the multivariate functionaldata, X ( t ) = ( X ( t ) , . . . , X p ( t )) ( p ∈ Z + ), where t is in a compact subspaceof R (often representing time), Martino et al. (2019) generalize the Maha-lanobis distance in Hilbert spaces to create a measure of the distance betweenfunctional curves and use it to build a k -means clustering algorithm. Theirsetting is different from our proposed method in that the time t in Martinoet al. (2019) is the same for all components of the multivariate functional data,while in this paper we allow bivariate functions, such as X ( t , t ), for example.A non-exaustive list of recent literature that has studied functional data clus-tering includes Abraham et al. (2003), Tokushige et al. (2007), Yamamoto andTerada (2014), Wang et al. (2014), Garc´ıa et al. (2015), Febrero-Bande andde la Fuente (2012), Tarpey and Kinateder (2003), Floriello (2011), Yamamoto(2012), Ferraty and Vieu (2006), and Boull´e (2012).In this paper we are interested in the generalization of clustering methods,such as the k -means algorithm, to surfaces and tensors. We propose using basisfunctions in a tensor product as an approximation of the observed data, andthen applying the estimated coefficients of the basis functions to cluster the Basis Approach to Surface Clustering 3 surfaces (or tensors) with the k -means algorithm. Simulations show that ourproposed method significantly improves the accuracy of clustering comparedto the baseline k -means algorithm applied directly to the raw vectorized data.The remainder of the paper is organized as follows. In Section 2 we describethe estimation procedure of the surfaces and the algorithm for clustering sur-faces. Section 3 shows some asymptotic results on the strong consistency of thealgorithm in correctly clustering the data as the sample size increases. A gen-eralization of this method to tensor products of higher dimensions is discussedin Section 4. In Section 5 we present simulations that assess the finite sam-ple performance of the proposed method in comparison with the benchmark k -means. Let S i := S i ( x, y ) , ( x, y ) ∈ Q be the underlying data generating process of the i th surface (1 ≤ i ≤ n ), where Q is a compact subset of R . Since data arein general discretely recorded and frequently contaminated with measurementerrors, denote z ij = S i ( x ij , y ij ) + (cid:15) ij , (1 ≤ j ≤ m i ; 1 ≤ i ≤ n ) (1)as the m i observed values of the i th surface at coordinates ( x ij , y ij ), where (cid:15) ij isthe measurement error which is assumed to be i.i.d. with mean 0 and constantfinite variance σ .Among several possible ways of representing functions and surfaces us-ing basis functions such as wavelets (Mallat (2008)), spline wavelets (Unser(1997)), logsplines (Kooperberg and Stone (1992)), Fourier series, radial ba-sis, in this paper we focus on B-splines (de Boor (1977)). The theoretical resultswe establish next are also valid for the aforementioned dimension reductionmethods. Assume that the surface S i ( x, y ) can be well approximated by thesmooth B-Spline tensor product, defined as s i ( x, y, Θ ) = R (cid:88) r =1 L (cid:88) l =1 B x,r ( x ) B y,l ( y ) θ irl , where θ irl are coefficients to be estimated, and B x,r ( · ) and B y,l ( · ) are B-splinebasis functions that generate the spline spaces S = span { B x, , . . . , B x,R } and S = span { B y, , . . . , B y,L } respectively. B x,r ( · ) and B y,l ( · ) are polynomialsof degree p and q respectively, so that S and S are piece-wise polynomialswith p − q − n surfaces (de Boor (1971), de Boor (1972)). In this paper we assume R and L are fixed, however, there are several methods in the literature thatdescribe automatic procedures to obtain these values, see for example Unser(1997) and Dias and Gamerman (2002). The linear space of functions formedby the product of these two spaces is denoted by S ⊗ S . Because the Sobolev Adriano Z. Zambom et al. space H := (cid:8) f : (cid:82) f + (cid:82) ( f (cid:48) ) + (cid:82) ( f (cid:48)(cid:48) ) < ∞ (cid:9) can be well approximated by S or S in their respective domains (Karlin (1973), Reif (1997), Lachoutet al. (2005), Lindemann and LaValle (2009)), the space of smooth surfaces in H ⊗ H can be well approximated by S ⊗ S .The R × L matrix of coefficients Θ i = { θ irl } ≤ r ≤ R ;1 ≤ l ≤ L for each surface i ( i = 1 , . . . , n ) observed with measurement error as specified in model (1) canbe estimated by minimizing the least squares errorsvec( (cid:98) Θ i ) = arg min Θ m i (cid:88) j =1 [ z ij − s i ( x ij , y ij , Θ )] = arg min Θ m i (cid:88) j =1 [ z ij − B ix ( x ij ) T ΘB iy ( y ij )] = arg min Θ m i (cid:88) j =1 [ z ij − ( B iy ( y ij ) ⊗ B ix ( x ij )) T vec( Θ )] = ( M iT M i ) − M iT z i where vec( Θ i ) is the vectorization of the matrix Θ i arranged by columns, z i =( z i , . . . , z im i ) T , B ix ( x ij ) = ( B i ( x ij ) , . . . , B iR ( x ij )) T , B iy ( y ij ) = ( B i ( y ij ) , . . . , B iL ( y ij )) T ,and M i is the m i × RL matrix with its j th row equal to the 1 × RL vector of( B iy ( y ij ) ⊗ B ix ( x ij )) T .Surface S i is hence summarized by the estimated matrix of parameters (cid:98) Θ i ,which will be used as the input features in clustering. Although the vector-ization of the parameter matrix Θ i exhibits an elegant expression of the leastsquares solution, it may lead to loss of the information contained in the matrixstructure of the estimated parameters, when employing the clustering proce-dure. The k -means clustering (or other clustering methods) is a minimizationalgorithm based on distances between objects. Thus, we propose to convertvec( (cid:98) Θ i ) back to a matrix form by writing (cid:98) Θ i = devec { ( M iT M i ) − M iT z i } , (2)where “devec” represents de-vectorization, i.e. arranging vec( (cid:98) Θ i ) into an R × L matrix whose entries correspond to the parameter matrix Θ i . This preservesthe spatial structure of the columns and rows of the coefficients that correspondto the hills and valleys of the surface. Such spatial structure of the coefficientscan be informative when computing the distance between objects, which arebased on matrix distance metrics. Next, one aims to find a partition of theset of surfaces S = ( S , . . . , S n ) by grouping the set of estimated parametermatrices (cid:98) Θ n = { (cid:98) Θ , . . . , (cid:98) Θ n } so that surfaces in the same cluster have featuresas similar as possible, and surfaces in different clusters have dissimilar features.That is, for a given number of clusters K , the algorithm searches for the set Basis Approach to Surface Clustering 5 of cluster centers c = { c , . . . , c K } that minimizes1 n n (cid:88) i =1 min c ∈ c || (cid:98) Θ i − c || , (3)where each c represents an R × L matrix with real elements, and || · || is anappropriate matrix norm such as the Frobenius norm.The k -means algorithm finds the partition of the surfaces and their clustercenters c in the following iterative manner. Step 1 : Initialize the partitions by setting c (0) as ( (cid:98) Θ (cid:96) , . . . , (cid:98) Θ (cid:96) K ) ( (cid:96) , . . . , (cid:96) K ∈{ , . . . , n } ), which can be done, for instance, by choosing(a) the K matrices with random entries in the range of the entries of (cid:98) Θ i ( i =1 , . . . , n ),(b) the K matrices among (cid:98) Θ i ( i = 1 , . . . , n ) whose distance (norm) is thelargest among themselves: first choose the 2 surfaces that are farthest apart,then sequentially choose other surfaces whose average distance to the previ-ously selected ones is the maximum,(c) the K matrices ( (cid:98) Θ (cid:96) , . . . , (cid:98) Θ (cid:96) K ) among (cid:98) Θ i ( i = 1 , . . . , n ) so that the sumof the distances (norm) from each (cid:98) Θ i to the closest one in ( (cid:98) Θ (cid:96) , . . . , (cid:98) Θ (cid:96) K ) isthe minimum.(d) K randomly chosen matrices among (cid:98) Θ i ( i = 1 , . . . , n ),(e) the output of a pre-clustering procedure. Step 2 : Assign each surface, i.e. estimated parameter matrix (cid:98) Θ i , to the closestcluster center (cid:96) according to the minimum distance (norm) || (cid:98) Θ i − (cid:98) Θ (cid:96) || ( (cid:96) ∈{ (cid:96) , . . . , (cid:96) k } ). Step 3 : Compute the new cluster centers c (1) = ( c (1)1 , . . . , c (1) K ), where c (1) (cid:96) isthe mean of the matrices (cid:98) Θ i for all surfaces i allocated to the (cid:96) -th cluster( (cid:96) = 1 , . . . , K ) . Step 4 : Repeat Steps 2 and 3 until there are no more changes in the clustermembership assignments. S isobservable without measurement error. Lemma 1 below shows that, given anapproximation of S by a B-spline tensor (projection onto S ⊗ S ), the clus-ter centers c n obtained from minimizing equation (3) converge to a unique(optimal) cluster center set c ∗ , as the number of surfaces n goes to infinity. Adriano Z. Zambom et al.
Here we use notations similar to those in Lemaire (1983) and Abrahamet al. (2003). Let Π ( S ) be the unique matrix Θ ∈ R R × L such thatinf Θ ∈ R R × L ||S − s ( · , Θ ) || = ||S − s ( · , Π ( S )) || . That is, Π ( S ) is the projection of the smooth surface space H ⊗ H onto S ⊗ S . Let B R R × L and µ denote the Borel σ -filed of R R × L and the image measure of P induced by Π . As Π is continuous, ( R R × L , B R R × L , µ ) is a probability space.The surface sequence ( S , . . . , S n ) induces a sequence Θ n = ( Θ , . . . , Θ n ) ofi.i.d. random matrices Θ i = Π ( S i ) ∈ R R × L .Let F = { c ⊂ R R × L | card( c ) ≤ K } ,u ( Θ , c ) = min c ∈ c || Θ − c || F , and denote the objective function of the k -means algorithm as u n ( Θ n , c ) = 1 n n (cid:88) i =1 u ( Θ i , c ) , for all Θ ∈ R R × L , c ∈ R R × L , and c ∈ F . This differs from the classical clus-tering methods in that it is composed of norms of matrix differences. Using anappropriate matrix norm, we can establish a result similar to that in Abrahamet al. (2003), which we state in Lemma 1. Lemma 1
Let u ( c ) = (cid:82) R R × L u ( Θ , c ) µ ( d Θ ) and assume that inf { u ( c ) | c ∈ F } < inf { u ( c ) | c ∈ F, card ( c ) < K } . Then, the (unique) minimizer c ∗ of u ( · ) exists and there also exists a unique sequence of measurable functions c n from ( Ω, A , P ) into ( F, B F ) such that c n ( ω ) ⊂ M n for all ω ∈ Ω and u n ( Θ n , c n ) = inf c ⊂ M n u n ( Θ n , c ) a.s. , where { M n } n is an increasing sequence of convex and compact subsets of R R × L such that R R × L = ∪ n M n . Furthermore, this sequence { c n } is strongly consis-tent to c ∗ with respect to the Hausdorff metric. { ( x i , y i , z i ) , . . . , ( x im i , y im i , z im i ) } , one can estimate the surface s i ( · , Θ )by the B-spline estimate ˆ s i = s ( · , (cid:98) Θ ), where (cid:98) Θ is least-square estimated B-spline coefficient matrix given in equation (2). Assume that each surface isobserved at different grid points over a compact set [ a, b ] × [ c, d ], for some realconstants a, b, c, and d ( a < b, c < d ). Assume also that for each i , x i , . . . , x im i and y i , . . . , y im i are i.i.d. with probability distributions h and g respectively. Basis Approach to Surface Clustering 7
The following lemma shows that the estimator (cid:98) Θ is strongly consistent for Θ = Π ( S ), projection of the smooth surface space H ⊗ H onto the spacegenerated by the B-spline bases. Lemma 2
Assume the spline bases functions B x, , . . . , B x,R and B y, , . . . , B y,L are linearly independent on the support of h and g respectively. Assume alsothat the surfaces S i belong to the space S defined as H ⊗ H restricted tobounded variation on [ a, b ] × [ c, d ] . Then, (cid:98) Θ converges strongly to Θ = Π ( S ) when m → ∞ uniformly over space S . As a result, for almost all ω ∈ Ω andall S ∈ S , || (cid:98) Θ − Θ || → as m goes to infinity. The proofs of Lemma 1 and Lemma 2 are similar to those in Abrahamet al. (2003) in curve clustering. For the details of the proofs, please refer tothe appendix.
The framework designed in Section 2 can be generalized to multi-dimensionalclustering. Denote S i ( x ) , x = ( x , . . . , x d ) ∈ Q , as the data generating tensormechanism, where Q is a subset of R d and d is the tensor dimension ( d ≥ z i , x i ) is such that z ij = S i ( x ij ) + (cid:15) ij , (1 ≤ j ≤ m i ; 1 ≤ i ≤ n ) . (4)The approximation of surface S i ( x ) is then based on the smooth B-Spline d -dimensional tensor product s i ( x , Θ ) = R (cid:88) r =1 . . . R d (cid:88) r d =1 B x ,r ( x ) . . . B x d ,r d ( x d ) θ ir ,...,r d , where θ ir ,...,r d are coefficients to be estimated and B x ,r ( · ) , . . . , B x d ,r d ( · ) areB-spline basis functions that generate the spline spaces S = span { B x , , . . . , B x ,R } , . . . , S d = span { B x d , , . . . , B x d ,R d } respectively. The array of coefficients Θ hasdimension (cid:81) di =1 r i , which is the number of parameters to be estimated.The multi-dimensional space of smooth surfaces in H ⊗ . . . ⊗ H is thenapproximated by S ⊗ . . . ⊗ S d . The least squares solution of this model canbe written asvec( (cid:98) Θ i ) = arg min Θ m i (cid:88) j =1 [ z ij − s i ( x ij , Θ )] = arg min Θ m i (cid:88) j =1 [ z ij − ( B ix d ( x idj ) ⊗ . . . ⊗ B ix ( x i j )) T vec( Θ )] = ( M iT M i ) − M iT z i , where vec( Θ i ) is the vectorization of matrix Θ i arranged by columns, z i =( z i , . . . , z im i ) T , B ix ( x ij ) = ( B i ( x ij ) , . . . , B iR ( x ij )) T , B iy ( y ij ) = ( B i ( y ij ) , . . . , B iL ( y ij )) T , Adriano Z. Zambom et al. and M i is the m i × RL matrix with its j th row equal to the 1 × RL vector of( B iy ( y ij ) ⊗ B ix ( x ij )) T . The proposed procedure can be applied to (cid:98) Θ i in a similarfashion as in Section 2, except that one needs to employ an appropriate arraynorm to evaluate the distances between (cid:98) Θ i and (cid:98) Θ j ( i (cid:54) = j ; i, j = 1 , ..., n ). Weomit the details for the discussion of tensor clustering in this paper. In this section we investigate the finite sample performance of the pro-posed method in clustering surfaces through several simulation scenarios. Forcomparison purposes, we also evaluate the performance of the k -means clus-tering, the benchmark procedure, which does not allow the use of random gridcoordinates. For both the proposed method and the benchmark k -means wechose the initial guess of the cluster centers in the first step of our proposedalgorithm as follows: consider the possible initial guesses given by initializationmethods (b), (c), and 50 random initializations as described in method (d) ofthe k -means algorithm in Section 2. From these 53 possible initial guesses, wechoose the one whose K matrices of estimated coefficients, when defined asthe center of clusters, have the minimum average distance to the objects inthe data assigned to their corresponding clusters. In our numerical studies, wefocus on the Frobenius norm as an exampleThe first simulation setting concerns two clusters whose cluster centers arethe following (probability density) surfaces, each composed of a mixture ofNormal distributions: f ( x, y ) = 0 . φ (cid:18) ( x, y ); (cid:18) − (cid:19) , (cid:18) (cid:19)(cid:19) + 0 . φ (cid:18) ( x, y ); (cid:18) (cid:19) , (cid:18) (cid:19)(cid:19) f ( x, y ) = 0 . φ (cid:18) ( x, y ); (cid:18) − (cid:19) , c (cid:18) (cid:19)(cid:19) + 0 . φ (cid:18) ( x, y ); (cid:18) (cid:19) , c (cid:18) (cid:19)(cid:19) where φ (( x, y ); µ, Σ ) is the probability density function of a bivariate Normaldistribution with mean µ and variance Σ , and c is a fixed constant. Notethat the covariance matrix of each Normal component of the second clustercenter is a multiple of c . For small values of c , the peaks, or modes, of theNormal mixture densities are very high. As c increases, the peaks become moreand more flat. The simulations we present below show the performance of theproposed clustering procedure when the constant c varies, that is, with varyingdegrees of difficulty. Figure 1 shows the centroid surfaces from both clusters,where the top plots display the surface of the second cluster for c = 0 . , c = 1 ischallenging, due to the fact that the only difference is the variance of the firstcomponent of the normal mixture. The task of clustering becomes even harderwith the presence of random error in the data generating process which wedescribe next.The simulated data ( x, y, z ) for each surface were generated in a 20 × − , × ( − , Basis Approach to Surface Clustering 9 m ba . i n t Y Z m ba . i n t Y Z m ba . i n t Y Z m ba . i n t Y Z m ba . i n t Y Z m ba . i n t Y Z Fig. 1: Top: from left to right, surface of the second cluster center (density f ) with constant c equal to 0.2, 1, and 3 respectively. Bottom: surface of thefirst cluster center (density f ) for ease of visual comparison with the secondcluster center.data for Clusters 1 and 2 are generated as follows Cluster 1 : { ( x, y, z ) : x j = y j = − j/ , j = 1 , . . . , z j = f ( x j , y j ) + (cid:15) j } Cluster 2 : { ( x, y, z ) : x j = y j = − j/ , j = 1 , . . . , z j = f ( x j , y j ) + (cid:15) j } , where (cid:15) j and (cid:15) j are i.i.d. random errors from N (0 , . ) and N (0 , . ) re-spectively. We generated a total of n = 60 surfaces, with 30 samples belongingto each of the two clusters. In the simulations in this paper, we used cubicB-spline basis and a fixed number of 6 knots in each axis for the estimationof the surfaces. This process is repeated for B = 500 Monte Carlos simulationruns.It is well known that the initialization of the k -means procedure can havean immense influence on the clustering results (Pena et al. (1999), Frantiand Sieranoja (2019)). For this reason and for a fair comparison between theproposal and the benchmark, we initialized each procedure in the same wayas described in the first paragraph of this section. The benchmark applies the k -means algorithm to the vectorized raw data set, while the proposal employsthe k -means to the B-spline estimated coefficients.In order to evaluate the results we consider the following performance mea-sure. Let S i ( b ) denote the randomly generated surface i ( i = 1 , . . . , n ) in the b -th simulation run ( b = 1 , . . . , B ). Let L ( S i ( b ) ) , L ∗ ( S i ( b ) ) ∈ { , . . . , K } be the predicted and true cluster membership for surface i respectively. We compute φ = (cid:80) Bb =1 min τ ∈ T (cid:80) ni =1 I ( L ( S i ( b ) ) (cid:54) = τ ( L ∗ ( S i ( b ) ))) Bν , where ν = (cid:80) Bb =1 I (cid:18) min τ ∈ T (cid:80) ni =1 I ( L ( S i ( b ) ) (cid:54) = τ ( L ∗ ( S i ( b ) ))) ≥ (cid:19) B , and the T in “ τ ∈ T ” is the set of permutations over { , . . . , K } . Notethat the term min τ ∈ T (cid:80) ni =1 I ( L ( S i ( b ) ) (cid:54) = τ ( L ∗ ( S i ( b ) ))) is the number of mis-specification errors of a clustering procedure, which is based on the MCE mea-sure in Fraiman et al. (2013). Hence, the quantity ν measures the proportionof times that the algorithm mis-specifies, that is, the proportion of simulationruns with at least one surface being assigned to an incorrect cluster. Hence, theperformance measure φ is the mean mis-specification, i.e., the average numberof surfaces being assigned to incorrect clusters in the simulations with at leastone mis-specification.Figure 2 shows the results of the proposed clustering method and thebenchmark in this first scenario. For small values of c (from 0.2 to 0.5) thevariance of the densities in the first cluster is small and hence the hill is high(see top and bottom left plots of Figure 1). In this case both algorithms clus-ter the data without any incorrectly grouped surfaces. While at c = 0 . k -means incorrectly mis-specifies an average of about 18 curves( φ = 18 . φ = 5 .
3. Forvalues of c from 1 to 1.7, the two bivariate densities that compose each clus-ter are very similar (see top and bottom middle plots in Figure 1), and bothalgorithms have similar performance by incorrectly clustering about 20 to 25surfaces out of 60. For larger values of c (greater than 2), the variance of thedensities in cluster 1 is large, so that the hills are low and the difference be-tween the clusters are again more visible (see top and bottom right plots inFigure 1). For these values of c the proposed procedure again yields very lownumber of incorrectly clustered surfaces, while the benchmark k -means stillstruggles to correctly cluster surfaces until c is very large.The second simulation setting is composed of 3 clusters whose centroidsurfaces are defined by f ( x, y ) = 0 . φ (cid:18) ( x, y ); (cid:18) − (cid:19) , (cid:18) (cid:19)(cid:19) + 0 . φ (cid:18) ( x, y ); (cid:18) (cid:19) , (cid:18) (cid:19)(cid:19) f ( x, y ) = 0 . φ (cid:18) ( x, y ); (cid:18) (cid:19) , c (cid:18) (cid:19)(cid:19) + 0 . φ (cid:18) ( x, y ); (cid:18) − (cid:19) , c (cid:18) (cid:19)(cid:19) f ( x, y ) = 0 . φ (cid:18) ( x, y ); (cid:18) − (cid:19) , c (cid:18) (cid:19)(cid:19) + 0 . φ (cid:18) ( x, y ); (cid:18) (cid:19) , c (cid:18) (cid:19)(cid:19) . Figure 3 shows the centroid surfaces from the 3 clusters for c = 0 . , Basis Approach to Surface Clustering 11 c m ean nu m be r o f i n c o rr e c t l y c l u s t e r ed s u r f a c e s Fig. 2: Comparison of mean number of incorrectly clustered surfaces, out of n = 60, with varying values of c in the first simulation setting. Dashed curverepresents the benchmark, and the solid curve corresponds to the proposedmethod.However, the third cluster may bring a challenge for values of c near 1, espe-cially when there is random error.The data for each cluster are simulated as follows:Cluster 1 : { ( x, y, z ) : x j = y j = − j/ , j = 1 , . . . , z j = f ( x j , y j ) + (cid:15) j } Cluster 2 : { ( x, y, z ) : x j = y j = − j/ , j = 1 , . . . , z j = f ( x j , y j ) + (cid:15) j } , Cluster 3 : { ( x, y, z ) : x j = y j = − j/ , j = 1 , . . . , z j = f ( x j , y j ) + (cid:15) j } , where (cid:15) j , (cid:15) j , (cid:15) j are i.i.d. random errors from N (0 , . ). We generated atotal of n = 60 surfaces, where 20 surfaces belong to each cluster. This processwas repeated for B = 500 Monte Carlos simulation runs. Figure 4 displays thecomparison in terms of the average number of incorrectly clustered surfaces fordifferent values of c . The results suggest that the proposed method consistentlyoutperforms the benchmark for 0 < c < .
0, especially for small values of c . Forlarge values of c the proposed algorithm has an increase in mis-specificationerror, possibly due to the random nature of the data generated from the smallhill in the third cluster that lies between the hills in the densities of clusters 1and 2. The randomness of the data from the third cluster may sometimes leadto estimated surfaces with slightly higher hills in positions where the otherclusters have hills, bringing a challenge to the clustering methods. In this section we illustrate the proposed surface clustering method in ananalysis of the Electroencephalogram (EEG) dataset, which is available at the m ba . i n t Y Z m ba . i n t Y Z m ba . i n t Y Z m ba . i n t Y Z m ba . i n t Y Z m ba . i n t Y Z m ba . i n t Y Z m ba . i n t Y Z m ba . i n t Y Z Fig. 3: Top: from left to right, surface of the second cluster center with constant c equal to 0.2, 1, and 3 respectively. Bottom: surface of the first cluster centerfor ease of visual comparison with the second cluster center. c m ean nu m be r o f i n c o rr e c t l y c l u s t e r ed s u r f a c e s Fig. 4: Comparison of mean number of incorrectly clustered surfaces, out of n = 60, with varying values of c in the second simulation setting. Dashed curverepresents the benchmark, and the solid curve corresponds to the proposedmethod. Basis Approach to Surface Clustering 13
University of California Irvine Machine Learning Repository(https://archive.ics.uci.edu/ml/datasets/eeg+database). The dataset is com-posed of 122 subjects that are divided into two groups: alcohol and control.Each subject was exposed to either a single stimulus (S1) or two stimuli (S1and S2). The stimulus was a picture of an object chosen from the 1980 Snod-grass and Vanderwart picture set. For the case of two stimuli, they were eithermatched, where S1 was identical to S2, or not matched, where S1 was differentfrom S2. On the scalp of each subject 64 electrodes were positioned accord-ing to the Standard Electrode Position Nomenclature, American Electroen-cephalographic Association, and measurements were taken at 256 Hz (3.9-msecepoch) for 1 second (See Zhang et al. (1995) for details). The channels (i.e.electrodes) and time compose the surface domain ( x, y ) and the measurementsare the response z .We averaged over all trials for each individual and used the mean surfaceto represent the output of each subject’s response to the stimuli. Visualizationof the raw data of a subject in the control group as well as the raw dataof a subject in the alcohol group are shown in Figure 5. The EEG surface c hanne l t i m e Z c hanne l t i m e Z Fig. 5: Example of EEG measurements for a subject in the control group (left)and alcohol group (right).of the control subject seems to be somewhat flat with some random noise,while the EEG surface of the subject in the alcohol group shows a few hillsfor some channels at later time stamps. However, this is not the case for allsubjects. It can also be observed that the EEG surfaces of some subjects inthe control group have hills at later time stamps, while some subjects in thealcohol group have somewhat flat EEG surfaces throughout time. This imposespractical challenges on the clustering task, as we discuss below.
We applied our proposed clustering method to the full dataset of 122 pa-tients, where the elements clustered by the k -means algorithm were the meansurfaces of the subjects across their trials. Given that we know the actualgroups of the subjects, i.e. either control or alcohol, we can compute howmany mistakes were made by the proposed algorithm when applied to thisdataset. The number of incorrectly clustered subjects was 51 (41.8%) whenusing surfaces from S1, 41 (33.6%) for S2 match, and 38 (31.1%) for S2 nomatch. Although an accuracy rate of about 60 to 70% correctly clustered sub-jects is not very high, we recall the fact that having hills at later time stampsis not a feature that only belongs to EEG surfaces of subjects in one particulargroup. This article proposes a new approach for clustering surfaces using the coeffi-cients obtained from B-splines that approximate their data generating smoothforms as the basis for a k -means clustering algorithm. The proposed methodis shown to be strongly consistent in both stochastic and non-stochastic cases.Compared to the classical k -means procedure applied directly to the vector-ized version of the data that may lose the geometric structure of the surface,our surface clustering procedure performs consistently better in correctly clus-tering surfaces observed with noise in simulation studies. From the wide rangeof applications of surface clustering, in this paper we studied the identificationof effects of alcohol in the brain by clustering Electroencephalogram data from122 patients and clustering into alcohol and control groups, where the surfaceswere defined by the stimuli response at 64 electrodes throughout time stamps. We would like to thank Fapesp for partially funding this research (Fapesp2018/04654-9, 2017/15306-9, 2019/04535-2).
Proof
Following similar steps as in the proof of Proposition 1 in Abraham etal. (2003), we need to show u ( Θ , · ) is strictly convex so that u n ( Θ n , · ) and u ( · ) have unique minimizers if they exist.Define the function φ from F into ( R R × L ) K such that φ ( c ) is an orderedvector of elements c i in c . Suppose we denote the order using the notation ≤ where c i ≤ c j means c j is ordered in front of c i . Let function ψ be one from( R R × L ) K to F such that ψ ( c , . . . , c K ) = { c , . . . , c K } . Thus, ψ ( φ ( c )) = c . Basis Approach to Surface Clustering 15
Let λ be a scalar. For two elements c and c (cid:48) from F , define scalar multi-plication and sum of elements as λ c := ψ ( λφ ( c )) = { λc , . . . , λc k } , c + c (cid:48) := ψ ( φ ( c ) + φ ( c (cid:48) )) , so that λ c ∈ F and c + c (cid:48) ∈ F . Then u ( Θ , λ c + (1 − λ ) c (cid:48) ) = min c ∈ λ c +(1 − λ ) c (cid:48) || Θ − c || = min ≤ i ≤ K || Θ − ( λc i + (1 − λ ) c (cid:48) i || < min ≤ i ≤ K λ || Θ − c i || + (1 − λ ) || Θ − c (cid:48) i ||≤ λu ( Θ , c ) + (1 − λ ) u ( Θ , c (cid:48) ) . The inequality in the third line only uses the property of triangle inequalityof a matrix norm, so that any appropriate matrix norm can be used.Therefore, u ( Θ , · ) is strictly convex, which implies that u n ( Θ n , · ) and u are also strictly convex. Consequently, the minimizers of u n ( Θ n , · ) and u areunique if they exist. The existence of the minimizer and the strong consistencyof c n to c ∗ follow from proposition 7, and theorems 1 and 2 in Lemaire (1983). Proof of Lemma 2
Proof
For ease of notation, we omit the superscript i in the proof of thislemma. Here we follow steps similar to those in the proof of Proposition 1 inAbraham et al. (2003).Let P (cid:15) denote the distribution of the error, and recall that h is the bivariatedistribution of ( x j , y j ). Following the notations and set-up in Appendix 2 ofAbraham et al. (2003), denote z ( x, y, (cid:15) ) = g ( x, y ) + (cid:15) . Thus, (cid:107) g ( x, y ) (cid:107) = (cid:90) g ( x, y ) h ( d ( x, y )) , (cid:107) z − s ( · , Θ ) (cid:107) = (cid:90) ( g ( x, y ) + (cid:15) − s ( x, y, Θ )) h ( d ( x, y )) P (cid:15) ( d(cid:15) ) = (cid:107) g − s ( · , Θ ) (cid:107) + (cid:107) (cid:15) (cid:107) (cid:107) g (cid:107) m = 1 m m (cid:88) j =1 g ( x j , y j ) (cid:107) z − s ( · , Θ ) (cid:107) m = 1 m m (cid:88) j =1 [ g ( x j , y j ) + (cid:15) j − s ( x j , y j , Θ )] = (cid:107) (cid:15) − ( g − s ( · , Θ )) (cid:107) m . Using similar arguments, we can obatain uniform strong law of large num-bers on the space F = { z − s ( · , α ) , g ∈ S , s ∈ B } , where S is the space forall the surfaces and B is the matrix subspace generated by the B-spline basis.That is, sup g ∈ S ,s ( · ,α ) ∈ B |(cid:107) z − s ( · , α ) || m − || z − s ( · , α ) (cid:107) | → . (5) Fix η >
0. By equation (5) for almost every ω ∈ Ω and all g ∈ S , thereexists an integer N such that for any m > N , sup s ( , Θ ) |(cid:107) z − s ( · , Θ ) (cid:107) m − (cid:107) z − s ( · , Θ ) | < η/
2. Let (cid:98) Θ be the least-square estimate of Π ( g ) using observations z j at the design points ( x j , y j ). Then, for sufficiently large m , (cid:107) (cid:15) (cid:107) + (cid:107) g − s ( · , (cid:98) Θ ) (cid:107) = (cid:107) z − s ( · , (cid:98) Θ ) (cid:107) ≤ (cid:107) z − s ( · , (cid:98) Θ ) (cid:107) m + η ≤ (cid:107) z − s ( · , Π ( S )) (cid:107) m + η ≤ (cid:107) g + (cid:15) − s ( · , Π ( S )) (cid:107) + η = (cid:107) (cid:15) (cid:107) + (cid:107) g − s ( · , Π ( S )) (cid:107) + η, which implies (cid:107) s ( · , Π ( S )) − s ( · , (cid:98) Θ ) (cid:107) ≤ η. This concludes the proof.
References
Abraham, C., Cornillon, P. A., Matzner-Løber, E., and Molinari, N. (2003).Unsupervised curve clustering using b-splines.
Scandinavian Journal ofStatistics , 30(3):581–595.Abu-Jamous, B., Fa, R., and Nandi, A. (2015).
Interactive cluster analysis inbioinformatics . Wiley.Abualigah, L. M., Khader, A. T., and Hanandeh, E. S. (2018). A combina-tion of objective functions and hybrid krill herd algorithm for text docu-ment clustering analysis.
Engineering Applications of Artificial Intelligence ,73:111–125.Blackhurst, J., Rungtusanatham, M. J., Scheibe, K., and Ambulkar, S. (2018).Supply chain vulnerability assessment: A network based visualization andclustering analysis approach.
Journal of Purchasing and Supply Manage-ment , 24:21–30.Boull´e, M. (2012). Functional data clustering via piecewise constant nonpara-metric density estimation.
Pattern Recognition , 45(12):4389 – 4401.de Boor, C. (1971). Subroutine package for calculating with b-splines.
Techn.Rep. LA-4728-MS, Los Alamos Sci.Lab, Los Alamos NM , pages 109– 121.de Boor, C. (1972). On calculating with b-splines.
Journal of ApproximationTheory , 6(1):50 – 62.de Boor, C. (1977). Package for calculating with b-splines.
SIAM Journal onNumerical Analysis , 14(3):441–472.Dias, R. and Gamerman, D. (2002). A Bayesian approach to hybrid splinesnonparametric regression.
Journal of Statistical Computation and Simula-tion. , 72(4):285–297.
Basis Approach to Surface Clustering 17
Duran, A. H., Greco, T. M., Vollmer, B., M., C. I., Crunewald, K., and Topf,M. (2019). Protein interactions and consensus clustering analysis uncoverinsights into herpesvirus virion structure and function relationships.
PLOSBiology .Febrero-Bande, M. and de la Fuente, M. (2012). Statistical computing in func-tional data analysis: The r package fda.usc.
Journal of Statistical Software,Articles , 51(4):1–28.Ferraty, F. and Vieu, P. (2006).
Nonparametric functional data analysis .Springer Series in Statistics.Floriello, D. (2011). Functional sparse k-means clustering.
Thesis, , Politecnicodi Milano .Franti, P. and Sieranoja, S. (2019). How much can k-means be improved byusing better initialization and repeats?
Pattern Recognition , 93:95 – 112.Garc´ıa, M. L. L., Garc´ıa-Rodenas, R., and G´omez, A. G. (2015). K-meansalgorithms for functional data.
NEUROCOMPUTING , 151:231–245.Hartigan, J. A. (1975).
Clustering algorithms . Wiley.Hartigan, J. A. and Wong, M. A. (1979). Algorithm as 136: A k-means clus-tering algorithm.
Journal of the Royal Statistical Society. Series C (AppliedStatistics) , 28(1):100–108.Hu, G., Kaur, M., Hewage, K., and Sadiq, R. (2019). Fuzzy clustering analysisof hydraulic fracturing additives for environmental and human health riskmitigation.
Clearn Technologies and Environmental Policy , 21:39–53.Ieva, F., Paganoni, A. M., Pigoli, D., and Vitelli, V. (2013). Multivariatefunctional clustering for the morphological analysis of electrocardiographcurves.
Journal of the Royal Statistical Society: Series C (Applied Statistics) ,62(3):401–418.Jim´enez-Murcia, S., Granero, R., Fernndez-Aranda, F., Stinchfield, R., Trem-blay, J., Steward, T., Mestre-Bach, G., Lozano-Madrid, M., Mena-Moreno,T., Mallorqu´ı-Bagu´, N., Perales, J. C., Navas, J. F., Soriano-Mas, C., Ay-mam´ı, N., G´omez-Pea, M., Ag¨uera, Z., del Pino-Guti´errez, A., Mart´ın-Romera, V., and Mench´on, J. M. (2019). Phenotypes in gambling disorderusing sociodemographic and clinical clustering analysis: an unidentified newsubtype?
Front Psychiatry , 10:173.Karlin, S. (1973). Some variational problems on certain sobolev spaces andperfect splines.
Bull. Amer. Math. Soc. , 79(1):124–128.Kooperberg, C. and Stone, C. J. (1992). Logspline density estimation forcensored data.
Journal of Computational and Graphical Statistics , 1(4):301–328.Lachout, P., Liebscher, E., and Vogel, S. (2005). Strong convergence of esti-mators as (cid:15) n-minimisers of optimisation problemsof optimisation problems.
Annals of the Institute of Statistical Mathematics , 57(2):291–313.Lemaire, J. (1983). Proprietes asymptotiques en classification.
Statistiques etanalyse des donnees , 8:41–58.Lindemann, S. R. and LaValle, S. M. (2009). Simple and efficient algorithmsfor computing smooth, collision-free feedback laws over given cell decompo-sitions.
The International Journal of Robotics Research , 28(5):600–621.
Mallat, S. (2008).
A Wavelet Tour of Signal Processing, Third Edition: TheSparse Way . Academic Press, Inc., USA, 3rd edition.Martino, A., Ghiglietti, A., Ieva, F., and Paganoni, A. M. (2019). A k-meansprocedure based on a mahalanobis type distance for clustering multivariatefunctional data.
Statistical Methods & Applications , 28(2):301–322.Pena, J., Lozano, J., and Larranaga, P. (1999). An empirical comparison offour initialization methods for the k-means algorithm.
Pattern RecognitionLetters , 20(10):1027 – 1040.Reif, U. (1997). Uniform b-spline approximation in sobolev spaces.
NumericalAlgorithms , 15(1):1–14.Tarpey, T. and Kinateder, K. K. J. (2003). Clustering functional data.
Journalof Classification , 20(1):093–114.Tokushige, S., Yadohisa, H., and Inada, K. (2007). Crisp and fuzzy k-meansclustering algorithms for multivariate functional data.
Computational Statis-tics , 22(1):1–16.Udler, M. S., Kim, J., von Grotthuss, M., Bons-Guarch, S., Cole, J. B., Chiou,J., Anderson, C. D., Boehnke, M., Laakso, M., Atzmon, G., Glaser, B. Mer-cader, J. M., Gaulton, K., Flannick, J., Getz, G., and Florez, J. C. (2018).Type 2 diabetes genetic loci informed by multi-trait associations point to dis-ease mechanisms and subtypes: A soft clustering analysis.
PLOS Medicine .Unser, M. A. (1997). Ten good reasons for using spline wavelets. In Aldroubi,A., Laine, A. F., and Unser, M. A., editors,
Wavelet Applications in Signaland Image Processing V , volume 3169, pages 422 – 431. International Societyfor Optics and Photonics, SPIE.Wang, G., Lin, N., and Zhang, B. (2014). Functional k-means inverse regres-sion.
Computational Statistics & Data Analysis , 70(C):172–182.Ward Jr., J. H. (1963). Hierarchical grouping to optimize an objective function.
Journal of the American Statistical Association , 58(301):236–244.Wedel, M. and Kamakura, W. (1999).
Market segmentation: conceptual andmethodological foundations . Springer Science & Business Media, 2 edition.Yamamoto, M. (2012). Clustering of functional data in a low-dimensionalsubspace.
Advances in Data Analysis and Classification , 6(3):219–247.Yamamoto, M. and Terada, Y. (2014). Functional factorial k-means analysis.
Computational Statistics and Data Analysis , 79:133–148.Zhang, X., Begleiter, H., Porjesz, B., Wang, W., and Litke, A. (1995). Eventrelated potentials during object recognition tasks.