[PDF] Batch Self Organizing maps for distributional data using adaptive distances

Abstract

The paper deals with a Batch Self Organizing Map algorithm (DBSOM) for data described by distributional-valued variables. This kind of variables is characterized to take as values one-dimensional probability or frequency distributions on a numeric support. The objective function optimized in the algorithm depends on the choice of the distance measure. According to the nature of the date, the L 2 Wasserstein distance is proposed as one of the most suitable metrics to compare distributions. It is widely used in several contexts of analysis of distributional data. Conventional batch SOM algorithms consider that all variables are equally important for the training of the SOM. However, it is well known that some variables are less relevant than others for this task. In order to take into account the different contribution of the variables we propose an adaptive version of the DBSOM algorithm that tackles this problem with an additional step: a relevance weight is automatically learned for each distributional-valued variable. Moreover, since the L 2 Wasserstein distance allows a decomposition into two components: one related to the means and one related to the size and shape of the distributions, also relevance weights are automatically learned for each of the measurement components to emphasize the importance of the different estimated parameters of the distributions. Examples of real and synthetic datasets of distributional data illustrate the usefulness of the proposed DBSOM algorithms.

Full PDF

BBatch Self Organizing maps for distributional data withautomatic weighting of variables and components

Francisco de A.T. De Carvalho Centro de Informatica, Universidade Federal de Pernambuco, Av. Jornalista AnibalFernandes s/n - Cidade Universitaria, CEP 50740-560, Recife-PE, Brazil

Antonio Irpino, Rosanna Verde and Antonio Balzanella Universit´a degli Studi della Campania “Luigi Vanvitelli”, Dept. of Mathematics andPhysics, Viale Lincoln, 5, 81100 Caserta, Italy

Abstract

This paper deals with Self Organizing Maps (SOM) for data described bydistributional-valued variables. This kind of variables takes as values empir-ical distributions on the real line or estimates of probability distributions. Wepropose a Batch SOM strategy (DBSOM) that optimizes an objective func-tion, using a L Wasserstein distance that is a suitable dissimilarity measure tocompare distributional data, already proposed in diﬀerent distributional dataanalysis methods. Moreover, aiming to take into consideration the diﬀerentcontribution of the variables, we propose an adaptive version of the DBSOM al-gorithm. This adaptive version has an additional step that learns automaticallya relevance weight for each distributional-valued variable. Besides, since the L Wasserstein distance allows a decomposition into two components: one relatedto the means and one related to the size and shape of the distributions, alsorelevance weights are automatically learned for each of the measurement compo-nents to emphasize the importance of the diﬀerent estimated parameters of thedistributions. Experiments with real datasets of distributional data corroboratethe proposed DBSOM algorithms.

Keywords:

Distribution-valued data, Wasserstein distance, Self-OrganizingMaps, Relevance weights, Adaptive distances

1. Introduction

Current big-data age requires the representation of voluminous data by sum-maries with loss of information as small as possible. Usually, this is achieved by Corresponding Author: [email protected] { antonio.irpino } , { rosanna.verde } , { antonio.balzanella } @unicampania.it Preprint submitted to Elsevier April 1, 2019 a r X i v : . [ s t a t . O T ] M a r escribing data subgroups according to descriptive statistics of their distribution(e.g.: the mean, the standard deviation, etc.) Alternatively, when a dataset isobserved with respect to a numerical variable, it can be described either by theestimate of the theoretical distribution that best ﬁts the data or by an empiricaldistribution.In these cases, each set of observations is described by a distribution-valueddata, and we call distributional-valued variable a more general type of vari-able whose values are one-dimensional empirical or estimated probability orfrequency distributions on numeric support. Symbolic Data Analysis [6] introduced distributional-valued variables as par-ticular set-valued variables, namely, modal variables having numeric support. Aparticular type of distributional-valued variable is a histogram-valued variable,whose values are histograms.Such kind of data is arising in many practical situations. Oﬃcial statisticalinstitutes collect data about territorial units or administrations and often theycarry out them as empirical distributions or histograms. Similarly, data areoften available as aggregates in order to preserve individuals’ privacy. For in-stance, bank transactions or measurements regarding patients of a hospital areoften provided as histograms or empirical distributions rather than as individualrecords.So far, several methods have been proposed for estimating distributions froma set of classical data, while few methods have been developed for the dataanalysis of objects described by distribution-valued variables.Among exploratory methods, Kohonen Self-Organizing Map (SOM) presentsboth visualization and clustering properties [40, 41]. SOM is based on a map ofnodes (neurons) organized on a regular low dimensional grid where the neuronspresent a priori neighborhood relations. Each neuron is described by a prototypevector (a model) and it is associated with a set of the input data.In this sense, SOM carries out clustering while preserving the topological or-der of the prototype vectors on the map: the more the neurons are adjacent onthe map, the more they are described by similar prototypes, whereas diﬀerentprototypes are associated with neurons that are distant on the map. Besides,during the training step, each object must be assigned to a neuron. This canbe done in two ways. The classical SOM algorithm assigns an input data tothe closest BMU (Best Matching Unit), namely the neuron that is described bythe closest prototype. Following the approach considered by Kohonen [40] too,it is possible to consider the assignment as a part of an optimization problem.In this case, an objective function is associated with a SOM that is minimizedaccording to the prototypes deﬁnition (a representation step) and the data as-signment. Thus, the assignment is not merely done accordingly to the closestBMU but according to the BMU that allows a minimization of the objectivefunction. After that each object is assigned to the optimal BMU, the corre-sponding representative and a subset of representatives of its neighbors on thegrid are modiﬁed aiming to ﬁt better the data set.An important property of the SOM is that it preserves at the best the originaltopological structure of the data: the objects that are similar in the original2pace have their corresponding representatives similar and located close in themap. Finally, the training of the SOM can be incremental or batch. Kohonen[41] states that for practical applications, the batch version of the SOM is themore suitable. However, when the data is presented sequentially, as in streamdata, the training has to be incremental.SOM can be considered as a distance-based clustering method, thus thedeﬁnition of a distance between the data is essential, mainly in our case wheredata are one dimensional distributions.The literature on clustering distributions includes several proposals. In [32]it is proposed a hierarchical clustering method which uses a Wasserstein distancefor comparing distributions estimated by means of histograms. Using the samedistance function [53] propose a method based on the Dynamic Clustering Al-gorithm (DCA) [16]. The latter is a centroid based algorithm which generalizesthe classic k -means. It optimizes an internal homogeneity criterion by perform-ing, iteratively, a representation and an allocation step until the convergence toa stable value of the optimized criterion. Another centroid based method hasbeen introduced in [50]. It is a k -means algorithm which uses empirical jointdistributions. Finally, [55] propose a Dynamic Clustering Algorithm based oncopula. All the mentioned algorithms require an appropriate dissimilarity mea-sure for comparing distributions. Among these, Wasserstein distances [49] haveinteresting features, as investigated in [33].In classical clustering, usually, it is assumed that variables play the samerole in the grouping process. Sometimes a standardization or a rescaling stepis performed before running the clustering algorithm but this step does notassure that each variable participates to the clustering process according to itsclustering ability. Indeed, there is a wide and unresolved debate about thevariable transformation in clustering. The use of adaptive distances [15] inclustering is a valid approach for the identiﬁcation of the importance of variablesin the clustering process. In the framework of Symbolic Data Analysis, severalDynamic Clustering algorithms including adaptive distances providing relevanceweights for the variables, have been introduced for interval-valued data [9, 10,11], for histogram-valued data [37, 38] and for modal symbolic data [25, 43, 44].The use of adaptive distances, based on the L Wasserstein metric, has beenalso proposed in the framework of Dynamic Clustering algorithm by [35]. Still,a fuzzy version of such algorithm is available in [36].Some extensions of the SOM have been proposed for interval-valued data[5, 17, 7, 21, 22, 23, 24]. Nonetheless, to the best of our knowledge, SOMalgorithms for distributional data have not yet been proposed. Moreover, usuallySOM algorithms assume that the variables have the same importance for thetraining of the map, i.e, they assume that the variables have the same relevance.However, in real applications some variables are irrelevant and some are lessrelevant than others [15, 18, 19, 27]. In order to consider the role of the variablesin the partitioning structure, diﬀerent approaches can be adopted.A ﬁrst strategy is to set the weight of variables according to the aprioriknowledge about the application domain and, then, to perform the SOM pro-cedure to train the map 3 suitable alternative is to add a step to the algorithm which computes, au-tomatically, the weights for the variables. This approach has been used in Didayand Govaert [15] which propose adaptive distances in which the automaticallycomputed weights have a product to one constraint.Recently, De Carvalho et al [13] proposes batch SOM algorithms that learna relevance weight for each interval-valued variable during the training phaseof the SOM thanks to adaptive distances. In particular, it is proposed to asso-ciate a relevance weight to each variable by introducing a weighting step in thealgorithm and by modifying the optimized criterion function.

Main contributions

The present paper extends the Batch Self Organized Map (BSOM) algorithmto histogram data. We refer to it as DBSOM, where D stands for Distributionaldata. Since the DBSOM cost function depends on a distance between data thatare distributions, among the diﬀerent distances for comparing distributions, wepropose the use of the Wasserstein distance. It belongs to a family of distancemeasures deﬁned by diﬀerent authors in a variety of contexts of analysis andwith diﬀerent norms. In our context, we consider the squared Wasserstein dis-tance, also known as Mallows distance, as deﬁned in [49], and here named L Wasserstein distance. The main motivation of this choice can be found in [51],where a comparison with other metrics is proposed. It is related to the possi-bility of deﬁning barycenters of sets of distributions as Fr´echt means, improvingthe ease of interpretation of the obtained results. Moreover, it allows of deﬁningthe variance over any distribution variables [33].Besides, for the L Wasserstein distance it has been proved an importantdecomposition in components related to the location (means) and variability andshape over the compared distributions [31]. Another contribution of the paper isthe introduction of a system of weights for considering the diﬀerent importanceof the variables in the clustering process. Indeed, the classical BSOM assumesthat each variable has the same role in the map learning, where standardizationof the variables is usually performed before the algorithm starts (like in thePrincipal Component Analysis). However, the eﬀect of teach variable can berelevant in the learning process. To take that into consideration, we introduce an

Adaptive version of DBSOM, denoted ADBSOM. We call it

Adaptive since it isbased on the use of adaptive distances proposed by [15] in clustering. As we willshow, the optimization process of ADBSOM allows to compute automaticallya system of weights for the variables. Since the L Wasserstein distance canbe decomposed in two components related to the locations (means), variabilityand shape, we also propose a system of weights for the two components of eachdistributional variable. This enriches the interpretation of the components ofthe variables that are relevant in the learning process and for the results. Theprocedure is performed through an additional step of the ADBSOM algorithmthat automatically learns relevance weights for each variable and/or for eachcomponent of the distance. Some preliminary results were presented in [34]. Inaddition to extending the methods mentioned above, we propose new variantsof the algorithm which consider new constraints in the optimization process.4he paper is organized as follows. Sec. 2 introduces details of our proposalfocusing on the criterion function optimized by each algorithm; the computationof the relevance weights on data described by distributional-valued variables; theadaptive distances for distributional data. Sec. 3 provides an application of theproposed algorithms to real-world datasets. Sec. 4 concludes the paper with adiscussion on the achieved results on the several datasets.

2. Batch SOM algorithms for distributional-valued data

SOM is proposed as an eﬃcient method to address the problem of clusteringand visualization, especially for high-dimensional data having as input a topo-logical structure. Since a grid of neurons is chosen for deﬁning the topologyof the map, the procedure of mapping a vector from data space onto the mapconsists in ﬁnding the neurons with the closest weighted distance to the dataspace vector. A correspondence between the input space and the mapping spaceis built such that, two close data in the input space, should activate the sameneuron, or two neighboring neurons, of the SOM. A prototype describes eachneuron. Neighboring neurons in the map providing the Best Matching Unit(BMU) of a data update their prototypes to better represent the data.An extension of SOM to histogram data is suitable to analyze data that arealready available in aggregate form, also generated as syntheses of a huge amountof original data. This paper proposes batch SOM algorithms for distributional-valued data that automatically provides a set of relevance weights for the diﬀer-ent variables. We present two new batch SOM algorithms, namely DBSOM(Distributional Batch SOM) and ADBSOM (Adaptive Distributional BatchSOM). Both algorithms are based on the L Wasserstein distance betweendistributional-valued data. Speciﬁcally, during the training of the map, ADB-SOM computes relevance weights for each distributional-valued variable. Suchrelevance weights are assumed as parameters of the dissimilarity function thatcompares the prototypes and the data items. Therefore, the computed valuesof such weights allow selecting the importance of the distributional-valued vari-ables to the training of the map. SOM provides both a visualization (given bythe proximity between the neurons) and a partitioning of the input data (by anorganization of the data in clusters).

This section extends the classical objective function of batch SOM to thecase of distributional-valued variables.Let E = { e , . . . , e N } be a set of N objects described by P distributional-valued variables Y j ( j = 1 , . . . , P ). Each i -th object e i (1 ≤ i ≤ N ) is representedby P distributions (or distributional-valued data) y ij (1 ≤ j ≤ P ). These areelements of an object vector y i = ( y i , . . . , y iP ). With each one-dimensionaldistributional data y ij is associated: a one-dimensional estimated density func-tion f ij , the corresponding cumulative distribution function (cdf) F ij and thequantile function (qf) Q ij = F − ij (namely, the inverse of the cdf).5herefore, the distributional-valued data set D = { y , . . . , y N } is collectedin an object table : Y = (cid:0) y ij (cid:1) ≤ i ≤ N ≤ j ≤ P where each element is a (one dimensional) distribution y ij .SOM is a low-dimensional (mainly, two-dimensional) regular grid, named map , that has M nodes named neurons . A SOM algorithm induces a partitionwhere to each cluster corresponds a unique neuron described by prototype vec-tor. Thus, the neuron m (1 ≤ m ≤ M ) is associated with a cluster C m and aprototype object vector g m .The assignment function f : D (cid:55)→ { , . . . , M } assigns an index m = f ( y i ) ∈{ , . . . , M } to each distributional-valued data y i according to: f ( y i ) = m = arg min ≤ r ≤ M d T ( y i , g r ) (1)where d T is a suitable dissimilarity function between the object vectors y i andthe prototype vectors g r .The partition P = { C , . . . , C M } carried out by SOM is obtained accordingto an assignment function that provides the index of the cluster of the partitionto which the object y i is assigned: C m = { e i ∈ E : f ( y i ) = m } .Since the variables are distributional-valued, each prototype g m (1 ≤ m ≤ M ) is a vector of P distributional data, i.e., g m = ( g m , . . . , g mP ), where each g mj (1 ≤ j ≤ P ) is a distribution. Besides, with each one-dimensional distribu-tional data g mj are associated: an estimate density function f g mj , the cdf G mj and the qf Q g mj . Hereafter, G is the matrix of the descriptions of each g mj associated with the prototypes: G = (cid:0) g mj (cid:1) ≤ m ≤ M ≤ j ≤ P where each cell contains a (one dimensional) distribution g mj .With the purpose that the obtained SOM represent the data set D accu-rately, the prototype matrix G and the partition P are computed iterativelyaccording to the minimization of an suitable objective function (also known as energy function of the map), hereafter refered as J DBSOM , deﬁned as the sumof the dissimilarities between the prototypes (best matching units) and the dataunities: J DBSOM ( G , P ) = N (cid:88) i =1 d T ( y i , g f ( y i ) ) (2)Dissimilarities between each object and all the prototype vectors are neededto be computed. The best matching unit (BMU) is the winner neuron, i.e., We remark that g mj is a distributional data because the corresponding quantile function Q g mj is a weighted average quantile function (for further details see [20] and [33]) m = f ( y i ) with prototype vector g m of minimum error.The dissimilarity function, d T , that is used to compare each object y i with eachprototype g h , is computed as follows: d T ( y i , g m ) = M (cid:88) h =1 K T ( d ( m, h )) d W ( y i , g h ) (3)In equation (3), d is the distance deﬁned on the set of neurons. Usually, it iscomputed as the length of the shortest path on the grid between nodes (neurons) m and h . T is the neighborhood radius . K T is the neighborhood kernel functionthat computes the neighborhood inﬂuence of winner neuron m on neuron h .The neighborhood inﬂuence diminishes with T [2].Since y i is described by P distributional variables as deﬁned above, withoutany information about the multivariate distribution, we assume that y i , . . . y ij ,. . . y iP are marginal distributions. Thus, the (standard) L2 Wasserstein distancebetween the i -th object and the prototype g h associated with the neuron h isdeﬁned as follows: d W ( y i , g m ) = P (cid:88) j =1 d W ( y ij , g mj ) . (4)Following the R¨ushendorﬀ [49] notation, the squared L Wasserstein distancebetween y ij and g mj is deﬁned as: d W ( y ij , g mj ) = (cid:90) (cid:2) Q ij ( p ) − Q g mj ( p ) (cid:3) dp. (5)In [36] is shown that the squared L Wasserstein distance presents moreinterpretative properties compered with other distances between distributions.Especially it can be decomposed in two independent distance-components asfollows: d W ( y ij , g mj ) = (¯ y ij − ¯ y g mj ) + (cid:90) (cid:104) Q cij ( p ) − Q cg mj ( p ) (cid:105) dp, (6)where: ¯ y ij and ¯ y g mj are the means; Q cij ( p ) = Q ij ( p ) − ¯ y ij and Q cg mj ( p ) = Q g mj ( p ) − ¯ y g mj are the centered quantile functions of y ij and g mj , respectively.Brieﬂy, the squared L Wasserstein distance is expressed as the sum of thesquared Euclidean distance between means and the squared L Wassersteindistance between the centered quantile functions corresponding to the two dis-tributions. We rewrite the same equation as follows: d W ( y ij , g mj ) = d (¯ y ij , ¯ y g mj ) + d W ( y cij , g cmj ) . (7)7inally, the standard multivariate squared L Wasserstein distance betweenthe i -th object y i and the prototype g m is as follows d W ( y i , g m ) = P (cid:88) j =1 (¯ y ij − ¯ y g mj ) + P (cid:88) j =1 d W ( y cij , g cmj ) . (8)For the rest of the paper, we denote with dM im,j = (¯ y ij − ¯ y g mj ) the squaredEuclidean distance between the means of distributional data y ij and g mj , andwith dV im,j = d W ( y cij , g cmj ) the squared L Wasserstein distance between thecentered distributional data. Equation (8) can be written in a compact form asfollows: d W ( y i , g m ) = P (cid:88) j =1 ( dM im,j + dV im,j ) . (9)Therefore, the generalized distance d T ( y i , g f ( y m ) ) (equation 3) is a weightedsum of the non-adaptive multivariate squared L Wasserstein distances d W com-puted between the vector y i and the prototypes of the neighborhood of thewinner neuron f ( y m ). Usually, SOM models assume that the variables have the same importancefor the clustering and visualization tasks. Nonetheless, in real applications,some variables are irrelevant and others are more or less relevant. Moreover,each cluster may have its speciﬁc set of relevant variables [15, 18, 19, 27].In the framework of clustering analysis, adaptive distances [15] have beenproposed for solving the issue. Adaptive distances are weighted distances, where,generally, a positive weight is associated with each variable according to itsrelevance in the clustering process, such that the system of weights satisﬁessuitable constraints.Adaptive distances were originally proposed in a k-means-like algorithm intwo diﬀerent ways. In a ﬁrst algorithm, a weight is associated with each variablefor the whole dataset (we call it a Global approach). In a second approach,considering a partition of the dataset into M cluster, a weight is associatedwith each variable and each cluster (we call it a Cluster-wise approach). Inthis paper, we remark that each neuron of the SOM is related to a Voronoiset that can be considered as a cluster of input data. Similarly to [34], in thispaper, we extend this idea to a dataset of distributional data. Following thesame approach, we go beyond by exploiting the decomposition of the squaredL2 Wasserstein distance in Eq. (9), proposing to weight the components too.Namely, we provide a method for observing the relevance of the two aspects ofa distributional variable related to the two components. The current proposaldiﬀers from the one in [34] for the constraints on the weights.In this paper, we denote with Λ the matrix of relevance weights. The dimen-sion of Λ is P × P ×

1, if relevance weights are associated witheach variable or, respectively, each component for the whole dataset; P × M or,8espectively, 2 P × M for each variable or, respectively, each component for eachneuron. We recall that adaptive distances rely on relevance weights that arenot deﬁned in advance, but they depend on the minimization of the objectivefunction, here denoted with J ADBSOM , which measures the dispersion of dataaround the prototypes. Obviously, the trivial solution of such minimization isobtained when Λ is a null matrix.To avoid the trivial solution, a constraint on the relevance weights is nec-essary. In the literature, a constraint on the product [15] or on the sum [26],usually to one, is suggested. Even if the latter approach appears more natural,it relies on the tuning of a parameter that must be ﬁxed in advance. So far, aconsensus on an optimal value is still missing, we do not discuss its use in thispaper.In the framework of clustering analysis for non-standard data, adaptivesquared L Wasserstein distances were applied (see [9, 11]) for deriving relevanceweights for each variable and cluster. Irpino et al. [35] provided a component-wise adaptive distance approach to clustering, but the relevance weights are re-lated to two independent constraints for each component of the distance, forcingthe assignment of high relevance weights to components whose contributes tothe clustering process are low too.In this paper, we propose the approach suggested in [36] that solve thisissue. Diﬀerently to the method proposed by Kohonen[40], we consider thetraining of the SOM as a set of iterative steps that minimizes a criterion function J ADBSOM [2]. The main diﬀerence is the allocation of objects to the Voronoiset of each neuron. Indeed, in the original formulation, that is the widely usedone, the allocation is performed according to the minimum distance between theobject and the prototype only. That approach does not guarantee a monotonicdecreasing of the criterion function along the training step. In our case, ateach step of the training of the SOM, a set G of M prototypes, namely, aprototype for each neuron, the matrix Λ of relevance weights and the partition P of the input objects are derived by the the minimization of the error function J ADBSOM (know also as energy function of the map), computed as the followingdispersion criterion: J ADBSOM ( G , Λ , P ) = N (cid:88) i =1 d T Λ ( y i , g f ( y i ) ) (10)The dissimilarity function, d T Λ , that compares each data unit y i to eachprototype g h (1 ≤ h ≤ M ) is deﬁned as: d T Λ ( y i , g m ) = M (cid:88) h =1 K T ( d ( m, h )) d Λ ( y i , g h ) (11)where d , T and K T are deﬁned as in equation (3), while d Λ ( y i , g h ) is one ofthe four following equations depending on the global or cluster-wise, variable or9omponent assignment of the relevance weights: d Λ ( y i , g m ) = P (cid:88) j =1 λ j ( dM im,j + dV im,j ) (12) d Λ ( y i , g m ) = P (cid:88) j =1 ( λ j, M dM im,j + λ j, V dV im,j ) (13) d Λ ( y i , g m ) = P (cid:88) j =1 λ mj ( dM im,j + dV im,j ) (14) d Λ ( y i , g m ) = P (cid:88) j =1 ( λ mj, M dM im,j + λ mj, V dV im,j ) (15)Thus, the generalized distance d T Λ ( y i , g f ( y i ) ) is a weighted sum of the adap-tive multivariate squared L Wasserstein distances d Λ ( y i , g m ) computed be-tween the vector y i and the prototype of the neighborhood of the winner neuron f ( y m ).For avoiding the trivial solution (namely, null λ ’s), we use the above men-tioned product constraint. We suggest four diﬀerent constraints on the relevanceweights, that is: (P1) A product constraint for Eq. (10) and Eq. (12) See [15]: P (cid:89) j =1 λ j = 1 , λ j > (P2) A product constraints for Eq. (10) and Eq. (13) See [35] and [12]: P (cid:89) j =1 ( λ j, M · λ j, V ) = 1 , λ j, M > , λ j, V > (P3) M product constraints for Eq. (10) and Eq. (14) See [15]: P (cid:89) j =1 λ mj = 1 , λ mj > m = 1 , . . . , M ; (18) (P4) M product constraints for Eq. (10) and Eq. (15) See [12] and [35]: P (cid:89) j =1 ( λ mj, M · λ mj, V ) = 1 , λ mj, M > , λ mj, V > m = 1 , . . . , M ;(19)10 emark. Note that the weights of each variable, or of each component in thecluster-wise scheme, are locally estimated. This means that at each iteration theweights change. Moreover, each neuron (whose Voronoi set represent a cluster)has its speciﬁc set of weights. On the other hand, weights deﬁned according aglobal scheme are the same for all the clusters. In general, note that a relevantvariable (or component) has a weight greater than 1 because of the product-to-one constraint.

In this paper, we use a batch training of the SOM, namely, all the data arepresented to the map at the same time. Once T (namely, the neuron radius)is ﬁxed, the training of the DBSOM depends on the minimization the criterionfunction J DBSOM which is based on classical squared L Wasserstein distances.Thus no relevance weights are computed. The training task alternates a repre-sentation and an assignment step iteratively. The representation step returnsthe optimal solution for the prototypes describing the neuron of the map. Inthe assignment step objects are optimally allocated to the Voronoi sets of eachneuron of the map. Diﬀerently, the ADBSOM algorithm is trained through theminimization of J ADBSOM function. In that case, three steps are iterated: arepresentation, a weighting and an assignment one. The representation and as-signment steps are performed like in DBSOM. The new weighting step carriesout optimal solutions for relevance weights according one of the four proposedschemes.

This section focuses on the optimal solution, for a ﬁxed radius T , of theprototype of the cluster associated to each neuron during the training of theDBSOM and ADBSOM algorithms.In the representation step of DBSOM, for a ﬁxed partition P , the objectivefunction J DBSOM is minimized regarding to the components of the matrix G of prototypes.Similarly, in the representation step of ADBSOM, for a ﬁxed partition P and for a ﬁxed set of weights in the matrix Λ , the objective function J ADBSOM is minimized regarding to the components of the matrix G of prototypes.DBSOM looks for the prototype g m of the cluster C m (1 , ≤ m ≤ M ) thatminimizes the following expression: P (cid:88) j =1 N (cid:88) i =1 K T ( d ( f ( y i ) , m )) d W ( y i , g m ) (20)ADBSOM looks for the prototype g m of the cluster C m (1 , ≤ m ≤ M ) thatminimizes the following expression: P (cid:88) j =1 N (cid:88) i =1 K T ( d ( f ( y i ) , m )) d Λ ( y i , g m ) (21)11ince both problems are additive, and depends on quadratic terms, the descrip-tion of the prototypes g mj as distributions for each variable ( m = 1 , . . . , M , j = 1 , . . . , P ), is obtained as solution of the following minimization problem: N (cid:88) i =1 K T ( d ( f ( y i ) , m )) (cid:2) (¯ y ij − ¯ y g mj ) + d W ( y cij , g cmj ) (cid:3) −→ Min . (22)Thus, for each cluster, setting to zero the partial derivatives w.r.t. ¯ y g mj and Q cg mj [33], the quantile function of the probability density function ( pdf ) de-scribing g mj is obtained as follows: Q g mj = Q cg mj + ¯ y g mj = N (cid:80) i =1 K T ( d ( f ( y i ) , m ) Q cijN (cid:80) i =1 K T ( d ( f ( y i ) , m ) + N (cid:80) i =1 K T ( d ( f ( y i ) , m )¯ y ijn (cid:80) i = N K T ( d ( f ( y i ) , m ) . (23) The aim of this section is to provide, for a ﬁxed radius T , the optimalsolution of the relevance weights of the distributional-valued variables duringthe training of the ADBSOM algorithms.In the weighting step of ADBSOM, ﬁxed the partition P and the prototypesin the matrix G , the objective function J ADBSOM is minimized w.r.t. to theweights, elements of the matrix of Λ . Proposition . The relevance weights are calculated depending upon the adap-tive squared L Wasserstein distance: (P1) If J ADBSOM ( G , Λ , P ) = M (cid:88) m =1 N (cid:88) i =1 P (cid:88) j =1 K T ( d ( f ( y i ) , m )) λ j d W ( y ij , g mj )subject to (cid:81) Pj =1 λ j = 1 , λ j >

0, then P relevance weights are derived asfollows: λ j = (cid:26) P (cid:81) r =1 (cid:20) M (cid:80) m =1 N (cid:80) i =1 K T ( d ( f ( y i ) , m )) d W ( y ir , g mr ) (cid:21)(cid:27) P M (cid:80) m =1 N (cid:80) i =1 K T ( d ( f ( y i ) , m )) d W ( y ij , g mj ) (24) (P2) If J ADBSOM ( G , Λ , P ) = M (cid:88) m =1 N (cid:88) i =1 P (cid:88) j =1 K T ( d ( f ( y i ) , m )) ( λ j, M dM im,j + λ j, V dV im,j )12ubject to (cid:81) Pj =1 ( λ j, M · λ j, V ) = 1, λ j, M > λ j, V >

0, then 2 × P relevance weights are derived as follows: λ j, M = (cid:26) P (cid:81) r =1 (cid:20) M (cid:80) m =1 N (cid:80) i =1 K T ( d ( f ( y i ) , m )) dM im,r (cid:21)(cid:20) M (cid:80) m =1 N (cid:80) i =1 K T ( d ( f ( y i ) , m )) dV im,r (cid:21)(cid:27) P M (cid:80) m =1 N (cid:80) i =1 K T ( d ( f ( y i ) , m )) dM im,j , and λ j, V = (cid:26) P (cid:81) r =1 (cid:20) M (cid:80) m =1 N (cid:80) i =1 K T ( d ( f ( y i ) , m )) dM im,r (cid:21)(cid:20) M (cid:80) m =1 N (cid:80) i =1 K T ( d ( f ( y i ) , m )) dV im,r (cid:21)(cid:27) P M (cid:80) m =1 N (cid:80) i =1 K T ( d ( f ( y i ) , h )) dV im,j . (25) (P3) If J ADBSOM ( G , Λ , P ) = M (cid:88) m =1 N (cid:88) i =1 P (cid:88) j =1 K T ( d ( f ( y i ) , m )) λ mj d W ( y ij , g mj )subject to M constraints (cid:81) Pj =1 λ mj = 1 , λ mj >

0, then M × P relevanceweights are derived as follows: λ mj = (cid:26) P (cid:81) r =1 (cid:20) N (cid:80) i =1 K T ( d ( f ( y i ) , m )) d W ( y ir , g mr ) (cid:21)(cid:27) P N (cid:80) i =1 K T ( d ( f ( y i ) , m )) d W ( y ij , g mj ) (26) (P4) If J ADBSOM ( G , Λ , P ) = M (cid:88) m =1 N (cid:88) i =1 P (cid:88) j =1 K T ( d ( f ( y i ) , m )) [ λ ij, M dM im,j + λ hj, V dV im,j )]subject to M constraints (cid:81) Pj =1 ( λ mj, M · λ mj, V ) = 1, λ mj, M > λ mj, V >

0, then 2 × M × P relevance weights are derived as follows: λ mj, M = (cid:26) P (cid:81) r =1 (cid:20) N (cid:80) i =1 K T ( d ( f ( y i ) , m )) dM im,r (cid:21) (cid:20) N (cid:80) i =1 K T ( d ( f ( y i ) , m )) dV im,r (cid:21)(cid:27) P N (cid:80) i =1 K T ( d ( f ( y i ) , m )) dM im,j and λ mj, V = (cid:26) P (cid:81) r =1 (cid:20) N (cid:80) i =1 K T ( d ( f ( y i ) , m )) dM im,r (cid:21) (cid:20) N (cid:80) i =1 K T ( d ( f ( y i ) , m )) dV im,r (cid:21)(cid:27) P N (cid:80) i =1 K T ( d ( f ( y i ) , m )) dV im,j . (27)13 roof. In the weighing step, we assume that the prototypes G and the partition P is ﬁxed. The matrix of relevance weights Λ are obtained according to one ofthe four above mentioned constraints. The minimization of J ADBSOM is doneby the Lagrange multipliers method. The four constraints allow for the followingLagrangian equations:( P1 ) : L = J ABSOM ( G , Λ , P ) − θ  P (cid:89) j =1 λ j −  ; (28)( P2 ) : L = J ABSOM ( G , Λ , P ) − θ  P (cid:89) j =1 ( λ j, M · λ j, V ) −  ; (29)( P3 ) : L = J ABSOM ( G , Λ , P ) − M (cid:88) m =1 θ m  P (cid:89) j =1 λ ij −  ; (30)( P4 ) : L = J ABSOM ( G , Λ , P ) − M (cid:88) m =1 θ m  P (cid:89) j =1 ( λ mj, M · λ mj, V ) −  . (31)Setting to zero the partial derivatives of L with respect to the λ ’s and the θ ’s respectively, a system of equations of the ﬁrst order condition is obtainedand their solution corresponds to the elements of the matrix Λ . The aim of this section is to give the optimal partition of the clusters asso-ciated to the neurons of the SOM in the assignment step of the DBSOM andADBSOM algorithms.Fixed the prototypes, elements of the matrix G , the objective function J DBSOM is minimized w.r.t. the partition P and each data unit y i is assignedto its nearest prototype (BMU). Proposition . The objective function J DBSOM is minimized w.r.t. the partition P when the clusters C m ( m = 1 , . . . , M ) are computed as: C m = { y i ∈ D : f ( y i ) = m = arg min ≤ r ≤ M d T ( y i , g r ) } (32) Proof.

Because the matrix of prototypes G is ﬁxed, the objective function J DBSOM can be rewrite as: J DBSOM ( P ) = N (cid:88) i =1 d T ( y i , g f ( y i ) )Remark that if, for each y i ∈ D , d T ( y i , g f ( y i ) ) is minimized, then the crite-rion J DBSOM ( P ) is also minimized. The matrix of prototypes G being ﬁxed, d T ( y i , g f ( y i ) ) is minimized if f ( y i ) = m = arg min ≤ r ≤ M d T ( y i , g r ), i.e., if C m = { y i ∈ D : f ( y i ) = m = arg min ≤ r ≤ M d T ( y i , g r ) } ( m = 1 , . . . , M ).14n the assignment step of ADBSOM, for a ﬁxed matrix of prototype G andmatrix of weights Λ , the objective function J ADBSOM is minimized w.r.t. thepartition P and each data unit y i is assigned to its nearest prototype. Proposition . The objective function J ADBSOM is minimized w.r.t. the parti-tion P when the clusters C m ( m = 1 , . . . , M ) are computed as: C m = { y i ∈ D : m = f ( y i ) = arg min ≤ r ≤ M d T Λ ( y i , g r ) } (33) Proof.

The proof is similar to the proof of the Proposition (2).

Algorithm 1 summarizes the batch SOM algorithms DBSOM and ADBSOMfor distributional-valued data. In the initialization step, ﬁrst the matrix ofprototypes G is initialized by randomly choosing m objects of the distributional-valued data-set and then, the weights of the matrix of relevance weights Λ are set to 1 (each component or each variable are assumed to have the samerelevance). Besides, using the initialization of the matrix of prototypes G andthe initialization of the matrix of relevance weights Λ , the rest of the objectsare assigned to the cluster represented by the nearest representative (BMU) toproduce the initial partition P . Finally, from the initialization of G , Λ and P ,and with T ← T max , the algorithm calls the ITERATIVE-FUNCTION-1 thatprovides the initial SOM map and the corresponding new initial values of G , Λ and P .In the iterative steps, in each iteration a new radius T is computed andfor this new radius the algorithm call the ITERATIVE-FUNCTION-1 that al-ternates once two (DBSOM) or three steps (ADBSOM) aiming to provide theupdate of the matrix of prototypes G , the matrix of relevance weights Λ andthe partition P .In the ﬁnal iteration, when the radius T is equal to T min , only few neuronsbelong to the neighborhood of the winner neurons and the SOM algorithm be-haves similar to k-means. The algorithm calls the ITERATIVE-FUNCTION-2that alternates once two (DBSOM) or three steps (ADBSOM) until the assign-ments no longer change. The ﬁnal iteration provides the ﬁnal update of thematrix of prototypes G , the matrix of relevance weights Λ and the partition P .

3. Applications

In this section, we present an application of the proposed algorithms usingtwo real-world data-sets. These real world datasets are used for observing thealgorithms’ performance whit labeled data and unlabeled ones. All the applica-tions have been executed using the HistDAWass package in R [30].

Before launching a SOM algorithm some choices have to be done. In thispaper, we do not propose a strategy for the selection of the best parameters15 lgorithm 1

DBSOM and ADBSOM algorithms

Input: the distributional-valued data set D = { y , . . . , y N } ; the number M ofneurons (clusters); the size map; the kernel function K T ; the dissimilarity d ; the initial radius T max and the ﬁnal radius T min ; the number N iter ofiterations. Output: the SOM map; the partition P ; the matrix G of prototypes; the matrix Λ of weights. INITIALIZATION: Set t ← T ← T max ; Initialization the matrix of prototypes G (0) : select randomly M distinctprototypes g (0) r ∈ D ( r = 1 , . . . , M ); Initialization of the matrix of relevance weights: set Λ (0) = ; Initialization of the partition P (0) : assign each object y i to the closest pro-totype g r (BMU) according to r = f (0) ( y i ) = arg min ≤ m ≤ M d T ( y i , g (0) m ); ITERATIVE-FUNCTION-1 ( D , t = 0, G (0) , Λ (0) , P (0) , T ) ITERATIVE STEPS: repeat

Set t ← t + 1; Compute T = T Max (cid:16) T min T Max (cid:17) tNiter ; Set P ( t ) ← P ( t − , Λ ( t ) ← Λ ( t − , G ( t ) ← G ( t − ; ITERATIVE-FUNCTION-1 ( D , t , G ( t ) , Λ ( t ) , P ( t ) , T ) until t == N iter − FINAL ITERATION:

Set t ← N iter ; Set T ← T min ; Set P ( t ) ← P ( t − , Λ ( t ) ← Λ ( t − , G ( t ) ← G ( t − ; ITERATIVE-FUNCTION-2 ( D , t , G ( t ) , Λ ( t ) , P ( t ) , T )16 : function ITERATIVE-FUNCTION-1 ( D , t , G ( t ) , Λ ( t ) , P ( t ) , T ) Step 1:

Representation For DBSOM, P ( t ) is ﬁxed; For ADBSOM, Λ ( t ) and P ( t ) are both ﬁxed; Compute G ( t ) using equation (23); Step 2:

Weighting (only ADBSOM algorithm) Compute Λ ( t ) using the suitable equation from Eq. (24) to Eq.(27); Step 3:

Assignment For DBSOM, G ( t ) is ﬁxed; For ADBSOM, G ( t ) and Λ ( t ) are both ﬁxed; for ≤ i ≤ N do Let w = f ( t ) ( y i ) DBSOM: let z = arg min ≤ m ≤ M d T ( y i , g ( t ) m ) ADBSOM: let z = arg min ≤ m ≤ M d T Λ ( y i , g ( t ) m ) if w (cid:54) = z then f ( t ) ( y i ) = z ; end if end for return G ( t ) , Λ ( t ) , P ( t ) end function function ITERATIVE-FUNCTION-2 ( D , t , G ( t ) , Λ ( t ) , P ( t ) , T ) repeat test ← Step 1:

Representation For DBSOM, P ( t ) is ﬁxed; For ADBSOM, Λ ( t ) and P ( t ) are both ﬁxed; Compute G ( t ) using equation (23); Step 2:

Weighting (only ADBSOM algorithm) Compute Λ ( t ) using the suitable equation from Eq. (24) to Eq.(27); Step 3:

Assignment

For DBSOM, G ( t ) is ﬁxed; For ADBSOM, G ( t ) and Λ ( t ) are both ﬁxed; for ≤ k ≤ N do Let w = f ( t ) ( y k ) DBSOM: let z = arg min ≤ m ≤ M d T ( y i , g ( t ) m ) ADBSOM: let z = arg min ≤ m ≤ M d T Λ ( y i , g ( t ) M ) if w (cid:54) = z then f ( t ) ( y i ) = z ; test ← end if end for until test == 0 return G ( t ) , Λ ( t ) , P ( t ) end function

17f the SOM. In the literature [54], some rule of thumbs is proposed for theSOM initialization according to some well-known side-eﬀects (for example, thepropensity of SOM to push all the objects in the corners of the map because ofthe kernel weighting impact). Hereafter, we list the choices for the parametersand the topologies chosen before launching the SOM algorithm.We choice a 2D hexagonal map of 5 √ N neurons [54], where N is the numberof input objects. The map is rectangular having ratio (namely, the horizontalside of the map is longer than the vertical one), in order to let the map choice itsdirection of variability. This, in general, mitigates a SOM side-eﬀect consistingin assigning objects into the Voronoi set of the BMU’s associated with the cor-ners of the map. Such an eﬀect is more evident when using a dynamic clusteringapproach. Indeed, the whole SOM algorithm is based on the minimization ofthe cost function in Eq. (2) or Eq. (10), and these functions are considered alsofor the assignment of data to the BMU’s. To mitigate this eﬀect, we proposeto use toroidal maps [47]. In the following, we present both planar and toroidalmap results.A second set of choices is related to the learning function and to the kernelone. First of all, we use a Gaussian kernel function: K ( t ) ( r, m ) = exp (cid:18) − d ( r, m )2 · T ( t ) (cid:19) (34)where, d ( r, m ) is the squared Euclidean distance in the topological space of theneurons between vertices (clusters) r and m , t is the generic epoch and T is thekernel width (radius) at the epoch t . Once deﬁned the kernel, it is importantchoose the number of epochs for learning the map, the initial and the ﬁnal kernelwidth (radius), and the rate of decreasing of the width (linear or exponential).In the literature [42], for the batch SOM algorithm, it is suggested that thenumber of epochs for training the map is lesser or equal than N iter = 50.Fixed an initial value of the kernel width T Max = σ (1) and ending valuewith T min = σ ( N iter ), we use a power series decreasing function for the widthof kernel as follows: T ( t ) = T Max (cid:18) T min T Max (cid:19) tNiter . (35)About the initial and ﬁnal value of the kernel width, some heuristics proposed inthe literature are mainly related to the classical SOM assignment (namely, theone performed using the distance between objects and BMU without consideringthe kernel). In [54], it is suggested a value for T Max equal to 1 / , T Max such that twoneurons having a distance equal to the radius of the map, namely, the half ofthe maximum topological distance between two neurons, have a kernel valueequal to 0 .

1, and we ﬁx T min such that two neighboring neurons have a kernel18alue equal to 0 .

01. Denoting the diameter of the map in the topological spacewith d Max , namely, the largest topological distance between two neurons of themap, the initial and the ﬁnal value of T , considering Eq. (34), are determinedas follows: T Max = (cid:115) − (0 . · d Max ) · log 0 . T min = (cid:114) − · log 0 . . (36)SOM is initialized randomly 20 times and the map with the lower ﬁnal costfunction is considered as the best run.Considering that the proposed SOM algorithms as particular clustering al-gorithms, we evaluate the output results using internal and external validityindexes (except for unlabeled data). Internal validity indexes

For validating the map results, we consider the topographic error measureof the map, the Silhouette Coeﬃcient as a base validity index for comparing thediﬀerent algorithms and the Quality of Partition index developed in Ref. [35].Further, we propose some extensions of the silhouette index for SOM.A classical validity index for SOM is the topographic error [39] E T . It is ameasure of topology preservation and it is computed as follows: for all inputvectors, the respective best and second-best matching units are determined. Anerror is counted if best and second-best matching units of an input vector areneighbors on the map. The total error is the ratio of the sum of the error countswith respect to the cordiality of the input vectors. It ranges from 0 to 1, where0 means perfect topology preservation. It is obtained as follows: E T = 1 N N (cid:88) i =1 u ( i ) . (37)where u ( i ) = (cid:26) , best − and second − BMU not − adjacent0 , otherwiseWe recall that the Silhouette Coeﬃcient was deﬁned for clustering algorithmsand it corresponds to the average of the silhouette scores s ( i ) computed for eachinput data vector. In particular, s ( i ) [48] is obtained as: s ( i ) = b ( i ) − a ( i ) max [ a ( i ) , b ( i )]where a ( i ) is the average distance between the input vector i and the onesassigned to the same cluster (say A, the cluster of i ), while b ( i ) is the minimumamong the average distances computed from the i to the input vectors assignedto each cluster except the one to which i belongs (say B, the second best clusterof i ) . The Silhouette Coeﬃcient is as follows: S = 1 N N (cid:88) i =1 s ( i ) . (38)19he Silhouette Coeﬃcient ( S ) ranges from − O ( N ) for ageneric distance, in the case of (squared) Euclidean distances it is possible toshow that it is only O ( N ) . For reducing the computational time of the index,a simpliﬁed version of the Silhouette Coeﬃcient was also proposed by [8], wherethe distances are computed with respect to the prototypes of the cluster only.We use also this index denoting it with S C .Regarding also the SOM solution and the considerations about the topo-graphic error, we remark that the obtained SOM map contains neurons that arerepresented by prototypes that are generally similar if the neurons are adjacent.This could reduce the Silhouette Coeﬃcient even if the map is a good repre-sentation of the cluster structure of the input data. To adapt the Silhouetteindex to the SOM, we propose to mix together the advantages of the SilhouetteCoeﬃcient in revealing a good cluster structure and the ones related to the to-pographic error. We thus propose to modify the b ( i ) calculation considering itas the second-best cluster of i as the cluster of input vectors that is not adja-cent to the neuron of the i BMU. The Silhouette score of each i is calculatedas above, but b ( i ) is obtained without considering those neurons adjacent to A(namely, the BMU of i ). We propose the same strategy for the S C index andwe denote the two new coeﬃcients with S E and S C E respectively. We remarkthat in the case of ADBSOM the distance used for the computation of a ( i ) and b ( i ) are the adaptive distances used in the algorithm. External Validity Indexes

Let N be the number of instance of a data table, P = { C , C , . . . , C M } the clusters obtained by a clustering algorithm and P (cid:48) = { C (cid:48) , C (cid:48) , . . . , C (cid:48) K } theclasses of the labeled instances. Generally, M is required to be equal to K , butin our case we can also consider M (cid:54) = K . Let C m (resp. C (cid:48) k ) the instances ofcluster m (resp. of class k ), and a m = | C m | (resp., b k = | C (cid:48) k | ) its cardinality.Let n mk = | C m ∩ C (cid:48) k | the number of instances of cluster m being in class k .For evaluating the results of the algorithms, for dataset with labeled in-stances, we use three external validity indexes: the Adjusted Rand Index (ARI)[28], the purity (Pur) [45] and the

Normalized Mutual Information ( N M I ) [46].The ARI index [28] is widely used for assessing the concordance betweenapriori partition and the partition provides by the algorithm: the index variesbetween − ARI = (cid:80) mk (cid:0) n mk (cid:1) − [ (cid:80) m (cid:0) a m (cid:1) (cid:80) k (cid:0) b k (cid:1) ] / (cid:0) N (cid:1) [ (cid:80) m (cid:0) a m (cid:1) + (cid:80) k (cid:0) b k (cid:1) ] − [ (cid:80) m (cid:0) a m (cid:1) (cid:80) k (cid:0) b k (cid:1) ] / (cid:0) N (cid:1) . (39) See the appendix for the proof.

N M I = I ( P , P (cid:48) ) | H ( P ) + H ( P (cid:48) ) | / I is the mutual information I ( P , P (cid:48) ) = (cid:88) m (cid:88) k n mk N log n mk a m b k H are the entropies: H ( I ( P )) = − (cid:88) m a m N log a m N and H ( I ( P (cid:48) )) = − (cid:88) k b k N log b k N .

ARI and NMI is maximal when the number of classes is equal to the number ofclusters. This can be problematic when evaluating the SOM results. In fact, theconsidering the SOM as a clustering algorithm it is frequent that the number ofclusters is greater than the number of apriori classes. We propose to considerthese measures together with the purity index, namely, another external validityindex that considers also this possibility.Purity index ( pur ) measures the homogeneity of clusters with respect toapriori partition. The index is calculated as follows: pur = 1 N (cid:88) m arg max k ( n mk ) (41)It consists in evaluating the fraction of labeled instances of the majority classin each cluster for all the clustering. It varies in [0; 1], where 1 indicates thatall clusters are pure, namely, they contain only labeled instances of one class.However, pur presents a major drawback. It over estimates the quality of aclustering having a large number of clusters that is the typical situation of aSOM output partition.Thus, we propose to read together the two sets of external indexes followingthese guidelines: if a SOM has both a higher value of pur and of NMI andARI, this means that the number of non-empty clusters (namely, neurons thatare BMU’s for at least one instance) are close to the number of apriori classes,while, if NMI and ARI are relatively low and pur index is high it means thatthe apriori class labeled instances are shared among a set of neurons identifyingpure clusters. Obviously, if NMI, ARI and pur are low the resulting SOM isless able to recognize the apriori class structure.In the remainder of the applications, in the tables we denote respectively thealgorithms with St. , the classic BSOM algorithm with each variable standard-ized using the standard deviation for distributional variables, P1 , the ADBSOMalgorithm with the automatic detection of the relevance weights on each variableof the whole dataset, P2 , the ADBSOM algorithm with the automatic detectionof relevance weights for the components of each variable included in the dataset,21 , the ADBSOM algorithm with the automatic detection of relevance weightsfor each variable and each neuron, and P4 , the ADBSOM algorithm with rele-vance weights automatically detected for the components of each variable andeach neuron. A ﬁrst dataset comes from an experiment on activity recognition of peopledoing

Daily and Sports Activities that is publicly available from the UCI repos-itory [14]. In particular, the raw data consists in the triaxial gyroscope andaccelerometers measurements of ﬁve sensors (two on the arms and on the legs,and one on the thorax) of eight people performing 19 diﬀerent activities for 5minutes [1]. Each 5 minutes session of activity is represented by 60 5-secondstime windows described by the histograms of the 125 measurements recorded inthat time window. In this case, considering that the records are labeled accord-ing to the person and the activity, we show how the proposed maps are able torepresent the diﬀerent activities or people (for a speciﬁc activity) using someexternal validity indexes.The second dataset describes the population pyramids of 228 countries ofthe world observed in 2014. Considering that a population pyramid is thedescription of the age distribution for the male and the female component ofa population, the dataset is described by only two distributional variables: themale age distributional variable and the female age one. We will refer to thisdataset naming it as the “AGE PYRAMIDS” dataset. The AGE PYRAMIDSdataset does not contain indications about the cluster structure. In this case,to compare the algorithms we will use some internal validity measures like thesilhouette index [48], the topographic error of the map [39] and the quality ofpartition index proposed in Ref. [13].

The dataset that we consider here can be downloaded from the University ofCalifornia Irvine machine learning repository . It collects data on 19 activitiesperformed by 8 diﬀerent people performed in sessions of 5 minutes. Table 1shows the description of the people involved and table 2 the list of activitiesmonitored.The research group that collected the data set has extensively used it tocompare classiﬁcation algorithms [1] and classiﬁcation software packages [3],study inter-subject and inter-activity variability [4].The data is collected by means of ﬁve MTx3-DOF units, manufactured byXsens Technologies. Sensor units are calibrated to acquire data at a samplingfrequency of 25Hz. Each unit has a tri-axial accelerometer, a tri-axial gyroscope,and a tri-axial magnetometer. Sensor units are placed on the arms, the legs andthe thorax of the subject’s body. We do not consider the magnetometer sensors. http://archive.ics.uci.edu/ml/ able 1: Description of the 8 individuals in the experiment. ID gender age height weight1 F 25 170 632 F 20 162 543 M 30 185 784 M 25 182 785 M 26 183 776 F 23 165 507 F 21 167 578 M 24 175 75

Table 2: List of activities in the data set × ×

19 activities), whereeach row is a window of 5 seconds of a given person performing one of theactivities.

Walking on a treadmill at 4 km/h in ﬂat position

Using the activity recognition dataset, we selected one of the activities andwe analyze if it is possible to recover a class structure in data. In particular, wewant to test our algorithm about its ability at discriminating people doing thesame activity. Among the 19 activities, we did a preliminary exploration and wenoted that the activity

Walking on a treadmill at 4 km/h in ﬂat position shows a particular diﬀerentiation among people. In this case, the subset is atable of 14 ,

400 histograms having 480 rows, namely, 60 5-seconds-windows foreach one of the 8 people, and 30 columns, namely, the measurements of the5 tri-axial sensors recording acceleration (accelerometers) and angular speeds(gyroscope).In Fig. 1, we show the plot of individuals, namely, the time windows recordedfor all the people, in the ﬁrst factorial plane obtained by a PCA performed usingthe method proposed in [52]. The ﬁrst factorial plane explains the 40 .

50% ofthe total inertia and the 60 time-windows of each person are contained in aconvex polygon that is colored diﬀerently for each person. Since the variablesare 30, the percentage of inertia explained by the ﬁrst factorial plane is not sohigh. However, we notice that the eight people appear quite separated and, inparticular, we can see that females are on the right of the plane, while malesare on the left except for person 6 (a woman that presents a particular patternshowing two diﬀerent ways of performing the activity during the ﬁve-minutessession).In this case, we run the ﬁve SOM algorithms initializing two types of maps:a planar and a toroidal one. The map is a 16 hexagonal grid. The size of sidesof the map has been chosen such that the cardinality of the neurons is close tothe 5 √ ≈

110 and each side has an even number of neurons (this is requiredfor the toroidal map using a hexagonal grid). The main validity indexes arereported in Tab. 3. The external validity indexes assume that the referencelabels are the ones identifying the people.In Tab. 3, results show that toroidal SOMs have lower topographic errorsand are more compact with respect to the planar maps. Looking at the externalvalidity indexes, there are not great substantial diﬀerences. Normalized mutual24 igure 1: Activity Recognition dataset: PCA on people while walking on a treadmill at 4km/h in ﬂat position. -6-4-2024 -5.0 -2.5 0.0 2.5 5.0

Comp. 1 (27.32%) C o m p . ( . % ) people PCA -

Walking on a treadmill at 4 km/h in flat position

First factorial plane (40.50% of explained inertia) information index appears slightly better for toroidal maps (except for the P3algorithm), while the obtained non-empty Voronoi sets of each neuron are verypure. The adjusted Rand indexes are not so high, but this is due to the diﬀerentnumber of obtained non-empty Voronoi sets with respect to the number ofclasses.Internal validity indexes also conﬁrm better compactness of the clusters iden-tiﬁed by the Voronoi sets associated with the BMUs for the toroidal map withrespect to the planar one.Fig. 2 and 3 show the counts of the Voronoi sets (according to the intensityof the colors) of the neurons of the map obtained using P4 algorithm for thehexagonal map, namely, the map having the lower topographic error and bestvalues of internal validity indexes, and the one obtained from the P3 algorithmand a planar map, namely, the one having the highest ARI and NMI index.Each neuron is labeled according to the labels of the objects contained in itsVoronoi set. We observe that the only neuron that has two labels is secondon the bottom-left of the planar map. Looking at the maps, we observe howthe pushing-toward-the-edges eﬀect is evident for the planar map, while, for thetoroidal map, we may appreciate how the cluster of neurons are more evidentand separated. This suggests that the topographic error, together with inter-nal validity indexes could be good hints for deciding what map could be moreexplicative of the class structure of this kind of datasets.In the remainder of this paragraph, we continue to describe results for theP4 algorithm with a toroidal map . Since the P4 algorithm assigns a relevance The authors can supply the R code and detailed results as supplementary material or onrequest. igure 2: Activity Recognition dataset, walking on a treadmill at 4 km/h in ﬂat position: P4 toroidal SOM count map. - 4-M - - - - - -- - - 1-F - - 3-M - - - - - - 7-F - -- - - - - 3-M 3-M 3-M - - - 7-F 7-F - - -2-F 2-F 2-F - - - - - - 5-M - - 7-F 7-F - -- - - - - - - - 5-M 5-M - - 7-F - - -- - - 6-F 6-F 6-F - - - - - - - - - -- - - - - - - - - - - - - - 8-M 8-M- - - - - - - - - - 4-M 4-M - - - -- - - 1-F 1-F - - 3-M counts

Figure 3: Activity Recognition dataset, walking on a treadmill at 4 km/h in ﬂat position: P3 planar

SOM count map. counts × planar mapInternal validity indexes External validity indexesMet. S S E S C S C E E T ARI NMI PurSt. 0.3065 0.4306 0.5914 0.6888 0.1812 0.4872 0.7391 1.0000P1 0.3588 0.4769 0.6132 0.7224 0.0750 0.5748 0.7630 1.0000P2 0.3354 0.5181 0.5823 0.7380 0.2167 0.5646 0.7650 1.0000P3 0.5519 0.6889 0.7300 0.8358 0.1292 0.7054 0.8305 0.9958P4 0.2810 0.5384 0.5658 0.7546 0.0521 0.5894 0.7974 1.000016 × toroidal mapInternal validity indexes External validity indexesMet. S S E S C S C E E T ARI NMI PurSt. 0.2888 0.5886 0.5221 0.7644 0.0229 0.4528 0.7636 1.0000P1 0.3903 0.7308 0.6045 0.8594 0.0021 0.4979 0.7798 1.0000P2 0.3954 0.6818 0.5981 0.8180 0.0062 0.6077 0.8194 1.0000P3 0.3587 0.7154 0.5958 0.8538 0.0062 0.5567 0.7918 0.9958P4 0.4043 0.7564 0.5823 0.8633 0.0021 0.5322 0.8063 1.0000

Table 3: Activity recognition dataset, walking on a treadmill at 4 km/h in ﬂat position:validity indexes. External validity indexes assume people as labels. weight to each variable for each neuron, it is interesting to observe what variablesare more relevant for SOM. In ﬁg. 4, we show the box-plots of the logarithms ofthe relevance weights for the two maps. We remark that we used the box-plotsince each variable may have a diﬀerent weight for each neuron of the ﬁrst map.For the sake of readability, we ordered the box-plot according to the medianvalue observed for the relevance weights.Generally, it is interesting to note that the thorax gyroscope variables assumegreater weights (the box-plots on the top left side of the plots), while lower valuesare generally assumed by accelerometer measures (on the left part).It is interesting to observe on the map what are the neurons where compo-nents assume greater relevance with respect the others. In Fig. 5, are reportedthe maps of the relevance weights for two (of 60) components, namely the av-erage component of the gyroscope measurements on the y axis of the sensorpositioned on the torax ( M.T O ygyr ) and the average component of the gy-roscope measurements on the z axis of the sensor positioned on the left arm( M.LA ygyr ). The choice is motivated by the fact that, looking at Fig. 4,

M.T O ygyr component is the one having the highest median relevance weightsamong neurons and

M.LA ygyr has the highest variability. To allow the readera more immediate reading of the results, the count map of Fig. 2 is replicatedat the top of Fig. 5. It is worth noting that,

M.T O ygyr , that is related tothe torsion of the thorax from left to right, is more relevant for male than forfemale people this because the torso of a male diﬀers from the one of a femaleand this impact on the rotational change on the y axes. Looking at the mapfor M.LA ygyr , the relevance is higher for people 1, 2, 7 and 8, namely threefemales and one male. This is related to the forward and backward movement27 M . T O _ y g y r M . T O _ z g y r M . L A _ y g y r M . R A _ y g y r M . R L_ y g y r M . L A _ z g y r M . T O _ x g y r M . LL_ y g y r M . L A _ x g y r M . R A _ z g y r D . T O _ z g y r D . T O _ y g y r M . R A _ x g y r M . LL_ x g y r M . R L_ x g y r M . LL_ z g y r D . T O _ x g y r M . R L_ z g y r D . LL_ y g y r D . R L_ y g y r D . L A _ y g y r D . R A _ y g y r M . T O _ x a cc M . T O _ y a cc D . LL_ z g y r D . R L_ z g y r M . LL_ x a cc M . R A _ x a cc D . L A _ x g y r M . R L_ x a cc Component

Log . o f w e i gh t -6-4-202 D . L A _ z g y r D . T O _ z a cc D . R L_ x g y r M . R L_ z a cc D . T O _ y a cc M . L A _ x a cc M . LL_ z a cc D . L A _ z a cc D . LL_ x g y r M . R A _ y a cc D . R A _ z g y r M . R A _ z a cc D . R A _ x g y r M . L A _ z a cc D . R A _ z a cc M . T O _ z a cc D . T O _ x a cc D . L A _ y a cc M . L A _ y a cc D . L A _ x a cc M . LL_ y a cc D . R A _ y a cc M . R L_ y a cc D . R A _ x a cc D . R L_ z a cc D . LL_ x a cc D . LL_ z a cc D . R L_ x a cc D . LL_ y a cc D . R L_ y a cc Component

Log . o f w e i gh t Figure 4: Activity Recognition dataset, walking on a treadmill at 4 km/h in ﬂat position:box-plot of the logarithms of weights for each component of variables sorted according to themedian weights: M denotes average component, D the dispersion one, TO, RA, LA, RL andLL are the position of the sensors.

28f the left arm while walking and the variability in the importance of this com-ponent within people may be caused, for example, by their handedness. Indeed,it ranges from positive (namely, highly relevant) log values to negative (namely,lowly relevant) ones. Other interpretations could be done observing the othervariables but, for the sake of brevity, we don’t go ahead, but we conﬁrmed thatthe use of relevance weights may enrich the interpretation of the results. - 4-M - - - - - -- - - 1-F - - 3-M - - - - - - 7-F - -- - - - - 3-M 3-M 3-M - - - 7-F 7-F - - -2-F 2-F 2-F - - - - - - 5-M - - 7-F 7-F - -- - - - - - - - 5-M 5-M - - 7-F - - -- - - 6-F 6-F 6-F - - - - - - - - - -- - - - - - - - - - - - - - 8-M 8-M- - - - - - - - - - 4-M 4-M - - - -- - - 1-F 1-F - - 3-M counts

M.TO_ygyr Log. of weights

M.LA_zgyr Log. of weights

Figure 5: Activity Recognition dataset, walking on a treadmill at 4 km/h in ﬂat position,results of the P4 algorithm using a toroidal map. On the top the count map. At the center,the map showing the logarithms of the relevance weights for the component having the highestmedian relevance. At the bottom, the map showing the logarithms of the relevance weights forthe component having the highest variability of the relevance weights among all the neurons.

The second application evaluates the eﬀectiveness of the proposed algorithmson a dataset recording the population age-sex pyramids of 228 World countries.The dataset is provided by the Census Bureau of USA in 2014 and it is includedin the

HistDAWass package developed in R. https://cran.r-project.org/package=HistDAWass × Figure 6: Types of population pyramid

MaleFemale

Percent of Population

MaleFemale

Percent of Population

Constrictive Stationary

MaleFemale

Percent of Population

Expansive

We remark that data is not labeled according to these three prototypicalmodels. Using BSOM we want to see if such prototypical situations arise inthe data and how well the map is able to represent the data structure and thevariables importance in the map generation.Also in this case, we run the ﬁve SOM algorithms initializing two types ofmaps: a planar and a toroidal one. The map is a 16 hexagonal grid. The sizeof sides of the map has been chosen such that the cardinality of the neurons isclose to the 5 √ ≈

110 and each side has an even number of neurons (thisis required for the toroidal map using a hexagonal grid). Two sized maps, amedium sized hexagonal map (6 ×

6) and a large sized one map (8 × S E and S C E also show anaverage better compactness and separation in the clustering structure inducedby the SOM with adaptive distances. In particular, for this dataset, P4 algo-rithm, namely, the one assigning weights to the components of each variable foreach cluster, returns a lower topographic error and shows more compactness forthe clusters deﬁned by the Voronoi sets of the prototypes associated with theneurons.Taking the best results in terms of topographic error, in Fig. 7 are shownthe maps with the counts and the ISO 3166-1 alpha-3 Country codes of dataattracted by each BMU for the P4 algorithms both in the planar and in thetoroidal map case. As shown in Fig. 7 planar map tends to push on the cornersand the border of the map the data, leaving a central empty zone that is notwell justiﬁed by the data. Indeed, in this dataset are described population30 igure 7: Age Pyramid dataset.P4 SOM algorithm: map of counts and the codes of countriesin each Voronoi set of the map BMU’s. GEO;JEY;PRIGBR BLR;RUSBIH;KOR;MNEPOL;ROU;TWNLUX;NOR;SHNABW;FRO;ISLNZL;USA CYM;MAC - - CMR;CAF;COGERI;GAB;GINKEN;LBR;MDGNGA;RWA;SLESDN;TLS;TGOZWE AFG;AGO;BENBFA;BDI;TCDCOD;ETH;PSEMWI;MLI;MOZNER;STP;SENSOM;SSD;TZAUGA;YEM;ZMBAND;BLM - SVK - AUS CUB - - COM CIV;GNQ;GMBGNB;MRT;SWZBMU;DNK;FRALIE;NLD;SWE - - - - - - KHM;JORBWA;DJI;HTIHND;LAO;LSONAM;PAK;PNGSYR;TJK;PSEBLZ;GHA;GTMIRQ;MHL;OMNSLB;VUT;ESHHKG;IMN;VIRCAN;CZE;MLTESP;CHE - - - - - BOL;KIR;NPLPHL;TONCPV;FSM;NRU NICBEL;HRV;FINGGY;SMR;SRB - - CHN;GIB;IRLSXM;THA SGP;VGB - - WSM BGD;EGY;KGZTUV DZA;BTN;GUYJAM;LBY;MNGSAU;ZAF;TKMEST;HUN;LVALTU;PRT;UKRAUT;BGR;SPMSVN - - TTO - - - UNK MMR;DOM;ECUSLV;FJI;INDMYS;MEX;MARPRY;PER;UZBVENDEU;GRC;ITA - - - CHL;PRK;MUS VCT GRD;GUM;ISRNCL;TUN;WLFASM;COL;IDNPAN;TUR SUR;VNMBRN;IRN;MDVJPN;MCO - - BRB;CUW;CYPMKD;MDA ARM;URYALB;AIA;COKDMA;GRL;KNALCA ARG;LBN;PLWSYC;LKAATG;AZE;BHSBRA;CRI;PYFKAZ;MSR;MAFMNP;TCABHR;KWT;QATARE counts

Planar SOM - KWT - - - BMU;CAN;DNKFRA;LIE;MLTNLD;SWE;CHEGBR AUT;BEL;BGRFIN;GGY;IMNSPM;SMR;SVNVIR DEU;GRC;ITA - - BRN;IRN;TCABHR;MDV QAT;ARE OMN;SAU - - HRV;CZE;ESTHUN;LVA;LTUPRT;SRB;ESPJPN;MCO - - DZA;MNG;TKMUZB BTN;LBY NRU - - BLR;BIH;MNEPOL;ROU;RUSSVK;UKRAND;HKG;BLM - - ASM;MMR;SURDOM;ECU;FJIIND;MYS;MEXMAR;PRY;PERVEN SLV;GUY;JAMKGZ;ZAF;TUVBGD;BOL;CPVKHM;EGY;KIRFSM;NPL;NICPHL;WSM;TONBWA;DJI;HTIHND;JOR;LAOLSO;NAM;PAKPNG;SYR;TJKPSE - - AUS;GEO;JEYKOR;LUX;NORPRI;TWN SHN - VNM UNK;PAN - BLZ;GHA;GTMIRQ;MHL;SLBVUT;ESH - - ABW;CYM;CUBCUW;FRO;ISLNZL;USA ARM;URY - AZE;COL;IDNKAZ;TURATG;BHS;BRA - - CAF;COM;COGCIV;GNQ;ERIGAB;GMB;GINGNB;MDG;MRTSLE;SDN;SWZTLS;TGO - - BRB;CHN;CYPMAC;MKD;MDASXM GIB;IRL;SGPTHA;TTO;VGBALB;AIA;CHLCOK;DMA;GRLPRK;MUS;PLWKNA;LCA;SYCARG;GRD;ISRLBN;LKA;WLF - - BFA;BDI;TCDMWI;MLI;MOZNER;SSD;UGAZMB AFG;AGO;BENCMR;COD;ETHPSE;KEN;LBRNGA;RWA;STPSEN;SOM;TZAYEM;ZWE - - - - VCT CRI;PYF;GUMMSR;NCL;MNPMAF;TUN - - - - - - - - - - counts

Toroidal

SOM × planar mapInternal validity indexesMet. S S E S C S C E E T St. 0.3432 0.6134 0.6670 0.8262 0.2368P1 0.2085 0.5186 0.6176 0.7949 0.1886P2 0.2688 0.5634 0.6505 0.8181 0.1842P3 0.3027 0.6060 0.6490 0.8283 0.1360P4 0.2884 0.6419 0.6572 0.8459 0.096510 × toroidal mapInternal validity indexesSt. 0.2820 0.6340 0.6367 0.8371 0.1360P1 0.3179 0.6252 0.6437 0.8257 0.1228P2 0.3424 0.6889 0.6474 0.8580 0.1316P3 0.2129 0.6307 0.5954 0.8312 0.0921P4 0.3951 0.7417 0.6880 0.8841 0.0570 Table 4: Age Pyramids dataset: validity indexes. structures of countries that are not so clustered in the reality, but the one whichare at one of the stages described in Fig. 6. Thus, we consider the toroidalmap description of the data more coherent than the planar one, showing asort of path of transition from countries with an

Expansive type of population(namely, the ones in the left top corner) toward

Constrictive ones (namely,the ones in the mid of the path), and, ﬁnally, to

Stationary ones. This typeof pattern is better visualized in Fig. 8, where is represented the populationpyramid (namely, a particular type of codebook map for data described by twodistributional variables where each prototype is described by two juxtaposedsmoothed frequency distributions) associated with the prototype of each BMUof the toroidal SOM.In Fig. 8, the three population structures are almost evident starting fromthe expansive model on the top-left side of the map, passing by the constric-tive model at the center until the stationary model on the bottom right. Aninteresting zone is the bottom zone of the map. Looking at the Fig.7, we ob-serve that the zone is a representation of the Arab states of the Persian Gulfwhere the population distributions present some particular patterns related tothe story and the economy of the region. What is clear is that a path from theexpansive to the constrictive and stationary model arise conﬁrming the demo-graphic theories that assume the models as three phases of an evolutionary pathof a population. Finally, in Fig. 9 are shown the map of the logarithms of therelevance weights associated with each Voronoi set of the BMU of the neurons.It is interesting to note that, for example, for Persian gulf countries the vari-ability components of the male and female population age distribution is morerelevant, while, in general, the average component of the male age populationis the most important in the cluster deﬁnition especially for the top-left zone ofthe map where there is the passage from a constrictive to a stationary shaped32 igure 8: Age Pyramid dataset. Toroidal SOM using P4 algorithm: prototypes map. population.

4. Conclusions

The two Batch Self-Organizing Maps (SOMs) methods proposed DBSOM,to extend classical Batch SOM algorithm for distributional data, and ADB-SOM, to innovative Batch SOM algorithms, using adaptive distances. DBSOMand ADBSOM training algorithms are based on the optimization of diﬀerentobjective functions. that is performed in alternates steps: two for DBSOM andthree for ADBSOM.In the representation step, DBSOM and ADBSOM algorithms compute theprototypes (vectors of distributions) of the clusters related to the neurons. Theyare achieved consistently with the optimal solution of the objective functions,having used the L Wasserstein distance between pairs of distributions vector.The main contribution to Classical batch SOM algorithm is to overcome theclassical assumption of SOM that all the variables have the same relevance fortraining the SOM. Indeed, it is well known that some variables are less relevantthan others in the clustering process.In particular, using the L Wasserstein distance between 1D distributions,we have extended the use of SOM for data described by distributions and wehave also introduced adaptive distances in the SOM algorithm according to fourdiﬀerent strategies. The particular choice of adaptive distances, diﬀerently fromother variable weighting schemes, does not require to tune further parameters.The weighting step of ADBSOM calculates the relevance-weights of the dis-tributional valued variables. This is achieved conforming to the optimal solution33 igure 9: Age Pyramid dataset. Toroidal SOM using P4 algorithm: log of relevance weightsplot. -1.0-0.50.00.51.01.5

M.Male.Population Log. of weights -1.0-0.50.00.51.0

D.Male.Population Log. of weights -101

M.Female.Population Log. of weights -1.0-0.50.00.5

D.Female.Population Log. of weights provided in the paper. The squared L Wasserstein distance between two distri-butions can be decomposed into two components: a squared Euclidean distancebetween the means and a L Wasserstein distance between the centered quantilefunctions associated to the distributions. This second component allows tak-ing more into account the variability and shape of the distributions. Relevanceweights are automatically learned for each of the measure components to em-phasize the importance of the diﬀerent characteristics (means, variability, andshape) of the distributions. These weights of relevance are calculated for eachcluster or for the entire partition such that their product is equal to one. Besides,the ADBSOM algorithm takes into consideration new sets of constraints.Finally, the second step of DBSOM and the third step of ADBSOM computethe partitions on the neurons of the SOM. This is achieved conforming to theoptimal solution provided in the paper.DBSOM and ADBSOM algorithms were evaluated experimentally on tworeal distributional-valued data sets. ADBSOM outperformed DBSOM on thesedata sets, especially using toroidal maps. This corroborates the importance ofthe weighting step of ADBSOM. We have observed that planar maps, togetherwith the optimized criterion suﬀer from a pushing-toward-the border eﬀect onthe data. This is almost evident in the application on real data. We proposedto overcome such limit of planar maps using a toroidal map that have shownbetter topology preservation of the map (namely, the topographic error is lowerthan the one referred to planar maps when all the other map parameters arethe same).From the application on real data, we have observed that the use of adaptivedistances reduces generally the topographic error and overcomes the problemsrelated to the scaling of the variables as preprocessing step. Moreover, ADB-SOM algorithms that calculate the relevance weights of the distributional vari-ables for each cluster had the best performance in terms of topological preser-34ation and in terms of compactness and separation of clusters induced by theVoronoi sets associated with the neurons.Moreover, as a supplementary contribution, we introduced two new Silhou-ette indexes for SOM, taking together the idea behind the topographic errorcomputation and the clustering structure of the SOM. We have also observedhow to compute exactly a Silhouette index when data are described in a Eu-clidean space in a more eﬃcient way.Finally, the usefulness of DBSOM and ADBSOM algorithms have been high-lighted with their applications to the Human Behavior Recognition and Worldcountries population pyramids distributional-valued data sets.

Acknowledgments

The ﬁrst author would like to thanks to Conselho Nacional de Desenvolvi-mento Cientiﬁco e Tecnologico - CNPq (303187/2013-1) for its partial ﬁnancialsupport.

APPENDIX: Fast computation of Silhouette index for Euclidean spaces

Let us consider a table Y containing N numerical data vectors y i = [ y i , . . . , y iP ]of P components. Without loss of generality, the N vectors are clustered intwo groups, namely, A and B, having, respectively, size N A and N B such that N A + N B = N . Let ¯ y Aj = n − A (cid:80) y i ∈ A y ij and ¯ y Bj = n − B (cid:80) y i ∈ B y ij , j = 1 , . . . , P ,be the two cluster means, and SSE Aj = (cid:80) y i ∈ A y ij − n A (¯ y Aj ) and SSE Bj = (cid:80) y i ∈ B y ij − n B (¯ y Bj ) the two sum of squares deviations from the respective clus-ter means. The average Euclidean distance between a point to all the other points of a setwhere it is contained.

Let consider that y i ∈ A , the average distance of y i with respect all theother vectors in A is computed as follows:( n A − − (cid:88) y k ∈ A P (cid:88) j =1 ( y ij − y kj ) It is easy to show, for a single variable j , that: (cid:88) y k ∈ A ( y ij − y kj ) = n A ( y ij ) + (cid:88) y k ∈ A ( y kj ) − y ij (cid:88) y k ∈ A y kj == n A ( y ij ) + (cid:2) SSE Aj + n A (¯ y Aj ) (cid:3) − n A y ij ¯ y Aj == n A ( y ij − ¯ y Aj ) + SSE Aj . n A − − (cid:88) y k ∈ A ( y ij − y kj ) = n A ( y ij − ¯ y Aj ) n A − SSE Aj n A − The average Euclidean distance of a point from all the other points of a setwhere it is not contained

Let consider that y i ∈ A and we want to compute the average distance of y i with respect all the other vectors in B. The average squared Euclidean distancebetween y i and all the other vectors in B, for each variable, is given by:( n B ) − (cid:88) y k ∈ B P (cid:88) j =1 ( y ij − y kj ) for a single variable j , it is easy to show that: (cid:88) y k ∈ B ( y ij − y kj ) = n B ( y ij ) + (cid:88) y k ∈ B ( y kj ) − y ij (cid:88) y k ∈ B y kj == n B ( y ij ) + (cid:2) SSE Bj + n B (¯ y Bj ) (cid:3) − n B y ij ¯ y Bj == n B ( y ij − ¯ y Bj ) + SSE Bj . Then, the average distance is that:( n B ) − (cid:88) y k ∈ B ( y ij − y kj ) = ( y ij − ¯ y Bj ) + SSE Bj n B The Silhouette index

As stated above, the Silhouette index for a single y i is given by: s ( i ) = b ( i ) − a ( i ) max [ a ( i ) , b ( i )]where, in the case of two groups A and B, if i ∈ A : a ( i ) = ( n A − − (cid:88) y k ∈ A P (cid:88) j =1 ( y ij − y kj ) and b ( i ) = ( n B ) − (cid:88) y k ∈ B P (cid:88) j =1 ( y ij − y kj ) . If we consider the original formulation, for computing N silhouette indexes thecomputational complexity is in the order of O ( N ), if N is suﬃciently largerthan P . If we use the formulas suggested here, once computed the averages36nd the SSE, the computational cost is approximatively of O ( N ). In fact, it ispossible to compute a ( i ) and b ( i ) as follows: a ( i ) = p (cid:88) j =1 (cid:34) n A ( y ij − ¯ y Aj ) n A − SSE Aj n A − (cid:35) and b ( i ) = p (cid:88) j =1 (cid:20) ( y ij − ¯ y Bj ) + SSE Bj n B (cid:21) . Considering that the squared L2 Wasserstein distance is an Euclidean distancebetween quantile functions, that the SSE computed for distribution functionsis deﬁned as a sum of squared distance an that the average distributions areFr´echet means with respect to the squared L Wasserstein distance. The samesimpliﬁcation can be applied for computing the Silhouette Coeﬃcient whendata are distributions and they are compared using the squared L Wassersteindistance.

ReferencesReferences [1] K. Altun, B. Barshan, and O. Tun¸cel. Comparative study on classifyinghuman activities with miniature inertial and magnetic sensors.

PatternRecognition , 43(10):3605–3620, 2010.[2] F. Badran, M. Yacoub, and S. Thiria.

Neural Networks: methodology andapplications , chapter Self-organizing maps and unsupervised classiﬁcation,pages 379–442. Springer, Singapore, 2005.[3] B. Barshan and M. C. Yuksek. Recognizing Daily and Sports Activitiesin Two Open Source Machine Learning Environments Using Body-WornSensor Units.

The Computer Journal , 57(11):1649–1667, 2014.[4] B. Barshan and A. Yurtman. Investigating Inter-Subject and Inter-ActivityVariations in Activity Recognition Using Wearable Motion Sensors.

TheComputer Journal , 59(9):1345–1362, 2016.[5] H. H. Bock. Clustering algorithms and kohonen maps for symbolic data.

J. Jpn. Soc. Comp. Statist. , 15:1–13, 2002.[6] H. H. Bock and E. Diday.

Analysis of Symbolic Data. Exploratory Methodsfor Extracting Statistical Information from Complex Data . Springer, Berlin,2000.[7] G. Cabanes, Y. Bennani, R. Destenay, and A. Hardy. A new topologicalclustering algorithm for interval data.

Pattern Recognition , 46:3030–3039,2013. 378] R. J. G. B. Campello and E. R. Hruschka. A fuzzy extension of the silhou-ette width criterion for cluster analysis.

Fuzzy Sets and Systems , 157(21):2858–2875, 2006.[9] F. A. T. De Carvalho and R. M. C. R. De Souza. Unsupervised patternrecognition models for mixed feature–type symbolic data.

Pattern Recog-nition Letters , 31:430–443, 2010.[10] F. A. T. De Carvalho and Y. Lechevallier. Partitional clustering algo-rithms for symbolic interval data based on single adaptive distances.

Pat-tern Recognition , 42(7):1223–1236, 2009.[11] F. A. T. De Carvalho and Y. Lechevallier. Dynamic clustering of interval-valued data based on adaptive quadratic distances.

Trans. Sys. Man Cyber.Part A , 39(6):1295–1306, 2009. ISSN 1083-4427.[12] F. A. T. De Carvalho, A. Irpino, and R. Verde. Fuzzy clustering ofdistribution-valued data using an adaptive L Wasserstein distance. In

Fuzzy Systems (FUZZ-IEEE), 2015 IEEE International Conference on ,pages 1–8, 2015. doi: 10.1109/FUZZ-IEEE.2015.7337847.[13] F. A. T. de Carvalho, P. Bertrand, and E. C. Sim˜oes. Batch SOM algo-rithms for interval-valued data with automatic weighting of the variables.

Neurocomputing , 182:66–81, 2016.[14] Dua Dheeru and Eﬁ Karra Taniskidou. UCI machine learning repository,2017. URL http://archive.ics.uci.edu/ml .[15] E. Diday and G. Govaert. Classiﬁcation automatique avec distances adap-tatives.

R.A.I.R.O. Informatique Computer Science , 11(4):329–349, 1977.[16] E. Diday and J. C. Simon. Clustering analysis. In K.S. Fu, editor,

DigitalPattern Classiﬁcation , pages 47–94. Springer, Berlin, 1976.[17] P. D’Urso and L. De Giovanni. Midpoint radius self-organizing maps forinterval-valued data with telecommunications application.

Applied SoftComputing , 11:3877–3886, 2011.[18] J. H. Friedman and J. J. Meulman. Clustering objects on subsets of at-tributes.

Journal of the Royal Statistical Society, Serie B , 66:815–849, 2004.[19] H. Frigui and O. Nasraoui. Unsupervised learning of prototypes and at-tribute weights.

Pattern Recognition , 37(3):567–581, 2004.[20] W. Gilchrist.

Statistical Modelling with Quantile Functions . CRC Press,Abingdon, 2000.[21] C. Hajjar and H. Hamdan. Self-organizing map based on city-block distancefor interval-valued data. In

Complex Systems Design and Management -CSDM 2011 , pages 281–292, 2011.3822] C. Hajjar and H. Hamdan. Self-organizing map based on hausdorﬀ distancefor interval-valued data. In

IEEE International Conference on Systems,Man, and Cybernetics - SMC 2011 , pages 1747–1752, 2011.[23] C. Hajjar and H. Hamdan. Self-organizing map based on l2 distance forinterval-valued data. In , pages 317–322,2011.[24] C. Hajjar and H. Hamdan. Interval data clustering using self-organizingmaps based on adaptive mahalanobis distances.

Neural Networks , 46:124–132, 2013.[25] A. Hardy and P. Lallemand.

Clustering of Symbolic Objects Described byMulti-Valued and Modal Variables , pages 325–332. Springer Berlin Heidel-berg, Berlin, Heidelberg, 2004.[26] J. Z. Huang, M. K. Ng, H. Rong, and Z. Li. Automated variable weightingin k-means type clustering.

Pattern Analysis and Machine Intelligence,IEEE Transactions on , 27(5):657–668, 2005.[27] J. Z. Huang, M. K. Ng, H. Rong, and Z. Li. Automated variable weightingin k-means type clustering.

IEEE Trans. Pattern Anal. Mach. Intell. , 27(5):657–668, 2005.[28] L. Hubert and P. Arabie. Comparing partitions.

Journal of Classiﬁcation ,2(1):193–218, 1985.[29] D. Hulse, S. Gregory, and J. Baker.

Willamette River Basin: trajec-tories of environmental and ecological change . Oregon State UniversityPress, 2002. URL .[30] A. Irpino.

HistDAWass: Histogram Data Analysis using Wasserstein dis-tance , 2018. R package version 1.0.1.[31] A. Irpino and E. Romano. Optimal histogram representation of large datasets: Fisher vs piecewise linear approximation.

Revue des Nouvelles Tech-nologies de l’Information , RNTI-E-9:99–110, 2007.[32] A. Irpino and R. Verde. A new Wasserstein based distance for the hierar-chical clustering of histogram symbolic data. In V. et al. Batagelj, editor,

Data Science and Classiﬁcation , pages 185–192. Springer, Berlin, 2006.[33] A. Irpino and R. Verde. Basic statistics for distributional symbolic vari-ables: a new metric-based approach.

Advances in Data Analysis and Clas-siﬁcation , 9(2):143–175, 2015. ISSN 1862-5347.[34] A. Irpino, R. Verde, and F. A. T. de Carvalho. Batch self organizing mapsfor interval and histogram data. In

International Conference on Computa-tional Statistics - COMPSTAT 2012 , pages 143–154, 2012.3935] A. Irpino, R. Verde, and F. A. T. De Carvalho. Dynamic clustering ofhistogram data based on adaptive squared Wasserstein distances.

ExpertSystems with Applications , 41(7):3351 – 3366, 2014.[36] A. Irpino, R. Verde, and F.A.T. De Carvalho. Fuzzy clustering of distribu-tional data with automatic weighting of variable components.

InformationSciences , 406-407:248–268, 2017.[37] J. Kim and L. Billard. A polythetic clustering process and cluster validityindexes for histogram-valued objects.

Computational Statistics and DataAnalysis , 55(7):2250 – 2262, 2011.[38] J. Kim and L. Billard. Dissimilarity measures for histogram-valued observa-tions.

Communications in Statistics - Theory and Methods , 42(2):283–303,2013.[39] K. Kiviluoto. Topology preservation in self-organizing maps. In

NeuralNetworks, 1996., IEEE International Conference on , volume 1, pages 294–299 vol.1, 1996. doi: 10.1109/ICNN.1996.548907.[40] T. Kohonen.

Self-Organizing Maps . Springer, New York, 1995.[41] T. Kohonen. Essentials of the self-organizing map.

Neural Networks , 37(1):52–65, 2013.[42] T. Kohonen.

MATLAB Implementations and Applications of the Self-Organizing Map . Unigraﬁa Oy, Helsinki, 2014.[43] S. Korenjak- ˇCerne and V. Batagelj.

Clustering Large Datasets of MixedUnits , pages 43–48. Springer Berlin Heidelberg, Berlin, Heidelberg, 1998.[44] S. Korenjak- ˇCerne and V. Batagelj.

Symbolic Data Analysis Approachto Clustering Large Datasets , pages 319–327. Springer Berlin Heidelberg,Berlin, Heidelberg, 2002.[45] C.D. Manning, P. Raghavan, and H. Sch¨utze.

Introduction to InformationRetrieval . Cambridge University Press, New York, NY, USA, 2008.[46] M. Meila. Comparing clusterings – an information based distance.

Journalof Multivariate Analysis , 98(5):873 – 895, 2007.[47] N. J. Mount and D. Weaver. Self-organizing maps and boundary eﬀects:quantifying the beneﬁts of torus wrapping for mapping som trajectories.

Pattern Analysis and Applications , 14(2):139–148, 2011.[48] P. J. Rousseeuw. Silhouettes: A graphical aid to the interpretation andvalidation of cluster analysis.

Journal of Computational and Applied Math-ematics , 20:53 – 65, 1987.[49] L. R¨ushendorﬀ. Wasserstein metric. In

Encyclopedia of Mathematics .Springer, 2001. 4050] T. Terada and H. Yadohisa. Non-hierarchical clustering for distribution-valued data. In Y. Lechevallier and G. Saporta, editors,

Proceedings ofCOMPSTAT 2010 , pages 1653–1660. Springer, Berlin, 2010.[51] R. Verde and A. Irpino. Dynamic clustering of histogram data: using theright metric. In P. et al. Brito, editor,

Selected contributions in data analysisand classiﬁcation , pages 123–134. Springer, Berlin, 2008.[52] R. Verde and A. Irpino. Multiple factor analysis of distributional data.

Statistica Applicata, Italin Journal of applied statistics , 2018. to appear.[53] R. Verde, A. Irpino, and Y. Lechevallier. Dynamic clustering of histogramsusing Wasserstein metric. In A. Rizzi and M. Vichi, editors,

Proceedingsof COMPSTAT 2006 , pages 869–876, Heidelberg, 2006. Compstat 2006,Physica Verlag.[54] Juha Vesanto, Johan Himberg, Esa Alhoniemi, and Juha Parhankangas.Self-organizing map in matlab: the som toolbox. In

In Proceedings of theMatlab DSP Conference , pages 35–40, 1999.[55] M. Vrac, L. Billard, E. Diday, and A. Chedin. Copula analysis of mixturemodels.