[PDF] Incremental Cluster Validity Indices for Hard Partitions: Extensions and Comparative Study

Abstract

Validation is one of the most important aspects of clustering, but most approaches have been batch methods. Recently, interest has grown in providing incremental alternatives. This paper extends the incremental cluster validity index (iCVI) family to include incremental versions of Calinski-Harabasz (iCH), I index and Pakhira-Bandyopadhyay-Maulik (iI and iPBM), Silhouette (iSIL), Negentropy Increment (iNI), Representative Cross Information Potential (irCIP) and Representative Cross Entropy (irH), and Conn_Index (iConn_Index). Additionally, the effect of under- and over-partitioning on the behavior of these six iCVIs, the Partition Separation (PS) index, as well as two other recently developed iCVIs (incremental Xie-Beni (iXB) and incremental Davies-Bouldin (iDB)) was examined through a comparative study. Experimental results using fuzzy adaptive resonance theory (ART)-based clustering methods showed that while evidence of most under-partitioning cases could be inferred from the behaviors of all these iCVIs, over-partitioning was found to be a more challenging scenario indicated only by the iConn_Index. The expansion of incremental validity indices provides significant novel opportunities for assessing and interpreting the results of unsupervised learning.

Full PDF

PPREPRINT SUBMITTED TO ARXIV.ORG 1

Incremental Cluster Validity Indices for HardPartitions: Extensions and Comparative Study

Leonardo Enzo Brito da Silva,

Member, IEEE,

Niklas M. Melton,

Member, IEEE,

Donald C. Wunsch II,

Fellow, IEEE

Abstract

Validation is one of the most important aspects of clustering, but most approaches have been batchmethods. Recently, interest has grown in providing incremental alternatives. This paper extends theincremental cluster validity index (iCVI) family to include incremental versions of Calinski-Harabasz(iCH), I index and Pakhira-Bandyopadhyay-Maulik (iI and iPBM), Silhouette (iSIL), Negentropy In-crement (iNI), Representative Cross Information Potential (irCIP) and Representative Cross Entropy(irH), and Conn Index (iConn Index). Additionally, the effect of under- and over-partitioning on thebehavior of these six iCVIs, the Partition Separation (PS) index, as well as two other recently developediCVIs (incremental Xie-Beni (iXB) and incremental Davies-Bouldin (iDB)) was examined through acomparative study. Experimental results using fuzzy adaptive resonance theory (ART)-based clusteringmethods showed that while evidence of most under-partitioning cases could be inferred from thebehaviors of all these iCVIs, over-partitioning was found to be a more challenging scenario indicatedonly by the iConn Index. The expansion of incremental validity indices provides signiﬁcant novelopportunities for assessing and interpreting the results of unsupervised learning.

Index Terms

Clustering, Validation, Incremental Cluster Validity Index (iCVI), Fuzzy, Adaptive Resonance The-ory (ART).

L. E. Brito da Silva is with the Applied Computational Intelligence Laboratory, Department of Electrical and ComputerEngineering, Missouri University of Science and Technology, Rolla, MO 65409 USA, and also with the CAPES Foundation,Ministry of Education of Brazil, Bras´ılia, DF 70040-020, Brazil (e-mail: [email protected]).N. M. Melton is with the Applied Computational Intelligence Laboratory, Department of Electrical and Computer Engineering,Missouri University of Science and Technology, Rolla, MO 65409 USA (e-mail: [email protected]).D. C. Wunsch II is with the Applied Computational Intelligence Laboratory, Department of Electrical and ComputerEngineering, Missouri University of Science and Technology, Rolla, MO 65409 USA (e-mail: [email protected]).

February 19, 2019 DRAFT a r X i v : . [ c s . L G ] F e b REPRINT SUBMITTED TO ARXIV.ORG 2

I. I

NTRODUCTION

Cluster validation [1] is a critical topic in cluster analysis. It is crucial to assess the qualityof the partitions detected by clustering algorithms when there is no class label information.Different clustering solutions may be found by distinct algorithms, or even by the same algorithmsubjected to different hyper-parameters or a different input presentation order [2], [3].

Clustervalidity indices (CVIs) perform the role of evaluators of such solutions. CVIs typically exhibit atrade-off between measures of compactness (within-cluster scatter) and isolation (between-clusterseparation) [2]. Numerous examples of such criteria have been presented in the literature; forcomprehensive reviews and experimental studies the interested reader can go to [4]–[11].Recently, incremental cluster validity indices (iCVIs) have been developed to track the effec-tiveness of online clustering methods over data streams [12]–[15]. To enable cluster validationin such applications, a recursive formulation of compactness was introduced in [12], [13]. Thisstrategy has been used to develop incremental versions of four CVIs so far [15]: viz., incrementalDavies-Bouldin (iDB) [12], [13], incremental Xie-Beni (iXB) [12], [13] and modiﬁed Dunn’sindices [16]. Particularly, the behavior of iXB and iDB are analyzed in both accurately andpoorly partitioned data sets in [12], [13], whereas the studies in [14], [15] only investigate theiDB’s behavior in cases where online clustering algorithms accurately detect data structures, i . e ., when they yield high performing experimental results.Therefore, the contributions of this work are three-fold: (1) presenting incremental versions ofsix additional CVIs (thereby extending the family of iCVIs), (2) discussing the interpretation ofthese novel iCVIs in cases of accurately, under- and over-partitioning, and (3) performing a sys-tematic comparative study among ten iCVIs. To explore such scenarios, fuzzy adaptive resonancetheory (ART)-based clustering methods [17], [18] were chosen for their simple parameterizationof cluster granularity and other appealing properties [19].The following, Section II, provides a brief review of CVIs, iCVIs and ART; Section III presentsthis work’s extensions of several other CVIs to the incremental family; Section IV details the set-up used in the numerical experiments; Section V describes and discusses the results; Section VIcompares batch and incremental versions of CVIs; and Section VII summarizes this paper’sﬁndings. February 19, 2019 DRAFTREPRINT SUBMITTED TO ARXIV.ORG 3

II. B

ACKGROUND AND RELATED WORK

This section brieﬂy recaps the theory regarding the CVIs, iCVIs and ART-based clusteringalgorithms used in this study.

A. Cluster Validity Indices (CVIs)

Consider a data set X = { x i } Ni =1 and its hard partition Ω = { ω i } ki =1 of k disjointed clusters ω i ,such that k (cid:83) i =1 ω i = X . In the following CVI overview, v is a cluster prototype (centroid), k is thenumber of clusters, d is the dimensionality of the data ( x i ∈ IR d ), (cid:107) · (cid:107) is the Euclidean norm,and N and n i are the cardinalities of a data set and cluster ω i , respectively.

1) Calinski-Harabasz (CH) [20]: the CH index is deﬁned as: CH = BGSS/ ( k − W GSS/ ( N − k ) , (1)where the between group sum of squares (BGSS) and within group sum of squares (WGSS) arecomputed as: W GSS = k (cid:88) i =1 n i (cid:88) j =1 x j ∈ ω i (cid:107) x j − v i (cid:107) , (2) BGSS = k (cid:88) i =1 n i (cid:107) v i − µ data (cid:107) , (3) µ data = 1 N N (cid:88) i =1 x i . (4)This is an optimization-like criterion [8] such that larger values of CH indicate better clusteringsolutions (maximization).

2) Davies-Bouldin (DB) [21]: the DB index averages the similarities R of each cluster i with respect to its maximally similar cluster j (cid:54) = i : DB = 1 k k (cid:88) i =1 R i , (5)where R i = max i (cid:54) = j (cid:18) S i + S j M i,j (cid:19) , (6) S l =  n l n l (cid:88) m =1 x m ∈ ω l (cid:107) x m − v l (cid:107) q  q , l = { , ..., k } , (7) February 19, 2019 DRAFTREPRINT SUBMITTED TO ARXIV.ORG 4 M i,j = (cid:34) d (cid:88) t =1 | v it − v jt | p (cid:35) p , p ≥ . (8)The variables ( p , q ) are user-deﬁned parameters, and S l and M i,j (Minkowski metric) mea-sure compactness and separation, respectively. Smaller values of DB indicate better clusteringsolutions (minimization).

3) Xie-Beni (XB) [22]: the XB index was originally designed to detect compact and sepa-rated clusters in fuzzy c-partitions. A hard partition version is given by the following ratio ofcompactness to separation [23], [24]: XB = W GSS/N min i (cid:54) = j (cid:107) v i − v j (cid:107) . (9)Smaller values of XB indicate better clustering solutions (minimization).

4) Pakhira-Bandyopadhyay-Maulik (PBM) [25], [26]: consider the I index [25] deﬁned as: I = (cid:18) k × E E k × D k (cid:19) p , p ≥ , (10)where E = N (cid:88) i =1 (cid:107) x i − µ data (cid:107) , (11) E k = k (cid:88) i =1 n i (cid:88) j =1 x j ∈ ω i (cid:107) x j − v i (cid:107) , (12) D k = max i (cid:54) = j ( (cid:107) v i − v j (cid:107) ) , (13)The quantities E k and D k measure compactness and separation, respectively. This CVI comprisesa trade-off among the three competing factors in Eq. (10): k decreases with k , whereas both E E k and D k increase. By setting p = 2 in Eq. (10), the I index reduces to the PBM index [26].Larger values of PBM indicate better clustering solutions (maximization).

5) Silhouette (SIL) [27]: the SIL index is computed by averaging the silhouette coefﬁcients sc i across all data samples x i : SIL = 1 N N (cid:88) i =1 sc i , (14)where sc i = b i − a i max ( a i , b i ) , (15) February 19, 2019 DRAFTREPRINT SUBMITTED TO ARXIV.ORG 5 a i = 1 n i − n i (cid:88) j =1 ,j (cid:54) = i x j ∈ ω i (cid:107) x j − x i (cid:107) , (16) b i = min l,l (cid:54) = i  n l n l (cid:88) j =1 x j ∈ ω l (cid:107) x j − x i (cid:107)  , (17)the variables a i and b i measure compactness and separation, respectively. Larger values ofSIL (close to 1) indicate better clustering solutions (maximization). To reduce computationalcomplexity, some SIL variants, such as [28]–[31], use a centroid-based approach. The simpliﬁedSIL [28], [29] has been successfully used in clustering data streams processed in chunks, in whichthe silhouette coefﬁcients are also used to make decisions regarding the centroids’ incrementalupdates [32].

6) Partition Separation (PS) [33]: the PS index was originally developed for fuzzy clustering;its hard clustering version is given by [34]:

P S = k (cid:88) i =1 P S i , (18)where P S i = n i max j ( n j ) − exp  − min i (cid:54) = j ( (cid:107) v i − v j (cid:107) ) β T  , (19) β T = 1 k k (cid:88) l =1 (cid:107) v l − ¯ v (cid:107) , (20) ¯ v = 1 k k (cid:88) l =1 v l , (21)The PS index only comprises a measure of separation between prototypes. Therefore, this CVIcan be readily used to evaluate the partitions identiﬁed by unsupervised incremental learners thatmodel clusters using centroids ( e . g ., [34]). Larger values of PS indicate better clustering solutions(maximization).

7) Negentropy Increment (NI) [35], [36]: the NI index measures the average normality of theclusters of a given partition Ω via negentropy [37] while avoiding the direct computation of theclusters’ differential entropies. Unlike the other CVIs discussed so far, the NI is not explicitlyconstructed using measures of compactness and separation [9], [35], thereby being deﬁned as: N I = 12 k (cid:88) i =1 p i ln | Σ i | −

12 ln | Σ data | − k (cid:88) i =1 p i ln p i , (22) February 19, 2019 DRAFTREPRINT SUBMITTED TO ARXIV.ORG 6 where | · | denotes the determinant. The probabilities ( p ), means ( v ) and covariance matrices ( Σ )are estimated as: p i = n i N , (23) v i = 1 n i n i (cid:88) j =1 x j ∈ ω i x j , (24) Σ i = 1 n i − n i (cid:88) j =1 x j ∈ ω i ( x j − v i )( x j − v i ) T , (25) Σ data = 1 N − (cid:0) X T X − N µ data µ Tdata (cid:1) , (26)and µ data is estimated using Eq. (4). Smaller values of NI indicate better clustering solutions(minimization).

8) Representative Cross Information Potential (rCIP) [38], [39]: cluster evaluation func-tions (CEFs) based on cross information potential (CIP) [40], [41] have been consistentlyused in the literature to evaluate partitions and drive optimization algorithms searching fordata structure [38]–[41], thus this work includes these CEFs under the CVI category. Pre-cisely, representative approaches [38], [39] replace the sample-by-sample estimation of Renyi’squadratic Entropy [42] using the Parzen-window method [43] (original CIP [40], [41]) viaprototypes and the statistics of their associated Voronoi polyhedron. The rCIP was devised forprototype-based clustering ( i . e ., two-step methods: vector quantization followed by clustering ofthe prototypes) [44]–[48]. The CEF used here is deﬁned as [39]: CEF = k − (cid:88) i =1 k (cid:88) j = i +1 rCIP ( ω i , ω j ) , (27)where rCIP ( ω i , ω j ) = 1 M i M j M i (cid:88) l =1 M j (cid:88) m =1 G ( v l − v m , Σ l,m ) , (28) G ( v l − v m , Σ l,m ) = e − ( v l − v m ) T Σ − l,m ( v l − v m ) (cid:113) (2 π ) d | Σ l,m | , (29) Σ l,m = Σ l + Σ m , { v l , Σ l } ∈ ω i , { v m , Σ m } ∈ ω j , M i and M j are the number of prototypesused to represent clusters ω i and ω j , respectively. The prototypes and covariance matrices areestimated using Eqs. (24) and (25), respectively. Smaller values of CEF indicate better clusteringsolutions (minimization). Recently, the information potential (IP) [49] measure has been used todeﬁne a system’s state when modeling and analyzing dynamic processes [50], [51]. February 19, 2019 DRAFTREPRINT SUBMITTED TO ARXIV.ORG 7

9) Conn Index [52], [53]: the Conn Index was also developed for prototype-based clustering.It is formulated using the connectivity strength matrix (CONN), which is a symmetric squaresimilarity matrix that represents local data densities between neighboring prototypes [54], [55].Its ( i, j ) th entry is formally given by: CON N ( i, j ) = CADJ ( i, j ) + CADJ ( j, i ) , (30)where the ( i, j ) th entry of the non-symmetric cumulative adjacency matrix (CADJ) correspondsto the number of samples for which v i and v j are, simultaneously, the ﬁrst and second closestprototypes (according to some measure), respectively. The Conn Index is deﬁned as: Conn Index = Intra Conn × (1 − Inter Conn ) , (31)where the intra-cluster ( Intra Conn ) and inter-cluster (

Inter Conn ) connectivities are:

Intra Conn = 1 k k (cid:88) l =1 Intra Conn ( ω l ) , (32) Intra Conn ( ω l ) = 1 n l P (cid:88) i,j v i , v j ∈ ω l CADJ ( i, j ) , (33) Inter Conn = 1 k k (cid:88) l =1 max mm (cid:54) = l [ Inter Conn ( ω l , ω m )] , (34) Inter Conn ( ω l , ω m ) = P (cid:80) i,j v i ∈ ω l , v j ∈ ω m CON N ( i, j ) P (cid:80) i,j v i ∈ V l,m CON N ( i, j ) , (35) V l,m = { v i : v i ∈ ω l , ∃ v j ∈ ω m : CADJ ( i, j ) > } , (36)the variable P is the total number of prototypes, and Inter Conn ( ω l , ω m ) = 0 if V l,m = {∅} .Naturally, the quantities Intra Conn and

Inter Conn measure compactness and separation,respectively. Larger values of the Conn Index (close to 1) indicate better clustering solutions(maximization).

February 19, 2019 DRAFTREPRINT SUBMITTED TO ARXIV.ORG 8

B. Incremental Cluster Validity Indices (iCVIs)

The compactness and separation terms commonly found in CVIs are generally computedusing data samples and prototypes, respectively [12], [14]. In order to handle online clusteringapplications demands ( i . e ., data streams), an incremental CVI (iCVI) formulation that recursivelyestimates the compactness term was introduced in [12], [13] in the context of fuzzy clustering.Speciﬁcally, consider the hard clustering version of cluster i ’s compactness CP ( i . e ., by settingthe fuzzy memberships in [12], [13] to binary indicator functions): CP i = n i (cid:88) j =1 x j ∈ ω i (cid:107) x j − v i (cid:107) , (37)in such a case, when a new sample x is presented and encoded by cluster i , then its newcompactness becomes: CP newi = n newi (cid:88) j =1 x j ∈ ω i (cid:107) x j − v newi (cid:107) , (38)where n newi = n oldi + 1 , (39) v newi = v oldi + ( x − v oldi ) /n newi , (40)and N new = N old + 1 . (41)The compactness in Eq. (38) can be updated incrementally as [12], [13]: CP newi = CP oldi + (cid:107) z i (cid:107) + n oldi (cid:107) ∆ v i (cid:107) + 2∆ v Ti g oldi , (42)where g newi = g oldi + z i + n oldi ∆ v i , (43) g i = n i (cid:88) j =1 ( x j − v i ) , (44) z i = x − v newi , (45) ∆ v i = v oldi − v newi . (46)The compactness CP and vector g are initialized as and (cid:126) (since v = x ), respectively. Notethat, at each iteration, the variable g is updated after CP . Using such incremental formulation,the following iCVIs were derived in [12], [13] (their hard partition counterparts are shown here) February 19, 2019 DRAFTREPRINT SUBMITTED TO ARXIV.ORG 9

1) incremental Xie-Beni (iXB): XB new = 1 N new × k new (cid:80) i =1 CP newi min i (cid:54) = j (cid:0) (cid:107) v newi − v newj (cid:107) (cid:1) , (47)

2) incremental Davies-Bouldin (iDB - based on [56]): DB new = 1 k new k new (cid:88) i =1 max j,j (cid:54) = i  CP newi n newi + CP newj n newj (cid:107) v newi − v newj (cid:107)  . (48)If a new cluster emerges, then k new = k old + 1 ; otherwise its previous value is maintained.Note that only one prototype v is updated after each input presentation. C. Adaptive Resonance Theory (ART)

For this study’s experiments, adaptive resonance theory (ART) [57] has been implemented. Itis a fast and stable online clustering method with automatic category recognition encompassinga rich history with many implementations well-suited to iCVI computation [17]–[19], [57]–[72].The following ART models were used in these experiments.

1) Fuzzy ART [17]:

This model implements fuzzy logic [73] to bound data within hyper-boxes. For a normalized data set X = { x i } Ni =1 (0 ≤ x i,j ≤ , j = { , ..., d } ) , the fuzzy ARTalgorithm, with parameters ( α, β, ρ ) , is deﬁned by: I = ( x i , − x i ) , (49) T j = (cid:107) min( I , w j ) (cid:107) α + (cid:107) w j (cid:107) , (50) (cid:107) min( I , w j ) (cid:107) ≥ ρ (cid:107) I (cid:107) , (51) w newj = w oldj (1 − β ) + β min( I , w oldj ) . (52)Equation (49) is the complement coding function, which concatenates sample x and itscomplement to form an input vector I with dimension d . Equation (50) is the activation functionfor each category j , where (cid:107) · (cid:107) is the L norm, min ( · ) is performed component-wise, and α is a tie breaking constant. Each category is checked for validity against Eq. (51)’s vigilanceparameter ρ in a descending order of activation. If no valid category is found during training,then a new category is initialized using I as the new weight vector w . Otherwise, the winningcategory is updated according to Eq. (52) using learning rate β . In this study, when fuzzy ARTis set to evaluation mode (learning is disabled), if no valid category is found during search, thenthe winning category defaults to the highest activated one. February 19, 2019 DRAFTREPRINT SUBMITTED TO ARXIV.ORG 10

2) Fuzzy self-consistent modular ART (SMART) [18]:

This model is a hierarchical clusteringtechnique based on the ARTMAP architecture [17]. In an ARTMAP network, two ART modules,A- and B-side, are supplied with separate but dependent data streams. Both ART modulescan cluster according to local topology and parameters while an inter-ART module enforcesa surjective mapping of the A-side to the B-side, effectively learning the functional map of theA-side to the B-side categories.To build a fuzzy SMART module, it is only necessary to stream the same sample to both theA- and B-sides of a fuzzy ARTMAP module, i . e ., use fuzzy ARTMAP in an auto-associativemode. If all else is equal in the A and B modules’ parameters, fuzzy SMART will begin to forma two-level self-consistent cluster hierarchy when ρ A > ρ B . This hierarchy will be required toextend the iCVI study to prototype-based CVIs such as the Conn Index. For such CVIs, theA-side categories act as cluster prototypes while the B-side provides the actual data partition.III. E XTENSIONS OF I

CVI S To compute the CVIs mentioned in Section II-A incrementally, employing one of the followingapproaches is sufﬁcient:1) The recursive computation of compactness developed in [12], [13] (CVIs: CH, I/PBM, andSIL).2) The incremental computation of probabilities, means and covariance matrices (CVIs: rCIPand NI). Naturally, if the clustering algorithm of choice already models the clusters usinga priori probabilities, means and covariance matrices (such as Gaussian ART [65] andBayesian ART [68]), then, similarly to PS, these CVIs can be readily computed.3) The incremental building of a multi-prototype representation of clusters in a self-consistenttwo-level hierarchy while tracking the density-based connections between neighboringprototypes (CVI: Conn index). Speciﬁcally, increment and/or expand the CADJ and CONNmatrices as clusters grow and/or are dynamically created.In the following iCVIs’ extensions (iCH, iI/iPBM, iSIL, irCIP, iNI, and iConn index), if anew cluster is formed after sample x is presented, then the number of clusters is k new = k old + 1 ,the number of samples encoded by this cluster is n newk new = 1 , the clusters’ prototype is set to v newk new = x , the initial compactness is CP newk new = 0 , and vector g newk new = (cid:126) (unless otherwise noted).Naturally, clusters that do not encode the presented sample remain with constant parametervalues for the duration of that input presentation. Also note that, when necessary, the Euclidean February 19, 2019 DRAFTREPRINT SUBMITTED TO ARXIV.ORG 11 norm is replaced with the squared Euclidean norm ( i . e ., (cid:107)·(cid:107) ) to allow for the computation ofcompactness CP (as per [12], [13]). Finally, for iCVIs that require the computation of pairwise(dis)similarity between prototypes, the (dis)similarity matrix is kept in memory, where only therows and columns corresponding to the prototype that is adapted are modiﬁed. A. Incremental Calinski-Harabasz index (iCH)

The iCH computation is deﬁned as: CH new = k new (cid:80) i =1 SEP newik new (cid:80) i =1 CP newi × N new − k new k new − , (53)where SEP newi = n newi (cid:107) v newi − µ newdata (cid:107) . (54)Note that the variables { n , ..., n k } , { v , ..., v k } , { CP , ..., CP k } , { g , ..., g k } , µ data , k , N , and { SEP , ..., SEP k } are all kept in memory. These are updated using Eqs. (39) to (43), exceptfor SEP , which is adapted using Eq. (54). The data mean µ data is updated similarly to theprototypes v ( i . e ., Eq. (40)). B. Incremental I index (iI)

The iI computation is deﬁned as: I new =  max i (cid:54) = j (cid:0) (cid:107) v newi − v newj (cid:107) (cid:1) k (cid:80) i =1 CP newi × CP new k new  p , (55)where CP and k (cid:80) i =1 CP newi correspond to E and E k , respectively. These are updated accordingto Eqs. (39) to (43) along with the remaining compactness variables. Only the pairwise distanceswith respect to the updated prototype at any given iteration need to be recomputed. C. Incremental Silhouette index (iSIL)

The SIL index is inherently batch (ofﬂine), since it requires the entire data set to be computed(the silhouette coefﬁcients are averaged across all data samples in Eq. (14)). To remove such arequirement and enable incremental updates, a hard version of the centroid-based SIL variant

February 19, 2019 DRAFTREPRINT SUBMITTED TO ARXIV.ORG 12 introduced in [30] is employed here as well as the squared Euclidean norm ( i . e ., (cid:107) · (cid:107) ): thisis done in order to employ the recurrent formulation of the compactness in Eq. (42). Considerthe matrix S k × k , where k prototypes v i are used to compute the centroid-based SIL (instead ofthe N samples x i - which, by deﬁnition, are discarded after each presentation in online mode).Deﬁne each entry s i,j = D ( v i , ω j ) (dissimilarity of v i to cluster ω j ) of S k × k as: s i,j = 1 n j n j (cid:88) l =1 x l ∈ ω j (cid:107) x l − v i (cid:107) = 1 n j CP ( v i , ω j ) , (56)where i = { , ..., k } and j = { , ..., k } . The silhouette coefﬁcients can be obtained from theentries of S k × k as: sc i = min l,l (cid:54) = J ( s i,l ) − s i,J max (cid:20) s i,J , min l,l (cid:54) = J ( s i,l ) (cid:21) , v i ∈ ω J . (57)where a i = s i,J and b i = min l,l (cid:54) = J ( s i,l ) .At ﬁrst, when examining Eq. (56), one might be tempted to store a k × k matrix of compactnessentries along with their accompanying k vectors g (one for each entry) to enable incrementalupdates of each element of matrix of S k × k ; this approach, however, may lead to unnecessarilylarge memory requirements. A more careful exam shows that it is sufﬁcient to simply redeﬁne CP and g for each cluster i ( i = { , ..., k } ) as: CP i = n i (cid:88) j =1 x j ∈ ω i (cid:107) x j − (cid:126) (cid:107) = n i (cid:88) j =1 x j ∈ ω i (cid:107) x j (cid:107) , (58) g i = n i (cid:88) j =1 x j ∈ ω i (cid:16) x j − (cid:126) (cid:17) = n i (cid:88) j =1 x j ∈ ω i x j , (59)which is equivalent to ﬁxing v = (cid:126) . Therefore, their incremental update equations become (asopposed to Eqs. (42) and (43)): CP newi = CP oldi + (cid:107) x (cid:107) , (60) g newi = g oldi + x . (61)Using this trick, when a sample x is assigned to cluster ω J , then the update equations foreach entry s i,j of S k × k are given by Eq. (62). Note that the numerators of the expressions inEq. (62) update the compactness “as if” the prototype has changed from (cid:126) to v new at everyiteration ( ∆ v = − v new ). The remaining variables such as n , N , and v are updated as previously February 19, 2019 DRAFTREPRINT SUBMITTED TO ARXIV.ORG 13 described. This allows { CP , ..., CP k } and { g , ..., g k } to continue being stored similarly to theprevious iCVIs, instead of a k × k matrix of compactness and the associated k vectors g . s newi,j =  n newj (cid:16) CP oldj + (cid:107) z i (cid:107) + n oldj (cid:107) v oldi (cid:107) − v old T i g oldj (cid:17) , ( i (cid:54) = J, j = J ) n oldj (cid:16) CP oldj + n oldj (cid:107) v newi (cid:107) − v new T i g oldj (cid:17) , ( i = J, j (cid:54) = J ) n newj (cid:16) CP oldj + (cid:107) z j (cid:107) + n oldj (cid:107) v newj (cid:107) − v new T j g oldj (cid:17) , ( i = J, j = J ) s oldi,j , ( i (cid:54) = J, j (cid:54) = J ) (62)In the case where a new cluster ω k +1 is created following the presentation of sample x , thena new column and a new row are appended to the matrix S k × k . Unlike the other iCVIs, thecompactness CP k +1 and vector g k +1 of this cluster are initialized as (cid:107) x (cid:107) and x , respectively.Then, the entries of S k × k are updated using Eq. (63). s newi,j =  CP k +1 + (cid:107) v oldi (cid:107) − v old T i g k +1 , ( i (cid:54) = k + 1 , j = k + 1) n oldj (cid:16) CP oldj + n oldj (cid:107) v newi (cid:107) − v new T i g oldj (cid:17) , ( i = k + 1 , j (cid:54) = k + 1)0 , ( i = k + 1 , j = k + 1) s oldi,j , ( i (cid:54) = k + 1 , j (cid:54) = k + 1) (63)Following the incremental updates of the entries of S k × k (Eq. (62) or (63)), the silhouettecoefﬁcients ( sc i ) are computed (Eq. (57)), and the iSIL is updated as: SIL new = 1 k new k new (cid:88) i =1 sc newi . (64) D. Incremental Negentropy Increment (iNI)

The iNI computation is deﬁned as:

N I new = k (cid:88) i =1 p newi ln (cid:32) (cid:112) | Σ newi | p newi (cid:33) −

12 ln | Σ data | (65)where p newi = n newi /N new , and Σ newi is computed using the following recursive formula [43]: Σ new = n new − n new − (cid:0) Σ old − δI (cid:1) + 1 n new (cid:0) x − v old (cid:1) (cid:0) x − v old (cid:1) T + δI (66)This work’s authors set δ = 10 − (cid:15)d to avoid numerical errors, where (cid:15) is a user-deﬁnedparameter. If a new cluster is created, then Σ = δI and | Σ | = 10 − (cid:15) . February 19, 2019 DRAFTREPRINT SUBMITTED TO ARXIV.ORG 14

E. Incremental representative Cross Information Potential (irCIP) and cross-entropy (irH)

Section V will show that using the representative cross-entropy rH for computing the CEFmakes it easier to observe the behavior of the incremental clustering process (this corroboratesa previous study in which rH was deemed more informative than rCIP for multivariate datavisualization [74]): rH ( ω i , ω j ) = − ln [ rCIP ( ω i , ω j )] , (67) CEF = k − (cid:88) i =1 k (cid:88) j = i +1 rH ( ω i , ω j ) . (68)Note that, as opposed to the rCIP-based CEF, larger values of rH-based CEF indicate betterclustering solutions (maximization). Concretely, since the CEF only measures separation, then,like iNI, it is only necessary to update the means and the covariance matrices online in order toconstruct the incremental CEF (iCEF). This is also done using Eqs. (40) and (66), respectively.The iCEFs, based on rCIP and rH, are hereafter referred to as irCIP and irH, respectively. F. Incremental Conn Index (iConn Index)

The Conn Index is another inherently batch CVI, as each element ( i, j ) of the CADJ matrixrequires the count of the samples in the data set with the ﬁrst and second closest prototypes, v i and v j respectively. Naturally, when clustering data online, v i and v j may change for previouslypresented samples as prototypes are continuously modiﬁed or created. However, for the purposeof building and incrementing CADJ and CONN matrices online (with only one element changingper sample presentation), it is assumed that the trends exhibited over time by the iConn Indexdoes not differ dramatically from its ofﬂine counterpart. Batch calculation can be eliminatedentirely by keeping the values of Eqs. (33) and (35) in memory and updating only the entriescorresponding to the winning prototype v i .In this study, the self-consistent hierarchy and multi-prototype cluster representation requiredby the iConn Index was generated using fuzzy SMART, whose modules A and B are used forprototype and cluster deﬁnition, respectively. Fuzzy SMART’s module A was modiﬁed in such away that it forcefully creates two prototypes from the ﬁrst two samples of every emerging clusterin module B. By enforcing this dynamic, each cluster always possesses at least two prototypesfor the computation of the iConn Index. This strategy addresses two problems: ﬁrst, it allowsCADJ to be created from the second sample seen and onward; second, it prevents some cases February 19, 2019 DRAFTREPRINT SUBMITTED TO ARXIV.ORG 15 in which well-separated clusters are strongly connected simply because one of them does nothave another prototype to assume the role of the second winner. The second winning prototypefor a sample v j is the winning A-side category when the ﬁrst winning prototype v i has beenremoved from the A-side category set.The iConn Index demands certain boundary conditions. In the case of exactly one prototypeand one category, such as the case for the very ﬁrst sample presentation, the CADJ matrix cannotbe incremented, and the iConn Index will default to 0 [53]. This paper presents a remedy forthis whereby a count of samples is kept separate from the CADJ matrix (instance counting [75]).Upon creation of the second prototype v in fuzzy SMART’s module A, the CADJ matrix willbe incremented for the ﬁrst time at element (2 , . At this point, the element (1 , will be setto the number of samples seen so far belonging to v . This situation is encountered in the veryﬁrst sample presentation to fuzzy SMART.Note that, in the case of a single category, Inter Conn , given by Eq. (34), defaults to1 [53]. In the case of a category with a single prototype, the

Intra Conn for that cate-gory, given by Eq. (33), also defaults to a value of 1 [53]. Finally, instead of the originalconstraint

CADJ ( i, j ) > imposed by Eq. (36), this paper’s iConn Index implementation uses CON N ( i, j ) > , as this makes its behavior smoother and more consistent in this applicationdomain. IV. N UMERICAL EXPERIMENTS SETUP

The numerical experiments were carried out using the MATLAB software environment. TheCluster Validity Analysis Platform Toolbox [76] was used to compute the Adjusted Rand Index(

ARI ) [77] to evaluate the partitions detected by the fuzzy ART-based clustering algorithms. Twosynthetic data sets were used: (1)

R15 [78], [79], consisting of 800 samples and 15 clusters in twodimensions and (2) D4 , which is an in-house artiﬁcially generated data set with 2000 samplesand 4 clusters also in two dimensions. For comparison purposes, hard clustering versions of iDB,iXB and PS CVIs were used in the experiments. Finally, it should be noted that this study doesnot employ multi-prototype representations for the irCIP and irH ( i . e ., M i = M j = 1 , ∀ i, j inEq. (28)) since each of the clusters from the data sets used in these experiments can be modeledusing single Gaussian distributions.All fuzzy ART and SMART dynamics were performed with normalized and complementcoded input, whereas the CVI computations were performed using the normalized data. To February 19, 2019 DRAFTREPRINT SUBMITTED TO ARXIV.ORG 16 emulate scenarios in which there is a natural order of presentation, the samples were presentedto fuzzy ART/SMART in a cluster-by-cluster fashion where samples within a given cluster wererandomized. Finally, in these experiments, (cid:15) = 12 in Eq. (66) for the incremental computation ofthe covariance matrices used by irCIP, irH and iNI. The source code of the CVIs/iCVIs, fuzzyART/SMART, and experiments is provided at the Applied Computational Intelligence Laboratorypublic GitLab repository . V. A COMPARATIVE STUDY

This section discusses the behavior of the iCVIs in three general cases when assessing thequality of the partitions detected by fuzzy ART-based systems in real-time: (1) high-qualitypartitions, (2) under-partitions, and (3) over-partitions. It should be emphasized that this analysisis not focused on evaluating the performance or capabilities of the chosen clustering algorithms,but instead the purpose of this study is to observe the behavior of the iCVIs in these differentscenarios to gain insight on their applicability. Moreover, in each of these scenarios, the iCVIs’dynamics are investigated in two sub-cases: (a) the creation of a new cluster and (b) thepresentation of samples within a given cluster.The following discussion is relative to the data sets used in the experiments and their respectiveorder of cluster and sample presentation (Fig. 1). This is not an exhaustive study of all possiblepermutations of clusters and samples, as each of them may trigger different global behaviors ofthe iCVIs. Nonetheless, it can be assumed that some behaviors are typical, which allows theinference of some particular problems that may arise during incremental unsupervised learning.Similar to [12]–[16], a natural ordering, i . e ., meaningful temporal information is assumed.The R15 data set was used to illustrate the behavior of the iCVIs in cases (1) and (2), whichare depicted in Figs. 2 and 3, respectively. Alternately, the D4 data set was used to illustrate thebehavior of the iCVIs in cases (1) and (3), which are depicted in Figs. 4 and 5, respectively.For both data sets, case (1) is used as a reference to which their respective cases (2) and (3) arecompared. Moreover, Figs. 2 to 5 depict the iCVIs immediately following the creation of thesecond cluster. https://github.com/ACIL-Group/iCVIs February 19, 2019 DRAFTREPRINT SUBMITTED TO ARXIV.ORG 17

A. Correct estimation and underestimation of the number of clusters

Consider the high-quality partition of the

R15 data set shown in Fig. 2a, which was obtainedwhen presenting samples in the cluster-by-cluster ordering depicted in Fig. 1a. This study shows,in general (and as expected from previous studies on iDB and iXB [12]–[16]), that the drasticchanges in most iCVI values follow the emergence of new clusters. The exceptions are theiXB and irCIP, which appear much less informative than the other iCVIs used in this particularexperiment, as they show no clearly deﬁned tendencies and seem insensitive to the well-separatedclusters numbered 12 to 15 in Fig. 2a.During the presentation of samples within a given cluster, many different behaviors can beobserved. Typically, iCH either improves or has small ﬂuctuations; iSIL and iDB either worsenor have small ﬂuctuations; iI/iPBM and iNI either worsen or improve; iConn Index and PSimprove; and irH consistently undergoes small ﬂuctuations. Again, irCIP and iXB do not appearto be particularly useful compared to the other iCVIs since no apparent trends were found overthe iterations. If an iCVI displays more than one trend, these usually do not occur prominentlyand simultaneously ( i . e ., during the presentation of samples from the same cluster). Note thatthese are important characteristics, since they will help in identifying the under-partition cases.Now consider the case of underestimating the number of clusters, as shown in Fig. 3a.The latter was obtained when presenting samples in the cluster-by-cluster ordering depicted inFig. 1b. This research notes that most iCVIs consistently worsen while the algorithm incorrectlyagglomerates samples from different clusters (clusters numbered 2 to 9 in Fig. 1b) into a singlecluster (cluster numbered 2 in Fig. 3a), except for the iConn Index (which actually improves (a) (b) (c)Fig. 1. Presentation order of the classes for the experiments carried out in (a) Fig. 2, (b) Fig. 3, and (c) Figs. 4 and 5. February 19, 2019 DRAFTREPRINT SUBMITTED TO ARXIV.ORG 18 due to the strong connectivity among prototypes) and irCIP (which remains constant). Moreover,when incorrectly merging clusters 10 and 11 in Fig. 1b into a single cluster labeled 3 in Fig. 3a,the performances of all iCVIs are accompanied by a drastic change typically toward worse values(except for PS, which only undergoes a slight slope change), while the number of clusters remainsconstant.The behavior previously described can also be observed for clusters labeled 4 and 1 in Fig. 3a.Drastic (iSIL, irCIP, irH, iNI, iDB, iXB, and iConn Index) or more subtle (iCH, iI/iPBM) changesentailing worsening trends take place in the behavior of all CVIs in Fig. 3 when these samplesare classiﬁed to the same cluster - again, with the exception of PS, which still improves, butwith a different inclination. These clearly indicate that the clustering algorithm is mistakenlyencoding the samples under the same cluster umbrella.At this stage, it is important to be cautious because even when a high-quality partition isretrieved (Fig. 2), some iCVIs (such as iSIL, iConn Index, and iDB), can both improve andworsen when fuzzy ART is allocating samples to the same cluster (although this happens lessfrequently and less drastically). Therefore, it is recommended to observe more than one iCVI todetermine if under-partition is taking place.

B. Correct estimation and overestimation of the number of clusters

For the sake of clarity, over-partition is illustrated using the D4 data set, which has a smallernumber of clusters. First, the iCVI behaviors regarding the high-quality partition shown in Fig. 4aare observed as a reference; these were obtained using the cluster sequence depicted in Fig. 1c.The same iCVI trends seem to hold following the emergence of new clusters as well as duringthe presentation of samples belonging to a given cluster (and again, iXB and irCIP provided theleast visually descriptive behavior over time). A notable exception, however, is the iNI, whichquickly improves immediately after the creation of a new cluster and then worsens as samplesfrom the same cluster are presented. This supports the fact that the iCVI behaviors are notuniversal: naturally, they are data- and order-dependent.Now consider the over-partition problem depicted in Fig. 5a, which was also obtained usingthe cluster sequence depicted in Fig. 1c. As expected, a steep descent (or ascent depending on theiCVI) usually occurs when new clusters are created. However, since this trend appears to occurregardless of the partition quality (being inherent to all iCVIs), then it is not sufﬁcient to identifythis issue. In this scenario, unless there was additional a priori information ( e . g ., the cardinality February 19, 2019 DRAFTREPRINT SUBMITTED TO ARXIV.ORG 19 of clusters) to detect a premature partition, these iCVIs were unable to patently identify over-partition solely based on the transitions of their values versus the number of clusters.Moreover, although there is a natural order for the presentation of clusters ( i . e ., as a timeseries), the presentation of samples within each cluster is random. Speciﬁcally, when the clusteris over-partitioned, samples are not presented in a subcluster-by-subcluster manner, but insteadthey are randomly sampled from the different subclusters. This adds another layer of complexityand thus makes this problem even more challenging. Compared to the correct partition in Fig. 4a,most iCVIs do not exhibit an overall behavior that deviates signiﬁcantly from the one typicallyexpected when accurately partitioning D4 (Fig. 4a), although most of them yield worse clusterquality evaluation values. In reality, in a true unsupervised learning scenario, such referencebehavior is unavailable; furthermore, the values of most iCVIs are not bounded, thus makingthis problem even more challenging to detect.Except for the iConn Index, none of the iCVIs provided distinctive insights on the over-partition problem: there is a noticeable decrease of iConn Index values (due to a large increaseof Inter Conn and decrease of

Intra Conn ), especially considering that this iCVI’s value isbounded to the interval [0 , . More importantly, following the over-partition, it does not exhibitthe general behavior previously observed in Figs. 2c and 4c, and it maintains its poor assessmentof the clustering solution, thus indicating that there is an issue with the partition found by theclustering algorithm.VI. I NCREMENTAL VERSUS BATCH IMPLEMENTATIONS

When evaluated over time, most iCVIs discussed in this study yield the same values as theirbatch counterparts ( e . g ., the the recursive formulation of compactness is an exact computation,not an approximation [12], [13]). The only exception is the iConn Index, which is the subject ofanalysis of this section. Figs. 6 to 9 illustrate the evolution of both Conn Index and iConn Indexfor all four experiments described in Section V. These ﬁgures also show the error (difference)between the batch and incremental implementations of the Conn Index after the presentationof each sample. To obtain the batch Conn Index values, fuzzy SMART was set to evaluationmode and all ﬁrst and second winning prototypes were recomputed after the presentation of eachsample.Notably, error spikes consistently occur on the appearance of new clusters. In general, the errorgradually diminishes over time, as samples within a given cluster are continuously presented to February 19, 2019 DRAFTREPRINT SUBMITTED TO ARXIV.ORG 20 (a) Data partition

WeakMediumStrong (b) Fuzzy SMART’s module Aconnectivity

80 120 160 200 240 280 320 360 400 440 480 520 560 600

Sample ID i C V I N o . C l u s t e r s ρ A =0.9 ρ A =0.93 ρ A =0.96 (c) iConn Index

80 120 160 200 240 280 320 360 400 440 480 520 560 600

Sample ID i C V I N o . C l u s t e r s (d) iCH

80 120 160 200 240 280 320 360 400 440 480 520 560 600

Sample ID i C V I N o . C l u s t e r s (e) iSIL

80 120 160 200 240 280 320 360 400 440 480 520 560 600

Sample ID i C V I N o . C l u s t e r s p=1.0p=1.5p=2.0 (PBM) (f) iPBM

80 120 160 200 240 280 320 360 400 440 480 520 560 600

Sample ID i C V I N o . C l u s t e r s (g) irCIP

80 120 160 200 240 280 320 360 400 440 480 520 560 600

Sample ID i C V I N o . C l u s t e r s (h) irH

80 120 160 200 240 280 320 360 400 440 480 520 560 600

Sample ID -2.8-2.6-2.4-2.2-2-1.8-1.6-1.4-1.2 i C V I N o . C l u s t e r s (i) iNI

80 120 160 200 240 280 320 360 400 440 480 520 560 600

Sample ID i C V I N o . C l u s t e r s (j) iXB

80 120 160 200 240 280 320 360 400 440 480 520 560 600

Sample ID i C V I N o . C l u s t e r s (k) iDB

80 120 160 200 240 280 320 360 400 440 480 520 560 600

Sample ID i C V I N o . C l u s t e r s (l) PSFig. 2. (a) A high-quality partition of the data set R15 by fuzzy ART-based clustering algorithms (

ARI = 0 . , ρ = 0 . ).(b) Fuzzy SMART’s module A categories ( ρ A = 0 . ) and CONNvis [55] (thicker and darker lines indicate stronger connections).(c)-(l) Behavior of the iCVIs (blue curve) for the partition in (a). The number of clusters is tracked by the step-like red curve.The dashed vertical lines represent the limits between two consecutive clusters (ground truth), i . e ., samples before a line belongto one cluster whereas samples after it belong to another. February 19, 2019 DRAFTREPRINT SUBMITTED TO ARXIV.ORG 21

12 345 (a) Data partition

WeakMediumStrong (b) Fuzzy SMART’s module Aconnectivity

80 120 160 200 240 280 320 360 400 440 480 520 560 600

Sample ID i C V I N o . C l u s t e r s ρ A =0.7 ρ A =0.8 ρ A =0.9 (c) iConn Index

80 120 160 200 240 280 320 360 400 440 480 520 560 600

Sample ID i C V I N o . C l u s t e r s (d) iCH

80 120 160 200 240 280 320 360 400 440 480 520 560 600

Sample ID i C V I N o . C l u s t e r s (e) iSIL

80 120 160 200 240 280 320 360 400 440 480 520 560 600

Sample ID i C V I N o . C l u s t e r s p=1.0p=1.5p=2.0 (PBM) (f) iPBM

80 120 160 200 240 280 320 360 400 440 480 520 560 600

Sample ID i C V I N o . C l u s t e r s (g) irCIP

80 120 160 200 240 280 320 360 400 440 480 520 560 600

Sample ID i C V I N o . C l u s t e r s (h) irH

80 120 160 200 240 280 320 360 400 440 480 520 560 600

Sample ID -1.8-1.6-1.4-1.2-1-0.8-0.6 i C V I N o . C l u s t e r s (i) iNI

80 120 160 200 240 280 320 360 400 440 480 520 560 600

Sample ID i C V I N o . C l u s t e r s (j) iXB

80 120 160 200 240 280 320 360 400 440 480 520 560 600

Sample ID i C V I N o . C l u s t e r s (k) iDB

80 120 160 200 240 280 320 360 400 440 480 520 560 600

Sample ID i C V I N o . C l u s t e r s (l) PSFig. 3. (a) An under-partition of the data set R15 by fuzzy ART-based clustering algorithms (

ARI = 0 . , ρ = 0 . ). (b)Fuzzy SMART’s module A categories ( ρ A = 0 . ) and CONNvis [55] (thicker and darker lines indicate stronger connections).(c)-(l) Behavior of the iCVIs (blue curve) for the partition in (a). The number of clusters is tracked by the step-like red curve.The dashed vertical lines represent the limits between two consecutive clusters (ground truth), i . e ., samples before a line belongto one cluster whereas samples after it belong to another. February 19, 2019 DRAFTREPRINT SUBMITTED TO ARXIV.ORG 22 (a) Data partition

WeakMediumStrong (b) Fuzzy SMART’s module Aconnectivity

Sample ID i C V I N o . C l u s t e r s ρ A =0.7 ρ A =0.8 ρ A =0.9 (c) iConn Index Sample ID i C V I × N o . C l u s t e r s (d) iCH Sample ID i C V I N o . C l u s t e r s (e) iSIL Sample ID i C V I N o . C l u s t e r s p=1.0p=1.5p=2.0 (PBM) (f) iPBM Sample ID i C V I × -6 N o . C l u s t e r s (g) irCIP Sample ID i C V I N o . C l u s t e r s (h) irH Sample ID -3.5-3-2.5-2-1.5-1 i C V I N o . C l u s t e r s (i) iNI Sample ID i C V I × -3 N o . C l u s t e r s (j) iXB Sample ID i C V I N o . C l u s t e r s (k) iDB Sample ID i C V I N o . C l u s t e r s (l) PSFig. 4. (a) A high-quality partitioning of the D4 data set by fuzzy ART-based clustering algorithms ( ARI = 1 . , ρ = 0 . ). (b)Fuzzy SMART’s module A categories ( ρ A = 0 . ) and CONNvis [55] (thicker and darker lines indicate stronger connections).(c)-(l) Behavior of the iCVIs (blue curve) for the partition in (a). The number of clusters is tracked by the step-like red curve.The dashed vertical lines represent the limits between two consecutive clusters (ground truth), i . e ., samples before a line belongto one cluster whereas samples after it belong to another. February 19, 2019 DRAFTREPRINT SUBMITTED TO ARXIV.ORG 23 (a) Data partition

WeakMediumStrong (b) Fuzzy SMART’s module Aconnectivity

Sample ID i C V I N o . C l u s t e r s ρ A =0.8 ρ A =0.9 ρ A =0.96 (c) iConn Index Sample ID i C V I × N o . C l u s t e r s (d) iCH Sample ID i C V I N o . C l u s t e r s (e) iSIL Sample ID i C V I N o . C l u s t e r s p=1.0p=1.5p=2.0 (PBM) (f) iPBM Sample ID i C V I N o . C l u s t e r s (g) irCIP Sample ID i C V I × N o . C l u s t e r s (h) irH Sample ID -3.5-3-2.5-2-1.5-1 i C V I N o . C l u s t e r s (i) iNI Sample ID i C V I N o . C l u s t e r s (j) iXB Sample ID i C V I N o . C l u s t e r s (k) iDB Sample ID -2.5-2-1.5-1-0.500.511.5 i C V I N o . C l u s t e r s (l) PSFig. 5. (a) An over-partition of the D4 data set by fuzzy ART-based clustering algorithms ( ARI = 0 . , ρ = 0 . ). (b)Fuzzy SMART’s module A categories ( ρ A = 0 . ) and CONNvis [55] (thicker and darker lines indicate stronger connections).(c)-(l) Behavior of the iCVIs (blue curve) for the partition in (a). The number of clusters is tracked by the step-like red curve.The dashed vertical lines represent the limits between two consecutive clusters (ground truth), i . e ., samples before a line belongto one cluster whereas samples after it belong to another. February 19, 2019 DRAFTREPRINT SUBMITTED TO ARXIV.ORG 24

Sample ID i C V I BatchIncremental (a) Conn Index and iConn Index behaviors ( ρ A = 0 . ) Sample ID E rr o r (b) Error = Conn Index − iConn Index Vigilance parameter ( A ) C o rr . c o e f . ( r ) M SE (c) Correlation coefﬁcient and MSEFig. 8. (a) Behaviors of Conn Index (continuous blue line) and iConn Index (dashed red line) for the high-quality partitioningof the D4 data set (Fig. 4). (b) Error between the batch and incremental versions in (a). (c) Correlation coefﬁcients and MSEbetween the batch and incremental versions for fuzzy SMART’s module A vigilance parameter ρ A ∈ [ ρ B , . . Sample ID i C V I BatchIncremental (a) Conn Index and iConn Index behaviors ( ρ A = 0 . ) Sample ID E rr o r (b) Error = Conn Index − iConn Index Vigilance parameter ( A ) C o rr . c o e f . ( r ) M SE (c) Correlation coefﬁcient and MSEFig. 9. (a) Behaviors of Conn Index (continuous blue line) and iConn Index (dashed red line) for the over-partitioning of the D4 data set (Fig. 5). (b) Error between the batch and incremental versions in (a). (c) Correlation coefﬁcients and MSE betweenthe batch and incremental versions for fuzzy SMART’s module A vigilance parameter ρ A ∈ [ ρ B , . . February 19, 2019 DRAFTREPRINT SUBMITTED TO ARXIV.ORG 26 the system. These trends are particularly clear when fuzzy SMART yield high quality partitions(Figs. 6 and 8). Regarding the cases of under- and over-partitioning (Figs. 7 and 9), the errors aremore pronounced. However, iConn Index still smoothly follows the overall trends of its batchcounterpart (which has a more jagged behavior).Finally, the effect of fuzzy SMART module A’s quantization level on the similarity of thebatch and incremental implementations was investigated. This was done by varying its vigilanceparameter ρ A in the closed interval [ ρ B , . (larger values of ρ A produce ﬁner granularityof cluster prototypes). The Pearson correlation coefﬁcients [80] and the mean squared error(MSE) depicted in Figs. 6c, 7c, 8c, and 9c show that the behavior of iConn Index is consistentwith Conn Index across wide ranges of fuzzy SMART module A’s vigilance. Interestingly, theirdissimilarity tends to increase with very large vigilance values. These results support the originalassumption, stated in Section III-F, that both versions of the Conn Index would behave similarly.Therefore, iConn Index is suitable for monitoring the performance online clustering methods.VII. C ONCLUSION

This paper extended six cluster validity indices (CVIs) to incremental versions, namely, incre-mental Calinski-Harabasz (iCH), incremental I index and incremental Pakhira-Bandyopadhyay-Maulik (iI and iPBM), incremental Silhouette (iSIL), incremental Negentropy Increment (iNI),incremental Representative Cross Information Potential (irCIP) and Cross Entropy (irH), andincremental Conn Index (iConn Index). Furthermore, using fuzzy adaptive resonance theory(ART)-based clustering algorithms, three different scenarios were analyzed: detection of thecorrect number of clusters in high-quality partitions, under- and over-partitioning. In such sce-narios, a comparative study was performed among the presented incremental cluster validityindices (iCVIs), the Partition Separation (PS) index, the incremental Xie-Beni (iXB), and theincremental Davies-Bouldin (iDB).As expected from previous studies, most iCVIs undergo abrupt changes following the creationof a new cluster. When samples from the same cluster are presented, however, each iCVI exhibitsa particular behavior, which was taken as a reference to compare the cases of under- and over-partitioning a data set. In these experiments, the least visually informative iCVIs ( i . e ., thatprovided less useful visual cues/hints in their behavior) were irCIP and iXB. Particularly, mostiCVIs detected under-partitioning in at least one stage of the incremental clustering process,whereas only the iConn Index provided some insight to indicate over-partitioning problems. February 19, 2019 DRAFTREPRINT SUBMITTED TO ARXIV.ORG 27

Nonetheless, the iConn Index failed in identifying one of the under-partitioning cases. Therefore,the usual recommendation regarding batch CVIs also applies to iCVIs: this research highlights theimportance of monitoring a number of iCVI dynamics at any given time, rather than relying onthe assessment of only one. Finally, it was shown that, although not equal to its batch counterpart,the iConn Index follows the same general trends. It is expected that the observations from thestudy presented here will assist in incremental clustering applications such as data streams.A

CKNOWLEDGMENT

This research was sponsored by the Missouri University of Science and Technology MaryK. Finley Endowment and Intelligent Systems Center; the Coordenac¸ ˜ao de Aperfeic¸oamento dePessoal de N´ıvel Superior - Brazil (CAPES) - Finance code BEX 13494/13-9; the U.S. Dept.of Education Graduate Assistance in Areas of National Need program; and the Army ResearchLaboratory (ARL), and it was accomplished under Cooperative Agreement Number W911NF-18-2-0260. The views and conclusions contained in this document are those of the authors andshould not be interpreted as representing the ofﬁcial policies, either expressed or implied, ofthe Army Research Laboratory or the U.S. Government. The U.S. Government is authorizedto reproduce and distribute reprints for Government purposes notwithstanding any copyrightnotation herein. The authors would also like to thank Prof. James M. Keller and his coauthorsfor providing an early copy of reference [15].R

EFERENCES [1] A. D. Gordon, “Cluster Validation,” in

Data Science, Classiﬁcation, and Related Methods , C. Hayashi, K. Yajima, H.-H.Bock et al. , Eds. Tokyo: Springer Japan, 1998, pp. 22–39.[2] R. Xu, J. Xu, and D. C. Wunsch II, “A Comparison Study of Validity Indices on Swarm-Intelligence-Based Clustering,”

IEEE Trans. Syst., Man, Cybern. B , vol. 42, no. 4, pp. 1243–1256, Aug 2012.[3] L. E. Brito da Silva and D. C. Wunsch II, “A study on exploiting VAT to mitigate ordering effects in Fuzzy ART,” in

Proc. Int. Joint Conf. Neural Netw. (IJCNN) , 2018, pp. 2351–2358.[4] G. W. Milligan and M. C. Cooper, “An examination of procedures for determining the number of clusters in a data set,”

Psychometrika , vol. 50, no. 2, pp. 159–179, Jun 1985.[5] J. C. Bezdek, W. Q. Li, Y. Attikiouzel et al. , “A geometric approach to cluster validity for normal mixtures,”

Soft Computing ,vol. 1, no. 4, pp. 166–179, Dec 1997.[6] M. Halkidi, Y. Batistakis, and M. Vazirgiannis, “Cluster Validity Methods: Part I,”

SIGMOD Rec. , vol. 31, no. 2, pp.40–45, Jun. 2002.[7] M. Halkidi, Y. Batistakis, and M. Vazirgiannis, “Clustering Validity Checking Methods: Part II,”

SIGMOD Rec. , vol. 31,no. 3, pp. 19–27, Sep. 2002.

February 19, 2019 DRAFTREPRINT SUBMITTED TO ARXIV.ORG 28 [8] L. Vendramin, R. J. G. B. Campello, and E. R. Hruschka, “Relative clustering validity criteria: A comparative overview,”

Statistical Analysis and Data Mining , vol. 3, no. 4, pp. 209–235, 2010.[9] O. Arbelaitz, I. Gurrutxaga, J. Muguerza et al. , “An extensive comparative study of cluster validity indices,”

PatternRecognit. , vol. 46, no. 1, pp. 243 – 256, 2013.[10] R. Xu and D. C. Wunsch II, “Survey of clustering algorithms,”

IEEE Trans. Neural Netw. , vol. 16, no. 3, pp. 645–678,May 2005.[11] R. Xu and D. C. Wunsch II,

Clustering . Wiley-IEEE Press, 2009.[12] M. Moshtaghi, J. C. Bezdek, S. M. Erfani et al. , “Online Cluster Validity Indices for Streaming Data,”

ArXiv e-prints , Jan2018, arXiv:1801.02937v1 [stat.ML].[13] M. Moshtaghi, J. C. Bezdek, S. M. Erfani et al. , “Online cluster validity indices for performance monitoring of streamingdata clustering,”

International Journal of Intelligent Systems , vol. 0, no. 0, pp. 1–23, Nov 2018.[14] O. A. Ibrahim, J. M. Keller, and J. C. Bezdek, “Analysis of streaming clustering using an incremental validity index,” in , July 2018, pp. 1–8.[15] O. Ibrahim, Y. Wang, and J. Keller, “Analysis of incremental cluster validity for big data applications,”

InternationalJournal of Uncertainty, Fuzziness and Knowledge-Based Systems , vol. 0, no. ja, p. null, 0.[16] O. Ibrahim, J. Keller, and J. Bezdek, “Evaluating Evolving Structure in Streaming Data with Modiﬁed Dunn’s Indices,”submitted to IEEE Transaction on Evolving Topics in Computational Intelligence, 2018.[17] G. A. Carpenter, S. Grossberg, and D. B. Rosen, “Fuzzy ART: Fast stable learning and categorization of analog patternsby an adaptive resonance system,”

Neural Netw. , vol. 4, no. 6, pp. 759 – 771, 1991.[18] G. Bartfai, “Hierarchical clustering with ART neural networks,” in

Proc. Int. Conf. Neural Netw. (ICNN) , vol. 2, Jun 1994,pp. 940–944.[19] D. Wunsch II, “ART properties of interest in engineering applications,” in

Proc. Int. Joint Conf. Neural Netw. (IJCNN) ,Jun 2009, pp. 3380–3383.[20] T. Cali´nski and J. Harabasz, “A dendrite method for cluster analysis,”

Communications in Statistics , vol. 3, no. 1, pp. 1–27,1974.[21] D. L. Davies and D. W. Bouldin, “A cluster separation measure,”

IEEE Trans. Pattern Anal. Mach. Intell. , vol. PAMI-1,no. 2, pp. 224–227, Apr 1979.[22] X. L. Xie and G. Beni, “A Validity Measure for Fuzzy Clustering,”

IEEE Trans. Pattern Anal. Mach. Intell. , vol. 13, no. 8,pp. 841–847, Aug 1991.[23] J. Lamirel and P. Cuxac, “New quality indexes for optimal clustering model identiﬁcation with high dimensional data,” in , Nov 2015, pp. 855–862.[24] J. Lamirel, N. Dugu, and P. Cuxac, “New efﬁcient clustering quality indexes,” in

Proc. Int. Joint Conf. Neural Netw.(IJCNN) , July 2016, pp. 3649–3657.[25] S. Bandyopadhyay and U. Maulik, “Nonparametric genetic clustering: comparison of validity indices,”

IEEE Trans. Syst.,Man, Cybern. C , vol. 31, no. 1, pp. 120–125, Feb 2001.[26] M. K. Pakhira, S. Bandyopadhyay, and U. Maulik, “Validity index for crisp and fuzzy clusters,”

Pattern Recognit. , vol. 37,no. 3, pp. 487 – 501, 2004.[27] P. J. Rousseeuw, “Silhouettes: A graphical aid to the interpretation and validation of cluster analysis,”

Journal ofComputational and Applied Mathematics , vol. 20, pp. 53 – 65, 1987.[28] E. R. Hruschka, L. N. de Castro, and R. J. G. B. Campello, “Evolutionary algorithms for clustering gene-expression data,”in

Proc. IEEE Int. Conf. Data Mining (ICDM) , Nov 2004, pp. 403–406.

February 19, 2019 DRAFTREPRINT SUBMITTED TO ARXIV.ORG 29 [29] E. R. Hruschka, R. J. Campello, and L. N. de Castro, “Evolving clusters in gene-expression data,”

Information Sciences ,vol. 176, no. 13, pp. 1898 – 1927, 2006.[30] M. Rawashdeh and A. Ralescu, “Center-wise intra-inter silhouettes,” in

Scalable Uncertainty Management , E. H¨ullermeier,S. Link, T. Fober et al. , Eds. Berlin, Heidelberg: Springer, 2012, pp. 406–419.[31] J. M. Luna-Romera, M. del Mar Mart´ınez-Ballesteros, J. Garc´ıa-Guti´errez et al. , “An Approach to Silhouette and DunnClustering Indices Applied to Big Data in Spark,” in

Advances in Artiﬁcial Intelligence , O. Luaces, J. A. G´amez,E. Barrenechea et al. , Eds. Cham: Springer International Publishing, 2016, pp. 160–169.[32] J. d. A. Silva and E. R. Hruschka, “A support system for clustering data streams with a variable number of clusters,”

ACMTrans. Auton. Adapt. Syst. , vol. 11, no. 2, pp. 11:1–11:26, Jul. 2016.[33] M.-S. Yang and K.-L. Wu, “A new validity index for fuzzy clustering,” in , vol. 1, Dec 2001, pp. 89–92.[34] E. Lughofer, “Extensions of vector quantization for incremental clustering,”

Pattern Recognit. , vol. 41, no. 3, pp. 995 –1011, 2008, part Special issue: Feature Generation and Machine Learning for Robust Multimodal Biometrics.[35] L. F. Lago-Fern´andez and F. Corbacho, “Normality-based validation for crisp clustering,”

Pattern Recognit. , vol. 43, no. 3,pp. 782 – 795, 2010.[36] L. F. Lago-Fern´andez and F. Corbacho, “Using the negentropy increment to determine the number of clusters,” in

Bio-Inspired Systems: Computational and Ambient Intelligence: 10th International Work-Conference on Artiﬁcial NeuralNetworks , J. Cabestany, F. Sandoval, A. Prieto et al. , Eds. Berlin, Heidelberg: Springer, 2009, pp. 448–455.[37] P. Comon, “Independent component analysis, A new concept?”

Signal Processing , vol. 36, no. 3, pp. 287 – 314, 1994.[38] D. Ara´ujo, A. D. Neto, and A. Martins, “Representative cross information potential clustering,”

Pattern Recognit. Lett. ,vol. 34, no. 16, pp. 2181 – 2191, 2013.[39] D. Ara´ujo, A. D. Neto, and A. Martins, “Information-theoretic clustering: A representative and evolutionary approach,”

Expert Syst. Appl , vol. 40, no. 10, pp. 4190 – 4205, 2013.[40] E. Gokcay and J. C. Principe, “A new clustering evaluation function using Renyi’s information potential,” in

Proc. Int.Conf. Acoust., Speech, Signal Process. (ICASSP) , vol. 6, 2000, pp. 3490–3493.[41] E. Gokcay and J. C. Principe, “Information theoretic clustering,”

IEEE Trans. Pattern Anal. Mach. Intell. , vol. 24, no. 2,pp. 158–171, Feb. 2002.[42] A. R´enyi, “On Measures of Entropy and Information,” in

Proc. 4th Berkeley Symp. Math. Statist. Probab., Contrib. TheoryStatist. , vol. 1. Berkeley, CA: University of California Press, 1961, pp. 547–561.[43] R. O. Duda, P. E. Hart, and D. G. Stork,

Pattern Classiﬁcation , 2nd ed. John Wiley & Sons, 2000.[44] M. Cottrell and P. Rousset, “The kohonen algorithm: A powerful tool for analysing and representing multidimensionalquantitative and qualitative data,” in

Biological and Artiﬁcial Computation: From Neuroscience to Technology , J. Mira,R. Moreno-D´ıaz, and J. Cabestany, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 1997, pp. 861–871.[45] G. Karypis, E.-H. Han, and V. Kumar, “Chameleon: hierarchical clustering using dynamic modeling,”

Computer , vol. 32,no. 8, pp. 68–75, Aug 1999.[46] E. W. Tyree and J. Long, “The use of linked line segments for cluster representation and data reduction,”

Pattern Recognit.Lett. , vol. 20, no. 1, pp. 21 – 29, 1999.[47] J. Vesanto and E. Alhoniemi, “Clustering of the self-organizing map,”

IEEE Trans. Neural Netw. , vol. 11, no. 3, pp.586–600, May 2000.[48] L. N. F. Ana and A. K. Jain, “Robust data clustering,” in , vol. 2, June 2003, pp. II–II.

February 19, 2019 DRAFTREPRINT SUBMITTED TO ARXIV.ORG 30 [49] J. C. Principe,

Information Theoretic Learning: Renyi’s Entropy and Kernel Perspectives , 1st ed. Springer PublishingCompany, Incorporated, 2010.[50] A. G. Oliveira, A. D. Neto, and A. Martins, “An analysis of information dynamic behavior using autoregressive models,”

Entropy , vol. 19, no. 11, 2017.[51] A. G. Oliveira, A. Martins, and A. D. Neto, “Information state: A representation for dynamic processes using informationtheory,” in

Proc. Int. Joint Conf. Neural Netw. (IJCNN) , July 2018, pp. 1–8.[52] K. Tas¸demir and E. Mer´enyi, “A new cluster validity index for prototype based clustering algorithms based on inter- andintra-cluster density,” in

Proc. Int. Joint Conf. Neural Netw. (IJCNN) , Aug. 2007, pp. 2205–2211.[53] K. Tas¸demir and E. Mer´enyi, “A Validity Index for Prototype-Based Clustering of Data Sets With Complex ClusterStructures,”

IEEE Trans. Syst., Man, Cybern. B , vol. 41, no. 4, pp. 1039–1053, Aug. 2011.[54] K. Tas¸demir and E. Mer´enyi, “Data topology visualization for the Self-Organizing Maps,” in

Proc. 14th EuropeanSymposium on Artiﬁcial Neural Networks (ESANN 2006) , Apr. 2006, pp. 277–282.[55] K. Tas¸demir and E. Mer´enyi, “Exploiting Data Topology in Visualization and Clustering of Self-Organizing Maps,”

IEEETrans. Neural Netw. , vol. 20, no. 4, pp. 549–562, Apr. 2009.[56] S. Araki, H. Nomura, and N. Wakami, “Segmentation of thermal images using the fuzzy c-means algorithm,” in

Proc.Second IEEE International Conference on Fuzzy Systems , vol. 2, Mar 1993, pp. 719–724.[57] G. A. Carpenter and S. Grossberg, “A massively parallel architecture for a self-organizing neural pattern recognitionmachine,”

Computer Vision, Graphics, and Image Processing , vol. 37, no. 1, pp. 54 – 115, 1987.[58] G. A. Carpenter and S. Grossberg, “ART 2: self-organization of stable category recognition codes for analog input patterns,”

Appl. Opt. , vol. 26, no. 23, pp. 4919–4930, Dec 1987.[59] G. A. Carpenter and S. Grossberg, “The ART of adaptive pattern recognition by a self-organizing neural network,”

Computer , vol. 21, no. 3, pp. 77–88, March 1988.[60] G. A. Carpenter and S. Grossberg, “ART 3: Hierarchical search using chemical transmitters in self-organizing patternrecognition architectures,”

Neural Netw. , vol. 3, no. 2, pp. 129–152, 1990.[61] R. Xu and D. C. Wunsch II, “BARTMAP: A viable structure for biclustering,”

Neural Netw. , vol. 24, no. 7, pp. 709–716,sep 2011.[62] L. E. Brito da Silva, I. Elnabarawy, and D. C. Wunsch II, “Dual vigilance fuzzy adaptive resonance theory,”

Neural Netw. ,vol. 109, pp. 1–5, 2019.[63] L. E. Brito da Silva, I. Elnabarawy, and D. C. Wunsch II, “Distributed dual vigilance fuzzy adaptive resonancetheory learns online, retrieves arbitrarily-shaped clusters, and mitigates order dependence,” arXiv e-prints , Nov. 2018,arXiv:1901.00794[cs.NE].[64] P. P. Chen, W.-C. Lin, and H.-L. Hung, “Multi-resolution fuzzy ART neural networks,” in

Proc. Int. Joint Conf. NeuralNetw. (IJCNN) , vol. 3, 1999, pp. 1973–1978.[65] J. R. Williamson, “Gaussian ARTMAP: A Neural Network for Fast Incremental Learning of Noisy Multidimensional Maps,”

Neural Netw. , vol. 9, no. 5, pp. 881 – 897, 1996.[66] G. Anagnostopoulos and M. Georgiopulos, “Hypersphere ART and ARTMAP for unsupervised and supervised, incrementallearning,” in

Proc. Int. Joint Conf. Neural Netw. (IJCNN) , vol. 6, 2000, pp. 59–64 vol.6.[67] G. Anagnostopoulos and M. Georgiopoulos, “Ellipsoid ART and ARTMAP for incremental clustering and classiﬁcation,”in

Proc. Int. Joint Conf. Neural Netw. (IJCNN) , vol. 2, 2001, pp. 1221–1226 vol.2.[68] B. Vigdor and B. Lerner, “The Bayesian ARTMAP,”

IEEE Trans. Neural Netw. , vol. 18, no. 6, pp. 1628–1644, 2007.[69] N. Brannon, G. Conrad, T. Draelos et al. , “Information fusion and situation awareness using artmap and partially observablemarkov decision processes,” in

Proc. Int. Joint Conf. Neural Netw. (IJCNN) , 2006, pp. 2023–2030.

February 19, 2019 DRAFTREPRINT SUBMITTED TO ARXIV.ORG 31 [70] Y.-T. Huang, F.-T. Cheng, Y.-H. Shih et al. , “Advanced ART2 scheme for enhancing metrology-data-quality evaluation,”

Journal of the Chinese Institute of Engineers , vol. 37, no. 8, pp. 1064–1079, 2014.[71] H. Isawa, H. Matsushita, and Y. Nishio, “Fuzzy Adaptive Resonance Theory Combining Overlapped Category inconsideration of connections,” in

Proc. Int. Joint Conf. Neural Netw. (IJCNN) , June 2008, pp. 3595–3600.[72] L. E. Brito da Silva and D. C. Wunsch II, “Validity index-based vigilance test in adaptive resonance theory neural networks,”in , Nov. 2017, pp. 1–8.[73] L. A. Zadeh, “Fuzzy sets,”

Information and Control , vol. 8, no. 3, pp. 338 – 353, 1965.[74] L. E. Brito da Silva and D. C. Wunsch II, “An Information-Theoretic-Cluster Visualization for Self-Organizing Maps,”

IEEE Trans. Neural Netw. Learn. Syst. , vol. 29, no. 6, pp. 2595–2613, June 2018.[75] G. A. Carpenter and N. Markuzon, “Artmap-ic and medical diagnosis: Instance counting and inconsistent cases,”

NeuralNetw. , vol. 11, no. 2, pp. 323 – 336, 1998.[76] K. Wang, B. Wang, and L. Peng, “CVAP: Validation for Cluster Analyses,”

Data Science Journal , vol. 8, pp. 88–93, 2009.[77] L. Hubert and P. Arabie, “Comparing partitions,”

J. Classiﬁcation , vol. 2, no. 1, pp. 193–218, 1985.[78] C. J. Veenman, M. J. T. Reinders, and E. Backer, “A maximum variance cluster algorithm,”

IEEE Trans. Pattern Anal.Mach. Intell. , vol. 24, no. 9, pp. 1273–1280, Sept. 2002.[79] Pasi Fr¨anti et al, “Clustering datasets,” 2015, accessed on May 4, 2017. [Online]. Available: http://cs.uef.ﬁ/sipu/datasets/[80] L. J. Bain and M. Engelhardt,

Introduction to Probability and Mathematical Statistics , 2nd ed. Brooks/Cole, CengageLearning, 1992., 2nd ed. Brooks/Cole, CengageLearning, 1992.