Interval Type-2 Enhanced Possibilistic Fuzzy C-Means Clustering for Gene Expression Data Analysis
©© 2021. This manuscript version is made available under the CC-BY-NC-ND 4.0 license http://creativecommons.org/licenses/by-nc-nd/4.0/
Interval Type-2 Enhanced Possibilistic Fuzzy C-Means Clustering for Gene Expression Data Analysis
Shahabeddin Sotudian a , Mohammad Hossein Fazel Zarandi b,c a Department of Electrical and Computer Engineering, Division of Systems Engineering, Boston University, Boston, USA b Knowledge/Intelligent Systems Laboratory, Department of Mechanical and Industrial Engineering, University of Toronto,Toronto, Canada c Department of Industrial Engineering, Amirkabir University of Technology, Tehran, Iran [email protected], [email protected]
Abstract
Both FCM and PCM clustering methods have been widely applied to pattern recognition and data clustering. Nevertheless, FCM is sensitive to noise and PCM occasionally generates coincident clusters. PFCM is an extension of the PCM model by combining FCM and PCM, but this method still suffers from the weaknesses of PCM and FCM. In the current paper, the weaknesses of the PFCM algorithm are corrected and the enhanced possibilistic fuzzy c-means (EPFCM) clustering algorithm is presented. EPFCM can still be sensitive to noise. Therefore, we propose an interval type-2 enhanced possibilistic fuzzy c-means (IT2EPFCM) clustering method by utilizing two fuzzifiers (π , π ) for fuzzy memberships and two fuzzifiers (π , π ) for possibilistic typicalities. Our computational results show the superiority of the proposed approaches compared with several state-of-the-art techniques in the literature. Finally, the proposed methods are implemented for analyzing microarray gene expression data. Keywords : Interval type-2 fuzzy sets; Fuzzy c-means (FCM); Possibilistic c-means (PCM); Possibilistic fuzzy c-means clustering (PFCM); Gene expression data analysis.
1. Introduction
Clustering is an unsupervised pattern classification method that forms clusters by grouping data samples into homogeneous clusters. Among various approaches for clustering, fuzzy c-means clustering (FCM) is one of the most popular in the real-world application due to its fast convergence, simplicity, and easy implementation[1]. That is why it has been extensively applied in various fields such as engineering and medical sciences [2 β In the current paper, we will focus on PFCM. More specifically, we will modify the conventional PFCM algorithm so that the proposed clustering method be able to overcome various problems of PCM, FCM, FPCM, and PFCM. With this in mind, the main contributions of our paper are as follows: β’ we will enhance the conventional PFCM algorithm, giving rise to a modified algorithm which will be called Enhanced Possibilistic Fuzzy C-Means (EPFCM) clustering. This method can not only overcome the negative impact of the noisy points efficiently but also reduce closely located or coincident clusters. Moreover, EPFCM does not produce unrealistic typicality values for large data sets which we usually see in FPCM-based clustering methods. These preferable characteristics of EPFCM make it satisfactory to be the basic model in our study. β’ Higher-order fuzzy clustering algorithms have been demonstrated to be very capable to deal with the high levels of uncertainties that exist in most of the real-world applications. We will incorporate interval type-2 fuzzy sets (IT2 FSs) into the EPFCM algorithm to enhance the flexibility of the model for handling uncertainty and vagueness. The uncertainty associated with fuzzifiers π and π is considered in the proposed algorithm. In this way, we also enhance the ability of our model to overcome the coincident-clusters problem. β’ Finally, the proposed algorithms are applied in the analysis of microarray gene expression data. Our results show that the proposed methods are more robust to outliers and initializations and can produce more accurate clustering results. Based on the above discussions, the remainder of this paper is organized as follows. In Section 2, the problem statement and motivations will be presented. In Section 3, the enhanced possibilistic fuzzy c-means algorithm is formulated. In Section 4, the proposed interval type-2 enhanced possibilistic fuzzy c-means algorithm and its properties are presented. Several examples and comparisons showing the validity of our proposed methods will be presented in Section 5. Section 6 is devoted to the application of the proposed methods in microarray gene expression data clustering. Finally, conclusions and future work are presented in Section 7.
2. Problem statement and motivations
As we discussed briefly in the introduction section, to avoid the tendency to produce coincident clusters, possibilistic fuzzy c-means (PFCM) was proposed by Pal et al. in 2005. PFCM uses the membership and typicality aspects of FCM and PCM and minimizes the following optimization problem [9] : min {π½ ππΉπΆπ (π, π, π; π) = β β(πΆ
πΉπΆπ (π’ ππ ) π + πΆ ππΆπ (π‘ ππ ) π )βπ₯ π β π£ π β π΄2ππ=1ππ=1 + β π πππ=1 β(1 β π‘ ππ ) πππ=1 }, (1) where π is a set of all data points, π is the typicality matrix, π‘ ππ is taken as the typicality of π₯ π in the π π‘β cluster of π . π is a vector of cluster centers, π is the partition matrix, and π’ ππ is taken as the membership of π₯ π in the π π‘β cluster of X . π > 1, Ξ· > 0, and π > 1 are user-defined constants, π₯ represents a data point, βxβ A = βx T Ax is any inner product norm, π is the number of data points, and π is the number of cluster centers. The constants πΆ πΉπΆπ and πΆ ππΆπ define the relative importance of fuzzy membership and typicality values in the objective function, respectively. Pal et al. believed that PFCM can simultaneously exhibit the invulnerability of FCM to the coincident-cluster problem and the robustness of PCM to outliers. However, the coincident-clusters problem is still observed in the experimental result of this algorithm. To shed more light on the issue, consider the dataset π· in Figure 1 (a) which consists of 1550 data points and five clusters. We also added two clusters of outliers far from other points (each cluster contains 25 data points). We know when we use a higher value for πΆ πΉπΆπ compared to πΆ ππΆπ , we force PFCM to behave more like FCM than PCM. In Figure 1 (b), we can see the clustering result for πΆ πΉπΆπ = 0.8 and πΆ ππΆπ = 0.2 (For this example, we set π = 4 and π = 2 ). Evidently, we can see that the coincident-clusters problem happened again. Despite the fact that PFCM enhances the clustering results and alleviates the coincident-clusters problem, it is not able to completely solve the issue as we will demonstrate in the example. Hence, it can be concluded that the existing PFCM method cannot provide satisfactory performance for real-world applications. Thus, we need a better way to cope with this problem. In the next section, we will present EPFCM clustering which is a modified version of PFCM. It will be demonstrated that EPFCM can overcome various problems of PCM, FCM, FPCM, and PFCM.
3. Enhanced Possibilistic Fuzzy C-Means
As stated in the previous section, the existing PFCM algorithm still suffers from the coincident-clusters problem. This is due to the structural weakness that exists in the PCM part of this model. Inspired by the PCM algorithm of Krishnapuram and Keller[12], we modify the PCM part of PFCM algorithm. The objective function of the enhanced possibilistic fuzzy c-means (EPFCM) clustering is presented as follows: min {π½ πΈππΉπΆπ (π, π, π; π) = β β(πΆ
πΉπΆπ (π’ ππ ) π + πΆ ππΆπ (π‘ ππ ) π )βπ₯ π β π£ π β π΄2ππ=1ππ=1 + β π πππ=1 β β π‘ ππ π ( log ( β π‘ ππ π ) β 1 ) ππ=1 }, (2) and the constraints are ππ β€ 1, β π’ ππππ=1 = 1, β π’ ππππ=1 > 0, (3) ππ β€ 1, β π‘ ππππ=1 > 0. Given that the parameters of this model are the same as PFCM, their redefining is avoided. The updating formulas for fuzzy membership, possibilistic typicality, and cluster centers can be written as follows: π’ ππ = (β (π· ππ π· π‘π ) ) β1 , (4) π‘ ππ = (exp(βπ(π))) ππ β1 , π = π (π β 1)πΆ ππΆπ π· ππ2 π π , (5) π£ π = β (πΆ πΉπΆπ (π’ ππ ) π + πΆ ππΆπ (π‘ ππ ) π )π₯ πππ=1 β (πΆ πΉπΆπ (π’ ππ ) π + πΆ ππΆπ (π‘ ππ ) π ) ππ=1 ,1 β€ π β€ π, (6) where π· ππ represents the distance between π π‘β cluster and π π‘β data given by π· ππ = βπ₯ π β π£ π β , and π( . ) is the Lambert W-function. Since π is always non-negative, π(π) has only one real value. Moreover, it should be noted that the value of π π can be calculated like PCM method using the following formula : π π = πΎ β (π’ ππ ) π βπ₯ π β π£ π β π΄2ππ=1 β (π’ ππ ) πππ=1 , (7) where K is a positive constant. The main steps of EPFCM clustering method are presented in Algorithm 1. Equation (2) and the constraint in Equations (3) can form a Lagrange equation as follows: β = β β(πΆ
πΉπΆπ (π’ ππ ) π + πΆ ππΆπ (π‘ ππ ) π )βπ₯ π β π£ π β π΄2ππ=1ππ=1 + β π πππ=1 β ( β π‘ ππ π ( log ( β π‘ ππ π ) β 1 )) ππ=1 β β π π (β(π’ ππ β 1) ππ=1 ) ππ=1 , where π = [π , β¦ , π π ] is a Lagrange multiplier vector. Since the second part of β does not depend on π’ ππ and π£ π , the proofs of updating formulas for π’ ππ and π£ π are exactly like original PFCM. Therefore, we do not repeat it (see [9,13] for details). Algorithm 1:
EPFCM clustering method
Initialization:
Fix π, π², π, π½, π, π
ππͺπ΄ , π
π·πͺπ΄ πππ π
π΄ππ (maximum number of iterations); Set iteration counter π = π ; Initialize πΌ , π» , π and π½ randomly; While ( βπΌ (π) β πΌ (πβπ) β > π or π < π π΄ππ ) Update πΌ using (4). Update π» using (5). Update π½ using (6). Update π using (7). π = π + π End While
To find the first-order necessary conditions for optimality, the gradients of β with respect to t ππ are set to zero: πβππ‘ ππ = ππΆ ππΆπ (π‘ ππ ) πβ1 π· ππ2 + π π ((π‘ ππ ) πππ ( βπ‘ ππ π )π ) = 0, which leads to π πΆ ππΆπ π· ππ2 π π (π‘ ππ ) πβ1π + log( βπ‘ ππ π ) = 0. By solving this equation for π‘ ππ , one can obtain π‘ ππ = (exp (βπ (π (π β 1)πΆ ππΆπ π· ππ2 π π ))) ππ β1 (1 β π‘ ππ ) π in the original PFCM (Equation (1)) is monotonically decreasing function in [0,1]. Similarly, β π‘ ππ π ( log ( β π‘ ππ π ) β 1 ) is also a monotonically decreasing function in [0,1]; therefore, it does not change the structure of PFCM algorithm. However, since the exponential function in the updating formula of EPFCM decays faster for large values of π· ππ2 , EPFCM can manage to overcome the coincident-clusters problem of PFCM. As can be observed in Figure 1 (c), if we use the proposed algorithm for clustering π· , EPFCM is able to solve the coincident-clusters problem of PFCM. Theorem.
The required condition of the convergence of the proposed algorithm is fulfilled when we have: lim π‘ββ βπ (π‘) β π (π‘β1) β = 0 (8) Proof : If we consider π½ as the error function to be minimize, the iterative formula for π’ ππ can be deduced from the Newton-Raphson method: π’ ππ(π‘) = π’ ππ(π‘β1) β π (π‘) (π½ πΈππΉπΆπ (π’ ππ , π‘ ππ , π£ π ) (π‘β1) ) (ππ½ πΈππΉπΆπ (π’ ππ , π‘ ππ , π£ π ) (π‘β1) ππ’ ππ ) β1 , where π (π‘) is a positive learning rate parameter, and ππ½ πΈππΉπΆπ (π’ ππ ,π‘ ππ ,π£ π ) (π‘β1) ππ’ ππ is the gradient of π½ πΈππΉπΆπ with respect to π’ ππ at (π‘ β 1) iteration. By rewriting this equation for π , one can obtain : π (π‘) β π (π‘β1) = βπ (π‘) (π½ πΈππΉπΆπ (π, π, π) (π‘β1) ) (ππ½
πΈππΉπΆπ (π, π, π) (π‘β1) ππ ) β1 . (9) By putting Equation (9) in Equation (8) we have: πππ π‘ββ βπ (π‘) β π (π‘β1) β = πππ π‘ββ (βπ (π‘) β βππ½ πΈππΉπΆπ (π’ ππ , π‘ ππ , π£ π ) (π‘β1) β β(ππ½ πΈππΉπΆπ (π’ ππ , π‘ ππ , π£ π ) (π‘β1) ππ’ ππ ) β1 β). (a) (b) (c) Figure 1.
PFCM and coincident-cluster problem (a) Dataset π· and red dots are the center of clusters(b) The clustering result for PFCM (c) The clustering result for EPFCM. Now, if we set π (π‘) = π (π‘) βππ½ πΈππΉπΆπ (π’ ππ ,π‘ ππ ,π£ π ) (π‘β1) ββ( ππ½πΈππΉπΆπ(π’ππ,π‘ππ,π£π)(π‘β1)ππ’ππ ) β1 β , where π (π‘) = π π‘ , π is a constant value and π (π‘) β 0 when π‘ β β , the previous equation can be simplified as follows: πππ π‘ββ βπ (π‘) β π (π‘β1) β = πππ π‘ββ βπ (π‘) β = 0. Therefore, Equation (8) is proved and EPFCM converges.
4. Interval Type-2 Enhanced Possibilistic Fuzzy C-Means
The higher ability of IT2 FSs for modeling uncertainties is mostly attributed to their three-dimensional membership functions. T2 FSs have been used to handle the uncertainties in different domains where the performance of T1 FSs is not sufficiently good. Even though the computational complexity is increased by the employment of T2 FSs, but it is a small price to pay for obtaining satisfactory results [10]. In this section, we improve the capabilities of the proposed model in uncertainty modeling using T2 FSs. Among the main approaches in the type-2 fuzzy clustering, the parameter uncertainty is a widely used method. This approach manages the amount of fuzziness of the final-partition by incorporating two values of fuzzifier π to make a footprint of uncertainty (FOU) corresponding to the lower and upper interval memberships[10]. In the current study, we extend EPFCM to interval type-2 EPFCM using the Zexuan, et al. approach [14]. For more detail on the concept of IT2 FSs and the techniques directly relevant to our work, please refer to[14]. For representing the problem into IT2 FSs, we consider two fuzzifiers (π , π ) for fuzzy memberships and two fuzzifiers (π , π ) for possibilistic typicalities. These four fuzzifiers which represent different fuzzy degrees give four distinct objective functions to be minimized. Equation (10) shows these objective functions. { π½ π ,π (π, π, π) = β β(πΆ πΉπΆπ (π’ ππ ) π + πΆ ππΆπ (π‘ ππ ) π )βπ₯ π β π£ π β π΄2ππ=1ππ=1 + β π πππ=1 β(π‘ πππ ππ(π‘ πππ ) β π‘ πππ ) ππ=1 π½ π ,π (π, π, π) = β β(πΆ πΉπΆπ (π’ ππ ) π + πΆ ππΆπ (π‘ ππ ) π )βπ₯ π β π£ π β π΄2ππ=1ππ=1 + β π πππ=1 β(π‘ πππ ππ(π‘ πππ ) β π‘ πππ ) ππ=1 π½ π ,π (π, π, π) = β β(πΆ πΉπΆπ (π’ ππ ) π + πΆ ππΆπ (π‘ ππ ) π )βπ₯ π β π£ π β π΄2ππ=1ππ=1 + β π πππ=1 β(π‘ πππ ππ(π‘ πππ ) β π‘ πππ ) ππ=1 π½ π ,π (π, π, π) = β β(πΆ πΉπΆπ (π’ ππ ) π + πΆ ππΆπ (π‘ ππ ) π )βπ₯ π β π£ π β π΄2ππ=1ππ=1 + β π πππ=1 β(π‘ πππ ππ(π‘ πππ ) β π‘ πππ ) ππ=1 (10) Similar to EPFCM, by differentiating these equations and setting them to zero, we can obtain the updating formulas. The primary lower and upper fuzzy memberships [ π’ ππ , π’ ππ ] are determined as follows: π’ ππ = πππ₯ [(β (π· ππ π· π‘π ) β1ππ‘=1 ) β1 , (β (π· ππ π· π‘π ) β1ππ‘=1 ) β1 ] (11) π’ ππ = πππ [(β (π· ππ π· π‘π ) β1ππ‘=1 ) β1 , (β (π· ππ π· π‘π ) β1ππ‘=1 ) β1 ] (12) Also, the primary lower and upper possibilistic typicality [ π‘ ππ , π‘ ππ ] are as follows: π‘ ππ = πππ₯ [ (ππ₯π (βπ (π (π β 1)πΆ ππΆπ π· ππ2 π π ))) π π β1 , (ππ₯π (βπ (π (π β 1)πΆ ππΆπ π· ππ2 π π ))) π π β1 ] , (13) π‘ ππ = πππ [ (ππ₯π (βπ (π (π β 1)πΆ ππΆπ π· ππ2 π π ))) π π β1 , (ππ₯π (βπ (π (π β 1)πΆ ππΆπ π· ππ2 π π ))) π π β1 ] . (14) To decrease the free parameters of our model and also to put the final uncertainty into the range of 0 to 1, the following constraints are added to the problem:
πΉπΆπ β€ 1, 0 β€ C
ππΆπ β€ 1, C
πΉπΆπ + C
ππΆπ = 1.
Now, we have two upper bounds and two lower bounds in the problem, but we need one upper bound and one lower bound to calculate the centroids of the clusters straightforwardly. For this purpose, the following equations are presented for calculating the final upper and lower uncertainty representations: { π ππ1 = C
πΉπΆπ . u ππ + C ππΆπ . t ππ π ππ2 = C πΉπΆπ . u ππ + C ππΆπ . t ππ π ππ3 = C πΉπΆπ . u ππ + C ππΆπ . t ππ π ππ4 = C πΉπΆπ . u ππ + C ππΆπ . t ππ , {π ππ = max (π ππ1 , π ππ2 , π ππ3 , π ππ4 )π ππ = min (π ππ1 , π ππ2 , π ππ3 , π ππ4 ). (15) The centroids of clusters in the IT2EPFCM are calculated using the fuzzy and possibilistic memberships. Also, these memberships are defined based on IT2 FSs. Consequently, each centroid is represented by an interval between π£ πΏ and π£ π . To obtain the centroid of the proposed model, type reduction and defuzzification methods are required. Type reduction refers to the process of mapping a T2 FSs to T1 FSs [15]. Among the various choices of type reduction methods, the Enhanced Karnik-Mendel (EKM) algorithm is used to estimate the cluster centers without using all of the embedded fuzzy sets. Details on this method can be found in [15]. After estimating the maximum and minimum values of cluster centers, the estimated interval type-1 fuzzy set fuzzy set for the cluster centers can be calculated as: π£Μ π = 1.0/[π£ ππΏ , π£ ππ ]. Finally, a defuzzification method is used to estimate crisp centers. This can be computed as follows: π£ π = π£ ππΏ + π£ ππ Since the pattern set is extended to an interval type-2 fuzzy set, we should estimate the cluster centers using the hard partitioning of the fuzzy memberships. Moreover, the type of interval type-2 fuzzy set should be reduced before hard-partitioning. To achieve this aim, the left and right memberships ( π πππΏ , π πππ ) for all patterns are calculated by considering the left and right centroids ( π£ ππΏ , π£ ππ ). Then type-reduction is performed using the following equation: π ππ = π πππΏ + π πππ
2 , π = 1, β¦ , π ; π = 1, β¦ , π.
Eventually, hard partitioning can be obtained using the following rule:
For π β π , π = 1, β¦ , π and π = 1, β¦ , π If π ππ > π ππ Then assign π₯ π to cluster π . Based on the above discussions, the main steps of IT2EPFCM clustering method can be summarized as follows: Algorithm 2:
IT2 EPFCM clustering method
Initialization:
Fix π, π², π , π , π½ , π½ , π, π ππͺπ΄ , π
π·πͺπ΄ πππ π
π΄ππ
Set iteration counter π = π ; Initialize π , π , π, π , π , π , ππ§π π randomly. While ( βπ (π) β π (πβπ) β > π or βπ (π) β π (πβπ) β > π or π < π π΄ππ ) Update
π ππ§π π using (11) and (12), respectively. Update
π ππ§π π using (13) and (14), respectively. Update π ππ§π π using (15). π = π + π
End While
Return cluster centers and membership values after performing type reduction and defuzzification
A remarkable feature of the proposed method is that by using the different values for its parameters ( π , π , π , π , C πΉπΆπ and C ππΆπ ) it can behave like the conventional FCM, PCM, IFCM, IPCM, and PFCM methods; as a result, the proposed method can be used as a general approach to fuzzy clustering. Furthermore, in the Zexuan, et al. model, the coincident-clusters problem still exists, but it has been corrected our algorithms. However, since the main goal of this study is to enhance PFCM; from now on, we just focus on EPFCM and IT2EPFCM.
5. Experimental results
In this section, several examples are presented to demonstrate the performance of the proposed algorithms. We test and compare the results of the proposed methods with those obtained by FCM, PCM, PFCM, IT2FCM, IT2PCM, and IT2PFCM. The results of these experiments are evaluated using three performance criteria, namely Reconstruction Error (RE), Rand Index (RI)[16], and three fuzzy cluster validity indices. In possibilistic and fuzzy clustering analysis, every point has a degree of belonging to the centroid of clusters. Furthermore, each point can be regenerated using the center of clusters and the degree of belonging to the clusters. Consequently, the reconstruction error for a data point like π₯ π is computed as the distance between π₯ π and its regenerated version π₯ πβ² . Rand Index (RI) is a measure of the similarity between two clustering results (e.g., predicted labels π¦Μ and true labels π¦ ). It estimates the likelihood of an element being correctly classified. RI index can be computed as follows: π πΌ = πΌ + π½π(π β 1) 2β where πΌ denotes the number of times a pair of elements belongs to a same cluster across π¦Μ and π¦ ; π½ is the number of times a pair of elements are in different clusters across π¦Μ and π¦ ; and π is the number of elements. This index ranges from 0 to 1, with a higher value representing a better clustering result. Furthermore, cluster validity indices (CVI) will be used for evaluating the fitness of partitions produced by clustering algorithms. Once the partitions are obtained by a clustering method, the validity function can help us to validate whether it accurately presents the data structure or not. For that purpose, three well-known validity indices including Fukuyama and Sugeno (FS), Xie and Beni (XB), and Kowan are used to compare the performance of the proposed algorithms to other baseline methods. For these CVIs, the lower index value for an algorithm indicates the better performance of that algorithm. For more detail on the concept of fuzzy validity indices and the techniques directly relevant to our work, please refer to [17]. In all experiments, the process of clustering and calculating the performance metrics has been repeated 5 times to verify the precision of our final result. Average of the obtained performance metrics are reported in the tables. For all experiments we use the following computational protocols: π = 0.0001 and the maximum number of iterations is 100, C πΉπΆπ = 0.8 , and C ππΆπ = 0.2 . Besides, for type-1 fuzzy and possibilistic clustering algorithms, we set π =1.5 , and π = 6 . For type-2 fuzzy and possibilistic clustering algorithms, we set π = 1.1 , π = 1.5 , π = 1.5 , and π = 3 . Moreover, all the experiments are performed on a computer with Intel Core i7 CPU 3.1 GHz processor, 10 GB of RAM, using Microsoft Windows 10 (64 bit) OS. Example 1.
In the first experiment, EPFCM is compared to other T1 fuzzy clustering methods. To that end, D consists of 520 data points and seven clusters is used. This data set is demonstrated in Figure 2. RE, Kowan, and FS are computed for this dataset using FCM, PCM, PFCM and EPFCM algorithms and are presented in Table 1. As can be seen, the proposed EPFCM clustering outperforms other T1 fuzzy clustering algorithms.
Moreover, PFCM works well and is the second-best method.
Table 1.
REs and CVIs of clustering results obtained by performing the T1 fuzzy clustering algorithms.
EPFCM PFCM PCM FCM Performance metric 0.0080 RE -845.63 -828.59 -588.39 -838.73 FS Figure 2.
Scatter plot of D data set containing seven clusters. Example 2.
In the second experiment, the proposed IT2EPFCM is compared to IT2FCM, IT2PCM, and IT2PFCM. This example involves four generated corner-shape clusters ( D data set) consists of 1600 data points as shown in Figure 3 (a). In this experiment, noisy points are added to this data set to better evaluate the robustness of our methods in the presence of noise. For this purpose, 50, 100, and 200 normally distributed noisy points were added to this dataset (Figure 3(b-d)). The obtained REs and XBs are reported in Table 2. According to our computations, the proposed algorithm is less sensitive to noise and the problems with a high level of uncertainty are better handled by the proposed interval type-2 clustering method. Additionally, IT2PFCM shows a reasonably good overall performance and is the second-best method. However, the differences between IT2EPFCMβs performance and IT2PFCM is significantly higher for D with noisy points. This fact demonstrates the robustness of our method in the presence of noise. Table 2.
REs and XBs of clustering results obtained by performing the interval type 2 clustering algorithms.
IT2EPFCM IT2PFCM IT2PCM IT2FCM Data set 0.1138 D D +50 noise points D +100 noise points D +200 noise points Example 3.
For our next experiment, three high dimensional data, namely iris plants database (Iris), Breast Cancer Coimbra, and Yeast with a known number of clusters are used to evaluate the performance of the proposed algorithms. To that end, RI is used to compare the proposed algorithm to their T1 and T2 counterparts. All the datasets are adopted from UCI Machine Learning Repository [18,19], and the details of them are given in Table 3 where r , N , and c indicate the number of features, number of data vectors, and number of clusters, respectively. (a) (b) (c) (d) Figure 3.
Scatter plot of D data set containing four corner-shape clusters (a) without noisy points (b) D +50 noisy points (c) D +100 noisy points (d) D +200 noisy points. Table 3.
The details of the high dimensional data sets.
Data set π« π π Iris 4 150 3 Breast Cancer Coimbra 9 116 2 Yeast 8 1484 10 The values of the Rand index for all algorithms are reported in Figure 4. As can be seen, the RI values for the EPFCM algorithm are higher than PCM, FCM, PFCM, and even IT2PCM algorithms. Moreover, the proposed IT2EPFCM algorithm performs steadily better than IT2PFCM clustering method. We can clearly see that the proposed IT2EPFCM performed the best among all methods. It worth mentioning that the only goal of this example is to compare the T1 and T2 fuzzy clustering algorithms under the same experimental settings. It is definitely possible to achieve higher RI using these clustering algorithms, but since we want to have a fair comparison, we did not tune the parameters of these algorithms.
Figure 4.
The values of the Rand Index for three high dimensional data sets.
Breast Cancer Coimbra Yeast IrisEPFCM 0.4981 0.7537 0.8392PFCM 0.4970 0.7392 0.8369FCM 0.49805 0.7529 0.82461PCM
IT2EPFCM 0.50525 0.77412 0.85396IT2PFCM 0.503 0.77113 0.85226
IT2FCM R a nd I nd e x Example 4.
To evaluate the computational cost of proposed algorithms, we compare their computational time costs to baseline methods. For that purpose, we use 3 datasets, namely D , Iris, and Wine data sets. D is consists of 140 data points and 2 clusters and is demonstrated in Figure 5. Moreover, Wine dataset is also adopted from UCI Machine Learning Repository [18]. We perform each experiment 10 times with randomly initialized cluster centers. The average execution time and the number of iterations (NI) are presented in Table 4. Both the number of iterations and the runtime of EPFCM are slightly higher, but still comparable with those in the other algorithms (PCM and PFCM). Moreover, the runtime of IT2EPFCM is comparable with that of other IT2 algorithms. Also, it can generally reach convergence with fewer iterations. In general, the proposed algorithms have slightly higher running time compared with all other algorithms. However, as we saw in the previous experiments, they outperform their counterparts. Thus, slightly higher runtime is a small price to pay for obtaining satisfactory results and higher accuracy. Furthermore, comparing to other algorithms, they can reach convergence with fewer iterations. Figure 5.
Scatter plot of D data set containing two clusters. Table 4.
Average of Runtime (seconds) and the number of iterations (mean Β± standard deviation). Method π Iris Wine Runtime NI Runtime NI Runtime NI
FCM 0.0021
PCM 0.0051
PFCM 0.0063
EPFCM 0.0071 16.3 Β±2.7 Β±2.4 Β± Β± Β± Β± Β± 1 IT2PFCM 0.0178 Β± IT2EPFCM 0.0185 7.2 Β± Β± Β±
6. Application in gene expression data analysis
Clustering algorithms have demonstrated their potential to find the underlying patterns in microarray gene expression profiles. In this section, two microarray gene expression datasets, namely Rat CNS and Arabidopsis thaliana are used and the capability of the proposed algorithms will be analyzed from various perspectives. These datasets are adopted from [20]. Arabidopsis Thaliana dataset contains expression levels of 138 genes of Arabidopsis Thaliana over 8 time points.
The Rat CNS dataset is comprised of the expression levels of a set of 112 genes during rat central nervous system development. It has 9 dimensions each of them representing 9 time points. The number of clusters for these datasets is computed using FP validity index [21]. Based on our computations, the near-optimal number of clusters for Arabidopsis dataset is 4 clusters and for Rat CNS dataset is 3 clusters. In the first step, to visually observe the clustering results, we apply EPFCM algorithm and Eisen plots of the clustered datasets are presented. We also generated a random sequence of genes for the simpler distinction between data before and after clustering. These plots are demonstrated in Figures 6 and 7. In these figures, bright colors denote higher expression levels while dark colors denote lower expression values. Furthermore, the red lines in these plots are the boundaries of clusters. Needless to say, the expression profiles of the genes in each cluster are similar to each other and they have similar color patterns. In other words, EPFCM can put similar genes beside each other. Also, the genes in different clusters are properly separated. In the second step, we use Silhouette index [22] to compare the performance of the proposed algorithms to their counterparts. This index evaluates how well an observation is clustered, and it can be defined as follows:
π = π β πmax (π, π) , where π is the average distance of a point from all other points in the same cluster, and π is the average distances of the point from all other points in the closest cluster. This index ranges from -1 to 1, with a higher value representing a better clustering result. We compare the methods proposed here to eleven state-of-the-art approaches for gene expression clustering including centroid linkage (UPGMC), weighted average linkage (WPGMA), unweighted average linkage (UPGMA), gaussian mixture model (GMM), partitioning around medoids (PAM), self-organizing map (SOM), k-means, general type-2 possibilistic fuzzy c-means (GT2PFCM), IT2PFCM, PFCM, and FCM. We chose PFCM and IT2PFCM because these clustering methods demonstrated a really good performance based on previous experiments. For more information about these algorithms, please refer to [14,23,24]. The process of clustering and calculating Silhouette index has been repeated 25 times to verify the precision of our final results. The averages of the obtained results are reported in Figure 8. As can be seen, IT2EPFCM outperforms its counterparts in both gene expression datasets. Interestingly, it is even better than GT2PFCM that is the result of using EPFCM instead of PFCM in its objective function. Moreover, if we ignore T2 fuzzy algorithms, EPFCM achieves a better or comparable performance compared with other algorithms. Generally, the fuzzy algorithms demonstrate superior performance over the hard-clustering algorithms. It should be noted that since we want to keep the main models as simple as possible, we used simple methods for type reduction and defuzzification steps. These steps can have a huge impact on the clustering results. Thus, using more sophisticated methods, we may be able to further improve our results. For more information about these methods, please refer to [25]. Eventually, to gain a better insight into IT2EPFCMβs clustering results, the cluster profile plots of these datasets are presented in Figures 9 and 10. The x-axis denotes the 8 or 9 time points that data were collected, and y-axis denotes the expression values of genes in each cluster. Obviously, IT2EPFCM is able to generate compact well-separated clusters. In other words, the general form of profile plots within each cluster are similar to each other while the profile plots from different clusters are dissimilar. (a) (b)
Figures 6 . Eisen plot for Rat CNS (a) random sequence of genes (b) clustered data. (a) (b)
Figures 7 . Eisen plot for Arabidopsis thaliana (a) random sequence of genes (b) clustered data.
Figures 8 . The average of Silhouette index values for various clustering algorithms. R A T C N S A R A B I D O P S I S T H A L I A N A SOM
UPGMAWPGMAUPGMCk-meansPAMGMMGT2PFCMIT2PFCM
IT2EPFCM
EPFCMPFCMFCM
Figures 10 . Cluster profile plots for Arabidopsis thaliana.
Features -3-2-10123 E x p r e ss i on v a l ue s Cluster 1
Features -3-2-10123 E x p r e ss i on v a l ue s Cluster 2
Features -3-2-10123 E x p r e ss i on v a l ue s Cluster 3
Features -3-2-10123 E x p r e ss i on v a l ue s Cluster 4
Figures 9 . Cluster profile plots for Rat CNS.
Features -3-2-10123 E x p r e ss i on v a l ue s Cluster 1
Features -3-2-10123 E x p r e ss i on v a l ue s Cluster 2
Features -2-10123 E x p r e ss i on v a l ue s Cluster 3
7. Conclusions and future work
Even though PFCM clustering performs better in the presence of noise over conventional FCM, cluster coincidence still occurs in PFCM algorithm. We analyzed the problem of cluster coincidence in the PCM-based algorithms (i.e. conventional PFCM and IT2PFCM) and solved this issue by presenting EPFCM method. Then, interval type-2 enhanced possibilistic fuzzy c-means clustering models, IT2EPFCM, introduced based on parameter uncertainties. In IT2EPFCM algorithm, we consider two fuzzifiers (π , π ) for fuzzy memberships and two fuzzifiers (π , π ) for possibilistic typicalities to express the uncertainty associated with the fuzzifier parameters of the possibilistic fuzzy clustering. To provide a comprehensive performance analysis, the proposed approaches evaluated using different artificial and real-world data sets. The experiments show that our methods can solve the problems of FCM, PCM, and PFCM algorithms. Furthermore, our results demonstrated the superiority and flexibility of the proposed algorithms under highly uncertain environments compared to several existing clustering algorithms in the literature. Eventually, we applied the proposed algorithms to gene expression data clustering. The comparative analysis illustrated that IT2EPFCM model outperforms the best of its counterparts and yields higher performance. Even though our model is quite promising, it also suffers from some limitations that can be addressed in future works. As discussed earlier, one of the outstanding features of IT2 FSs is their resilience to noise compared to T1 FSs. This resilience can be enhanced by introducing GT2 FSs. Second, the computational complexity associated with the T2 clustering algorithms is an obstacle for some applications. As future work, we will develop more efficient algorithms using bio-inspired optimization techniques. Furthermore, an interesting future direction is to extend the proposed framework to a semi-supervised scenario. Finally, we can extend the proposed models to deal with more complex data, e.g., data with missing entries. References [1] Hashemzadeh M, Golzari Oskouei A, Farajzadeh N. New fuzzy C-means clustering method based on feature-weight and cluster-weight learning. Appl Soft Comput 2019;78:324 β
45. https://doi.org/10.1016/j.asoc.2019.02.038. [2] Dang TH, Mai DS, Ngo LT. Multiple kernel collaborative fuzzy clustering algorithm with weighted super-pixels for satellite image land-cover classification. Eng Appl Artif Intell 2019;85:85 β
98. https://doi.org/10.1016/j.engappai.2019.05.004. [3] Tao Y, Zhang Y, Wang Q. Fuzzy c-mean clustering-based decomposition with GA optimizer for FSM synthesis targeting to low power. Eng Appl Artif Intell 2018;68:40 β
52. https://doi.org/10.1016/j.engappai.2017.10.022. [4] Sotudian S, Zarandi MHF, Turksen IB. From Type-I to Type-II Fuzzy System Modeling for Diagnosis of Hepatitis. Int J Comput Inf Eng 2016;10:1280 β
8. [5] Ding Y, Fu X. Kernel-based fuzzy c-means clustering algorithm based on genetic algorithm. Neurocomputing 2016;188:233 β
8. https://doi.org/10.1016/j.neucom.2015.01.106. [6] Cheng H, Qian Y, Wu Y, Guo Q, Li Y. Diversity-induced fuzzy clustering. Int J Approx Reason 2019;106:89 β β β
38. https://doi.org/10.1142/S0218488507004650. [9] Pal NR, Pal K, Keller JM, Bezdek JC. A possibilistic fuzzy c-means clustering algorithm. IEEE Trans Fuzzy Syst 2005;13:517 β
30. https://doi.org/10.1109/TFUZZ.2004.840099. [10] Hwang C, Rhee FC-H. Uncertain Fuzzy Clustering: Interval Type-2 Fuzzy Approach to $C$-Means. IEEE Trans Fuzzy Syst 2007;15:107 β
20. https://doi.org/10.1109/TFUZZ.2006.889763. [11] Melin P, Castillo O. A review on type-2 fuzzy logic applications in clustering, classification and pattern recognition. Appl Soft Comput 2014;21:568 β
77. https://doi.org/10.1016/j.asoc.2014.04.017. [12] Krishnapuram R, Keller JM. The possibilistic C-means algorithm: insights and recommendations. IEEE Trans Fuzzy Syst 1996;4:385 β
93. https://doi.org/10.1109/91.531779. [13] Krishnapuram R, Keller JM. A possibilistic approach to clustering. IEEE Trans Fuzzy Syst 1993;1:98 β β
56. https://doi.org/10.1016/j.fss.2013.12.011. [15] Wu D, Mendel JM. Enhanced Karnik β Mendel Algorithms. IEEE Trans Fuzzy Syst 2009;17:923 β
34. https://doi.org/10.1109/TFUZZ.2008.924329. [16] Rand WM. Objective Criteria for the Evaluation of Clustering Methods. J Am Stat Assoc 1971;66:846 β
50. https://doi.org/10.1080/01621459.1971.10482356. [17] Wang W, Zhang Y. On fuzzy cluster validity indices. Fuzzy Sets Syst 2007;158:2095 β Γ cio M, Pereira J, Cris Γ³ stomo J, Matafome P, Gomes M, Sei Γ§ a R, et al. Using Resistin, glucose, age and BMI to predict the presence of breast cancer. BMC Cancer 2018;18:29. https://doi.org/10.1186/s12885-017-3877-1. [20] Combining Pareto-Optimal Clusters using Supervised Learning for Identifying Co-expressed Genes n.d. http://anirbanmukhopadhyay.50webs.com/data.html (accessed May 20, 2020). [21] Zarandi MHF, Sotudian S, Castillo O. A New Validity Index for Fuzzy-Possibilistic C-Means Clustering. ArXiv200509162 Cs Stat n.d. [22] Rousseeuw PJ. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 1987;20:53 β
65. https://doi.org/10.1016/0377-0427(87)90125-7. [23] Kerr G, Ruskin HJ, Crane M, Doolan P. Techniques for clustering gene expression data. Comput Biol Med 2008;38:283 β
93. https://doi.org/10.1016/j.compbiomed.2007.11.001. [24] Oyelade J, Isewon I, Oladipupo F, Aromolaran O, Uwoghiren E, Ameh F, et al. Clustering Algorithms: Their Application to Gene Expression Data. Bioinforma Biol Insights 2016;10:237 β
53. https://doi.org/10.4137/BBI.S38316. [25] Torshizi AD, Zarandi MHF, Zakeri H. On type-reduction of type-2 fuzzy sets: A review. Appl Soft Comput 2015;27:614 β
27. https://doi.org/10.1016/j.asoc.2014.04.031.27. https://doi.org/10.1016/j.asoc.2014.04.031.