TTaming Wild High Dimensional Text Data with aFuzzy Lash
Amir Karami
School of Library and Information ScienceUniversity of South Carolina
Columbia, SC, [email protected]
Abstract —The bag of words (BOW) represents a corpus ina matrix whose elements are the frequency of words. However,each row in the matrix is a very high-dimensional sparse vector.Dimension reduction (DR) is a popular method to address spar-sity and high-dimensionality issues. Among different strategiesto develop DR method, Unsupervised Feature Transformation(UFT) is a popular strategy to map all words on a new basisto represent BOW. The recent increase of text data and itschallenges imply that DR area still needs new perspectives.Although a wide range of methods based on the UFT strategyhas been developed, the fuzzy approach has not been consideredfor DR based on this strategy. This research investigates theapplication of fuzzy clustering as a DR method based on the UFTstrategy to collapse BOW matrix to provide a lower-dimensionalrepresentation of documents instead of the words in a corpus.The quantitative evaluation shows that fuzzy clustering producessuperior performance and features to
Principal ComponentsAnalysis (PCA) and
Singular Value Decomposition (SVD), twopopular DR methods based on the UFT strategy.
Index Terms —dimension reduction, fuzzy clustering, SVD,PCA, classification
I. I
NTRODUCTION
Large electronic archives provide extremely useful andvaluable resources to the scholarly community [1]. For ex-ample, there are more than 25 million documents in theMEDLINE/PubMed website and more than 4 million doc-uments in the IEEE Xplore Digital Library website . Thishuge amount of documents has created a growing need todevelop new methods for processing high dimensional data [2].This computational area is one of the data-intensive challengesidentified by National Science Foundation (NSF) as an areafor future study [3].Bag-of-words (BOW) is a common method in text datarepresentation. This technique represents documents based onthe frequency of words with a matrix [4]. However, this highdimensional matrix is a sparse matrix for large number ofdocuments [5]. Sparsity means that most elements in BOWmatrix are zero because each document contains a smallpercentage of all words in a corpus [6].Dimension reduction (DR) is a per-processing step forreducing the original BOW dimension. The objectives ofdimension reduction strategies are to improve speed and accuracy of data mining [2]. There are four main strategiesfor DR: Supervised-Feature Selection (SFS), Unsupervised-Feature Selection (UFS), Supervised-Feature Transformation(SFT), and Unsupervised-Feature Transformation (UFT) [2].Feature selection focuses on finding a feature subset thatcan describe the data, as good as the original dataset, forsupervised or unsupervised learning tasks [7]. Unsupervisedmeans there is no teacher, in the form of class labels [8]. Manyexisting databases are unlabeled because large amounts of datamake it difficult for humans to manually label the categoriesof each document. Moreover, human labeling is expensive andsubjective. Hence, unsupervised learning is needed.DR reduction methods are based on some approaches suchas linear algebra, statistical distributions, and neural network.Fuzzy approach has contributed to decision making [9], [10]and data mining in various ways by providing a flexible ap-proach such as fuzzy information granulation and representingvague patterns [11]; however, fuzzy clustering has not beenconsidered as a DR approach.This paper will discuss the application of fuzzy clusteringfor dimensionality reduction based on the UFT strategy. Thisresearch compares the DR performance of fuzzy clustering,PCA, and SVD, and shows that fuzzy clustering has betterperformance in document classification and has computationaladvantages over the current methods.The remainder of this paper is organized as follows. Inthe related work section, we review the DR research. Inthe methodology and experiment sections, we provide moredetails about using fuzzy clustering as a DR method alongwith an evaluation study to verify the effectiveness of fuzzyclustering. Finally, we present a summary, limitations, andfuture directions in the last section.II. R ELATED W ORK
Big text data have encouraged researchers to propose di-mension reduction techniques in four categories [12]: SFS,SFT, UFS, and UFT.SFS strategy explores the best minimum subset of theoriginal words (features) for labeled data. Assume that W = { w , w , ..., w m } and L = { l , l , ..., l p } denote the words andthe class label set where m and p are the number of words andlabels, respectively. D = { d , d , ..., d n } is the corpus where n is the number of documents. The goal of SFS strategy is to a r X i v : . [ s t a t . M L ] D ec nd F = { f , f , ..., f k } that is a subset of W with k features( k < m ) with respect to L . Several methods were developedbased on SFS strategy such as information gain [13] and Chi-square measure [14].SFT strategy maps the words to a new basis for labeleddata. The goal of SFT strategy is to map the words in W onto clusters, C = { c , c , ..., c k } , with respect to L where k << m . For example, Linear Discriminant Analysis (LDA) isa SFT method using Fisher criterion based on maximizing thebetween class scatter and minimizing the within class scatter[15].UFS explores the best minimum subset of the originalwords for unlabeled data. The goal of unsupervised-featureselection strategy is to find the best minimum subset ( k ) of F without having L where k < m . Different methods have beendeveloped based on UFS strategy such as Non-negative MatrixFactorization (NMF) [16] and Laplacian Score (LS) [17].the UFT strategy maps the words to a new basis for un-labeled data. The goal of unsupervised-feature transformationstrategy is to map the words in W onto C without having L where k << m . Several methods have been developed basedon the UFT strategy such as Principal Components Analysis(PCA) that is a linear unsupervised-feature transformation tomap a set of correlated features into a set of uncorrelatedfeatures using orthogonally [18]. PCA is among the mosteffective dimension reduction techniques and has shown abetter performance than other techniques [19]. While PCAuses eigen-decomposition of the covariance matrix, LatentSemantic Analysis (LSA) is a similar method using SingularValue Decomposition (SVD) for feature transformation [20].SVD detects the maximum variance of the data in a set oforthogonal basis vectors [21].Some studies have applied fuzzy approach to developdimensionality reduction methods based on supervised- andunsupervised- feature selection strategies such as Rough SetAttribute Reduction (RSAR) [22]. The current fuzzy-baseddimension reduction methods rely on retaining important fea-tures, and removing irrelevant and redundant (noisy) features[23]; however, this strategy loses some information. Thisresearch investigates the potential of fuzzy clustering as a DRmethod and compares its performance with powerful currentDR methods based on the UFT strategy.III. M ETHOD
The goal of UFT strategy is to obtain a new basis that isa combination of the original basis. Among different methodswith respect to this strategy, PCA and LSA are well-knownwidely used methods [24]. PCA converts matrix X thatcontains n objects or documents with m variables or words tothree matrices: linear combination of variables for each object( t ), vectors of regression coefficients ( P ), and residuals ( E ): X = tP T + E LSA applies SVD on matrix X to drop the least significantsingular values and keep k singular values. SVD convertsmatrix X to three matrices: diagonalized XX T ( U ) , singular values of X ( S ) , and diagonalized X T X ( V T ) . In both PCAand SVD, the original basis is represented by a new reducedbase with k dimensions ( d << m and d << n ): X = U SV T The traditional reasoning has a precise character that is yes-or-no rather than more-or-less [25]. Fuzzy logic proposes anew approach to move from the classical logic, zero or one,to the truth values between zero and one [10], [26].Fuzzy logic assumes that if X is a collection of data pointsrepresented by x , then a fuzzy set A in X is a set of orderpairs, A = { ( x, µ A ( x ) | x ∈ X ) } . µ A ( x ) is the membershipfunction which maps X to the membership space M which isbetween 0 and 1 [10].The goal of most clustering algorithms is to minimize theobjective function ( J ) that measures the quality of clustersto find the optimum J which is the sum of the squareddistances between each cluster center and each data point.There are two major clustering approaches: hard and fuzzy(soft) [27]. The hard approach assigns exactly one clusterto a document, but the soft approach assign a degree ofmembership with respect to each of cluster for a document [2].Among fuzzy clustering techniques, fuzzy C-means (FCM) isthe most popular model [28] to minimize an objective functionby considering constraints: M in J q = k (cid:88) f =1 n (cid:88) j =1 ( µ fj ) q || d j − v f || (1)subject to: ≤ µ fj ≤ (2) c (cid:88) f =1 µ fj = 1 (3) < n (cid:88) j =1 µ fj < n ; (4)Where: n = number of documents k = number of clusters µ = membership value q = fuzzifier, < q ≤ ∞ d = document vector v = cluster center vectorIn this research, we use fuzzy clustering to find µ fj as themembership degree for each document ( d j ) with respect toeach of clusters. The value of µ fj is between 0 and 1 andis assumed to be a new basis to represent document-termfrequency matrix. The number of documents and the numberof clusters are represented by n and k . We assume that fuzzyclustering converts X with n documents and m words to a newreduced matrix (C) with k variables or dimensions ( k << m )(Fig. 1). It is worth mentioning that fuzzy clustering does notose information in X and does not need to select a subset ofdimensions such as SVD. D o c u m e n t s Words Fuzzy Clusters D o c u m e n t s X n × m → Cn × k Fig. 1: Matrix Interpretation of FCFor example, assume that there are 10 words in a corpuswith 5 documents represented by X matrix whose elementsshow the frequency of the words in each of the documents. Forinstance, word 1 ( w ) is appeared two times in document 2( d ). Applying fuzzy clustering on X to find two fuzzy clusterscreates matrix C that each element is a cluster’s membershipdegree with respect to a document. For instance, document 1( d ) with . membership value belongs to cluster 1( c ) and with . membership value belongs to cluster2 (Fig. 2). In this example, fuzzy clustering converts X × matrix to C × matrix and reduces the dimension space by80%.A large number of fuzzy clustering algorithms has beendeveloped [29], [30]. To mange text data sparsity, we use aspherical fuzzy clustering, called soft spherical k-means. Thismethod iterates between determining optimal memberships forfixed prototypes and computing optimal prototypes for fixedmemberships [31]. IV. E XPERIMENTS
In this section, we evaluate the dimension reduction appli-cation of fuzzy clustering against PCA and SVD by documentclassification using Functional Trees (FT), Random Forest,and Adaptive Boosting (AdaBoost). that are among highperformance classification algorithms [21], [32]–[36].We use two datasets, the irbla R package for comput-ing SVD and PCA [37], the skmeans R package for soft(fuzzy) spherical k-means with 100 iterations and 1e-5 asthe minimum improvement in objective function between twoconsecutive iterations [38], and the Weka tool with its defaultsettings for the document classification. A. Datasets
We leverage two datasets in this research: • The Reuters dataset has 21,578 documents with severalnews categories. Two classes were created for binaryclassification. The documents in the Grain class werelabeled as “Grain” and the rest of the documents werelabeled as “Not Grain”. • The Ohsumed dataset has 20,000 documents with differ-ent cardiovascular diseases categories. Two classes were https://archive.ics.uci.edu/ml/datasets/reuters-21578+text+categorization+collection http://disi.unitn.it/moschitti/corpora.htm created for binary classification. The documents in theVirus Diseases class were labeled as “Virus Diseases”and 5000 documents randomly selected from the rest ofthe documents were labeled as “Not Virus Diseases”. B. Document Classification
Document classification problem assigns a document to aclass. For this purpose, a pre-processing step is needed toextract features from text data. Using words in a corpus asfeatures creates a large sparse matrix. One solution to reducethe feature set is to use DR methods such as fuzzy clustering,SVD, and PCA to reduce the number of the original features.Three classification methods including Functional Trees(FT), Random Forest, and Adaptive Boosting (AdaBoost) weretrained on 10, 20, 30, 40, 50, 60, 70, 80, 90, 100 reduceddimensions. To avoid any possible sampling bias, we applythe 5-fold cross validation method that the data is broken into5 subsets for 5 iterations. Each of the subsets is selected fortesting and the rest of them are selected for training.The output of a classification method is presented as aconfusion matrix (Table I) with the following definitions:TABLE I: Confusion Matrix
PredictedNegative PositiveActual Negative
TN FP
Positive
FN TP • True Negative (TN) is the number of correct predictionsthat an instance is negative. • False Negative (FN) is the number of incorrect of pre-dictions that an instance negative. • False Positive (FP) is the number of incorrect predictionsthat an instance is positive. • True Positive (TP) is the number of correct predictionsthat an instance is positive.Classification accuracy of a classifier is an evaluation metricto measure how well the classifier recognizes instances of thevarious classes. The accuracy of a classifier is the percentageof correctly classified documents in a test set [32].
Accuracy = T P + T NT P + T N + F P + F N (5)
C. Evaluation Results
Fig. 3.a and Fig. 3.b show the average of the accuracy ofthe three classifiers along with two fuzzifier values including1.5 (FC-1.5) and 2 (FC-2) for the two datasets. These twofigures indicate that fuzzy clustering illustrates better accuracyperformance than PCA and SVD.In addition, FC-1.5 has better performance in most of theclassification experiments and shows the highest stability withthe lowest standard deviation value following by FC-2, SVD,and PCA. Although increasing the number of dimensionsmostly has the negative effect on the accuracy performanceof PCA and SVD based on Fig. 3, fuzzy clustering showsa stable performance with lower standard deviation than the = w w w w w w w w w w d d d d d → C = c c d . . d . . d . . d . . d . . Fig. 2: A Numerical Example for Dimension Reduction Application of Fuzzy Clustering
20 40 60 800 . . . Number of Dimensions A cc u r ac y F C − . F C − P CA SV D (a) Reuters Dataset
20 40 60 800 . . . . . . Number of Dimensions A cc u r ac y F C − . F C − P CA SV D (b) OHSUMED Dataset
Fig. 3: Classification Evaluationnon-fuzzy ones. While SVD shows more stability than PCA,the latter one has better accuracy than the earlier one withdifferent number of dimensions.While the complexities for the PCA and the SVD methodsare O ( mnlog ( k )) and O ( mnlog ( k )+( m + n ) k ) , respectively[39], the complexity for the fuzzy spherical k-means is O ( n + k ) [40]. Other than the complexity advantage, there are otherbenefits for the DR application of fuzzy clustering includingnot losing information, estimating the number of clusters ordimensions with already developed methods such as silhouetteindex [41] and Xie-Beni index [42], and working with bothdiscrete and continuous data.V. C ONCLUSION
The big text data databases represent extremely usefulresources to the scholarly community; however, analyzingindividual words in a corpus leads to a high dimensionalsparse BOW matrix. DR is a pre-processing step in datamining to reduce BOW matrix dimension for better accuracy.Although a wide range of DR methods has been developed,the exponential growth of data indicates that DR still needsnew perspectives. DR methods have been developed based ondifferent strategies. UFT is a popular and efficient strategyusing different approaches such as linear algebra, statisticaldistributions, and neural network. However, fuzzy clustering has not been considered as a DR approach based on the UFTstrategy.This study discusses the potential of fuzzy clustering forDR based on the UFT strategy. Fuzzy clustering processesBOW matrix and creates a new matrix whose elements aremembership degree values for each document in a corpus. Thisresearch uses the new matrix as a reduced matrix of BOWmatrix. The efficiency and effectiveness of fuzzy clusteringare demonstrated through accuracy comparisons with PCA andSVD using two public available corpora.This paper’s results illustrate that fuzzy clustering is acompetitor to the powerful methods such as PCA and SVD inthe setting of dimensionality reduction for document collec-tions. Indeed, the principal advantages fuzzy clustering includenot losing information and having less complexity. Fuzzyclustering also works with both discrete and continuous dataand there are developed methods to estimate the optimumnumber of dimensions. Although this paper has applied fuzzyclustering for text data dimension reduction purpose, thisclustering method can be used for other data types such asimage and microarray data.This research has several limitations. First, word weightingmethods such as entropy are not considered. Second, thefuzzifier is limited to two values (1.5 and 2). Third, theaccuracy improvement of the fuzzy clustering over PCA andSVD is not significant. In our future work, we will apply wordeighting methods on fuzzy clustering, investigate differentfuzzifier values, and explore other fuzzy clustering methods.R
EFERENCES[1] A. Karami, A. Gangopadhyay, B. Zhou, and H. Kharrazi, “A fuzzyapproach model for uncovering hidden latent semantic structure inmedical text collections,” in
Proceedings of the iConference , 2015.[2] A. Karami,
Fuzzy Topic Modeling for Medical Corpora . University ofMaryland, Baltimore County, 2015.[3] N. Council, “Future directions for nsf advanced computing infrastructureto support u.s. science and engineering in 2017-2020: Interim report,”2016,
The National Academies Press Washington, DC .[4] A. Karami and A. Gangopadhyay, “Fftm: A fuzzy feature transformationmethod for medical documents,” in
Proceedings of the Conference of theAssociation for Computational Linguistics (ACL) , vol. 128, 2014.[5] A. Karami, A. Gangopadhyay, B. Zhou, and H. Kharrazi, “Flatm: Afuzzy logic approach topic model for medical documents,” in
Proceed-ings of the Annual Meeting of the North American Fuzzy InformationProcessing Society (NAFIPS) . IEEE, 2015.[6] C. C. Aggarwal and C. Zhai, “An introduction to text mining,” in
MiningText Data . Springer, 2012, pp. 1–10.[7] S. Wu and P. A. Flach, “Feature selection with labelled and unlabelleddata,” in
ECML/PKDD , vol. 2, 2002, pp. 156–167.[8] H. Liu and H. Motoda,
Computational methods of feature selection .CRC Press, 2007.[9] A. Karami, H. R. Yazdani, H. S. Beiryaie, and N. Hosseinzadeh, “Arisk based model for is outsourcing vendor selection,” in
Proceedingsof the IEEE International Conference on Information and FinancialEngineering (ICIFE) . IEEE, 2010, pp. 250–254.[10] A. Karami and Z. Guo, “A fuzzy logic multi-criteria decision frame-work for selecting it service providers,” in
Proceedings of the HawaiiInternational Conference on System Science (HICSS) . IEEE, 2012, pp.1118–1127.[11] E. H¨ullermeier, “Fuzzy sets in machine learning and data mining,”
Applied Soft Computing , vol. 11, no. 2, pp. 1493–1505, 2011.[12] P. Cunningham, “Dimension reduction,”
Machine learning techniquesfor multimedia , pp. 91–112, 2008.[13] Y. Yang and J. O. Pedersen, “A comparative study on feature selectionin text categorization,” in
Icml , vol. 97, 1997, pp. 412–420.[14] L. Gao, J. Song, X. Liu, J. Shao, J. Liu, and J. Shao, “Learning in high-dimensional multimedia data: the state of the art,”
Multimedia Systems ,vol. 23, no. 3, pp. 303–313, 2017.[15] S. Mika, G. Ratsch, J. Weston, B. Scholkopf, and K.-R. Mullers, “Fisherdiscriminant analysis with kernels,” in
Neural Networks for SignalProcessing IX, 1999. Proceedings of the 1999 IEEE Signal ProcessingSociety Workshop.
IEEE, 1999, pp. 41–48.[16] D. D. Lee and H. S. Seung, “Learning the parts of objects by non-negative matrix factorization,”
Nature , vol. 401, no. 6755, p. 788, 1999.[17] X. He, D. Cai, and P. Niyogi, “Laplacian score for feature selection,” in
Advances in neural information processing systems , 2006, pp. 507–514.[18] H. Abdi and L. J. Williams, “Principal component analysis,”
Wileyinterdisciplinary reviews: computational statistics , vol. 2, no. 4, pp. 433–459, 2010.[19] L. Van Der Maaten, E. Postma, and J. Van den Herik, “Dimensionalityreduction: a comparative,”
J Mach Learn Res , vol. 10, pp. 66–71, 2009.[20] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, andR. Harshman, “Indexing by latent semantic analysis,”
Journal of theAmerican society for information science , vol. 41, no. 6, p. 391, 1990.[21] E. M. Sweeney, J. T. Vogelstein, J. L. Cuzzocreo, P. A. Calabresi,D. S. Reich, C. M. Crainiceanu, and R. T. Shinohara, “A comparisonof supervised machine learning algorithms and feature vectors for mslesion segmentation using multimodal structural mri,”
PloS one , vol. 9,no. 4, p. e95753, 2014.[22] R. Jensen and Q. Shen, “Semantics-preserving dimensionality reduc-tion: rough and fuzzy-rough-based approaches,”
IEEE Transactions onknowledge and data engineering , vol. 16, no. 12, pp. 1457–1471, 2004.[23] N. Mac Parthal´aIn and R. Jensen, “Unsupervised fuzzy-rough set-baseddimensionality reduction,”
Information Sciences , vol. 229, pp. 106–121,2013.[24] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality ofdata with neural networks,” science , vol. 313, no. 5786, pp. 504–507,2006. [25] H.-J. Zimmermann, “Fuzzy set theory,”
Wiley Interdisciplinary Reviews:Computational Statistics , vol. 2, no. 3, pp. 317–332, 2010.[26] L. A. Zadeh, “Outline of a new approach to the analysis of complexsystems and decision processes,”
IEEE Transactions on Systems, Manand Cybernetics , no. 1, pp. 28–44, 1973.[27] A. Karami, A. Gangopadhyay, B. Zhou, and H. Kharrazi, “Fuzzyapproach topic discovery in health and medical corpora,”
InternationalJournal of Fuzzy Systems , pp. 1–12, 2017.[28] J. C. Bezdek,
Pattern recognition with fuzzy objective function algo-rithms . Kluwer Academic Publishers, 1981.[29] A. Baraldi and P. Blonda, “A survey of fuzzy clustering algorithmsfor pattern recognition. i,”
IEEE Transactions on Systems, Man, andCybernetics, Part B: Cybernetics , vol. 29, no. 6, pp. 778–785, 1999.[30] ——, “A survey of fuzzy clustering algorithms for pattern recognition.ii,”
IEEE Transactions on Systems, Man, and Cybernetics, Part B:Cybernetics , vol. 29, no. 6, pp. 786–801, 1999.[31] I. S. Dhillon and D. S. Modha, “Concept decompositions for large sparsetext data using clustering,”
Machine learning , vol. 42, no. 1, pp. 143–175, 2001.[32] B. F. Chimieski and R. D. R. Fagundes, “Association and classificationdata mining algorithms comparison over medical datasets,”
Journal ofhealth informatics , vol. 5, no. 2, 2013.[33] D. R. Rao, V. Pellakuri, S. Tallam, and T. R. Harika, “Performanceanalysis of classification algorithms using healthcare dataset,”
Inter-national Journal of Computer Science and Information Technologies ,vol. 6, no. 2, pp. 1103–1106, 2015.[34] X. Wu, V. Kumar, J. R. Quinlan, J. Ghosh, Q. Yang, H. Motoda, G. J.McLachlan, A. Ng, B. Liu, S. Y. Philip et al. , “Top 10 algorithms indata mining,”
Knowledge and information systems , vol. 14, no. 1, pp.1–37, 2008.[35] R. Caruana and A. Niculescu-Mizil, “An empirical comparison ofsupervised learning algorithms,” in
Proceedings of the 23rd internationalconference on Machine learning . ACM, 2006, pp. 161–168.[36] Y. Qi, Z. Bar-Joseph, and J. Klein-Seetharaman, “Evaluation of dif-ferent biological data and computational classification methods for usein protein interaction prediction,”
Proteins: Structure, Function, andBioinformatics , vol. 63, no. 3, pp. 490–500, 2006.[37] J. Baglama and L. Reichel, “irlba: Fast truncated singular value decom-position and principal components analysis for large dense and sparsematrices,”
R package version 2.2.1 , 2017.[38] K. Hornik, “Spherical k-means clustering,”
R package version 0.2-10 ,2017.[39] N. Halko, P.-G. Martinsson, and J. A. Tropp, “Finding structure withrandomness: Stochastic algorithms for constructing approximate matrixdecompositions,” 2009.[40] I. S. Dhillon, Y. Guan, and J. Kogan, “Iterative clustering of highdimensional text data augmented by local search,” in
IEEE InternationalConference on Data Mining (ICDM).
IEEE, 2002, pp. 131–138.[41] R. J. Campello and E. R. Hruschka, “A fuzzy extension of the silhouettewidth criterion for cluster analysis,”
Fuzzy Sets and Systems , vol. 157,no. 21, pp. 2858–2875, 2006.[42] X. L. Xie and G. Beni, “A validity measure for fuzzy clustering,”