EEF: Exponentially Embedded Families with Class-Specific Features for Classification
11 EEF: Exponentially Embedded Families withClass-Specific Features for Classification
Bo Tang,
Student Member, IEEE,
Steven Kay,
Fellow, IEEE,
Haibo He,
Senior Member, IEEE, and Paul M.Baggenstoss,
Senior Member, IEEE
Abstract —In this letter, we present a novel exponentiallyembedded families (EEF) based classification method, inwhich the probability density function (PDF) on raw data isestimated from the PDF on features. With the PDF construc-tion, we show that class-specific features can be used in theproposed classification method, instead of a common featuresubset for all classes as used in conventional approaches. Weapply the proposed EEF classifier for text categorization as acase study and derive an optimal Bayesian classification rulewith class-specific feature selection based on the InformationGain (IG) score. The promising performance on real-life datasets demonstrates the effectiveness of the proposed approachand indicates its wide potential applications.
Index Terms —Exponentially embedded families, class-specific features, feature selection, text categorization, proba-bility density function estimation, naive Bayes.
I. I
NTRODUCTION
Classification is one of fundamental problems in thefields of machine learning and signal processing. Thecommonly used classifier assigns a sample or a signal to theclass with maximum posterior probability, which usuallyrequires probability density function (PDF) estimation inan either model-driven or data-driven manner [1] [2] [3].For high-dimensional data sets, it is necessary to performfeature reduction to estimate the PDFs robustly in a low-dimensional feature subspace. However, feature reductionmay lose pertinent information for discrimination. Forexample, data samples from different classes that could bewell separated in the raw data space may be overlapped inthe feature subspace, causing classification errors.The PDF reconstruction approach provides a solution toaddress this information loss issue in feature reduction byreconstructing the PDF on raw data and making classifi-cation in raw data space, which could improve classifica-tion performance. Several approaches have been developedalong this track. Moghaddam et al. [4] [5] use an eigenspacedecomposition to approximate the high-dimensional raw
Copyright (c) 2015 IEEE. Personal use of this material is per-mitted. However, permission to use this material for any other pur-poses must be obtained from the IEEE by sending a request to [email protected] work was supported in part by National Science Foundation (NSF)under grant ECCS 1053717 and CCF 1439011, and the Army ResearchOffice under grant W911NF-12-1-0378.Bo Tang, Steven Kay, and Haibo He are with the Department of Elec-trical, Computer and Biomedical Engineering at the University of RhodeIsland, Kingston, RI, USA, 02881. E-mail: { btang, kay, he } @ele.uri.eduPaul M. Baggenstoss is with Fraunhofer FKIE, Wachtberg, Germany,53343. E-mail: [email protected]. data PDF, where the raw data space is divided into two com-plementary subspaces using Principal Components Analysis(PCA): the principal subspace (distance in feature space)and the orthogonal complement subspace (distance fromfeature space). While the PDF in the low-dimensionalprincipal subspace is estimated using training data, the PDFin the complementary subspace is approximated with thePCA residual error. Then, the estimated PDF in the raw dataspace is written as the product of these two PDFs. Morerecently, researchers apply Bayesian partitioning techniquesto estimate the distribution in high dimensional data space.For example, Wong and Ma in [6] developed the OptionalPolya Tree (OPT) to construct a prior distribution, and Luet al. in [7] derived a closed form of posterior probabilityusing Bayesian sequential partitioning.PDF Projection Theorem (PPT) [8] offers another solu-tion for distribution construction which projects the PDF inthe feature subspace back to the raw data space. It can beshown that all PDFs that generate the given feature PDFcan be constructed with the PPT by selecting a suitablereference hypothesis. The generality of PPT makes it a goodone for classification to avoid the “curse of dimensionality”[8] [9]. It also allows class-specific features, that is tosay, each class could have its own feature transformationfunction. Class-specific features offer many advantagesfor multi-class classification. For example, class-specificfeatures carry much more discriminative information fromthe original raw data, because each class can select themost discriminative features against the other classes. Thischaracteristic makes the PPT different from many otherclassifications methods which usually need to incorporatea one-vs-all classification scheme [10] to build hierarchicalmulticlassifiers [11] [12] to use class-specific features.The exponentially embedded family (EEF) [13] is relatedto PPT. Like PPT, EEF is based on the estimated featurePDF and a specified reference hypothesis. While PPTproduces a raw data PDF that reproduces the given featurePDF exactly, EEF is a way of combining one or morePDFs constructed with PPT in a geometric mixture withthe reference hypothesis. The raw data PDF constructedusing EEF reproduces the moments of a log-likelihoodratio statistic. This statistic can be easily estimated in thefeature space and is directly linked to class separability.Thus, while PPT could be preferred for general PDFestimation, produces PDFs that are easily sampled, andoffers maximum entropy optimality [14], EEF could bepreferable in classifier design since it directly targets class a r X i v : . [ s t a t . M L ] M a y separability.In this letter, we apply EEF to the class-specific classifi-cation problem and show that EEF can attain even higherclassification performance than PPT. Using the constructedPDF on raw data, we derive a Bayesian classifier withclass-specific features, termed EEF classifier, and apply itfor text categorization as a case study. The experimentalresults on real-life benchmarks show superior classificationperformance of the proposed EEF classifier and furtherindicate many potential applications for machine learningand signal processing.II. EEF C LASSIFIER WITH C LASS -S PECIFIC F EATURES
A. Background: Bayesian Classifier with Feature Reduction
Considering a N -class classification task in which a datasample x , x ∈ R D , is to be classified into one of N classes: c i , i = 1 , , · · · , N . The optimal Bayesian classifier withminimum probability of error for this task is the maximuma posteriori (MAP) rule which assigns class c ∗ to x witha maximum posterior probability: c ∗ = arg max i ∈{ , , ··· ,N } p ( c i | x ) = arg max i ∈{ , , ··· ,N } p ( x | c i ) p ( c i ) (1)where p ( x | c i ) is the likelihood of observing x in class c i ,and p ( c i ) is the prior probability of class c i . Usually theclass-wise distribution p ( x | c i ) is unknown and needs to beestimated from training data. For high-dimensional data, itis impractical to estimate p ( x | c i ) accurately when the giventraining data is limited. For this case, one usually reducesthe sample x via feature transformation: z = f ( x ) , where z ∈ R K is called the feature of x and the dimension of z is far less than that of x , i.e., K (cid:28) D . By doing so, theestimation of p ( z | c i ) in the feature subspace is simplified.Using the MAP rule in the feature subspace, we have: c ∗ = arg max i ∈{ , , ··· ,N } p ( z | c i ) p ( c i ) (2)This feature-based Bayesian classifier approach forces oneto make the choice between (a) sufficient feature informa-tion, but too high dimension, or (b) manageable featuredimension, but insufficient feature information. This meansthat there is no possiblity that Eq. (2) is equivalent to Eq.(1). We seek to avoid this compromise by working in theraw data space and estimating p ( x | c i ) without incurring thedimensionality problem caused by the need for a commonfeature set. B. EEF for PDF Construction
In this subsection, we show that the raw data PDF p ( x | c i ) can be constructed from the feature PDF p ( z | c i ) usingEEF. First, we define a smoothing reference hypothesis c (e.g., the union of all classes is used as c in ourstudy case), which is non-committal with respect to the N classes. Next we define a log-likelihood ratio statistic T ( x ) = log p ( f ( x ) | c i ) /p ( f ( x ) | c ) = log p ( z | c i ) /p ( z | c ) ,which is a measure of the discriminative power betweenthe given class and the reference hypothesis. Mathematically, using EEF [13] [15], we estimate thePDF p ( x | c i ) for class c i in raw data space as follows: p ( x | c i ; θ ) = exp (cid:18) θ ln p ( z | c i ) p ( z | c ) − K ( θ ) + ln p ( x | c ) (cid:19) (3)where θ is the embedding parameter, and K ( θ ) is thecumulant generating function, which is given by: K ( θ ) = ln (cid:90) x exp (cid:18) θ ln p ( z | c i ) p ( z | c ) (cid:19) p ( x | c ) d x = ln E p (cid:20) exp (cid:18) θ ln p ( z | c i ) p ( z | c ) (cid:19)(cid:21) (4)where E p [ · ] denotes the expectation with respect to thedistribution p = p ( x | c ) . Note that for θ = 1 , we have K ( θ ) = 0 and p ( x | c i ) = p ( x | c ) /p ( z | c ) p ( z | c i ) , which isthe PPT [8].As discussed before, the motivation of PDF constructionin Eq. (3) is to effectively smooth the constructed densityby minimizing the KL-divergence from p ( x | c i ; θ ) to thesmoothed and non-committal reference PDF p ( x | c ) withthe constraints of moment-matching for the statistic T ( x ) =ln[ p ( z | c i ) /p ( z | c )] , i.e., E ˆ p [ T ( x )] = E p [ T ( x )] , where ˆ p denotes the PDF p ( x | c i ; θ ) in Eq. (3) and p denotes the truePDF p ( x | c i ) . The following theorem [16] demonstrates ourmotivation: Theorem 1:
Let p ( x ) be the reference distributionand p ( x ) be the true distribution to be estimated.Given that T ( x ) is a measurable statistic such that both λ = (cid:82) T ( x ) p ( x ) d x and M ( θ ) = (cid:82) exp ( θT ( x )) p ( x ) exist, the estimate ˆ p ( x ) with minimum KL-divergence KL (ˆ p || p ) is: ˆ p ( x ; θ ) = exp ( θT ( x ) − ln M ( θ ) + ln p ( x )) (5) Proof:
The proof of this theorem is given by Kullback[16], and its applicability has also been demonstrated in ourprevious work [13] [17] [18].In EEF, it is better to choose the reference distributionthat is smooth and non-committal with respect to the N classes. The reference hypothesis consisting of the unionof all classes is good one, and can be considered thegeometric center of PDFs of all classes [18]. The embed-ding parameter θ specifies the constructed PDF that hasminimum KL-divergence to the reference distribution withthe constraint of moment-matching. For each class, theoptimal embedding parameter θ ∗ i can be estimated usingthe MLE criterion, which is given by: θ ∗ i = arg max θ ∈ Θ θ ln p ( z | c i ) p ( z | c ) − K ( θ ) i = 1 , , · · · , N (6)Since the cumulant generating function K ( θ ) is strictlyconvex and differentiable, the target function in Eq. (6) isconcave and the optimal embedding parameter θ ∗ i can beeasily found. C. EEF for Classification with Class-Specific Features
The PDF construction on raw data from the PDF onfeatures allows class-specific features for classification. Let f i ( x ) be the feature transformation for class i , and thus wehave class-specific features z i = f i ( x ) for i = 1 , , · · · , N .Using Eq. (3), for each class, we can always construct thePDF p ( x | c i ; θ i ) in raw data space from the PDF in class-specific feature space p ( z i | c i ) . Applying the MAP rule, wemake classification decisions as follows: c ∗ = arg max i ∈{ , , ··· ,N } θ i ln p ( z i | c i ) p ( z i | c ) − K ( θ i ) + ln p ( c i ) (7)We note here that by using a common reference distributionin the PDF construction, the classifier given by Eq. (7)can be constructed without actually measuring the raw data x . Nevertheless, Eq. (7) is based on an implied raw dataPDF. One could apply a different reference distribution tothe PDF construction of each class, which would requiremeasuring x , but this is not explored in this letter.III. S TUDY C ASE : EEF C
LASSIFIER FOR T EXT C ATEGORIZATION
In this section, we apply the proposed EEF classifier fortext categorization in which the multinomial naive Bayes(MNB) is used as classifier. In Fig. 1, we illustrate thedifference between our EEF classifier and the conventionalclassifier for text categorization. Using the “bag-of-words”,a document is transformed to a real-valued vector through adictionary that consists of all distinct words or phrases fora data set. In the real-value vector, the element denotesthe occurrence of words in the document. Because ofits high dimensionality, it is necessary to perform featurereduction to reduce the computational burden for traininga classifier. Feature selection is a commonly used methodfor feature reduction in text categorization. In conventionalapproaches, a feature importance measurement, such asinformation gain (IG) [19] or maximum discrimination(MD) [20], is first employed to calculate feature importancefor each individual class, and then a global function, suchas sum or weighted average, is applied to rank features toselect a common feature subset for all classes. In contrast,we rank features for each class and apply the class-specificfeatures for classification.
A. PDF Construction
In MNB, the features (word occurrences) of each classsatisfy a specific multinomial distribution. Let x ∈ R D bethe raw feature transformed from the document, and thenfor each class c i , i = 1 , , · · · , N , we have a multinomialdistribution p ( x | c i ) with D parameters (cell probabilities): [ p i, , · · · , p i,D ] . The likelihood of observing a document x in class c i conditioned on its document length l is given The likelihood of observing a document p ( x | c i , l ) is conditioned onthe document length l . This is different than the conventional MNBclassifier for text categorization in which the document length is usuallyassumed to be constant, i.e., p ( x | c i , l ) = p ( x | c i ) . Fig. 1: The flow chart of our EEF classifier with class-specific features for text categorization (right), comparedwith the conventional approach (left).by: p ( x | c i , l ) = l ! x ! x ! · · · x D ! D − (cid:89) k =1 p x k i,k p x D i,D (8)where (cid:80) Dk =1 p i,k = 1 and (cid:80) Dk =1 x k = l .Suppose that the feature selection will select K outof D features. Denote z i as the feature vector in class c i and I i = [ n i , · · · , n iK ] as the corresponding featureindexes in x such that z ik = x n ik . Note that the marginaldistribution p ( z i | c i ) still satisfies a multinomial distribution,but with K + 1 elements. The ( K + 1) -st feature is thecombination of all other features in x except for the K selected features, and the multinomial distribution p ( z i | c i ) has K +1 cells: [ p (cid:48) i, , · · · , p (cid:48) i,K , p (cid:48) i,K +1 ] where p (cid:48) i,k = p i,n ik , k = 1 , , · · · , K , and p (cid:48) i,K +1 = 1 − (cid:80) Kk =1 p (cid:48) i,k .We denote class c as the reference class which consistsof all given training data so that the reference distribution p ( x | c ) still satisfies a multinomial distribution with D parameters: [ p , , · · · , p ,D ] , each of which can be writtenas: p ,k = N (cid:88) i =1 p i,k p ( c i ) k = 1 , , · · · , D (9)Using the general construction form in Eq. (3), weconstruct the PDF p ( x | c i ) for class c i , i = 1 , , · · · , N as follows: p ( x | c i , l ; θ i ) = exp (cid:34) θ i K (cid:88) k =1 z ik β ik − K ( θ i , l ) + ln p ( x | c ) (cid:35) (10)where K ( θ i , l ) = l ln (cid:32) K (cid:88) k =1 p (cid:48) ,k exp( θ i β ik ) + (cid:32) − K (cid:88) k =1 p (cid:48) ,k (cid:33)(cid:33) (11)and β ik = ln p (cid:48) i,k p (cid:48) ,k − ln p (cid:48) i,K +1 p (cid:48) ,K +1 (12) Note that we obtain a closed form solution of the PDFconstruction in the original high-dimensional space of x asshown in Eq. (10) to Eq. (12). The detailed derivation isprovided in our Supplemental Material.Given a N -class training data set X = X ∪ X ∪· · · ∪ X N , each class consists of M i documents X i = { x , x , · · · , x M i } , and each document x m has a lengthof l m = (cid:80) Dk =1 x mk , where x mk is the k -th element in x m . We use the MLE to estimate the optimal embeddingparameter, which is given by: θ ∗ i = arg max θ i ∈ Θ θ i K (cid:88) k =1 ¯ z ik β ik − K ( θ i , ¯ l ) (13)where ¯ z ik and ¯ l are the average of word occurrences for the k -th selected feature and the average of the document lengthover the training set X i of class c i , respectively. Althoughit is difficult to find an analytic solution of θ ∗ i , it can beeasily found using convex optimization techniques since theobjective function is a concave function with respect to θ i .IV. E XPERIMENTAL R ESULTS AND A NALYSIS
We use two real-life data sets: R
EUTERS -10 andR
EUTERS -20, to evaluate the performance of our proposedapproach for text categorization. Both R
EUTERS -10 andR
EUTERS -20 data sets are the subsets of ModApte versionof R
EUTERS collection which consists of 8,293 documentswith 65 classes (topics). More specifically, the data setof R
EUTERS -10 and R
EUTERS -20 consists of documentsfrom the first 10 and 20 classes, respectively.In these two data sets, we have an original feature size of , . To reduce the feature size, we apply the IG metric[19] to evaluate the feature importance. For each class c i ,the score of the k -th feature is calculated as follows: IG ( t k , c i ) = p ( t k , c i ) log p ( t k ,c i ) p ( t k ) p ( c i ) + p (¯ t k , c i ) log p (¯ t k ,c i ) p (¯ t k ) p ( c i ) (14)where t k indicates the k -th term appears in the document,and ¯ t k indicates it does not. It is shown that IG ( t k , c i ) isa class-specific feature score. In conventional approaches,a global function, e.g., sum or average, is used to calculateclass-independent feature scores for feature ranking, asshown in Fig. 1. However, the class-specific feature basedclassifiers rank the feature of each class with the score IG ( t k , c i ) in Eq. (14), and use the class-specific featuresfor classification.We compare our EEF class-specific MNB classifier withthree other state-of-the-art classifiers: MNB classifier [21],support vector machine (SVM) [22] [23], and PPT class-specific MNB classifier [8]. While the first two are com-monly used in text categorization with class-independentfeatures, the last one and our classifier use class-specificfeatures for classification. In PPT class-specific MNB clas-sifier, we use the same reference hypothesis given by Eq. (9)and class-specific features given by Eq. (14) as used in EEF,and make the classification decision with the following rule: c ∗ = arg max i = { , , ··· ,N } K +1 (cid:88) k =1 z ik ln p (cid:48) i,k p (cid:48) ,k + ln p ( c i ) (15) We report the classification results on the data setsof R EUTERS -10 and R
EUTERS -20 in Fig. 2 and Fig. 3,respectively, where the feature size ranges from 100 to2000. It can be shown that our EEF class-specific MNBclassifier outperforms other the three methods. For theR
EUTERS -10 data set, the two class-specific feature basedMNB classifiers greatly improve the accuracy when thefeature size is small. When the feature size increases,our EEF class-specific MNB shows promising performanceimprovement with a large margin compared to the others.For the R
EUTERS -20 data set, it is seen that our EEF class-specific MNB consistently performs better than the others.
100 500 1000 20000.930.9350.940.9450.950.9550.96 Number of features A cc u r a cy SVMMNBPPT Class−Specific MNBEEF Class−Specific MNB
Fig. 2: Classification results on R
EUTERS -10.
100 500 1000 20000.910.9150.920.9250.930.9350.94 Number of features A cc u r a cy SVMMNBPPT Class−Specific MNBEEF Class−Specific MNB
Fig. 3: Classification results on R
EUTERS -20.V. C
ONCLUSION AND F UTURE W ORK
In this letter, we introduced a new PDF constructionmethod based on EEF to convert the feature PDF to theraw data PDF. With the constructed PDF on raw data, aBayesian classifier with class-specific features is derived.As a case study, we applied the proposed EEF classifier fortext categorization. The superior performance demonstratesthe effectiveness of our proposed approach and indicatesits wide potential application to machine learning andsignal processing. In our future work, we will continue toexplore its potential for various practical problems whichmight require different and complex reference distribu-tions. Particularly, we are interested in applying sampling-based approaches to address the issue that the constructeddistribution has no closed form for a complex referencedistribution. R EFERENCES[1] R. O. Duda, P. E. Hart, and D. G. Stork,
Pattern classification . JohnWiley & Sons, 2012.[2] C. M. Bishop,
Pattern recognition and machine learning . Springer,2006.[3] B. Tang and H. He, “ENN: Extended nearest neighbor methodfor pattern recognition [research frontier],”
IEEE ComputationalIntelligence Magazine , vol. 10, no. 3, pp. 52–60, 2015.[4] B. Moghaddam and A. Pentland, “Probabilistic visual learning forobject representation,”
IEEE Transactions on Pattern Analysis andMachine Intelligence , vol. 19, no. 7, pp. 696–710, 1997.[5] B. Moghaddam, T. Jebara, and A. Pentland, “Bayesian face recog-nition,”
Pattern Recognition , vol. 33, no. 11, pp. 1771–1782, 2000.[6] W. H. Wong and L. Ma, “Optional p´olya tree and bayesian infer-ence,”
The Annals of Statistics , vol. 38, no. 3, pp. 1433–1459, 2010.[7] L. Lu, H. Jiang, and W. H. Wong, “Multivariate density estimation bybayesian sequential partitioning,”
Journal of the American StatisticalAssociation , vol. 108, no. 504, pp. 1402–1410, 2013.[8] P. M. Baggenstoss, “Class-specific classifier: avoiding the curse ofdimensionality,”
IEEE Aerospace and Electronic Systems Magazine ,vol. 19, no. 1, pp. 37–52, 2004.[9] B. Tang, H. He, P. Baggenstoss, and S. Kay, “A Bayesian classifica-tion approach using class-specific features for text categorization,”
IEEE Transactions on Knowledge and Data Engineering , vol. PP,no. 99, pp. 1–1, 2016.[10] R. Rifkin and A. Klautau, “In defense of one-vs-all classification,”
The Journal of Machine Learning Research , vol. 5, pp. 101–141,2004.[11] S. Kumar, J. Ghosh, and M. Crawford, “A hierarchical multiclas-sifier system for hyperspectral data analysis,” in
Multiple ClassifierSystems , 2000, pp. 270–279.[12] G. De Lannoy, D. Franc¸ois, and M. Verleysen, “Class-specificfeature selection for one-against-all multiclass svms,” in
EuropeanSymposium on Artificial Neural Networks , 2011, pp. 269–274.[13] S. Kay, “Exponentially embedded families-new approaches to modelorder estimation,”
IEEE Transactions on Aerospace and ElectronicSystems , vol. 41, no. 1, pp. 333–345, 2005.[14] P. Baggenstoss, “A maximum entropy framework for feature inver-sion and a new class of spectral estimators,”
IEEE Transactions onSignal Processing , vol. 63, no. 11, 2015.[15] S. Kay, Q. Ding, B. Tang, and H. He, “Probability density functionestimation using the EEF with application to subset/feature selec-tion,”
IEEE Transactions on Signal Processing , vol. 64, no. 3, pp.641–651, 2016.[16] S. Kullback,
Information theory and statistics . Courier Corporation,1997.[17] S. Kay, Q. Ding, B. Tang, and H. He, “Probability density functionestimation using the EEF with application to subset/feature selec-tion,”
IEEE Transactions on Signal Processing , vol. 64, no. 3, pp.641–651, 2016.[18] B. Tang, H. He, Q. Ding, and S. Kay, “A parametric classificationrule based on the exponentially embedded family,”
IEEE Transac-tions on Neural Networks and Learning Systems , vol. 26, no. 2, pp.367–377, 2015.[19] Y. Yang and J. O. Pedersen, “A comparative study on featureselection in text categorization,” in
International Conference onMachine Learning , vol. 97, 1997, pp. 412–420.[20] B. Tang, S. Kay, and H. He, “Toward optimal feature selection innaive bayes for text categorization,”
IEEE Transactions on Knowl-edge and Data Engineering , vol. PP, no. 99, pp. 1–1, 2016.[21] D. D. Lewis, “Naive (Bayes) at forty: The independence assump-tion in information retrieval,” in
European Conference on MachineLearning , 1998, pp. 4–15.[22] C. Cortes and V. Vapnik, “Support-vector networks,”
Machine learn-ing , vol. 20, no. 3, pp. 273–297, 1995.[23] T. Joachims,