Hierarchical Multiclass Decompositions with Application to Authorship Determination
aa r X i v : . [ c s . A I] O c t Hierarchical Multiclass Decompositions with Application toAuthorship Determination
Ran El-Yaniv [email protected]
Department of Computer Science,Technion - Israel Institute of Technology
Noam Etzion-Rosenberg [email protected]
Babylon Ltd.10 Hataasia St., Or-Yehuda, Israel
Abstract
This paper is mainly concerned with thequestion of how to decompose multiclassclassification problems into binary subprob-lems. We extend known Jensen-Shannonbounds on the Bayes risk of binary problemsto hierarchical multiclass problems and usethese bounds to develop a heuristic proce-dure for constructing hierarchical multiclassdecomposition for multinomials. We test ourmethod and compare it to the well known“all-pairs” decomposition. Our tests are per-formed using a new authorship determina-tion benchmark test of machine learning au-thors. The new method consistently outper-forms the all-pairs decomposition when thenumber of classes is small and breaks even onlarger multiclass problems. Using both meth-ods, the classification accuracy we achieve,using an SVM over a feature set consisting ofboth high frequency single tokens and highfrequency token-pairs, appears to be excep-tionally high compared to known results inauthorship determination.
1. Introduction
In this paper we consider the problem of decompos-ing multiclass classification problems into binary ones.While binary classification is quite well explored, thequestion of multiclass classification is still rather openand recently attracted considerable attention of bothmachine learning theorists and practitioners. A num-ber of general decomposition schemes have emerged, including ‘error-correcting output coding’ ( ? ; ? ), themore general ‘probabilistic embedding’ ( ? ) and ‘con-straint classification’ ( ? ). Nevertheless, practitionersare still mainly using the infamous ‘one-vs-rest’ de-composition whereby an individual binary “soft” (orconfidence-rated) classifier is trained to distinguish be-tween each class and the union of the other classes andthen, for classifying an unseen instance, all classifiersare applied and the winner classifier, with the largestconfidence for one of the classes, determines the clas-sification. Another less commonly known method isthe so called ‘all-pairs’ (or ‘one-vs-one’) decomposi-tion proposed by ( ? ). In this method we train onebinary classifier for each pair of classes. To classify anew instance we run a majority vote among all binaryclassifiers. The nice property of the “all-pairs” methodis that it generates the easiest and most natural bi-nary problems of all known methods. The weaknessof this method is that there may be irrelevant binaryclassifiers which participate in the vote. A numberof papers provide evidences that ‘all-pairs’ decomposi-tions are powerful and efficient and in particular, theyoutperform the ‘one-vs-rest’ method; see e.g. ( ? ).For the most part, known decomposition methods in-cluding all those mentioned above are “flat”. In thispaper we focus on hierarchical decompositions. Theincentive to decompose a multiclass problem as a hier-archy is natural and can have at the outset general ad-vantages which are both statistical and computational.Considering a multiclass problem with k classes, theidea is to learn a full binary tree of classes, where eachnode is associated with a subset of the k classes as fol- In a full binary tree each node is either a leaf or hastwo children. ows: Each of the k leaves is associated with a distinctclass, and each internal node is associated with theunion of the class subsets of its right and left children.Each such tree defines a hierarchical partition of theset of classes and the idea is to train a binary classifierfor each internal node so as to discriminate betweenthe class subset of the right child and the class subsetof the left child. Note that in a full binary tree with k leaves there are k − hard binary classifiers giving la-beles in {± } . When using “soft” (confidence-rated)and in particular probabilistic classifiers, giving confi-dence rates in [0 , Bayes error . We attemptto use the Bayes error of the resulting decompositionand aim to hierarchically decompose the multiclassproblem so as to construct statistically “easy” collec-tion of binary problems.Determining the Bayes error of a classification prob-lem based on the data (and without knowledge of theunderlying distributions) is a hard problem, withoutany restrictions ( ? ). In this paper we restrict ourselvesto settings where the underlying distributions can befaithfully modelled as multinomials . Potential appli-cation areas are classification of natural language, bi-ological sequences etc. We can therefore in principleconveniently rely on studies, which offer efficient andreliable density estimation for multinomials ( ? ; ? ; ? ; ? ). As a first approximation, throughout this paperwe make the assumption that we hold “ideal” datasmaples and simply rely on maximum likelihood esti-mators that count occurrences. But even if the underlying distributions are known, afaithful estimation of the Bayes error is computation-ally difficult. We rely on known information theoreticbounds on the Bayes error, which can be efficientlycomputed. In particular, we use Bayes error boundsin terms of the Jansen-Shannon divergence ( ? ) and wederive upper and lower bounds on the inherent classi-fication difficulty of hierarchical multiclass decomposi-tions. Our bounds, which are tight in the worst case,can be used as optimality measures for such decompo-sitions. Unfortunatelly, the translation of our boundsinto provably efficient algorithms to search for highquality decompositions appear at the moment com-putationally difficult. Therefore, we use a simple andefficient greedy heuristic, which is able to generate rea-sonable decompositions.We provide initial empirical evaluation of our meth-ods and test them on multiclass problems of varyingsizes in the application area of ‘authorship determi-nation’. Our hierarchical decompositions consistentlyimprove on the ‘all-pairs’ method when the number ofclasses are small but do not outperform all-pairs withlarger number of classes. The authorship determina-tion set of problems we consider is taken from a newbenchmark collection consisting of machine learningauthors. The absolute accuracy results we obtain areparticularly high compared to standard results in thisarea.
2. Preliminaries: Bounds on the BayesError and the Jensen-ShannonDivergence
Consider a standard binary classification problem ofclassifying an observation given by the random vari-able X into one of two classes C and C . Let π and π denote the priors on these two classes, π + π = 1with π i ≥
0. Let p i ( x ) = p ( X = x | C i ), i = 1 ,
2, bethe class-conditional probabilities. If X = x is ob-served then by Bayes rule the posterior probability of C i is p ( C i | x ) = π i p i ( x ) π p ( x )+ π p ( x ) . If all probabilities areknown we can achieve the Bayes error by choosing theclass with the larger posterior probability. Thus, thesmallest error probability is p ( error | x ) = min { π p ( x ) , π p ( x ) } π p ( x ) + π p ( x ) , and the Bayes error is given by p Bayes = p ( error ) = R x p ( x ) p ( error | x ) dx = E x [min { π p ( x ) , π p ( x ) } ].The Bayes error quantifies the inherent difficulty of theclassification problem at hand (given the entire prob-abilistic characterization of the problem) without anyonsiderations of inductive approximation based on fi-nite samples. In this paper we attempt to decomposemulti-class problems into hierarchically ordered collec-tions of binary problems so as to minimize the Bayeserror of the entire construction. Let P and P be two distributions over some finiteset X , and let π = ( π , π ) be their priors. Then, theJensen-Shannon (JS) divergence ( ? ) of P and and P with respect to the prior π is JS π ( P , P ) = H ( π P π P ) − π H ( P ) − π ( P ) , (1)where H ( · ) is the Shannon entropy. It can be shownthat JS π ( P , P ) is non-negative, symmetric, bounded(by H ( π )) and it equals zero if and only if P ≡ P .According to ( ? ) the JS-divergence was first intro-duced by ( ? ) as a dissimilarity measure for randomgraphs. Setting M π = π P π P it is easy to see( ? ) that JS ( P , P ) = π D KL ( P || M π ) + π D KL ( P || M π ) , (2)where D KL ( ·||· ) is the Kullback-Leibler divergence( ? ). The average distribution M π is called the mutualsource of P and P ( ? ) and it can be easily shownthat M π = arg min Q π D KL ( P || Q ) + π D KL ( P || Q ) . (3)That, is the mutual source of P and P is the closestto both of them simultaneously in terms of the KL-divergence. Like the KL-divergence the JS-divergencehas a number of important roles in statistics and pat-tern recognition. In particular, the JS-divergence,compared against a threshold is an optimal statisti-cal test in the Neyman-Pearson sense ( ? ) for the two-sample problem ( ? ). Lower and upper bounds on the binary Bayes errorare given by ( ? ). Again, let π = ( π , π ) be the priorsand p , p , the class conditionals, as defined above.Let p ( error ) be the Bayes error. Set J = H ( π ) − JS π ( p , p ) with H ( π ) denoting the binary entropy. Theorem 1 (Lin) J ≤ p ( error ) ≤ J (4)These bounds are generalized to k classes in a straight-forward manner. Considering a multiclass problem with k classes and class-conditionals p , . . . , p k and pri-ors π = ( π , . . . , π k ), the Bayes error is given by p ( error k ) = Z x p ( x )(1 − max { p ( C | x ) , . . . , p ( C k | x ) } ) dx. Now setting J k = H ( π ) − JS π ( p , . . . , p k ) we have Theorem 2 (Lin) k − J k ≤ p ( error k ) ≤ J. (5)Given the true class-conditional, these JS bounds onthe Bayes error can be efficiently computed using ei-ther (1) or (2) (or their generalized forms).
3. Bounds on the Bayes Error ofHierarchical Decompositions
In this section we provide bounds on the Bayes er-ror of hierarchical decompositions. The bounds areobtained using a straightforward application of the bi-nary bounds of Theorem 1. We begin with a moreformal description of hierarchical decompositions.Consider a multi-class problem with k classes C = C , . . . , C k , and let T = ( V, E ) be any full binary treewith k leaves, one for each class. For each node v ∈ V we map a label set ℓ ( v ) ⊆ C which is defined as fol-lows. Each leaf v (of the k leaves) is mapped to aunique class (among the k classes). If v is an internalnode whose left and right children are v L and v R , re-spectively, then ℓ ( v ) = ℓ ( v L ) ∪ ℓ ( v R ). Given the tree T and the mapping ℓ we decompose the multi-classproblem by constructing a binary classifier h v for eachinternal node v of T such that h v is trained to discrim-inate between classes in ℓ ( v L ) and classes in ℓ ( v R ). Inthe case of hard classifiers h v ( x ) ∈ {± } and we iden-tify ‘ −
1’ with ‘ L ’ and ‘+1’ with ‘ R ’. In the case of softclassifiers, h v ( x ) ∈ [0 ,
1] and we identify 0 with ‘ L ’ and1 with ‘ R ’. Since there are k leaves there are exactly k − ℓ .Given a sample x whose label (in C ) is unknown, onecan think of a number of “decoding” schemes thatcombine the individual binary classifiers. When con-sidering hard binary classifiers a natural choice to ag-gregate the binary decisions is to start from the root r and apply its associated classifier h r . If h r ( x ) = − r L and otherwise we go to r R , etc. Thisway we continue until we reach a leaf and predict for x this leaf’s associated (unique) class. In the case ofoft binary classifiers a natural decomposition wouldbe to consider for each leaf v the path from the rootto v , and multiply the probability estimates along thispath. Then the leaf with the largest probability willassign a label to x .There is a huge number of possible hierarchical decom-positions already for moderate values of k . We notethat a known decomposition scheme which is capturedby such hierarchical constructions is the decision listmulticlass decomposition approach (referred to as “or-dered one-against-all class binarization” in ( ? )).Consider a k -way multiclass problem with class con-ditionals P i ( x ) = P ( x | C i ) and priors π , . . . , π k . Sup-pose we are given a decomposition structure ( T, ℓ ) for k classes consisting of the tree T and the class map-ping ℓ . Each internal node v of T corresponds to onebinary classification problem. The original multiclassproblem naturally induces class conditional probabili-ties and priors for the binary problem at v and we de-note these conditionals by p v ( x | v L ) and p v ( x | v R ) andthe prior by π ( v ). For example, denoting the root of T by r , we have p r ( r L | x ) = X C i ∈ ℓ ( r L ) p ( C i | x ) , with p r ( x | r L ) = p r ( r L | x ) p ( x ) /π ( r L ) by Bayes rule and π ( L ) = P C i ∈ ℓ ( r L ) π i . Let p v ( error ) be the Bayes errorof this problem and denote the Bayes error of the entiretree by p T ( error ). Proposition 3
For each internal node v of T let q ( v ) = (1 − J ( v )) where J ( v ) = H [ π ( v )] − JS π ( v ) [ p v ( x | v L ) , p v ( x | v R )] . Then p T ( error ) ≤ − Q ( T ) , where Q ( T ) = q ( r ) [ Q ( T L ) + Q ( T R )] (6) and for a leaf v , Q ( v ) = 1 . Proof
For each class j , j = 1 , . . . , k let v j , v j , . . . , v jn j be the path from the root to the leaf correspondingto class j , where v j is the root of T and v jn j is theleaf. This path consists of n j − v n j is P [reaching v jn j ] = n j − Y i =1 (1 − p v ji ( error )) . Thus, the overall average error probability P T ( error )for the entire structure ( T, ℓ ) is P T ( error ) = P kj =1 π j (1 − P [reaching v jn j ])= 1 − P kj =1 Q n j − i =1 (1 − p v ji ( error )) . Using the JS (upper) bound from Equation (4) on theindividual binary problems in T we have P T ( error ) ≤ − k X j =1 n j − Y i =1 (1 − J ( v ji )) , (7)where for v = v ji J ( v ) = H ( π ( v )) − JS π ( v ) ( p v ( x | v L ) , p v ( x | v R )). Rearranging terms itis not hard to see that Q ( T ) = k X j =1 n j − Y i =1 (1 − J ( v ji ))The same derivation now using the JS lower bound ofEquation (4) yields: Proposition 4
For each internal node v of T let q ′ ( v ) = (1 − J ′ ( v )) where J ′ ( v ) = (cid:0) H [ π ( v )] − JS π ( v ) [ p v ( x | v L ) , p v ( x | v R )] (cid:1) . Then p T ( error ) ≥ − Q ′ ( T ) , where Q ′ ( T ) = q ′ ( r ) [ Q ′ ( T L ) + Q ′ ( T R )] and for a leaf v , Q ( v ) = 1 .
4. A Heuristic Procedure forAgglomerative Tree Constructions
The recurrences of Propositions 3 and 4 provide themeans for efficient calculations of upper and lowerbounds on the multiclss Bayes error of any tree de-composition given the class conditional probabilitiesof the leaves. Our goal is to construct a full binary T whose Bayes error is minimal. A natural approachwould be to consider trees whose Bayes error upperbound are minimal. This corresponds to maximizing Q ( T ) (6) over all trees T . There are two obstacles forachieving this goal. The statistical obstacle is that thetrue class conditional distributions of internal nodesare not available to us. The computational obstacle isthat the number of possible trees is huge. Handling The number of unlabeled full binary trees with k leavesis the Catalan number C k − = k (cid:0) k − k − (cid:1) . The number oflabeled trees (not counting isomorphic trees) is O (2 k k !). he first obstacle in the general case using density es-timation technics appears to be counterproductive asdensity estimation is considered harder than classifi-cation. But we can restrict ourselves to parametricmodels such as multinomials where estimation of theclass conditional probabilities can be achieved reliablyand efficiently; see e.g. ( ? ; ? ; ? ; ? ). In the presentwork we ignore the discrepancy that will appear in ourBayes error bounds (even in the case of multinomials)and rely on simple maximum likelihood estimates ofthe class-conditionals.To handle the maximization of Q ( T ) we use the fol-lowing agglomerative randomized heuristic procedure.We start with a forest of all k leaves, correspond-ing to the k classes. Our estimates for the priorof these classes π j , j = 1 , . . . , k , are obtained fromthe data. We perform k − i , i = 1 , . . . , k − F i containing N i = k − i + 1 trees, T , . . . , T N i . Each of these trees T has an associatedclass-conditional probability P T ( x ) (which is againestimated from the data), and a weight w ( T ) thatequals the sum of priors of its leaves. For each pairof trees T i and T j we compute their JS-divergence JS ( i, j ) = JS π ( i,j ) ( P T i ( x ) , P T j ( x )) where π ( i, j ) =( w ( T i ) / ( w ( T i ) + w ( T j )) , w ( T j ) / ( w ( T i ) + w ( T j ))). Foreach possible merger (between i and j ) we assign theprobability p ( i, j ) proportional to 2 − JS ( i,j ) . This waylarge JS values are assigned to smaller probabilitiesand vice versa. We then randomly choose one mergeraccording to these probabilities. The newly mergedtree T ij is assigned the mutual source of T i and T j asits class-conditional (see Equation (3)) and its weightis w ( T i ) + w ( T j ). In all the experiments described be-low, to obtain a multiclass decomposition we ran thisrandomized procedure 10 times and chose the tree T that maximized Q ( T ). The chosen tree T then deter-mines the hierarchical decomposition, as described inSection 3. Note that the above procedure does not di-rectly maximize Q ( T ). The routine simply attemptsto find trees whose higher internal nodes are “well-separated”. Such trees will have low Bayes error andour formal indication for that will be that Q ( T ) willbe large. Thus, currently we can only use our boundsas a means to verify that a hierarchical decompositionis good, or to compare between two decompositions. Using a Bayesian argument it can be shown ( ? ) thatif X and Y are samples with types (empirical probability) P T i and P T j , respectively, then 2 − JS ( i,j ) is proportionalto the probability that X and Y emerged from the samedistribution.
5. The Machine Learning AuthorsDataset
In our experiments (Section 6) we used a new bench-mark dataset for testing authorship determination al-gorithms. This dataset contains a collection of singly-authored scientific research papers. The scientific af-filiation of all authors is machine learning, statisti-cal pattern recognition and related application areas.After this dataset was automatically collected fromthe web using a focused crawler guided by a com-piled list of machine learning researchers, it was man-ually checked to see that all papers are indeed by sin-gle authors. This
Machine Learning Authors (MLA) dataset. contains articles by more than 400 authorswith each author having at least one singly-authoredpaper. For the present study we extracted from theMLA collection a subset that was prepared as follows.The raw papers (given in either PS or PDF formats)were first translated to ascii and then each paper wasparsed into tokens . A token is either a word (a se-quence of alpha numeric characters ending with one ofthe space characters or a punctuation) or a punctua-tion symbol. To enhance uniformity and experimen-tal control we segmented each paper into chunks of paragraphs where a paragraph contains 1000 tokens. To eliminate topical information we projected all doc-uments on the most frequent 5000 tokens. Appearingamong these tokens are almost all of the most fre-quent function words in English, which bare no topicalcontent but are known to provide highly discrimina-tive information for authorship determination ( ? ; ? ).For example, on Figure 1 we see a projected excerptfrom the paper ( ? ) as well as its source containingall the tokens. Clearly there are non-function words(like ‘data’), which remained in the projected excerpt.Nevertheless, since all the authors in the dataset writeabout machine learning related issues, such words donot contain much topical content.We selected from MLA only the authors who havemore than 30 paragraphs in the dataset. The resultis a set of exactly 100 authors and in the rest of thepaper we call the resulting set the MLA-100 dataset. ∼ rani/authorship. We considered as tokens the following punctuations:.;,:?!’()”-/ \ . Last paragraphs of length <
500 tokens were combinedwith second-last paragraphs. This way, paragraphs lengthsvary in [500 , rojected Text Over the many have to of data their ,,their ,,and their..At the same time,,,and in many nd complex ,,suchas the of data that in .. The of data the of how bestto use this data to general and to ..Data ::using datato and ..The of in data follows from the of several :
Original Text
Over the past decade many organizations have be-gun to routinely capture huge volumes of historicaldata describing their operations, their products, andtheir customers. At the same time, scientists andengineers in many fields find themselves capturingincreasingly complex experimental datasets, such asthe gigabytes of functional MRI data that describebrain activity in humans. The field of data miningaddresses the question of how best to use this histor-ical data to discover general regularities and to im-prove future decisions. Data Mining: using historicaldata to discover regularities and improve future de-cisions. The rapid growth of interest in data miningfollows from the confluence of several recent trends:
Figure 1.
An excerpt from the paper “Machine Learningand Data Mining” ( ? ). Top: A projection of the text overthe high frequency tokens; Bottom: The original text. Ex-cerpt is taken from the paper Machine Learning and DataMining ( ? ).
6. Experiments
Here we describe our initial empirical studies of theproposed multiclass decomposition procedure. Wecompare our method with the “all-pairs’ decomposi-tion. Taking the MLA-100 dataset (see Section 5) wegenerated a a progressively increasing random subsetas follows. From the MLA-100 we randomly chose3 authors, then added another author, chosen ran-domly and uniformly from the remaining authors, etc.This way we generated increasing sets of authors inthe range of 3-100. So far we have experimented withmulticlass subsets with k = 3 − ,
50 and 100. In allthe experiments we used an SVM with an RBF ker-nel. The SVM parameters where chosen using cross-validation. The reported results are averages of 3-foldcross-validation.The features generated for our authorship determina-tion problems contained in all cases the top 5000 sin-gle tokens (see Section 5 for the token definition) aswell as the following “high order pairs”. After pro-jecting the documents over the high frequency singletokens we took all bigrams. For instance, consideringthe projected text in Figure 1, the token pair ‘to’+‘of’ appearing in the first line of the projected text (top)is one of our features. Notice that in the original textthis pair of words appears 5 words apart. This way ourrepresentation captures high order pairwise statisticsof the tokens. Moreover, since we restrict ourselvesto the most frequent tokens in the text these pairs oftoken do not suffer too much from the typical statisti-cal sparsness which is usually experienced when con-sidering n -grams in text categorization and languagemodels.Accuracy results for both “all-pairs” and our hierarchi-cal decomposition procedure appear in Figure 2. Thefirst observation is that the absolute values of theseclassification results are rather high compared to typ-ical figures reported in authorship determination. Forexample, ( ? ) report on accuracy around 70% for de-termining between 10 authors of newspaper articles.Such figures (i.e. number of authors and around 60%-80% accuracy) appear to be common in this field. Theclosest results in both size and accuracy we have foundare of ( ? ), who distinguish between 117 newsgroup au-thors with accuracy 58.8% and between 84 authorswith accuracy 80.9%. Still, this is far from he 91%accuracy we obtain for 50 authors and 88% accuracyfor 100 authors.The consistent advantage of hierarchical decomposi-tions over all-pairs is evident for small number ofclasses. However, for over 10 classes, there is no sig-nificant difference between the methods. Interestingly,the best hierarchical constructs our method generated(in terms of the Q ( T )) were completely skewed. It isnot clear to us at this stage whether this is an artifactof our Bayes error bound or a weakness of our heuristicprocedure.
7. Concluding Remarks
This paper presents a new approach for hierarchicalmulticlass decomposition of multinomials. A similarhierarchical approach can be attempted with nonpa-rameteric models. For instance using any nonparamet-ric probabilistic binary discriminator one can attemptto heuristically estimate the hardness of the involvedbinary problems using empirical error rates and designreasonable hierarchical decompositions. However, amajor difficulty in this approach is the computationalburden.When considering the main inherent deficiency of all-pairs decompositions it appears that this deficiencyshould disappear or at least soften when the number ofclasses increases. The reason is that with large numberof classes, the noisy votings of irrelevant classifiers will
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 50 1008990919293949596979899100
Nuber of Classes A cc u r a cy % Hierarchical DecompositionAll Pairs
Figure 2.
The performance of hierarchical multiclass de-compositions and ‘all-pairs’ decompositions on 20 au-thorship determination problems with varying number ofclasses. tend to cancel out and the power of the relevant clas-sifiers will then increase. We therefore speculate thatit would be very hard to consistently beat all-pairs de-compositions with very large number of classes. Never-theless,a desirable property of a decomposition schemeis scalability , which allows for efficient handling oflarge number of classes (and datasets). For example,one can hypothesize useful authorship determinationapplications, which need to determine between thou-sands or even millions of authors. While balanced hi-erarchical decomposition will be able to scale up tothese dimensions, the O ( k ) complexity of the all-pairsmethod would probably start to form a computationalbottleneck. References
Antos et al.][1999]LowerAntosDG99 Antos, A., De-vroye, L., & Gyorfi, L. (1999). Lower bounds forbayes error estimation.
Pattern Analysis and Ma-chine Intelligence , , 643–645.Burrows][1987]Burrows87 Burrows, J. (1987). Wordpatterns and story shapes: The statistical analysisof narative style. Literary and Linguistic Comput-ing , , 61–70.Cover and Thomas][1991]CoverT91 Cover, T., &Thomas, J. (1991). Elements of information theory .John Wiley & Sons, Inc.Dekel and Singer][2002]DekelS02 Dekel, O., & Singer, Y. (2002). Multiclass learning by probabilistic em-bedding.
Neural Information Processing Systems(NIPS) .Dietterich and Bakiri][1995]DietterichB95 Dietterich,T., & Bakiri, G. (1995). Solving multiclass learningproblems via error-correcting output codes.
Journalof Artificial Intelligence Research , , 263–286.El-Yaniv et al.][1997]ElYanivFT97 El-Yaniv, R., Fine,S., & Tishby, N. (1997). Agnostic classification ofmarkovian sequences. Neural Information Process-ing Systems (NIPS) .Friedman][1996]Friedman96 Friedman, J. (1996).
An-other approach to polychotomous classification (Technical Report). Stanford University.Friedman and Singer][1998]friedman99efficient Fried-man, N., & Singer, Y. (1998). Efficient bayesianparameter estimation in large discrete domains.F¨urnkranz][2002]Furnkranz02 F¨urnkranz, J. (2002).Round robin classification.
Journal of MachineLearning Research , , 721–747.Griths and Tenenbaum][2002]griths-using Griths, T.,& Tenenbaum, J. (2002). Using vocabulary knowl-edge in bayesian multinomial estimation.Gutman][1989]Gutman89 Gutman, M. (1989).Asymptotically optimal classification for multipletests with empirically observed statistics. IEEETrans. on Information Theory , , 401–408.Har-Peled et al.][2002]HarpeledRZ02 Har-Peled, S.,Roth, D., & Zimak, D. (2002). Constraint classifica-tion for multiclass classification and ranking. NeuralInformation Processing Systems (NIPS) .Lehmann][1959]Lehmann59 Lehmann, E. (1959).
Testin statistical hypotheses . John Wiley & Sons.Lin][1991]Lin91 Lin, J. (1991). Divergence measuresbased on the shannon entropy.
IEEE Transactionson Information Theory , , 145–151.McAllester and Schapire][2000]llester00convergenceMcAllester, D., & Schapire, R. E. (2000). On theconvergence rate of good-Turing estimators. Proc.13th Annu. Conference on Comput. Learning The-ory (pp. 1–6). Morgan Kaufmann, San Francisco.Mitchell][1999]mitchell99machine Mitchell, T. (1999).Machine learning and data mining.
Communica-tions of the ACM , , 30–36.osteller and Wallace][1964]MostellerW64 Mosteller,F., & Wallace, D. (1964). Inference and disputedauthorship: The federalist . Addison-Wesley.Rao and Rohatgi][2000]Rao00 Rao, J., & Rohatgi, P.(2000). Can pseudonymity really guarantee privacy?
USENIX Security Symposium .Ristad][1998]ristad95natural Ristad, E. (1998). A nat-ural law of succession.
IEEE International Sympo-sium on Information Theory (pp. 216–21).Sejnowski and Rosenberg][1987]SejnowskiR87 Se-jnowski, T., & Rosenberg, C. (1987). Parallel net-works that learn to pronounce English text.
Journalof Complex Systems , , 145–168.Stamatatos et al.][2001]StamatatosFK Stamatatos, E.,Fakotakis, N., & Kokkinakis, G. (2001). Automatictext categorisation in terms of genre and author. Computational Linguistics , , 471–495.Wong and You][1985]WongY85 Wong, A., & You, M.(1985). Entropy and distance of random graphs withapplication to structural pattern recognition. Pat-tern Analysis and Machine Intelligence ,7