Re-evaluating the need for Modelling Term-Dependence in Text Classification Problems
RRunning head: TERM-DEPENDENCE IN TEXT CLASSIFICATION 1Re-evaluating the need for Modelling Term-Dependence in Text Classification ProblemsSounak Banerjee , Prasenjit Majumder , and Mandar Mitra {[email protected], [email protected], [email protected]} CVPR Unit, Indian Statistical Institute, Kolkata, West Bengal, India; Ph : (+91) 33 25752858 Room No : 4209, DAIICT Gandhinagar, Gujarat, India; Ph : (+91) 79 3051 0605Affiliation a r X i v : . [ c s . I R ] O c t ERM-DEPENDENCE IN TEXT CLASSIFICATION 2AbstractA substantial amount of research has been carried out in developing machine learningalgorithms that account for term dependence in text classification. These algorithms offeracceptable performance in most cases but they are associated with a substantial cost. Theyrequire significantly greater resources to operate. This paper argues against thejustification of the higher costs of these algorithms, based on their performance in textclassification problems. In order to prove the conjecture, the performance of one of the bestdependence models is compared to several well established algorithms in text classification.A very specific collection of datasets have been designed, which would best reflect thedisparity in the nature of text data, that are present in real world applications. The resultsshow that even one of the best term dependence models, performs decent at best whencompared to other independence models. Coupled with their substantially greaterrequirement for hardware resources for operation, this makes them an impractical choicefor being used in real world scenarios.
Keywords:
Text Classification, Copula, Support Vector Machine, Unigram LanguageModel, K Nearest NeighboursERM-DEPENDENCE IN TEXT CLASSIFICATION 3Re-evaluating the need for Modelling Term-Dependence in Text Classification Problems
INTRODUCTION
For quite some time, researchers have fostered the idea of a need to design algorithms thatmodel dependence among terms in documents to improve classification performanceEickhoff, de Vries, and Hofmann, 2015; Han and Karypis, 2000; Metzler and Croft, 2005;Nallapati and Allan, 2002, 2003; Yu, Buckley, Lam, and Salton, 1983. The central idea isthat, one could better predict the class a document belongs to, if the underlying essence ofthe text could be interpreted, rather than an unordered collection of words that conveyvery little logical sense. In order to materialize the concept, many approaches have beenproposed. A Copula based language model Eickhoff et al., 2015 presented by Eickhoff et al.considers sentential co-occurrence of term-pairs to capture the dependence structure of theterms in a document. On the other hand a centroid based document classificationalgorithm attempts to represent each document as a vector in the term-space and for eachclass, calculate a centroid vector using its constituent documents finally comparing anynew documents to the available centroids Han and Karypis, 2000. While the MarkovRandom Field based classifier models dependence on a contiguous sequence of terms,representing them as a chained dependence structure Metzler and Croft, 2005.Standard models on the other hand utilize properties such as rate of occurrence, length,distribution of features. They try to establish a relationship between the class and theproperties of the documents it contains. It assigns values to features that are relevant to aparticular class and then try to guess the membership of a new document to that class bycomparing these values.Though bolstering prediction potential through the exploitation of complex dependencestructures inherent to natural language seems tempting, each of these models requiresignificantly greater hardware resources to operate compared to independence models thatas their name suggests, depend on the properties of independent features of the text. Inaddition to the collection of features that independence models rely on, dependence basedmodels require both processing and memory for interpreting and storing relationshipsbetween the features.ERM-DEPENDENCE IN TEXT CLASSIFICATION 4In addition to that, no recent literature exists that compares classification performance ofdependence models with widely accepted independence models like K nearest neighbours orsupport vector classifiers. So, to verify the validity of the argument justifying the use ofcomplex dependence structures for text classification, we compare four classificationalgorithms which include, Naive Bayes Classifier, Copula Language Model, K NearestNeighbour Classifier and Support Vector Machine. Each classifier is used to performclassification on multiple datasets. Finally, we analyse the merits and demerits of eachclassifier through a close examination of the properties of the datasets and their effects onthe classifiers.We used the copula based classifier as a benchmark for dependence models primarilybecause its superior performance over other dependence models has been well establishedand secondly for its recency of publication Eickhoff et al., 2015.The copula language model allows us to account for term dependence by utilizing the listof all co-occurring term-pairs in sentences. Since, each sentence in a document is thesmallest entity that carries a sense, it is assumed the co-occurrence of terms in eachsentence will carry some semantic relevance to the topic. The co-occurrence measures arecalculated separately for all term-pairs to model a classifier for each class. The modelutilizes both co-occurrence data and term probability to calculate the similarity measure ofa document to a specific class.
DATASETS
Multiple Datasets were used to alleviate the possibility of any bias in the evaluation.Datasets were selected based on varying length of documents, class size, and language(colloquial and formal). Another key aspect that was considered while selecting thedatasets was the classification type, multi-class and multi-label. The twitter,20-Newsgroups and Stack Overflow datasets have been chosen for multi-class classification,while Reuters-21578 and RCV1 are multi-label datasets.All datasets were processed in the same manner. Stop word removal was carried out basedon the list of English stop words available in NLTK. Stemming was done using PorterStemmer.ERM-DEPENDENCE IN TEXT CLASSIFICATION 5
Reuters-21578
The corpus is a collection of 21578 news wire articles from Reuters. It is a multi labeldataset with a total of 90 categories. David D. Lewis, n.d. provides a detailed summary ofthe corpus. Class-sizes of training documents range from 1 to 2861. The average length ofdocuments in the corpus is 126 words.
RCV1-V2
The original Reuters RCV1 corpus is a collection of 800,000 documents, with 103categories. Since carrying out any operation on such a large corpus is difficult, achronological split has been proposed Lewis, Yang, Rose, and Li, 2004. The RCV1-V2contains 1 training and 4 test sets. The first 23,149 documents are used as the training setand the rest of the collection has been split into 4 test sets each containing about 200,000documents. Each document belongs to at least 1 to a maximum of 17 categories with eachtopic containing atleast 5 documents over the entire corpus. For some categories there areno documents in the training set. Class-size for training documents range from 0 to 10786.Each document in the corpus is 143 words long on average.Scikit-Learn provides a tokenized version of the corpus scikit-learn, n.d. that can be easilyimported into Python. This version of the corpus was used as the input for all the existingmodels, except for the Copula Language Model, for which the original version that isavailable on request Reuters, n.d. was used, since sentence-level co-occurrence data wasneeded for this algorithm. For carrying out this task, documents that have been omittedfrom version 2 of the corpus were ignored from the original data, to maintain uniformityacross all tests.
Twitter-Sample
This corpus has been considered because of its short document length and use of colloquiallanguage. The corpus is available in the NLTK corpus library. It is a collection of 10,000tweets, separated into 5,000 positive and 5,000 negative tweets. The average size of eachtweet in the collection is 11 words.ERM-DEPENDENCE IN TEXT CLASSIFICATION 6
StackOverflow Questions
Stack overflow is a platform where novices post questions from different fields and anyonewho has a solution may provide an answer. The corpus contains a list of 20,000 suchquestion titles from the stack overflow website divided over 20 categories J. Xu et al., 2015.This corpus was selected because of its similarity in document size with the twitter corpus,so that any effect on classification of short text documents may be identified without bias.The average length of each question over the entire corpus is 8 words.
This corpus is a collection of 18,846 news articles distributed almost evenly across 20newsgroup categories like, comp.graphics, rec.sport.hockey, sci.electronics,soc.religion.christian . It is available for download from their official website at JasonRennie, n.d., but for our purpose we used the version available in the NLTK corpus library.The class size ranges from 377 to 600 documents for training and 251 to 399 for testing.With 318 words per document, the average document size is the highest among all thedatasets used in this experiment.
CLASSIFICATION
All implementations were carried out in Python. The term-weights used for classificationwere kept consistent for all classifiers with the exception of the copula language model.Since the input scores of terms for copulas need to be normalized [0,1], simple probabilitiesof the occurrence of terms in a class were used. Where probability of occurrence of term i in a class C is given by, P ( w i ) = N i Σ | C | i N i N i being the number of occurrences of the term i in class C from the training set.Every other algorithm utilized the TF-IDF scores of terms for classification. The TF-IDFscore of terms were generated using the TfidfVectorizer function from Scikit-Learn.Also, in the case of RCV1 the tokenized data available from Scikit-Learn was used for ourexperiments. However the tokens were not labelled, so co-occurrence information could notERM-DEPENDENCE IN TEXT CLASSIFICATION 7be mapped to the original data. So, the original RCV1 data was split into sets similar tothe RCV1-V2 dataset and the copula based classifier was run on this new data.Finally, for performing multi-label classification on RCV1 and Reuters-21578 datasets,binary relevance method was employed. Binary relevance involves the use of a collection ofyes/no classifiers, one for each class in a dataset that can determine whether a documentbelongs to a specific class or not.
Naive Bayes Classifier
It is one of the most basic and also a fairly simple classification model that uses Bayesalgorithm to get probability scores. It is also the most commonly used method tobenchmark other algorithms.The similarity score of a document to a class is expressed as: P ( t | d ) = P ( t ) ∗ P ( d | t )Where, P(t|d) is the probability that document d belongs to topic t . P(t) is the priorprobability of topic t given by: P ( t ) = N t N total and P ( d | t ) = Y w ∈ d P ( w | t ) , where N t is the number of documents present in the training set of topic t and N total is thetotal number of documents present in the complete training set. P ( w | t ) is the probabilitythat word w belongs to topic t . Though traditional Naive Bayes algorithms use simpleterm-probabilities, we used the TF-IDF scores of the words for the purpose of thisexperiment.Additive smoothing was employed for smoothing of term probabilities. The general formulafor additive smoothing is P ( w i ) = n i + αN + α | V | where, P ( w i ) is the probability of occurrence of all words w i in a class. n i is the frequencyof word w i for a specific class in the training set, N is the sum of the frequencies of allERM-DEPENDENCE IN TEXT CLASSIFICATION 8words w i in the class, |V| is the size of the vocabulary of the class. Finally α is the userdefined parameter. Laplace smoothing is a special case of additive smoothing when α = 1.When 0< α <1, it is called Lidstone smoothing Vatanen, Väyrynen, and Virpioja, 2010. Forour experiments, we apply both Laplace and Lidstone smoothing (with α = 0.01).We have also used the Multinomial Naive Bayes algorithm which accounts for the exactfrequencies of terms for each class instead of Binomial Naive Bayes. Our reason forchoosing this variation of the NB classifier is because of its superior performance in textclassification problems. There have been multiple studies demonstrating the efficacy of themultinomial Naive Bayes algorithm in text classification Eyheramendy, Lewis, andMadigan, 2003; McCallum, Nigam, et al., 1998; Protasiewicz, Mirończuk, and Dadas, 2017;S. Xu, Li, and Wang, 2017. K-Nearest Neighbours
As the name suggests, for an input document d , the KNN-algorithm selects a user-definednumber of neighbours from the set of training documents, that are nearest to it.The distance between the documents is calculated using their features by plugging theminto a similarity measure or graph based structures. After creating a list of K neighboursthat have the least distance, the algorithm uses a voting scheme, wherein each documentenlisted by the algorithm places a vote for its respective class. The final decision for adocument d is made based on the number of votes each class receives for that document,from its K nearest neighbours.For our experiment, the choice of K was adjusted based on what was best suited for eachcorpus. A brute-force method was used to perform classification, since the implementationavailable could only use brute-force for sparse inputs of feature matrices. The Scikit-Learnimplementation of KNN provides two options for calculating distances among documents,Euclidean and Manhattan distances. We used the Manhattan distance because theimplementation of Euclidean distance caused our system to run out of memory.ERM-DEPENDENCE IN TEXT CLASSIFICATION 9 Support Vector Machine
A support vector machine plots the documents as points in n-dimensional space wherethere are n features each representing its own dimension. It then defines a hyperplanebetween these sets of points that segregate each set so that the collection on either side ofthe hyperplane contains the maximum number of documents of the intended class.The function used to classify a document x is given by: sign (Σ y i ∗ w i ∗ K ( x ‘ , x ) + b )Where: y i is the class value (+1 & -1 for binary classification) w i is the weight vector (vector for the hyperplane) K being the kernel function (linear in our case) x‘ is the collection of support vectors b is the distance of the hyperplane from originSince binary relevance was used in our case, each Support Vector Classifier (SVC) solved abinary classification problem and the sign of the value determined whether the documentbelonged to a class.A support vector machine can use multiple functions to generate the hyperplane, thesefunctions are called kernels. We used the linear kernel in our experiment, which basicallycreates a linear hyperplane. Our choice of the kernel is based on the fact that, with highdimensional vector spaces selecting non-linear kernels runs the risk of over fitting Ben-Hurand Weston, 2010. Copula Language Model
In this language model, a classifier for any class c works with two sets of features. The firstone is a list of all terms present in the documents of the class with their probability ofoccurrence. The second is a list of all term pairs that occur in the same sentence with theirrespective Pointwise Multual Information (PMI) or Jaccard’s coefficient values across theERM-DEPENDENCE IN TEXT CLASSIFICATION 10class, normalized between [1, ∞ ]. θ t ,t = f ( t , t ) µ if f ( t , t ) > µ else θ t ,t = 1 f is the function for the choice of co-occurrence metric between terms t and t . µ is theaverage over all f ( t i , t j ) [ i = j ].When theta is 1, there is complete independence between the terms and an increasingcoefficient value implies increasing dependence. In our version of the algorithm we usedPMI, as it generated marginally superior results compared to Jaccard in every case.The probability that a document d belongs to a topic t is calculated by: P ( t | d ) = P ( t ) ∗ P ( d | t )Where P(t) is the prior probability of topic t and, P ( d | t ) = C t ( w , w , w , ..., w n ) w i ∈ d Where: C t ( w , w , w , ..., w n ) = ψ − ( ψ ( w ) + ψ ( w ) + ψ ( w ) + ... + ψ ( w n ))We used the Gumbel copula from the Archimedean family, as it was reported to producethe best results in Eickhoff and de Vries, 2014. ψ and ψ − for Gumbel copulas are defined by: ψ ( u ) = ( − log ( u )) θ ψ − ( u ) = exp ( − u /θ )Thus : C t ( u i , u j ) = exp ( − (( − log ( u i )) θ + ( − log ( u j )) θ ) /θ )Where, u i , u j are the probabilities of occurrence of words w i , w j in topic t respectively, and θ is a parameter that represents the strength of dependency between individual wordsERM-DEPENDENCE IN TEXT CLASSIFICATION 11 w i , w j in topic t , expressed in PMI in our case. It is important to note that when θ becomes 1, i.e. when they are completely independent, C t ( u i , u j ) = u i ∗ u j For the sake of simplicity, the value of θ between two term-pairs say, ( w i , w j ) and ( w l , w k )is assumed to be 1, making them independent. This causes the copula function to becomea nested list of bivariate copulas. C t ( w , w , w , ..., w n | c ) = C t ( w , w | c ) ∗ C t ( w , w | c ) ∗ ... for all ( w i , w j ) with, θ > 1We used Jelinek-Mercer smoothing for all terms in the corpus Jelinek, 1980. Any unfamiliarwords from the test set were omitted during the classification process. Also, whenconsidering term pairs, a single word might occur in multiple pairs, in our implementationwe chose to include the contribution of the probability scores of these re-occurring terms.The algorithm for this language model closely resembles the Naive Bayes algorithm. InBayes method complete independence of terms is assumed.Given a collection of terms w , w , w , ...w k , from a document d and their probabilities ofoccurrence u , u , u , ...u k in a certain topic t , the copula based similarity score of thedocument to the topic is given by: C t ( u , u , u , ...u k ) = ψ − ( ψ ( u ) + ψ ( u ) + ψ ( u ) + ... + ψ ( u k ))If we assume complete independence like Naive Bayes ( θ = 1), C t ( u , u , u , ...u k ) = u ∗ u ∗ u ∗ ... ∗ u k which is equal to the Bayes formula.But since, this algorithm accounts for term dependence using sentential co-occurrence, it isexpected to perform better than simple Naive Bayes classification.The goal is to figure out whether this extra computation to improve performance usingterm dependence is beneficial to the text-classification paradigm.ERM-DEPENDENCE IN TEXT CLASSIFICATION 12 EXPERIMENTS
For every test except with copulas, the Scikit-Learn implementations of the algorithmswere used for comparison. The input fed to each classifier was the exact samepre-processed data, generated using the standard NLTK library functions.Since there are no separate training and testing sets for the Twitter and Stack-Overflowdatasets, a 10-fold cross validation was carried out for assessing the performance of eachclassifier. For RCV1 the classification was carried out on each test set separately. Anaverage of the F1-measures from all 4 test sets for each classifier have been reported.Parameter values for almost every classifier was set to default, except for the value of K forthe KNN classifier which was optimized for best result. The parameter was set to 100 forboth short text datasets, StackOverflow and twitter-samples. For the rest, 15 neighbourswere selected. Also the α parameter for Jelinek-Mercer smoothing was set to 0.99 for thecopula classifier.The micro-averaged F1-scores of all the classifiers with the corresponding dataset havebeen listed in Table 1. Corpus
N B α =1 N B α =0 . KNN Copula SVCReuters-21578 0.51 0.77 0.79 0.64 0.87RCV1 0.49 0.71 0.72 0.66 0.8020-NewsGroups 0.78 0.84 0.74 0.82 0.85Stack Overflow 0.83 0.78 0.74 0.76 0.85Twitter 0.75 0.72 0.71 0.70 0.75
Table 1
Micro-averaged F scores. Significance Testing
A statistical significance test was carried out for each dataset. The Wilcoxon signed-ranktest was employed for this purpose. The test was carried out by comparing the F1-scores ofeach classifier with the copula model. Two test strategies were used based on the type ofERM-DEPENDENCE IN TEXT CLASSIFICATION 13the data.1. For the datasets that required a K-fold cross validation, the micro-averaged F1-scoresfor each fold were compared. In case of both, StackOverflow and Twitter corpora weused a 10-fold cross validation.2. For the other datasets that had a predefined train and test set, we performed acategory wise testing. The F1-scores of each category were compared in order tomeasure whether the difference in scores were statistically significant Yang and Liu,1999.Table 2 summarizes our observation of the significance tests. The test results have beenclassified into 3 categories. Category |, is for results that satisfied a confidence level of α =0.01, category || were the class of results whose P-Value lied between 0.01 and 0.05 andcategory ||| represented a P-Value greater than 0.05. Finally, the results marked with anasterisk signify the statistical significance of the hypotheses that contradict the conclusionfrom Table 1. SVC
N B α =1 N B α =0 . KNNReuters-21578 | | ||| |||RCV1 Set-1 | | |* ||*RCV1 Set-2 | | |* |||RCV1 Set-3 | | |* |||RCV1 Set-4 | | |* ||*20-NewsGroups ||| ||| || |Stack Overflow | | | |||Twitter | | | |||
Table 2
P-Value for classification performance.
RESULTS AND DISCUSSION
Form a rough perusal of table 1 we can make a few observations,ERM-DEPENDENCE IN TEXT CLASSIFICATION 141. The linear support vector classifier (SVC) outperforms the copula language model inevery case in spite of the later using co-occurrence data and as a result utilizingsignificantly greater hardware resources.2. The Naive Bayes model with Laplace smoothing performs surprisingly well for shorttext data.In order to properly analyse the results, we create a list of properties for each dataset thatwould enable us to explain the performance of the classifiers. These properties have beenlisted in Table 3. The first column gives the average variance of the frequency of termsacross all classes in the corpus. The stack overflow and twitter corpus have the minimumvariance in term frequencies which could be a result of their short document length and thefact that they are both informal text. While the other three corpora which contain newsarticles that are long text data and have been constructed for formal use, have a highervariance in frequency with Reuters 21578 having the highest value. The second columncontains the average document length of the corresponding corpus. And the final column ofthe table lists the type of classification algorithm required to perform a classification taskon the corpora.
Corpus
V ar tf Doc Length TypeReuters-21578 5.228e-03 126 Multi-LabelRCV1 1.117e-03 143 Multi-Label20-NewsGroups 1.474e-04 318 Multi-ClassStack Overflow 6.845e-05 8 Multi-ClassTwitter 1.076e-06 11 Multi-Class
Table 3
Corpus Properties.
We studied the relationship of each of these properties against the classification scores ofthe models and came to the conclusion that their behaviours could be attributed to acombination of all the listed properties. But the most intuitive comparisons could bederived based on term variance. So we plot the F1-scores of the classifiers against all fiveERM-DEPENDENCE IN TEXT CLASSIFICATION 15corpora in decreasing order of term variance. Figure 1 demonstrates the the performance ofeach of the classifiers, copula, Naive Bayes, linear SVC and KNN respectively.
Figure 1 . Plot of F1-Scores for all classifiers.On a preliminary examination of the graph, we observe that while the F1 values of bothcopula and Naive Bayes classifiers increases substantially when transitioning frommulti-label to multi-class classification, this phenomenon does not seem to have any effecton either of the discriminative models in terms of performance.Both discriminative models, KNN and SVC have relatively uniform performance scorescompared to the erratic nature of the scores of all the generative models. Another patterncommon in almost every curve is that, datasets with higher term variance have betterclassification accuracy in general for both multi-class and multi-label datasets.Finally, from figure 1, it is quite clear that the copula language model and the Naive Bayesclassifier with Lidstone smoothing seem to follow a similar trend in terms of classificationperformance. Which proves, our earlier hypothesis about the copula language modelsharing certain properties with the Naive Bayes algorithm was well founded.For short texts like twitter and StackOverflow both versions of the Naive Bayes classifierERM-DEPENDENCE IN TEXT CLASSIFICATION 16outperforms the copula language model. This is a result of insufficient hits on the availableco-occurrence list generated from the training set, which is a direct outcome of shortdocument length. The F1-score of KNN also plummeted when classification was carriedout on short text documents, and the number of neighbours had to be adjusted to 100 toimprove accuracy.Moving on to Table 2 we observe that, even with a substantial difference of micro-averagedF1-scores of copulas and Naive Bayes with Lidstone smoothing for the Reuters-21578 datait has a confidence score greater than 0.05. More surprisingly the significance score of thisalgorithm for the RCV1 corpus indicates, with a 0.01 confidence level that copula is abetter algorithm. KNN presented similar counter-intuitive significance scores for bothmulti label corpora, with a very high margin of difference in their micro-averagedF1-scores. SVC and Naive Bayes with Laplace smoothing also have P-values higher than0.05 for the 20-NewsGroups dataset.Similar results were observed for the Twitter and StackOverflow corpora in case of KNN,but the differences in its micro-averaged F1-scores with copulas were very low in both casesand no relevant conclusion could be drawn even with a more detailed analysis. So, theobservations were attributed to data bias.In order to create a better understanding of the anomalies in the confidence scores for therest, we generated Charts 2 through 7 that demonstrate the performance scores of eachclassifier over all the data points, that were used to perform the significance tests. TheX-axis lists all classes in a corpus in decreasing order of class-size and the Y-axis plots theF1-scores.For SVC in figure 2 the differences in scores are not as significant, but there are 6 caseswhere copula marginally outperforms this model. This resulted in a marginally higherP-value.But more interestingly, in each case the copula model demonstrates a significant andconsistent improvement in classification accuracy over other algorithms, for classes thathave a low document frequency. Thus the superior performance of the classifier in suchcategories is the cause for the shift in the significance values. This also sheds new light onthe properties of the copula classification model. We learn that the information the modelERM-DEPENDENCE IN TEXT CLASSIFICATION 17accumulates using term dependence helps with classification accuracy for classes withinadequate features.To further investigate this property, we decided to plot the F1-scores of all classifiers foronly the classes that had the highest difference in performance. By studying graphs 2-7, weobserve that this phenomenon is most clearly visible for class sizes 16 through 2 of theReuters-21578 corpus. Figure 8 presents the accuracy measures of all the classifiers for theaforementioned sequence of classes. The superior performance of the dependence model isclearly visible in the chart.But, even with copulas showing impressive performance, SVC still manages to do a betterjob at classifying documents in most cases. To eliminate the possibility of a bias, we plotthe F1-scores of the two classifiers for a similar range of class size from a different corpus.The RCV1 corpus was the only other corpus with comparable class sizes, so we used theresults of the four test sets to compare the performance of the two classifiers. Figures 9 to12 represent the F1-scoes of the smallest classes of each test set for the two classifiers. Inall 4 cases copulas clearly take the lead.Thus it can be concluded that copulas evenly match SVM based classification in classeswith sparse features, for long text data. The relative scores of the two classifiers willdepend on the nature of data, but in general both algorithms manage to perform decentlyfor such classes.
CONCLUSION
From the extensive set of experiments that were carried out, it is clear that Support VectorMachine based classifiers continue to dominate over others and remain the most reliableclassifier. All the other classifiers had their own limitations. Even though the copula modeldemonstrated impressive performance for classes with a limited number of documents,SVM achieved nearly equal performance in fractional time. It also performed poorly inshort text datasets, where Naive Bayes demonstrated why it still remains a benchmark forall other classification algorithms. While classification accuracy for generative classifiersfaltered in multi-label classification problems compared to multi-class, discriminativemethods maintained a very stable curve. But most importantly, the copula language modelERM-DEPENDENCE IN TEXT CLASSIFICATION 18in spite of boasting the use of complex dependence structures, failed to impress.It is therefore clear that dependence models like the Copulas are still outperformed bycommon independence methods like KNN and SVM and even small modifications to theNaive Bayes Classifier like changing smoothing parameters can sometimes result in betterscores.A state of the art dependence model could not hold its place among existing classificationalgorithms that do not model term dependence, which also makes them less resourceintensive than the former. So the question remains, should researchers continue tointroduce new dependence models that perform better than their predecessors or focus onimproving the performance of existing methods?The inherent limitation of modelling term dependence on text data lies in the considerablyhigh costs associated with computation and storage, which in turn creates a demand for asignificantly higher classification accuracy from these algorithms. Term dependenceobviously has its perks, but every algorithm that models this dependence can notautomatically be expected to be better than existing state of the art models like SVM.Proposing dependence models that perform gracefully when compared to the best existingmodels even for specific use cases could be a valuable contribution. But, introducing newalgorithms that generate marginally better results than models which are themselves notvery efficient, may not be in the best interest for the progress of the field.
ACKNOWLEDGEMENT
We would Like to thank Kaggle and Scikit-Learn for for making the StackOverflow and atokenized version of the RCV1-V2 dataset available to us for our experiments.ERM-DEPENDENCE IN TEXT CLASSIFICATION 19ReferencesBen-Hur, A. & Weston, J. (2010). A user’s guide to support vector machines.
Data miningtechniques for the life sciences
Proceedings of the 23rd acm international conference on conference on informationand knowledge management (pp. 1831–1834). ACM.Eickhoff, C., de Vries, A. P., & Hofmann, T. (2015). Modelling term dependence withcopulas. In
Proceedings of the 38th international acm sigir conference on research anddevelopment in information retrieval (pp. 783–786). ACM.Eyheramendy, S., Lewis, D. D., & Madigan, D. (2003). On the naive bayes model for textcategorization.Han, E.-H. S. & Karypis, G. (2000). Centroid-based document classification: analysis andexperimental results. In
European conference on principles of data mining andknowledge discovery (pp. 424–431). Springer.Jason Rennie. (n.d.). The 20 Newsgroups data set.http://qwone.com/~jason/20Newsgroups/ (Jan. 2008).Jelinek, F. (1980). Interpolated estimation of markov source parameters from sparse data.In
Proc. workshop on pattern recognition in practice, 1980 .Lewis, D. D., Yang, Y., Rose, T. G., & Li, F. (2004). Rcv1: a new benchmark collection fortext categorization research.
Journal of machine learning research , (Apr), 361–397.McCallum, A., Nigam, K. et al. (1998). A comparison of event models for naive bayes textclassification. In Aaai-98 workshop on learning for text categorization (pp. 41–48).Madison, WI.Metzler, D. & Croft, W. B. (2005). A markov random field model for term dependencies. In
Proceedings of the 28th annual international acm sigir conference on research anddevelopment in information retrieval (pp. 472–479). ACM.ERM-DEPENDENCE IN TEXT CLASSIFICATION 20Nallapati, R. & Allan, J. (2002). Capturing term dependencies using a language modelbased on sentence trees. In
Proceedings of the eleventh international conference oninformation and knowledge management (pp. 383–390). ACM.Nallapati, R. & Allan, J. (2003). An adaptive local dependency language model: relaxingthe naıve bayes’ assumption.Protasiewicz, J., Mirończuk, M., & Dadas, S. (2017). Categorization of multilingualscientific documents by a compound classification system. In
International conferenceon artificial intelligence and soft computing (pp. 563–573). Springer.Reuters. (n.d.). Reuters Corpora. http://trec.nist.gov/data/reuters/reuters.html/.scikit-learn. (n.d.). RCV1 dataset. http://scikit-learn.org/stable/datasets/rcv1.html/.Vatanen, T., Väyrynen, J. J., & Virpioja, S. (2010). Language identification of short textsegments with n-gram models. In
Lrec .Xu, J., Peng, W., Guanhua, T., Bo, X., Jun, Z., Fangyuan, W., Hongwei, H., et al. (2015).Short text clustering via convolutional neural networks.Xu, S., Li, Y., & Wang, Z. (2017). Bayesian multinomial naıve bayes classifier to textclassification. In
Advanced multimedia and ubiquitous engineering (pp. 347–352).Springer.Yang, Y. & Liu, X. (1999). A re-examination of text categorization methods. In
Proceedings of the 22nd annual international acm sigir conference on research anddevelopment in information retrieval (pp. 42–49). ACM.Yu, C. T., Buckley, C., Lam, K., & Salton, G. (1983).
A generalized term dependence modelin information retrieval . Cornell University.ERM-DEPENDENCE IN TEXT CLASSIFICATION 21
Figure 2 . Plot for F1-Score acrossall classes for 20-NewsGroup Corpus
Figure 3 . Plot for F1-Score acrossall classes for Reuters-21578 Corpus
Figure 4 . Plot for F1-Score acrossall classes for test set-1 of RCV1Corpus
Figure 5 . Plot for F1-Score acrossall classes for test set-2 of RCV1Corpus
Figure 6 . Plot for F1-Score acrossall classes for test set-3 of RCV1Corpus
Figure 7 . Plot for F1-Score acrossall classes for test set-4 of RCV1CorpusERM-DEPENDENCE IN TEXT CLASSIFICATION 22
Figure 8 . Plot of F1-Scores of all classifiers for small classes from the Reuters-21578Corpus.ERM-DEPENDENCE IN TEXT CLASSIFICATION 23
Figure 9 . F1-Scores for small classesfrom RCV1, Set 1
Figure 10 . F1-Scores for smallclasses from RCV1, Set 2
Figure 11 . F1-Scores for smallclasses from RCV1, Set 3