[PDF] Re-evaluating the need for Modelling Term-Dependence in Text Classification Problems

Abstract

A substantial amount of research has been carried out in developing machine learning algorithms that account for term dependence in text classification. These algorithms offer acceptable performance in most cases but they are associated with a substantial cost. They require significantly greater resources to operate. This paper argues against the justification of the higher costs of these algorithms, based on their performance in text classification problems. In order to prove the conjecture, the performance of one of the best dependence models is compared to several well established algorithms in text classification. A very specific collection of datasets have been designed, which would best reflect the disparity in the nature of text data, that are present in real world applications. The results show that even one of the best term dependence models, performs decent at best when compared to other independence models. Coupled with their substantially greater requirement for hardware resources for operation, this makes them an impractical choice for being used in real world scenarios.

Full PDF

RRunning head: TERM-DEPENDENCE IN TEXT CLASSIFICATION 1Re-evaluating the need for Modelling Term-Dependence in Text Classiﬁcation ProblemsSounak Banerjee , Prasenjit Majumder , and Mandar Mitra {[email protected], [email protected], [email protected]} CVPR Unit, Indian Statistical Institute, Kolkata, West Bengal, India; Ph : (+91) 33 25752858 Room No : 4209, DAIICT Gandhinagar, Gujarat, India; Ph : (+91) 79 3051 0605Aﬃliation a r X i v : . [ c s . I R ] O c t ERM-DEPENDENCE IN TEXT CLASSIFICATION 2AbstractA substantial amount of research has been carried out in developing machine learningalgorithms that account for term dependence in text classiﬁcation. These algorithms oﬀeracceptable performance in most cases but they are associated with a substantial cost. Theyrequire signiﬁcantly greater resources to operate. This paper argues against thejustiﬁcation of the higher costs of these algorithms, based on their performance in textclassiﬁcation problems. In order to prove the conjecture, the performance of one of the bestdependence models is compared to several well established algorithms in text classiﬁcation.A very speciﬁc collection of datasets have been designed, which would best reﬂect thedisparity in the nature of text data, that are present in real world applications. The resultsshow that even one of the best term dependence models, performs decent at best whencompared to other independence models. Coupled with their substantially greaterrequirement for hardware resources for operation, this makes them an impractical choicefor being used in real world scenarios.

Keywords:

Text Classiﬁcation, Copula, Support Vector Machine, Unigram LanguageModel, K Nearest NeighboursERM-DEPENDENCE IN TEXT CLASSIFICATION 3Re-evaluating the need for Modelling Term-Dependence in Text Classiﬁcation Problems

INTRODUCTION

For quite some time, researchers have fostered the idea of a need to design algorithms thatmodel dependence among terms in documents to improve classiﬁcation performanceEickhoﬀ, de Vries, and Hofmann, 2015; Han and Karypis, 2000; Metzler and Croft, 2005;Nallapati and Allan, 2002, 2003; Yu, Buckley, Lam, and Salton, 1983. The central idea isthat, one could better predict the class a document belongs to, if the underlying essence ofthe text could be interpreted, rather than an unordered collection of words that conveyvery little logical sense. In order to materialize the concept, many approaches have beenproposed. A Copula based language model Eickhoﬀ et al., 2015 presented by Eickhoﬀ et al.considers sentential co-occurrence of term-pairs to capture the dependence structure of theterms in a document. On the other hand a centroid based document classiﬁcationalgorithm attempts to represent each document as a vector in the term-space and for eachclass, calculate a centroid vector using its constituent documents ﬁnally comparing anynew documents to the available centroids Han and Karypis, 2000. While the MarkovRandom Field based classiﬁer models dependence on a contiguous sequence of terms,representing them as a chained dependence structure Metzler and Croft, 2005.Standard models on the other hand utilize properties such as rate of occurrence, length,distribution of features. They try to establish a relationship between the class and theproperties of the documents it contains. It assigns values to features that are relevant to aparticular class and then try to guess the membership of a new document to that class bycomparing these values.Though bolstering prediction potential through the exploitation of complex dependencestructures inherent to natural language seems tempting, each of these models requiresigniﬁcantly greater hardware resources to operate compared to independence models thatas their name suggests, depend on the properties of independent features of the text. Inaddition to the collection of features that independence models rely on, dependence basedmodels require both processing and memory for interpreting and storing relationshipsbetween the features.ERM-DEPENDENCE IN TEXT CLASSIFICATION 4In addition to that, no recent literature exists that compares classiﬁcation performance ofdependence models with widely accepted independence models like K nearest neighbours orsupport vector classiﬁers. So, to verify the validity of the argument justifying the use ofcomplex dependence structures for text classiﬁcation, we compare four classiﬁcationalgorithms which include, Naive Bayes Classiﬁer, Copula Language Model, K NearestNeighbour Classiﬁer and Support Vector Machine. Each classiﬁer is used to performclassiﬁcation on multiple datasets. Finally, we analyse the merits and demerits of eachclassiﬁer through a close examination of the properties of the datasets and their eﬀects onthe classiﬁers.We used the copula based classiﬁer as a benchmark for dependence models primarilybecause its superior performance over other dependence models has been well establishedand secondly for its recency of publication Eickhoﬀ et al., 2015.The copula language model allows us to account for term dependence by utilizing the listof all co-occurring term-pairs in sentences. Since, each sentence in a document is thesmallest entity that carries a sense, it is assumed the co-occurrence of terms in eachsentence will carry some semantic relevance to the topic. The co-occurrence measures arecalculated separately for all term-pairs to model a classiﬁer for each class. The modelutilizes both co-occurrence data and term probability to calculate the similarity measure ofa document to a speciﬁc class.

DATASETS

Multiple Datasets were used to alleviate the possibility of any bias in the evaluation.Datasets were selected based on varying length of documents, class size, and language(colloquial and formal). Another key aspect that was considered while selecting thedatasets was the classiﬁcation type, multi-class and multi-label. The twitter,20-Newsgroups and Stack Overﬂow datasets have been chosen for multi-class classiﬁcation,while Reuters-21578 and RCV1 are multi-label datasets.All datasets were processed in the same manner. Stop word removal was carried out basedon the list of English stop words available in NLTK. Stemming was done using PorterStemmer.ERM-DEPENDENCE IN TEXT CLASSIFICATION 5

Reuters-21578

The corpus is a collection of 21578 news wire articles from Reuters. It is a multi labeldataset with a total of 90 categories. David D. Lewis, n.d. provides a detailed summary ofthe corpus. Class-sizes of training documents range from 1 to 2861. The average length ofdocuments in the corpus is 126 words.

RCV1-V2

The original Reuters RCV1 corpus is a collection of 800,000 documents, with 103categories. Since carrying out any operation on such a large corpus is diﬃcult, achronological split has been proposed Lewis, Yang, Rose, and Li, 2004. The RCV1-V2contains 1 training and 4 test sets. The ﬁrst 23,149 documents are used as the training setand the rest of the collection has been split into 4 test sets each containing about 200,000documents. Each document belongs to at least 1 to a maximum of 17 categories with eachtopic containing atleast 5 documents over the entire corpus. For some categories there areno documents in the training set. Class-size for training documents range from 0 to 10786.Each document in the corpus is 143 words long on average.Scikit-Learn provides a tokenized version of the corpus scikit-learn, n.d. that can be easilyimported into Python. This version of the corpus was used as the input for all the existingmodels, except for the Copula Language Model, for which the original version that isavailable on request Reuters, n.d. was used, since sentence-level co-occurrence data wasneeded for this algorithm. For carrying out this task, documents that have been omittedfrom version 2 of the corpus were ignored from the original data, to maintain uniformityacross all tests.

Twitter-Sample

This corpus has been considered because of its short document length and use of colloquiallanguage. The corpus is available in the NLTK corpus library. It is a collection of 10,000tweets, separated into 5,000 positive and 5,000 negative tweets. The average size of eachtweet in the collection is 11 words.ERM-DEPENDENCE IN TEXT CLASSIFICATION 6

StackOverﬂow Questions

Stack overﬂow is a platform where novices post questions from diﬀerent ﬁelds and anyonewho has a solution may provide an answer. The corpus contains a list of 20,000 suchquestion titles from the stack overﬂow website divided over 20 categories J. Xu et al., 2015.This corpus was selected because of its similarity in document size with the twitter corpus,so that any eﬀect on classiﬁcation of short text documents may be identiﬁed without bias.The average length of each question over the entire corpus is 8 words.

This corpus is a collection of 18,846 news articles distributed almost evenly across 20newsgroup categories like, comp.graphics, rec.sport.hockey, sci.electronics,soc.religion.christian . It is available for download from their oﬃcial website at JasonRennie, n.d., but for our purpose we used the version available in the NLTK corpus library.The class size ranges from 377 to 600 documents for training and 251 to 399 for testing.With 318 words per document, the average document size is the highest among all thedatasets used in this experiment.

CLASSIFICATION

All implementations were carried out in Python. The term-weights used for classiﬁcationwere kept consistent for all classiﬁers with the exception of the copula language model.Since the input scores of terms for copulas need to be normalized [0,1], simple probabilitiesof the occurrence of terms in a class were used. Where probability of occurrence of term i in a class C is given by, P ( w i ) = N i Σ | C | i N i N i being the number of occurrences of the term i in class C from the training set.Every other algorithm utilized the TF-IDF scores of terms for classiﬁcation. The TF-IDFscore of terms were generated using the TﬁdfVectorizer function from Scikit-Learn.Also, in the case of RCV1 the tokenized data available from Scikit-Learn was used for ourexperiments. However the tokens were not labelled, so co-occurrence information could notERM-DEPENDENCE IN TEXT CLASSIFICATION 7be mapped to the original data. So, the original RCV1 data was split into sets similar tothe RCV1-V2 dataset and the copula based classiﬁer was run on this new data.Finally, for performing multi-label classiﬁcation on RCV1 and Reuters-21578 datasets,binary relevance method was employed. Binary relevance involves the use of a collection ofyes/no classiﬁers, one for each class in a dataset that can determine whether a documentbelongs to a speciﬁc class or not.

Naive Bayes Classiﬁer

It is one of the most basic and also a fairly simple classiﬁcation model that uses Bayesalgorithm to get probability scores. It is also the most commonly used method tobenchmark other algorithms.The similarity score of a document to a class is expressed as: P ( t | d ) = P ( t ) ∗ P ( d | t )Where, P(t|d) is the probability that document d belongs to topic t . P(t) is the priorprobability of topic t given by: P ( t ) = N t N total and P ( d | t ) = Y w ∈ d P ( w | t ) , where N t is the number of documents present in the training set of topic t and N total is thetotal number of documents present in the complete training set. P ( w | t ) is the probabilitythat word w belongs to topic t . Though traditional Naive Bayes algorithms use simpleterm-probabilities, we used the TF-IDF scores of the words for the purpose of thisexperiment.Additive smoothing was employed for smoothing of term probabilities. The general formulafor additive smoothing is P ( w i ) = n i + αN + α | V | where, P ( w i ) is the probability of occurrence of all words w i in a class. n i is the frequencyof word w i for a speciﬁc class in the training set, N is the sum of the frequencies of allERM-DEPENDENCE IN TEXT CLASSIFICATION 8words w i in the class, |V| is the size of the vocabulary of the class. Finally α is the userdeﬁned parameter. Laplace smoothing is a special case of additive smoothing when α = 1.When 0< α <1, it is called Lidstone smoothing Vatanen, Väyrynen, and Virpioja, 2010. Forour experiments, we apply both Laplace and Lidstone smoothing (with α = 0.01).We have also used the Multinomial Naive Bayes algorithm which accounts for the exactfrequencies of terms for each class instead of Binomial Naive Bayes. Our reason forchoosing this variation of the NB classiﬁer is because of its superior performance in textclassiﬁcation problems. There have been multiple studies demonstrating the eﬃcacy of themultinomial Naive Bayes algorithm in text classiﬁcation Eyheramendy, Lewis, andMadigan, 2003; McCallum, Nigam, et al., 1998; Protasiewicz, Mirończuk, and Dadas, 2017;S. Xu, Li, and Wang, 2017. K-Nearest Neighbours

As the name suggests, for an input document d , the KNN-algorithm selects a user-deﬁnednumber of neighbours from the set of training documents, that are nearest to it.The distance between the documents is calculated using their features by plugging theminto a similarity measure or graph based structures. After creating a list of K neighboursthat have the least distance, the algorithm uses a voting scheme, wherein each documentenlisted by the algorithm places a vote for its respective class. The ﬁnal decision for adocument d is made based on the number of votes each class receives for that document,from its K nearest neighbours.For our experiment, the choice of K was adjusted based on what was best suited for eachcorpus. A brute-force method was used to perform classiﬁcation, since the implementationavailable could only use brute-force for sparse inputs of feature matrices. The Scikit-Learnimplementation of KNN provides two options for calculating distances among documents,Euclidean and Manhattan distances. We used the Manhattan distance because theimplementation of Euclidean distance caused our system to run out of memory.ERM-DEPENDENCE IN TEXT CLASSIFICATION 9 Support Vector Machine

A support vector machine plots the documents as points in n-dimensional space wherethere are n features each representing its own dimension. It then deﬁnes a hyperplanebetween these sets of points that segregate each set so that the collection on either side ofthe hyperplane contains the maximum number of documents of the intended class.The function used to classify a document x is given by: sign (Σ y i ∗ w i ∗ K ( x ‘ , x ) + b )Where: y i is the class value (+1 & -1 for binary classiﬁcation) w i is the weight vector (vector for the hyperplane) K being the kernel function (linear in our case) x‘ is the collection of support vectors b is the distance of the hyperplane from originSince binary relevance was used in our case, each Support Vector Classiﬁer (SVC) solved abinary classiﬁcation problem and the sign of the value determined whether the documentbelonged to a class.A support vector machine can use multiple functions to generate the hyperplane, thesefunctions are called kernels. We used the linear kernel in our experiment, which basicallycreates a linear hyperplane. Our choice of the kernel is based on the fact that, with highdimensional vector spaces selecting non-linear kernels runs the risk of over ﬁtting Ben-Hurand Weston, 2010. Copula Language Model

In this language model, a classiﬁer for any class c works with two sets of features. The ﬁrstone is a list of all terms present in the documents of the class with their probability ofoccurrence. The second is a list of all term pairs that occur in the same sentence with theirrespective Pointwise Multual Information (PMI) or Jaccard’s coeﬃcient values across theERM-DEPENDENCE IN TEXT CLASSIFICATION 10class, normalized between [1, ∞ ]. θ t ,t = f ( t , t ) µ if f ( t , t ) > µ else θ t ,t = 1 f is the function for the choice of co-occurrence metric between terms t and t . µ is theaverage over all f ( t i , t j ) [ i = j ].When theta is 1, there is complete independence between the terms and an increasingcoeﬃcient value implies increasing dependence. In our version of the algorithm we usedPMI, as it generated marginally superior results compared to Jaccard in every case.The probability that a document d belongs to a topic t is calculated by: P ( t | d ) = P ( t ) ∗ P ( d | t )Where P(t) is the prior probability of topic t and, P ( d | t ) = C t ( w , w , w , ..., w n ) w i ∈ d Where: C t ( w , w , w , ..., w n ) = ψ − ( ψ ( w ) + ψ ( w ) + ψ ( w ) + ... + ψ ( w n ))We used the Gumbel copula from the Archimedean family, as it was reported to producethe best results in Eickhoﬀ and de Vries, 2014. ψ and ψ − for Gumbel copulas are deﬁned by: ψ ( u ) = ( − log ( u )) θ ψ − ( u ) = exp ( − u /θ )Thus : C t ( u i , u j ) = exp ( − (( − log ( u i )) θ + ( − log ( u j )) θ ) /θ )Where, u i , u j are the probabilities of occurrence of words w i , w j in topic t respectively, and θ is a parameter that represents the strength of dependency between individual wordsERM-DEPENDENCE IN TEXT CLASSIFICATION 11 w i , w j in topic t , expressed in PMI in our case. It is important to note that when θ becomes 1, i.e. when they are completely independent, C t ( u i , u j ) = u i ∗ u j For the sake of simplicity, the value of θ between two term-pairs say, ( w i , w j ) and ( w l , w k )is assumed to be 1, making them independent. This causes the copula function to becomea nested list of bivariate copulas. C t ( w , w , w , ..., w n | c ) = C t ( w , w | c ) ∗ C t ( w , w | c ) ∗ ... for all ( w i , w j ) with, θ > 1We used Jelinek-Mercer smoothing for all terms in the corpus Jelinek, 1980. Any unfamiliarwords from the test set were omitted during the classiﬁcation process. Also, whenconsidering term pairs, a single word might occur in multiple pairs, in our implementationwe chose to include the contribution of the probability scores of these re-occurring terms.The algorithm for this language model closely resembles the Naive Bayes algorithm. InBayes method complete independence of terms is assumed.Given a collection of terms w , w , w , ...w k , from a document d and their probabilities ofoccurrence u , u , u , ...u k in a certain topic t , the copula based similarity score of thedocument to the topic is given by: C t ( u , u , u , ...u k ) = ψ − ( ψ ( u ) + ψ ( u ) + ψ ( u ) + ... + ψ ( u k ))If we assume complete independence like Naive Bayes ( θ = 1), C t ( u , u , u , ...u k ) = u ∗ u ∗ u ∗ ... ∗ u k which is equal to the Bayes formula.But since, this algorithm accounts for term dependence using sentential co-occurrence, it isexpected to perform better than simple Naive Bayes classiﬁcation.The goal is to ﬁgure out whether this extra computation to improve performance usingterm dependence is beneﬁcial to the text-classiﬁcation paradigm.ERM-DEPENDENCE IN TEXT CLASSIFICATION 12 EXPERIMENTS

For every test except with copulas, the Scikit-Learn implementations of the algorithmswere used for comparison. The input fed to each classiﬁer was the exact samepre-processed data, generated using the standard NLTK library functions.Since there are no separate training and testing sets for the Twitter and Stack-Overﬂowdatasets, a 10-fold cross validation was carried out for assessing the performance of eachclassiﬁer. For RCV1 the classiﬁcation was carried out on each test set separately. Anaverage of the F1-measures from all 4 test sets for each classiﬁer have been reported.Parameter values for almost every classiﬁer was set to default, except for the value of K forthe KNN classiﬁer which was optimized for best result. The parameter was set to 100 forboth short text datasets, StackOverﬂow and twitter-samples. For the rest, 15 neighbourswere selected. Also the α parameter for Jelinek-Mercer smoothing was set to 0.99 for thecopula classiﬁer.The micro-averaged F1-scores of all the classiﬁers with the corresponding dataset havebeen listed in Table 1. Corpus

N B α =1 N B α =0 . KNN Copula SVCReuters-21578 0.51 0.77 0.79 0.64 0.87RCV1 0.49 0.71 0.72 0.66 0.8020-NewsGroups 0.78 0.84 0.74 0.82 0.85Stack Overﬂow 0.83 0.78 0.74 0.76 0.85Twitter 0.75 0.72 0.71 0.70 0.75

Table 1

Micro-averaged F scores. Signiﬁcance Testing

A statistical signiﬁcance test was carried out for each dataset. The Wilcoxon signed-ranktest was employed for this purpose. The test was carried out by comparing the F1-scores ofeach classiﬁer with the copula model. Two test strategies were used based on the type ofERM-DEPENDENCE IN TEXT CLASSIFICATION 13the data.1. For the datasets that required a K-fold cross validation, the micro-averaged F1-scoresfor each fold were compared. In case of both, StackOverﬂow and Twitter corpora weused a 10-fold cross validation.2. For the other datasets that had a predeﬁned train and test set, we performed acategory wise testing. The F1-scores of each category were compared in order tomeasure whether the diﬀerence in scores were statistically signiﬁcant Yang and Liu,1999.Table 2 summarizes our observation of the signiﬁcance tests. The test results have beenclassiﬁed into 3 categories. Category |, is for results that satisﬁed a conﬁdence level of α =0.01, category || were the class of results whose P-Value lied between 0.01 and 0.05 andcategory ||| represented a P-Value greater than 0.05. Finally, the results marked with anasterisk signify the statistical signiﬁcance of the hypotheses that contradict the conclusionfrom Table 1. SVC

N B α =1 N B α =0 . KNNReuters-21578 | | ||| |||RCV1 Set-1 | | |* ||*RCV1 Set-2 | | |* |||RCV1 Set-3 | | |* |||RCV1 Set-4 | | |* ||*20-NewsGroups ||| ||| || |Stack Overﬂow | | | |||Twitter | | | |||

Table 2

P-Value for classiﬁcation performance.

RESULTS AND DISCUSSION

Form a rough perusal of table 1 we can make a few observations,ERM-DEPENDENCE IN TEXT CLASSIFICATION 141. The linear support vector classiﬁer (SVC) outperforms the copula language model inevery case in spite of the later using co-occurrence data and as a result utilizingsigniﬁcantly greater hardware resources.2. The Naive Bayes model with Laplace smoothing performs surprisingly well for shorttext data.In order to properly analyse the results, we create a list of properties for each dataset thatwould enable us to explain the performance of the classiﬁers. These properties have beenlisted in Table 3. The ﬁrst column gives the average variance of the frequency of termsacross all classes in the corpus. The stack overﬂow and twitter corpus have the minimumvariance in term frequencies which could be a result of their short document length and thefact that they are both informal text. While the other three corpora which contain newsarticles that are long text data and have been constructed for formal use, have a highervariance in frequency with Reuters 21578 having the highest value. The second columncontains the average document length of the corresponding corpus. And the ﬁnal column ofthe table lists the type of classiﬁcation algorithm required to perform a classiﬁcation taskon the corpora.

Corpus

V ar tf Doc Length TypeReuters-21578 5.228e-03 126 Multi-LabelRCV1 1.117e-03 143 Multi-Label20-NewsGroups 1.474e-04 318 Multi-ClassStack Overﬂow 6.845e-05 8 Multi-ClassTwitter 1.076e-06 11 Multi-Class

Table 3

Corpus Properties.

We studied the relationship of each of these properties against the classiﬁcation scores ofthe models and came to the conclusion that their behaviours could be attributed to acombination of all the listed properties. But the most intuitive comparisons could bederived based on term variance. So we plot the F1-scores of the classiﬁers against all ﬁveERM-DEPENDENCE IN TEXT CLASSIFICATION 15corpora in decreasing order of term variance. Figure 1 demonstrates the the performance ofeach of the classiﬁers, copula, Naive Bayes, linear SVC and KNN respectively.

Figure 1 . Plot of F1-Scores for all classiﬁers.On a preliminary examination of the graph, we observe that while the F1 values of bothcopula and Naive Bayes classiﬁers increases substantially when transitioning frommulti-label to multi-class classiﬁcation, this phenomenon does not seem to have any eﬀecton either of the discriminative models in terms of performance.Both discriminative models, KNN and SVC have relatively uniform performance scorescompared to the erratic nature of the scores of all the generative models. Another patterncommon in almost every curve is that, datasets with higher term variance have betterclassiﬁcation accuracy in general for both multi-class and multi-label datasets.Finally, from ﬁgure 1, it is quite clear that the copula language model and the Naive Bayesclassiﬁer with Lidstone smoothing seem to follow a similar trend in terms of classiﬁcationperformance. Which proves, our earlier hypothesis about the copula language modelsharing certain properties with the Naive Bayes algorithm was well founded.For short texts like twitter and StackOverﬂow both versions of the Naive Bayes classiﬁerERM-DEPENDENCE IN TEXT CLASSIFICATION 16outperforms the copula language model. This is a result of insuﬃcient hits on the availableco-occurrence list generated from the training set, which is a direct outcome of shortdocument length. The F1-score of KNN also plummeted when classiﬁcation was carriedout on short text documents, and the number of neighbours had to be adjusted to 100 toimprove accuracy.Moving on to Table 2 we observe that, even with a substantial diﬀerence of micro-averagedF1-scores of copulas and Naive Bayes with Lidstone smoothing for the Reuters-21578 datait has a conﬁdence score greater than 0.05. More surprisingly the signiﬁcance score of thisalgorithm for the RCV1 corpus indicates, with a 0.01 conﬁdence level that copula is abetter algorithm. KNN presented similar counter-intuitive signiﬁcance scores for bothmulti label corpora, with a very high margin of diﬀerence in their micro-averagedF1-scores. SVC and Naive Bayes with Laplace smoothing also have P-values higher than0.05 for the 20-NewsGroups dataset.Similar results were observed for the Twitter and StackOverﬂow corpora in case of KNN,but the diﬀerences in its micro-averaged F1-scores with copulas were very low in both casesand no relevant conclusion could be drawn even with a more detailed analysis. So, theobservations were attributed to data bias.In order to create a better understanding of the anomalies in the conﬁdence scores for therest, we generated Charts 2 through 7 that demonstrate the performance scores of eachclassiﬁer over all the data points, that were used to perform the signiﬁcance tests. TheX-axis lists all classes in a corpus in decreasing order of class-size and the Y-axis plots theF1-scores.For SVC in ﬁgure 2 the diﬀerences in scores are not as signiﬁcant, but there are 6 caseswhere copula marginally outperforms this model. This resulted in a marginally higherP-value.But more interestingly, in each case the copula model demonstrates a signiﬁcant andconsistent improvement in classiﬁcation accuracy over other algorithms, for classes thathave a low document frequency. Thus the superior performance of the classiﬁer in suchcategories is the cause for the shift in the signiﬁcance values. This also sheds new light onthe properties of the copula classiﬁcation model. We learn that the information the modelERM-DEPENDENCE IN TEXT CLASSIFICATION 17accumulates using term dependence helps with classiﬁcation accuracy for classes withinadequate features.To further investigate this property, we decided to plot the F1-scores of all classiﬁers foronly the classes that had the highest diﬀerence in performance. By studying graphs 2-7, weobserve that this phenomenon is most clearly visible for class sizes 16 through 2 of theReuters-21578 corpus. Figure 8 presents the accuracy measures of all the classiﬁers for theaforementioned sequence of classes. The superior performance of the dependence model isclearly visible in the chart.But, even with copulas showing impressive performance, SVC still manages to do a betterjob at classifying documents in most cases. To eliminate the possibility of a bias, we plotthe F1-scores of the two classiﬁers for a similar range of class size from a diﬀerent corpus.The RCV1 corpus was the only other corpus with comparable class sizes, so we used theresults of the four test sets to compare the performance of the two classiﬁers. Figures 9 to12 represent the F1-scoes of the smallest classes of each test set for the two classiﬁers. Inall 4 cases copulas clearly take the lead.Thus it can be concluded that copulas evenly match SVM based classiﬁcation in classeswith sparse features, for long text data. The relative scores of the two classiﬁers willdepend on the nature of data, but in general both algorithms manage to perform decentlyfor such classes.

CONCLUSION

From the extensive set of experiments that were carried out, it is clear that Support VectorMachine based classiﬁers continue to dominate over others and remain the most reliableclassiﬁer. All the other classiﬁers had their own limitations. Even though the copula modeldemonstrated impressive performance for classes with a limited number of documents,SVM achieved nearly equal performance in fractional time. It also performed poorly inshort text datasets, where Naive Bayes demonstrated why it still remains a benchmark forall other classiﬁcation algorithms. While classiﬁcation accuracy for generative classiﬁersfaltered in multi-label classiﬁcation problems compared to multi-class, discriminativemethods maintained a very stable curve. But most importantly, the copula language modelERM-DEPENDENCE IN TEXT CLASSIFICATION 18in spite of boasting the use of complex dependence structures, failed to impress.It is therefore clear that dependence models like the Copulas are still outperformed bycommon independence methods like KNN and SVM and even small modiﬁcations to theNaive Bayes Classiﬁer like changing smoothing parameters can sometimes result in betterscores.A state of the art dependence model could not hold its place among existing classiﬁcationalgorithms that do not model term dependence, which also makes them less resourceintensive than the former. So the question remains, should researchers continue tointroduce new dependence models that perform better than their predecessors or focus onimproving the performance of existing methods?The inherent limitation of modelling term dependence on text data lies in the considerablyhigh costs associated with computation and storage, which in turn creates a demand for asigniﬁcantly higher classiﬁcation accuracy from these algorithms. Term dependenceobviously has its perks, but every algorithm that models this dependence can notautomatically be expected to be better than existing state of the art models like SVM.Proposing dependence models that perform gracefully when compared to the best existingmodels even for speciﬁc use cases could be a valuable contribution. But, introducing newalgorithms that generate marginally better results than models which are themselves notvery eﬃcient, may not be in the best interest for the progress of the ﬁeld.

ACKNOWLEDGEMENT

We would Like to thank Kaggle and Scikit-Learn for for making the StackOverﬂow and atokenized version of the RCV1-V2 dataset available to us for our experiments.ERM-DEPENDENCE IN TEXT CLASSIFICATION 19ReferencesBen-Hur, A. & Weston, J. (2010). A user’s guide to support vector machines.

Data miningtechniques for the life sciences

Proceedings of the 23rd acm international conference on conference on informationand knowledge management (pp. 1831–1834). ACM.Eickhoﬀ, C., de Vries, A. P., & Hofmann, T. (2015). Modelling term dependence withcopulas. In

Proceedings of the 38th international acm sigir conference on research anddevelopment in information retrieval (pp. 783–786). ACM.Eyheramendy, S., Lewis, D. D., & Madigan, D. (2003). On the naive bayes model for textcategorization.Han, E.-H. S. & Karypis, G. (2000). Centroid-based document classiﬁcation: analysis andexperimental results. In

European conference on principles of data mining andknowledge discovery (pp. 424–431). Springer.Jason Rennie. (n.d.). The 20 Newsgroups data set.http://qwone.com/~jason/20Newsgroups/ (Jan. 2008).Jelinek, F. (1980). Interpolated estimation of markov source parameters from sparse data.In

Proc. workshop on pattern recognition in practice, 1980 .Lewis, D. D., Yang, Y., Rose, T. G., & Li, F. (2004). Rcv1: a new benchmark collection fortext categorization research.

Journal of machine learning research , (Apr), 361–397.McCallum, A., Nigam, K. et al. (1998). A comparison of event models for naive bayes textclassiﬁcation. In Aaai-98 workshop on learning for text categorization (pp. 41–48).Madison, WI.Metzler, D. & Croft, W. B. (2005). A markov random ﬁeld model for term dependencies. In

Proceedings of the 28th annual international acm sigir conference on research anddevelopment in information retrieval (pp. 472–479). ACM.ERM-DEPENDENCE IN TEXT CLASSIFICATION 20Nallapati, R. & Allan, J. (2002). Capturing term dependencies using a language modelbased on sentence trees. In

Proceedings of the eleventh international conference oninformation and knowledge management (pp. 383–390). ACM.Nallapati, R. & Allan, J. (2003). An adaptive local dependency language model: relaxingthe naıve bayes’ assumption.Protasiewicz, J., Mirończuk, M., & Dadas, S. (2017). Categorization of multilingualscientiﬁc documents by a compound classiﬁcation system. In

International conferenceon artiﬁcial intelligence and soft computing (pp. 563–573). Springer.Reuters. (n.d.). Reuters Corpora. http://trec.nist.gov/data/reuters/reuters.html/.scikit-learn. (n.d.). RCV1 dataset. http://scikit-learn.org/stable/datasets/rcv1.html/.Vatanen, T., Väyrynen, J. J., & Virpioja, S. (2010). Language identiﬁcation of short textsegments with n-gram models. In

Lrec .Xu, J., Peng, W., Guanhua, T., Bo, X., Jun, Z., Fangyuan, W., Hongwei, H., et al. (2015).Short text clustering via convolutional neural networks.Xu, S., Li, Y., & Wang, Z. (2017). Bayesian multinomial naıve bayes classiﬁer to textclassiﬁcation. In

Advanced multimedia and ubiquitous engineering (pp. 347–352).Springer.Yang, Y. & Liu, X. (1999). A re-examination of text categorization methods. In

Proceedings of the 22nd annual international acm sigir conference on research anddevelopment in information retrieval (pp. 42–49). ACM.Yu, C. T., Buckley, C., Lam, K., & Salton, G. (1983).

A generalized term dependence modelin information retrieval . Cornell University.ERM-DEPENDENCE IN TEXT CLASSIFICATION 21

Figure 2 . Plot for F1-Score acrossall classes for 20-NewsGroup Corpus

Figure 3 . Plot for F1-Score acrossall classes for Reuters-21578 Corpus

Figure 4 . Plot for F1-Score acrossall classes for test set-1 of RCV1Corpus

Figure 5 . Plot for F1-Score acrossall classes for test set-2 of RCV1Corpus

Figure 6 . Plot for F1-Score acrossall classes for test set-3 of RCV1Corpus

Figure 7 . Plot for F1-Score acrossall classes for test set-4 of RCV1CorpusERM-DEPENDENCE IN TEXT CLASSIFICATION 22

Figure 8 . Plot of F1-Scores of all classiﬁers for small classes from the Reuters-21578Corpus.ERM-DEPENDENCE IN TEXT CLASSIFICATION 23

Figure 9 . F1-Scores for small classesfrom RCV1, Set 1

Figure 10 . F1-Scores for smallclasses from RCV1, Set 2

Figure 11 . F1-Scores for smallclasses from RCV1, Set 3