[PDF] Text Categorization via Similarity Search: An Efficient and Effective Novel Algorithm

Abstract

We present a supervised learning algorithm for text categorization which has brought the team of authors the 2nd place in the text categorization division of the 2012 Cybersecurity Data Mining Competition (CDMC'2012) and a 3rd prize overall. The algorithm is quite different from existing approaches in that it is based on similarity search in the metric space of measure distributions on the dictionary. At the preprocessing stage, given a labeled learning sample of texts, we associate to every class label (document category) a point in the space of question. Unlike it is usual in clustering, this point is not a centroid of the category but rather an outlier, a uniform measure distribution on a selection of domain-specific words. At the execution stage, an unlabeled text is assigned a text category as defined by the closest labeled neighbour to the point representing the frequency distribution of the words in the text. The algorithm is both effective and efficient, as further confirmed by experiments on the Reuters 21578 dataset.

Full PDF

aa r X i v : . [ c s . I R ] J u l Text Categorization via Similarity Search

An Eﬃcient and Eﬀective Novel Algorithm ⋆ Hubert Haoyang Duan , Vladimir Pestov , and Varun Singla University of Ottawa, Ottawa, Ontario K1N 6N5, Canada { hduan065,vpest283 } @uottawa.ca Indian Institute of Technology Delhi, Hauz Khas, New Delhi-110 016, India [email protected]

Abstract.

We present a supervised learning algorithm for text cate-gorization which has brought the team of authors the 2nd place in thetext categorization division of the 2012 Cybersecurity Data Mining Com-petition (CDMC’2012) and a 3rd prize overall. The algorithm is quitediﬀerent from existing approaches in that it is based on similarity searchin the metric space of measure distributions on the dictionary. At thepreprocessing stage, given a labeled learning sample of texts, we asso-ciate to every class label (document category) a point in the space ofquestion. Unlike it is usual in clustering, this point is not a centroid ofthe category but rather an outlier, a uniform measure distribution on aselection of domain-speciﬁc words. At the execution stage, an unlabeledtext is assigned a text category as deﬁned by the closest labeled neigh-bour to the point representing the frequency distribution of the wordsin the text. The algorithm is both eﬀective and eﬃcient, as further con-ﬁrmed by experiments on the Reuters 21578 dataset.

The amount of texts readily available in the world is growing at an astonishingrate; classifying these texts through machine learning techniques, promptly andwithout much human intervention, has thus become an important problem indata mining. Much research, in the ﬁeld of supervised learning, has been done toﬁnd accurate algorithms to classify documents in a dataset to their appropriatecategories, e.g. see [27] or [1] for a detailed survey of text categorization.The most widely used model for text categorization is the Vector Space Model(VSM) [24]. Under this model, a data dictionary T consisting of unique wordsacross the documents in the dataset is constructed. The documents are repre-sented by real-valued vectors in the space R T with dimension equaling to thesize of the dictionary. Given t ∈ T , the t -th coordinate of a vector is the relativefrequency of the word t in a given document. When some of the documents’ ⋆ This work has been partially supported by a 2012 NSERC Canada Graduate Scholar-ship and a 2013 Ontario Graduate Scholarship (Hubert Haoyang Duan), 2012–2017NSERC Discovery Grant “New set-theoretic tools for statistical learning” (VladimirPestov), and the 2012 Mitacs Globalink Program (Varun Singla). ctual class labels are known and used for training, many well-known classiﬁersin supervised machine learning, such as SVM [6], k -NN [7], and Random Forest[3], can then be applied to categorize documents.Text categorization decidedly comes across as a problem of detecting sim-ilarities between a given text and a collection of texts of a particular type.Although distance-based learning rules for text categorization, such as the k -nearest neighbour classiﬁer, e.g. [18], are not new, they are currently based onthe entire feature space, while any dimension reduction steps are done indepen-dently beforehand [27].We aim to ﬁll this gap by suggesting a novel supervised learning algorithmfor text categorization, called the Domain-Speciﬁc classiﬁer. It discovers speciﬁcwords for each category, or domain, of documents in training and classiﬁes basedon similarity searches in the space of word frequency distributions supported onthe respective domain-speciﬁc words.For each class label, j = 1 , , . . . , k , our algorithm extracts class, or domain,speciﬁc words from labeled training documents, that is, words that appear in theclass j more frequently than in all the other document classes combined (moduloa given threshold). Now a given unlabeled document is assigned a label j if thenormalized frequency of domain-speciﬁc words for j in the document is higherthan for any other label.To see that this classiﬁer is indeed similarity search based, let x j ∈ R T bea binary vector whose t -th coordinate is 1 if and only if t is domain-speciﬁcto j , and 0 otherwise. Normalize x j according to the ℓ p distance, and let adocument be represented by a vector w ∈ R T . Then the label assigned to w is that of the closest neighbour to w among x , x , . . . , x k with regard to thesimplest similarity measure, the inner product on R T . In other words, we seekto maximize the value of h w, x j i over j = 1 , , . . . , k . Notice that the well-knowncosine similarity measure, cf. e.g. [27], corresponds to the special case p = 2.This algorithm was ﬁrst used in the 3rd Cybersecurity Data Mining Competi-tion (CDMC 2012) to notable success, as the team of authors placed second in thetext categorization challenge, and ﬁrst in classiﬁcation accuracy [19]. In addition,the classiﬁcation performance of the algorithm was validated on a sub-collectionof the popular Reuters 21578 dataset [16], consisting of single-category docu-ments from the top eight most frequent categories, with the standard “modApt´e”training/testing split. In terms of accuracy, our classiﬁer performs slightly betterthan SVM with a linear kernel, and is signiﬁcantly faster.This paper is organized as follows. Section 2 surveys common feature selectionand extraction methods and classiﬁers considered in the text categorization lit-erature. Section 3 explains the new Domain-Speciﬁc classiﬁer in detail and castsit as a similarity search problem. Section 4 discusses results from the CDMC2012 Data Mining competition and experiments from the competition and onthe Reuters 21578 dataset. Finally, Section 5 concludes the paper with somediscussion and directions for future work. Related Work

In this section, we describe the VSM model and provide a brief survey on widelyknown methods for text categorization.From this section onwards, following notation similar to [27], we let D = { d , d , . . . , d n } denote the dataset of documents, with size n = |D| , and T = { t , t , . . . , t m } the data dictionary of all unique words from documents in D ,with size m = |T | . Given a document d and a word t ∈ T , | d | denotes thenumber of words in d and t ∈ d indicates that the word t is found in d . The Vector Space Model (VSM) [24] is the most common model for documentrepresentation in text categorization. According to [1], there is usually a standardpreprocessing step for the documents in D , where all alphabets are convertedto lowercase and all stop words, such as articles and prepositions, are removed.Sometimes, a stemming algorithm, such as the widely used Porter stemmer [20],is applied to remove suﬃces of words (e.g. the word “connection” → “connect”).In VSM, the data dictionary, consisting of all unique words that appear inat least one document in D , is ﬁrst constructed. Sometimes, n -grams, which arephrases of words, are also included in the dictionary; however, the beneﬁt ofthese additional phrases is still up for debate [27]. Given the data dictionary,each document can be represented as a vector in the real-valued vector spacewith dimension equaling the size of the dictionary. Two common methods forassociating a document to a vector are explained below.The simplest method assigns to a document d the vector consisting of therelative term frequencies for d , see e.g. [11]. The second, known as the tf - idf method, assigns d to the vector consisting of the products of term and inversedocument frequencies [23]. Mathematically speaking, a document d is mappedto a real-valued vector of length m : d ( w , w , . . . , w m ) ∈ IR m , for w i = c ( t i , d ) | d | (frequency method) (1)or w i = c ( t i , d ) log (cid:18) n |{ d ∈ D : t i ∈ d }| (cid:19) ( tf - idf method) , (2)where c ( t i , d ) denotes the number of times the word t i appears in d . Otherrepresentations include binary and entropy weightings and the normalized tf - idf method [1].Once the documents are represented as vectors, the dataset can be inter-preted as a data matrix M of size ( n × m ). However, a main challenge for textcategorization is that the size of the data dictionary is usually immense so thedata matrix is extremely high dimensional. Dimension reduction techniques mustoften be applied before classiﬁcation to reduce complexity [1]. .2 Feature Selection and Extraction Methods Due to the potentially large size of the data dictionary, feature selection and ex-traction methods are often applied to reduce the dimension of the data matrix.Feature selection methods assign to each feature, a word in the data dictio-nary, a statistical score based on some measure of importance. Only the highestscored features, past some deﬁned threshold, are kept and a lower dimensionaldata matrix is created from only these features. Some known feature selectionmethods in text categorization include calculating the document frequency, e.g.[31], mutual information [5], and χ statistics, e.g. [26]. See e.g. [10] or [31] for athorough study of text feature selection methods.Feature extraction methods transform the original list of features to a smallerlist of new features, based on some form of feature dependency. Common well-known feature extraction techniques, as surveyed in [22], are Latent Semantic In-dexing (LSI) [8], Linear Discriminant Analysis (LDA) [28], Partial Least Squares(PLS) [32], and Random Projections [2]. Well-known classiﬁers that have been applied to text categorization include the k -nearest neighbour classiﬁer [18], Support Vector Machines [12], the NaiveBayes classiﬁer [15], and decision trees [13]. We ask the reader to refer to in-dicated references, or to survey articles, such as [27], [11], and [1]. The paper[30] provides comparable empirical results on some of these classiﬁers.The standard approach in literature for text categorization is that one, ormore, feature selection or extraction technique is ﬁrst applied to a data matrix,since the original data matrix is often extremely high-dimensional. A learning al-gorithm, independent of the dimension reduction process, is then used for classi-ﬁcation [27]. The novel approach in this paper is that we consider a new classiﬁerbased only on extracted class speciﬁc words, which naturally reduces time com-plexity and the dimension of the dataset. In other words, the Domain-Speciﬁcclassiﬁer both performs dimension reduction and classiﬁes, in consecutive anddependent steps. Our algorithm consists of two distinct stages: extraction of domain-speciﬁc wordsfrom training samples and classiﬁcation of documents based on the closest la-beled point determined by these domain-speciﬁc words.

Fix an alphabet Σ and denote Σ ∗ as the set of all possible “words” formed from Σ . A document d is then simply an ordered sequence of “words”, d ∈ ( Σ ∗ ) | d | , andthe data dictionary T is a subset of Σ ∗ . Given a set of labeled documents, we canenote it as D lab = { ( d , l ) , ( d , l ) , . . . , ( d n , l n ) } , where d i is a document and l i ∈ { , , , . . . , k } is its label, out of a possible k diﬀerent labels. In addition,we can partition D lab into subsets of documents according to their labels: D lab = k [ j =1 D j lab (3)where D j lab = { ( d, l ) ∈ D lab : l = j } is the set of documents of label j . Then, fora particular label j and a word t ∈ T in the data dictionary, we denote f j ( t ) asthe average proportion of times the word t appears in documents with label j : f j ( t ) = 1 |D j lab | X ( d,j ) ∈D j lab c ( t, d ) | d | . (4)Domain-speciﬁc words are those words which appear, on average, propor-tionally more often in one label type of documents in D lab than other types. Deﬁnition 1.

Let α ≥ . A word t ∈ T in the data dictionary is domain (or class ) j speciﬁc if f j ( t ) > α X j ′ = j f j ′ ( t ) . (5)This deﬁnition of domain-speciﬁc words depends on the parameter α andhence, so does the Domain-Speciﬁc classiﬁer. As α increases from 0, the numberof domain-speciﬁc words for each class label decreases; as a result, α can bethought of as a threshold parameter, and an optimal choice for α is determinedthrough cross-validation using training data. Let now d be an unclassiﬁed document. We associate to it a vector w = w d ∈ R T (a relative frequency distribution of words) as in Eq. (1), that is, for every t ∈ T , w ( t ) = c ( t, d ) | d | . (6)Let j be a label. Denote CS j = CS j,α the set of domain-speciﬁc words to j .Deﬁne the total relative frequency of domain j speciﬁc words found in d : w [ CS j ] = X t ∈ CS j w ( t ) = 1 | d | X t ∈ CS j c ( t, d ) . (7)The classiﬁer assigns to d the label j for which the following ratio is the highest: j = argmax i w [ CS i ] | CS i | /p . (8)Here, p ∈ (0 , ∞ ] is a parameter, which normalizes a certain measure with regardto the ℓ p distance, cf. below in Section 3.3. .3 Space of Positive Measures on the Dictionary A (positive) measure on a ﬁnite set T is simply an assignment t w ( t ) toevery t ∈ T of a non-negative number w ( t ); a probability measure also satisﬁes P t ∈T w ( t ) = 1. Denote M ( T ) the set of all positive measures on T .Fix a parameter p ∈ (0 , ∞ ]. The following is a positive measure on T : x j ( t ) = (cid:26) | CS j | /p , if t ∈ CS j , , otherwise. (9)If p = 1, we obtain a probability measure uniformly supported on the set ofdomain j speciﬁc words. In general, values of p ∈ (0 , ∞ ] correspond to diﬀerentnormalizations of the uniform measure supported on these words, according tothe ℓ p distance. (The case when p = ∞ , that is, the ℓ ∞ distance, corresponds tonon-normalized uniform measure.)Among the similarity measures on M ( T ), we single out the standard innerproduct h w, v i = X t w t v t . (10)Notice that for every w ∈ M ( T ) and each j , h w, x j i = w [ CS j ] | CS j | /p , (11)and for this reason, the classiﬁcation algorithm (8) can be rewritten as follows: j = argmax i h w, x i i . (12)Our classiﬁer is based on ﬁnding the closest point x i to the input point w in thesense of the simplest similarity measure, the inner product.The similarity workload is a triple ( U , S, X ), consisting of the domain U = M ( T ), the similarity measure S ( w, v ) = h w, v i equal to the standard inner prod-uct (a rather common choice in the problems of statistical learning, cf. [25]), andthe dataset X = { x , x , . . . , x k } of normalized uniform measures correspondingto the text categories and domain-speciﬁc words extracted at the preprocessingstage.Note that the well-known cosine similarity measure arises in the special casewhen the normalizing parameter is p = 2; hence, it is not necessary to considerthis measure separately. Our experiments have shown that diﬀerent datasetsrequire diﬀerent normalizing parameters for x j , and that the optimal normal-ization depends on the sizes of the document categories; Section 5 includes adiscussion on this topic. This section details the experiments and results obtained for the Domain-Speciﬁcclassiﬁer, in the 2012 Cybersecurity Data Mining Competition and on the Reuters1578 dataset. All of the programming for this section were done with standardpackages in R [21] and with the specialized packages e1071 [9] and randomForest [17], on a desktop running Windows 7 Enterprise, with a Intel i5 3.10 GHz pro-cessor and 4GB of RAM.

The 3rd Cybersecurity Data Mining Competition (CDMC 2012) [19], associ-ated with the 19th International Conference on Neural Information Processing(ICONIP 2012) in Doha, Qatar from November 12 - 15, 2012 [29], included threesupervised classiﬁcation tasks: electronic news (e-News) text categorization, in-trusion detection, and handwriting recognition.The Domain-Speciﬁc classiﬁer was ﬁrst developed by the team of authorsfor the e-News text categorization challenge, which required classifying newsdocuments to ﬁve topics: business, entertainment, sports, technology, and travel.The documents were collected from eight online news sources. The words in thesedocuments were obfuscated, and all punctuations and stop words removed. Hereis a sample scrambled text document paragraph from the competition:

HUJR Xj gjXZMUXe fAJjAeK UO jwXeA URSek UYjmX xjI K SeeW eOWrjJeeR ZARWZDekUAk WDjkmzXZMe KXR UA eRReAXZUr BmeRXZjA RZAze zjOWUAZeR XjkUJ OmRXUzzjOWrZRx OjDe IZXx weIeD WejWre uxe OjRX RmzzeRRwmr RXUDXmWR OmRX

In total, 1063 e-News documents for training, each labeled as one of the k = 5topics, were given for the goal of classifying 456 documents. Table 1.

Information on the dataset size for e-News classiﬁcation task.

Label j Topic

After a pre-processing step, where all document words of length less or equal to3 were removed, a data dictionary of all unique words from the training and clas-siﬁcation documents, consisting of m = 55822 words, was constructed. The 1063labeled documents were converted to vectors of length m = 55822, according tothe Vector Space Model. Then, 5-fold cross-validation on the training datasetwas performed to test the performance of the classiﬁer.For comparison purposes, the Support Vector Machines (SVM) classiﬁer [6],using the Gaussian Radial Basis (GRB) and linear kernels with cost 10, and theandom Forest classiﬁer [3], using 50 trees, were also tested. The performancemeasures considered were classiﬁcation accuracy and the F-Measure (F1) [1].See Table 2, where the computation times for both the training and predictingstages are also indicated. Table 2.

Classiﬁcation performance of the Domain-Speciﬁc classiﬁer (DSC) through5-fold cross validation on the training set, compared to SVM with the Gaussian RadialBasis (GRB) and linear kernels and Random Forest (RF).

DSC DSC DSC DSC DSC DSC DSC DSC SVM SVM RF α The choice α = 2 resulted in the best accuracy of 0.915 for the Domain-Speciﬁc classiﬁer, and the optimal normalizing parameter was p = 1, correspond-ing to the choice of x , x , . . . , x k normalized as probability measures uniformlysupported on the domain-speciﬁc words. Consequently, these two values wereused for the classiﬁcation of the 456 documents in the competition. Note thatthe accuracy score and the F-Measure for 4 out of the 5 categories, for α = 2,were higher than the respective scores obtained with SVM with the GRB andthe linear kernels and Random Forest. Experiments had also shown that theDomain-Speciﬁc classiﬁer was extremely fast and eﬃcient, since distance calcu-lations are only based on domain-speciﬁc words, not the entire data dictionary.As a result, no dimension reduction technique prior to classiﬁcation was required. The submissions for the three tasks for the 2012 Cybersecurity Data MiningCompetition [19] were strictly evaluated based on the F-Measure with respectto each class label to determine the overall rankings. However, the classiﬁcationaccuracy scores for the three tasks were also sent to the participants.The team of authors ﬁnished 1st in pure accuracy and 2nd in the e-News textcategorization task with the Domain-Speciﬁc classiﬁer, see Table 3. Overall, theteam received 3rd place in the entire competition. able 3.

Results of the Domain-Speciﬁc classiﬁer for the e-News task.

Label 1 Label 2 Label 3 Label 4 Label 5 Accuracy Task RankingF-Measure 0.847 0.943 0.991 0.805 0.947 - 2ndAccuracy - - - - - 0.912 1st

The Reuters 21578 dataset, consisting of documents from the Reuters newswirein 1987 and categorized by Reuters Ltd. and Carnegie Group, Inc., is a clas-sical benchmark for text categorization classiﬁers [16]. To further test the ef-fectiveness and eﬃciency of the Domain-Speciﬁc classiﬁer, we considered single-category documents from the top eight most frequent classes (known as the R8subset) from the Reuters 21578 dataset and divided according to the standard“modApt´e” training/testing split. These documents were downloaded from [4].Table 4 provides the category sizes for this dataset. Standard pre-processingof the dataset consisted of removing all stop words and words of length two orless; afterwards, the size of the dictionary of all unique words from the trainingand testing documents was m = 22931 words. Table 4.

Information on the Reuter 21578 dataset considered.

Label j Topic

In this case, evaluation of the Domain-Speciﬁc classiﬁer, based on the accu-racy, F-Measure, and computational time, has shown that α = 0 .

45 and p = ∞ (non-normalized measures on domain-speciﬁc words) were optimal. The SVM,using the linear kernel and a class weight adjustment (2840 divided by the num-ber of documents in each category) to address the varying sizes of the categories,and Random Forest, using 50 trees, classiﬁers were also tested to compare againstour novel algorithm. Table 5 provides the classiﬁcation results obtained by theDomain-Speciﬁc classiﬁer at those values, SVM, and Random Forest.The Domain-Speciﬁc classiﬁer performed slightly better than SVM with thelinear kernel, and better than Random Forest in terms of accuracy. With respectto the F-Measure, our classiﬁer performed better than SVM for categories withlarge sizes, and better than Random Forest in 6 of the 8 categories, while SVMhad a higher F-Measure on two of the smaller categories, undoubtably due to able 5. Classiﬁcation performance of the Domain-Speciﬁc classiﬁer (DSC) comparedto SVM with the linear kernel and Random Forest (RF) on the Reuters 21578 dataset.

Accuracy F1 acq F1 crude F1 earn F1 grain F1 interest F1 money-fx F1 ship F1 tradeDSC α = 0 . Random Forest 0.926 0.917 0.911

SVM’s class weight adjustment. Computationally, our classiﬁer ran considerablyfaster than SVM and Random Forest.

In this paper, we have introduced a novel text categorization algorithm, theDomain-Speciﬁc classiﬁer, based on similarity searches in the space of measureson the data dictionary. The classiﬁer ﬁnds domain-speciﬁc words for each docu-ment category, which appear in this category relatively more often than in therest of the categories combined, and associates to it a normalized uniform mea-sure supported on the domain-speciﬁc words. For an unlabeled document, theclassiﬁer assigns to it the category whose associated measure is most similar tothe document’s vector of relative word frequencies, with respect to the innerproduct. The cosine similarity measure arises as a special case corresponding tothe ℓ normalization.Our classiﬁer involves a similarity search problem in a suitably interpreteddomain. We believe that this is the right viewpoint with the aim of furtherimprovements. It is worthwhile noting that our algorithm is unrelated to previ-ously used distance-based algorithms (e.g. the k -NN classiﬁer [18]). The datasetin the similarity workload is completely diﬀerent, and as a result, unlike mostalgorithms in text categorization, this classiﬁer does not require any separatedimension reduction step beforehand.The process of selecting domain-speciﬁc words in our algorithm is actually animplicit feature selection method which is class-dependent, something we havenot seen before from a classiﬁer in text categorization. For each class, instead of acentroid, we are choosing an outlier, a uniform measure supported on the domain-speciﬁc words, which is representative of this class and not of any other class.Not only does each uniform measure lead to a reduction in the dimension of thefeature space (as most words are not domain-speciﬁc) for similarity calculations,it does so dependent of the class labels, since domain-speciﬁc words are chosenrelative to all classes.This algorithm was ﬁrst developed for the 2012 Cybersecurity Data MiningCompetition and brought the team of authors 2nd place in the text categoriza-ion challenge, and 1st place in accuracy. This is evidence that our algorithmoutperformed many existing text categorization algorithms, as surveyed in Sec-tion 2. In addition, our algorithm was evaluated on a sub-collection of the Reuters21578 dataset against two state-of-the-art classiﬁers, and shown to have a slightlyhigher classiﬁcation accuracy than SVM, with a higher F-Measure for the largercategories, and overall performed better than Random Forest. Computationally,our classiﬁer ran signiﬁcantly faster than either, especially in the training stage.The normalizing parameter p plays a signiﬁcant role: it is to account for classimbalance. When there are categories with very few documents, p = ∞ shouldbe used to avoid over-emphasizing the smaller categories; and small values of p should be used when the categories have roughly the same number of documents.For future work, we hope to test the Domain-Speciﬁc classiﬁer on biologicalsequence databases. Other deﬁnitions of domain-speciﬁc words can be investi-gated, for instance the one proposed in [14]. We would like to experiment withassigning non-uniform measures on the domain-speciﬁc words, for instance, byputting weights based on their relative occurrences or on α . Finally, we wouldlike to extend the process of selecting domain-speciﬁc words to a general clas-siﬁcation context, by deﬁning class-speciﬁc features relative to the classes andperforming classiﬁcation on only these class-dependent features. References

1. Aas, K., Eikvil, L.: Text Categorization: A Survey. In Technical Report 941. Nor-wegian Computing Center (1999).2. Bingham, E., Mannila, H.: Random projection in dimensionality reduction: Applica-tions to image and text data. In: KDD ’01, San Francisco, USA 2001. Proceedings of7th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, pg. 245-250.3. Breiman, L.: Random Forests. Machine Learning 45(1), 5 - 32 (2001).4. Cardoso-Cachopo, A.: Datasets for single-label text categorization, http://web.ist.utl.pt/acardoso/datasets .5. Church, K.W., Hanks, P.: Word association norms, mutual information and lexicog-raphy. In: ACL 27, Vancouver, Canada 1989. Proceedings of ACL 27, pg. 76-83.6. Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning 20, 273 - 297(1995).7. Cover, T., Hart, P.: Nearest neighbor pattern classiﬁcation. IEEE Transactions onInformation Theory 13, 21 - 27 (1967).8. Deerwester, S., Dumais, S.T., Harshman, R.: Indexing by Latent Semantic Analysis.Journal of the American Society for Information Science 41(6), 391 - 407 (1990).9. Dimitriadou, E., Hornik, K., Leisch, F., Meyer, D., Weingessel, A.: e1071: Miscfunctions of the Department of Statistics (e1071), TU Wien. R package version 1.6,available at http://CRAN.R-project.org/package=e1071 (2011).10. Forman, G.: An Extensive Empirical Study of Feature Selection Metrics for TextClassiﬁcation. Journal of Machine Learning Research 3, 1289 - 1305 (2003).11. Ikonomakis, M., Kotsiantis, S., Tampakas, V.: Text Classiﬁcation Using MachineLearning Techniques. WSEAS Transactions on Computers 4(8), 966 - 974 (2005).12. Joachims, T.: Text Categorization with Support Vector Machines: Learning withMany Relevant Features. In: ECML-98, Chemnitz, Germany 1998. Proceedings of10th European Conference on Machine Learning, pg. 137 - 142.3. Johnson, D.E., Oles, F.J., Zhang, T., Goetz, T.: A decision-tree-based symbolicrule induction system for text categorization. IBM Systems Journal 41(3), 428 - 437(2002).14. Keim, D.A., Oelke, D., Rohrdantz, C.: Analyzing document collections via context-aware term extraction. In: NLDB ’09 Saarbr¨ucken, Germany. Proceedings of 14thInter. Conf. on Appl. of Natural Language to Information Systems, pg. 154 - 168.15. Kim, S.B., Rim, H.C., Yook, D.S., Lim, H.S.: Eﬀective Methods for ImprovingNaive Bayes Text Classiﬁers. LNAI 2417, 414 - 423 (2002).16. Lewis, D.D.: Test Collections, Reuters-21578, .17. Liaw, A., Wiener, M.: Classiﬁcation and Regression by randomForest. R News2(3), 18 - 22 (2002).18. Lim, H.: Improving kNN Based Text Categorization with Well Estimated Param-eters. LNCS 3316, 516 - 523 (2004).19. Pang, P.S., Ban, T., Kadobayashi, Y., Song, J., Huang, K.: The 3rd CybersecurityData Mining Competition, (2012).20. Porter, M.F.: An algorithm for suﬃx stripping. Program 14(3), 130 - 137 (1980).21. R Development Core Team: R: A Language and Environment for Statistical Com-puter. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, available at (2008).22. Radovanovic, M., Ivanovic, M.: Text Mining: Approaches and Applications. NoviSad J. Math 38 (3), 227 - 234 (2008).23. Salton, G., McGill, M.J.: An Introduction to Modern Information Retrieval.McGraw-Hill (1983).24. Salton, G., Wong, A., Yang, C.S.: A Vector Space Model for Automatic Indexing.Communications of the ACM 18(11), 613 - 620 (1975).25. Sch¨olkopf, B. and Smola, A.: A Short Introduction to Learning with Kernels. In:S. Mendelson, A.J. Smola (Eds.), Advanced Lectures on Machine Learning, LNAI2600, pp. 41–64 (2003).26. Sch¨utze, H., Hull, D.A., Pedersen, J.O.: A Comparison of Classiﬁers and DocumentRepresentations for the Routing Problem. In: SIGIR ’95 Seattle, USA. Proceedingsof SIGIR 1995, 18th ACM International Conference on Research and Developmentin Information Retrieval, pg. 229 - 237.27. Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Com-puting Surveys 34, 1 - 47 (2002).28. Torkkola, K.: Linear Discriminant Analysis in Document Classiﬁcation. In: ICDM’01 San Jose, USA 2001. Proceedings of 2001 IEEE ICDM Workshop on Text Min-ing, pg. 800–806.29. Weichold, M., Huang, T.W., Lorentz, R., Qaraqe, K.: The 19th In-ternational Conference on Neural Information Processing (ICONIP 2012),