Text Categorization via Similarity Search: An Efficient and Effective Novel Algorithm
aa r X i v : . [ c s . I R ] J u l Text Categorization via Similarity Search
An Efficient and Effective Novel Algorithm ⋆ Hubert Haoyang Duan , Vladimir Pestov , and Varun Singla University of Ottawa, Ottawa, Ontario K1N 6N5, Canada { hduan065,vpest283 } @uottawa.ca Indian Institute of Technology Delhi, Hauz Khas, New Delhi-110 016, India [email protected]
Abstract.
We present a supervised learning algorithm for text cate-gorization which has brought the team of authors the 2nd place in thetext categorization division of the 2012 Cybersecurity Data Mining Com-petition (CDMC’2012) and a 3rd prize overall. The algorithm is quitedifferent from existing approaches in that it is based on similarity searchin the metric space of measure distributions on the dictionary. At thepreprocessing stage, given a labeled learning sample of texts, we asso-ciate to every class label (document category) a point in the space ofquestion. Unlike it is usual in clustering, this point is not a centroid ofthe category but rather an outlier, a uniform measure distribution on aselection of domain-specific words. At the execution stage, an unlabeledtext is assigned a text category as defined by the closest labeled neigh-bour to the point representing the frequency distribution of the wordsin the text. The algorithm is both effective and efficient, as further con-firmed by experiments on the Reuters 21578 dataset.
The amount of texts readily available in the world is growing at an astonishingrate; classifying these texts through machine learning techniques, promptly andwithout much human intervention, has thus become an important problem indata mining. Much research, in the field of supervised learning, has been done tofind accurate algorithms to classify documents in a dataset to their appropriatecategories, e.g. see [27] or [1] for a detailed survey of text categorization.The most widely used model for text categorization is the Vector Space Model(VSM) [24]. Under this model, a data dictionary T consisting of unique wordsacross the documents in the dataset is constructed. The documents are repre-sented by real-valued vectors in the space R T with dimension equaling to thesize of the dictionary. Given t ∈ T , the t -th coordinate of a vector is the relativefrequency of the word t in a given document. When some of the documents’ ⋆ This work has been partially supported by a 2012 NSERC Canada Graduate Scholar-ship and a 2013 Ontario Graduate Scholarship (Hubert Haoyang Duan), 2012–2017NSERC Discovery Grant “New set-theoretic tools for statistical learning” (VladimirPestov), and the 2012 Mitacs Globalink Program (Varun Singla). ctual class labels are known and used for training, many well-known classifiersin supervised machine learning, such as SVM [6], k -NN [7], and Random Forest[3], can then be applied to categorize documents.Text categorization decidedly comes across as a problem of detecting sim-ilarities between a given text and a collection of texts of a particular type.Although distance-based learning rules for text categorization, such as the k -nearest neighbour classifier, e.g. [18], are not new, they are currently based onthe entire feature space, while any dimension reduction steps are done indepen-dently beforehand [27].We aim to fill this gap by suggesting a novel supervised learning algorithmfor text categorization, called the Domain-Specific classifier. It discovers specificwords for each category, or domain, of documents in training and classifies basedon similarity searches in the space of word frequency distributions supported onthe respective domain-specific words.For each class label, j = 1 , , . . . , k , our algorithm extracts class, or domain,specific words from labeled training documents, that is, words that appear in theclass j more frequently than in all the other document classes combined (moduloa given threshold). Now a given unlabeled document is assigned a label j if thenormalized frequency of domain-specific words for j in the document is higherthan for any other label.To see that this classifier is indeed similarity search based, let x j ∈ R T bea binary vector whose t -th coordinate is 1 if and only if t is domain-specificto j , and 0 otherwise. Normalize x j according to the ℓ p distance, and let adocument be represented by a vector w ∈ R T . Then the label assigned to w is that of the closest neighbour to w among x , x , . . . , x k with regard to thesimplest similarity measure, the inner product on R T . In other words, we seekto maximize the value of h w, x j i over j = 1 , , . . . , k . Notice that the well-knowncosine similarity measure, cf. e.g. [27], corresponds to the special case p = 2.This algorithm was first used in the 3rd Cybersecurity Data Mining Competi-tion (CDMC 2012) to notable success, as the team of authors placed second in thetext categorization challenge, and first in classification accuracy [19]. In addition,the classification performance of the algorithm was validated on a sub-collectionof the popular Reuters 21578 dataset [16], consisting of single-category docu-ments from the top eight most frequent categories, with the standard “modApt´e”training/testing split. In terms of accuracy, our classifier performs slightly betterthan SVM with a linear kernel, and is significantly faster.This paper is organized as follows. Section 2 surveys common feature selectionand extraction methods and classifiers considered in the text categorization lit-erature. Section 3 explains the new Domain-Specific classifier in detail and castsit as a similarity search problem. Section 4 discusses results from the CDMC2012 Data Mining competition and experiments from the competition and onthe Reuters 21578 dataset. Finally, Section 5 concludes the paper with somediscussion and directions for future work. Related Work
In this section, we describe the VSM model and provide a brief survey on widelyknown methods for text categorization.From this section onwards, following notation similar to [27], we let D = { d , d , . . . , d n } denote the dataset of documents, with size n = |D| , and T = { t , t , . . . , t m } the data dictionary of all unique words from documents in D ,with size m = |T | . Given a document d and a word t ∈ T , | d | denotes thenumber of words in d and t ∈ d indicates that the word t is found in d . The Vector Space Model (VSM) [24] is the most common model for documentrepresentation in text categorization. According to [1], there is usually a standardpreprocessing step for the documents in D , where all alphabets are convertedto lowercase and all stop words, such as articles and prepositions, are removed.Sometimes, a stemming algorithm, such as the widely used Porter stemmer [20],is applied to remove suffices of words (e.g. the word “connection” → “connect”).In VSM, the data dictionary, consisting of all unique words that appear inat least one document in D , is first constructed. Sometimes, n -grams, which arephrases of words, are also included in the dictionary; however, the benefit ofthese additional phrases is still up for debate [27]. Given the data dictionary,each document can be represented as a vector in the real-valued vector spacewith dimension equaling the size of the dictionary. Two common methods forassociating a document to a vector are explained below.The simplest method assigns to a document d the vector consisting of therelative term frequencies for d , see e.g. [11]. The second, known as the tf - idf method, assigns d to the vector consisting of the products of term and inversedocument frequencies [23]. Mathematically speaking, a document d is mappedto a real-valued vector of length m : d ( w , w , . . . , w m ) ∈ IR m , for w i = c ( t i , d ) | d | (frequency method) (1)or w i = c ( t i , d ) log (cid:18) n |{ d ∈ D : t i ∈ d }| (cid:19) ( tf - idf method) , (2)where c ( t i , d ) denotes the number of times the word t i appears in d . Otherrepresentations include binary and entropy weightings and the normalized tf - idf method [1].Once the documents are represented as vectors, the dataset can be inter-preted as a data matrix M of size ( n × m ). However, a main challenge for textcategorization is that the size of the data dictionary is usually immense so thedata matrix is extremely high dimensional. Dimension reduction techniques mustoften be applied before classification to reduce complexity [1]. .2 Feature Selection and Extraction Methods Due to the potentially large size of the data dictionary, feature selection and ex-traction methods are often applied to reduce the dimension of the data matrix.Feature selection methods assign to each feature, a word in the data dictio-nary, a statistical score based on some measure of importance. Only the highestscored features, past some defined threshold, are kept and a lower dimensionaldata matrix is created from only these features. Some known feature selectionmethods in text categorization include calculating the document frequency, e.g.[31], mutual information [5], and χ statistics, e.g. [26]. See e.g. [10] or [31] for athorough study of text feature selection methods.Feature extraction methods transform the original list of features to a smallerlist of new features, based on some form of feature dependency. Common well-known feature extraction techniques, as surveyed in [22], are Latent Semantic In-dexing (LSI) [8], Linear Discriminant Analysis (LDA) [28], Partial Least Squares(PLS) [32], and Random Projections [2]. Well-known classifiers that have been applied to text categorization include the k -nearest neighbour classifier [18], Support Vector Machines [12], the NaiveBayes classifier [15], and decision trees [13]. We ask the reader to refer to in-dicated references, or to survey articles, such as [27], [11], and [1]. The paper[30] provides comparable empirical results on some of these classifiers.The standard approach in literature for text categorization is that one, ormore, feature selection or extraction technique is first applied to a data matrix,since the original data matrix is often extremely high-dimensional. A learning al-gorithm, independent of the dimension reduction process, is then used for classi-fication [27]. The novel approach in this paper is that we consider a new classifierbased only on extracted class specific words, which naturally reduces time com-plexity and the dimension of the dataset. In other words, the Domain-Specificclassifier both performs dimension reduction and classifies, in consecutive anddependent steps. Our algorithm consists of two distinct stages: extraction of domain-specific wordsfrom training samples and classification of documents based on the closest la-beled point determined by these domain-specific words.
Fix an alphabet Σ and denote Σ ∗ as the set of all possible “words” formed from Σ . A document d is then simply an ordered sequence of “words”, d ∈ ( Σ ∗ ) | d | , andthe data dictionary T is a subset of Σ ∗ . Given a set of labeled documents, we canenote it as D lab = { ( d , l ) , ( d , l ) , . . . , ( d n , l n ) } , where d i is a document and l i ∈ { , , , . . . , k } is its label, out of a possible k different labels. In addition,we can partition D lab into subsets of documents according to their labels: D lab = k [ j =1 D j lab (3)where D j lab = { ( d, l ) ∈ D lab : l = j } is the set of documents of label j . Then, fora particular label j and a word t ∈ T in the data dictionary, we denote f j ( t ) asthe average proportion of times the word t appears in documents with label j : f j ( t ) = 1 |D j lab | X ( d,j ) ∈D j lab c ( t, d ) | d | . (4)Domain-specific words are those words which appear, on average, propor-tionally more often in one label type of documents in D lab than other types. Definition 1.
Let α ≥ . A word t ∈ T in the data dictionary is domain (or class ) j specific if f j ( t ) > α X j ′ = j f j ′ ( t ) . (5)This definition of domain-specific words depends on the parameter α andhence, so does the Domain-Specific classifier. As α increases from 0, the numberof domain-specific words for each class label decreases; as a result, α can bethought of as a threshold parameter, and an optimal choice for α is determinedthrough cross-validation using training data. Let now d be an unclassified document. We associate to it a vector w = w d ∈ R T (a relative frequency distribution of words) as in Eq. (1), that is, for every t ∈ T , w ( t ) = c ( t, d ) | d | . (6)Let j be a label. Denote CS j = CS j,α the set of domain-specific words to j .Define the total relative frequency of domain j specific words found in d : w [ CS j ] = X t ∈ CS j w ( t ) = 1 | d | X t ∈ CS j c ( t, d ) . (7)The classifier assigns to d the label j for which the following ratio is the highest: j = argmax i w [ CS i ] | CS i | /p . (8)Here, p ∈ (0 , ∞ ] is a parameter, which normalizes a certain measure with regardto the ℓ p distance, cf. below in Section 3.3. .3 Space of Positive Measures on the Dictionary A (positive) measure on a finite set T is simply an assignment t w ( t ) toevery t ∈ T of a non-negative number w ( t ); a probability measure also satisfies P t ∈T w ( t ) = 1. Denote M ( T ) the set of all positive measures on T .Fix a parameter p ∈ (0 , ∞ ]. The following is a positive measure on T : x j ( t ) = (cid:26) | CS j | /p , if t ∈ CS j , , otherwise. (9)If p = 1, we obtain a probability measure uniformly supported on the set ofdomain j specific words. In general, values of p ∈ (0 , ∞ ] correspond to differentnormalizations of the uniform measure supported on these words, according tothe ℓ p distance. (The case when p = ∞ , that is, the ℓ ∞ distance, corresponds tonon-normalized uniform measure.)Among the similarity measures on M ( T ), we single out the standard innerproduct h w, v i = X t w t v t . (10)Notice that for every w ∈ M ( T ) and each j , h w, x j i = w [ CS j ] | CS j | /p , (11)and for this reason, the classification algorithm (8) can be rewritten as follows: j = argmax i h w, x i i . (12)Our classifier is based on finding the closest point x i to the input point w in thesense of the simplest similarity measure, the inner product.The similarity workload is a triple ( U , S, X ), consisting of the domain U = M ( T ), the similarity measure S ( w, v ) = h w, v i equal to the standard inner prod-uct (a rather common choice in the problems of statistical learning, cf. [25]), andthe dataset X = { x , x , . . . , x k } of normalized uniform measures correspondingto the text categories and domain-specific words extracted at the preprocessingstage.Note that the well-known cosine similarity measure arises in the special casewhen the normalizing parameter is p = 2; hence, it is not necessary to considerthis measure separately. Our experiments have shown that different datasetsrequire different normalizing parameters for x j , and that the optimal normal-ization depends on the sizes of the document categories; Section 5 includes adiscussion on this topic. This section details the experiments and results obtained for the Domain-Specificclassifier, in the 2012 Cybersecurity Data Mining Competition and on the Reuters1578 dataset. All of the programming for this section were done with standardpackages in R [21] and with the specialized packages e1071 [9] and randomForest [17], on a desktop running Windows 7 Enterprise, with a Intel i5 3.10 GHz pro-cessor and 4GB of RAM.
The 3rd Cybersecurity Data Mining Competition (CDMC 2012) [19], associ-ated with the 19th International Conference on Neural Information Processing(ICONIP 2012) in Doha, Qatar from November 12 - 15, 2012 [29], included threesupervised classification tasks: electronic news (e-News) text categorization, in-trusion detection, and handwriting recognition.The Domain-Specific classifier was first developed by the team of authorsfor the e-News text categorization challenge, which required classifying newsdocuments to five topics: business, entertainment, sports, technology, and travel.The documents were collected from eight online news sources. The words in thesedocuments were obfuscated, and all punctuations and stop words removed. Hereis a sample scrambled text document paragraph from the competition:
HUJR Xj gjXZMUXe fAJjAeK UO jwXeA URSek UYjmX xjI K SeeW eOWrjJeeR ZARWZDekUAk WDjkmzXZMe KXR UA eRReAXZUr BmeRXZjA RZAze zjOWUAZeR XjkUJ OmRXUzzjOWrZRx OjDe IZXx weIeD WejWre uxe OjRX RmzzeRRwmr RXUDXmWR OmRX
In total, 1063 e-News documents for training, each labeled as one of the k = 5topics, were given for the goal of classifying 456 documents. Table 1.
Information on the dataset size for e-News classification task.
Label j Topic
After a pre-processing step, where all document words of length less or equal to3 were removed, a data dictionary of all unique words from the training and clas-sification documents, consisting of m = 55822 words, was constructed. The 1063labeled documents were converted to vectors of length m = 55822, according tothe Vector Space Model. Then, 5-fold cross-validation on the training datasetwas performed to test the performance of the classifier.For comparison purposes, the Support Vector Machines (SVM) classifier [6],using the Gaussian Radial Basis (GRB) and linear kernels with cost 10, and theandom Forest classifier [3], using 50 trees, were also tested. The performancemeasures considered were classification accuracy and the F-Measure (F1) [1].See Table 2, where the computation times for both the training and predictingstages are also indicated. Table 2.
Classification performance of the Domain-Specific classifier (DSC) through5-fold cross validation on the training set, compared to SVM with the Gaussian RadialBasis (GRB) and linear kernels and Random Forest (RF).
DSC DSC DSC DSC DSC DSC DSC DSC SVM SVM RF α The choice α = 2 resulted in the best accuracy of 0.915 for the Domain-Specific classifier, and the optimal normalizing parameter was p = 1, correspond-ing to the choice of x , x , . . . , x k normalized as probability measures uniformlysupported on the domain-specific words. Consequently, these two values wereused for the classification of the 456 documents in the competition. Note thatthe accuracy score and the F-Measure for 4 out of the 5 categories, for α = 2,were higher than the respective scores obtained with SVM with the GRB andthe linear kernels and Random Forest. Experiments had also shown that theDomain-Specific classifier was extremely fast and efficient, since distance calcu-lations are only based on domain-specific words, not the entire data dictionary.As a result, no dimension reduction technique prior to classification was required. The submissions for the three tasks for the 2012 Cybersecurity Data MiningCompetition [19] were strictly evaluated based on the F-Measure with respectto each class label to determine the overall rankings. However, the classificationaccuracy scores for the three tasks were also sent to the participants.The team of authors finished 1st in pure accuracy and 2nd in the e-News textcategorization task with the Domain-Specific classifier, see Table 3. Overall, theteam received 3rd place in the entire competition. able 3.
Results of the Domain-Specific classifier for the e-News task.
Label 1 Label 2 Label 3 Label 4 Label 5 Accuracy Task RankingF-Measure 0.847 0.943 0.991 0.805 0.947 - 2ndAccuracy - - - - - 0.912 1st
The Reuters 21578 dataset, consisting of documents from the Reuters newswirein 1987 and categorized by Reuters Ltd. and Carnegie Group, Inc., is a clas-sical benchmark for text categorization classifiers [16]. To further test the ef-fectiveness and efficiency of the Domain-Specific classifier, we considered single-category documents from the top eight most frequent classes (known as the R8subset) from the Reuters 21578 dataset and divided according to the standard“modApt´e” training/testing split. These documents were downloaded from [4].Table 4 provides the category sizes for this dataset. Standard pre-processingof the dataset consisted of removing all stop words and words of length two orless; afterwards, the size of the dictionary of all unique words from the trainingand testing documents was m = 22931 words. Table 4.
Information on the Reuter 21578 dataset considered.
Label j Topic
In this case, evaluation of the Domain-Specific classifier, based on the accu-racy, F-Measure, and computational time, has shown that α = 0 .
45 and p = ∞ (non-normalized measures on domain-specific words) were optimal. The SVM,using the linear kernel and a class weight adjustment (2840 divided by the num-ber of documents in each category) to address the varying sizes of the categories,and Random Forest, using 50 trees, classifiers were also tested to compare againstour novel algorithm. Table 5 provides the classification results obtained by theDomain-Specific classifier at those values, SVM, and Random Forest.The Domain-Specific classifier performed slightly better than SVM with thelinear kernel, and better than Random Forest in terms of accuracy. With respectto the F-Measure, our classifier performed better than SVM for categories withlarge sizes, and better than Random Forest in 6 of the 8 categories, while SVMhad a higher F-Measure on two of the smaller categories, undoubtably due to able 5. Classification performance of the Domain-Specific classifier (DSC) comparedto SVM with the linear kernel and Random Forest (RF) on the Reuters 21578 dataset.
Accuracy F1 acq F1 crude F1 earn F1 grain F1 interest F1 money-fx F1 ship F1 tradeDSC α = 0 . Random Forest 0.926 0.917 0.911
SVM’s class weight adjustment. Computationally, our classifier ran considerablyfaster than SVM and Random Forest.
In this paper, we have introduced a novel text categorization algorithm, theDomain-Specific classifier, based on similarity searches in the space of measureson the data dictionary. The classifier finds domain-specific words for each docu-ment category, which appear in this category relatively more often than in therest of the categories combined, and associates to it a normalized uniform mea-sure supported on the domain-specific words. For an unlabeled document, theclassifier assigns to it the category whose associated measure is most similar tothe document’s vector of relative word frequencies, with respect to the innerproduct. The cosine similarity measure arises as a special case corresponding tothe ℓ normalization.Our classifier involves a similarity search problem in a suitably interpreteddomain. We believe that this is the right viewpoint with the aim of furtherimprovements. It is worthwhile noting that our algorithm is unrelated to previ-ously used distance-based algorithms (e.g. the k -NN classifier [18]). The datasetin the similarity workload is completely different, and as a result, unlike mostalgorithms in text categorization, this classifier does not require any separatedimension reduction step beforehand.The process of selecting domain-specific words in our algorithm is actually animplicit feature selection method which is class-dependent, something we havenot seen before from a classifier in text categorization. For each class, instead of acentroid, we are choosing an outlier, a uniform measure supported on the domain-specific words, which is representative of this class and not of any other class.Not only does each uniform measure lead to a reduction in the dimension of thefeature space (as most words are not domain-specific) for similarity calculations,it does so dependent of the class labels, since domain-specific words are chosenrelative to all classes.This algorithm was first developed for the 2012 Cybersecurity Data MiningCompetition and brought the team of authors 2nd place in the text categoriza-ion challenge, and 1st place in accuracy. This is evidence that our algorithmoutperformed many existing text categorization algorithms, as surveyed in Sec-tion 2. In addition, our algorithm was evaluated on a sub-collection of the Reuters21578 dataset against two state-of-the-art classifiers, and shown to have a slightlyhigher classification accuracy than SVM, with a higher F-Measure for the largercategories, and overall performed better than Random Forest. Computationally,our classifier ran significantly faster than either, especially in the training stage.The normalizing parameter p plays a significant role: it is to account for classimbalance. When there are categories with very few documents, p = ∞ shouldbe used to avoid over-emphasizing the smaller categories; and small values of p should be used when the categories have roughly the same number of documents.For future work, we hope to test the Domain-Specific classifier on biologicalsequence databases. Other definitions of domain-specific words can be investi-gated, for instance the one proposed in [14]. We would like to experiment withassigning non-uniform measures on the domain-specific words, for instance, byputting weights based on their relative occurrences or on α . Finally, we wouldlike to extend the process of selecting domain-specific words to a general clas-sification context, by defining class-specific features relative to the classes andperforming classification on only these class-dependent features. References
1. Aas, K., Eikvil, L.: Text Categorization: A Survey. In Technical Report 941. Nor-wegian Computing Center (1999).2. Bingham, E., Mannila, H.: Random projection in dimensionality reduction: Applica-tions to image and text data. In: KDD ’01, San Francisco, USA 2001. Proceedings of7th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, pg. 245-250.3. Breiman, L.: Random Forests. Machine Learning 45(1), 5 - 32 (2001).4. Cardoso-Cachopo, A.: Datasets for single-label text categorization, http://web.ist.utl.pt/acardoso/datasets .5. Church, K.W., Hanks, P.: Word association norms, mutual information and lexicog-raphy. In: ACL 27, Vancouver, Canada 1989. Proceedings of ACL 27, pg. 76-83.6. Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning 20, 273 - 297(1995).7. Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Transactions onInformation Theory 13, 21 - 27 (1967).8. Deerwester, S., Dumais, S.T., Harshman, R.: Indexing by Latent Semantic Analysis.Journal of the American Society for Information Science 41(6), 391 - 407 (1990).9. Dimitriadou, E., Hornik, K., Leisch, F., Meyer, D., Weingessel, A.: e1071: Miscfunctions of the Department of Statistics (e1071), TU Wien. R package version 1.6,available at http://CRAN.R-project.org/package=e1071 (2011).10. Forman, G.: An Extensive Empirical Study of Feature Selection Metrics for TextClassification. Journal of Machine Learning Research 3, 1289 - 1305 (2003).11. Ikonomakis, M., Kotsiantis, S., Tampakas, V.: Text Classification Using MachineLearning Techniques. WSEAS Transactions on Computers 4(8), 966 - 974 (2005).12. Joachims, T.: Text Categorization with Support Vector Machines: Learning withMany Relevant Features. In: ECML-98, Chemnitz, Germany 1998. Proceedings of10th European Conference on Machine Learning, pg. 137 - 142.3. Johnson, D.E., Oles, F.J., Zhang, T., Goetz, T.: A decision-tree-based symbolicrule induction system for text categorization. IBM Systems Journal 41(3), 428 - 437(2002).14. Keim, D.A., Oelke, D., Rohrdantz, C.: Analyzing document collections via context-aware term extraction. In: NLDB ’09 Saarbr¨ucken, Germany. Proceedings of 14thInter. Conf. on Appl. of Natural Language to Information Systems, pg. 154 - 168.15. Kim, S.B., Rim, H.C., Yook, D.S., Lim, H.S.: Effective Methods for ImprovingNaive Bayes Text Classifiers. LNAI 2417, 414 - 423 (2002).16. Lewis, D.D.: Test Collections, Reuters-21578, .17. Liaw, A., Wiener, M.: Classification and Regression by randomForest. R News2(3), 18 - 22 (2002).18. Lim, H.: Improving kNN Based Text Categorization with Well Estimated Param-eters. LNCS 3316, 516 - 523 (2004).19. Pang, P.S., Ban, T., Kadobayashi, Y., Song, J., Huang, K.: The 3rd CybersecurityData Mining Competition, (2012).20. Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130 - 137 (1980).21. R Development Core Team: R: A Language and Environment for Statistical Com-puter. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, available at (2008).22. Radovanovic, M., Ivanovic, M.: Text Mining: Approaches and Applications. NoviSad J. Math 38 (3), 227 - 234 (2008).23. Salton, G., McGill, M.J.: An Introduction to Modern Information Retrieval.McGraw-Hill (1983).24. Salton, G., Wong, A., Yang, C.S.: A Vector Space Model for Automatic Indexing.Communications of the ACM 18(11), 613 - 620 (1975).25. Sch¨olkopf, B. and Smola, A.: A Short Introduction to Learning with Kernels. In:S. Mendelson, A.J. Smola (Eds.), Advanced Lectures on Machine Learning, LNAI2600, pp. 41–64 (2003).26. Sch¨utze, H., Hull, D.A., Pedersen, J.O.: A Comparison of Classifiers and DocumentRepresentations for the Routing Problem. In: SIGIR ’95 Seattle, USA. Proceedingsof SIGIR 1995, 18th ACM International Conference on Research and Developmentin Information Retrieval, pg. 229 - 237.27. Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Com-puting Surveys 34, 1 - 47 (2002).28. Torkkola, K.: Linear Discriminant Analysis in Document Classification. In: ICDM’01 San Jose, USA 2001. Proceedings of 2001 IEEE ICDM Workshop on Text Min-ing, pg. 800–806.29. Weichold, M., Huang, T.W., Lorentz, R., Qaraqe, K.: The 19th In-ternational Conference on Neural Information Processing (ICONIP 2012),