[PDF] Semantic classifier approach to document classification

Abstract

In this paper we propose a new document classification method, bridging discrepancies (so-called semantic gap) between the training set and the application sets of textual data. We demonstrate its superiority over classical text classification approaches, including traditional classifier ensembles. The method consists in combining a document categorization technique with a single classifier or a classifier ensemble (SEMCOM algorithm - Committee with Semantic Categorizer).

Full PDF

SSemantic classiﬁer approach to document classiﬁcation

Piotr Borkowski, Krzysztof Ciesielski, and Mieczysław A. Kłopotek

Institute of Computer Science, Polish Academy of Sciences,ul. Jana Kazimierza 5, 01-238 Warszawa, PolandTel.: (+48) 22 380-05-00Fax: (+48) 22 380-05-10 piotrb, kciesiel, [email protected]

Abstract.

In this paper we propose a new document classiﬁcation method, bridg-ing discrepancies (so-called semantic gap) between the training set and the ap-plication sets of textual data. We demonstrate its superiority over classical textclassiﬁcation approaches, including traditional classiﬁer ensembles. The methodconsists in combining a document categorization technique with a single clas-siﬁer or a classiﬁer ensemble (SemCom algorithm - Committee with SemanticCategorizer).

The text document classiﬁcation methods are well-established in the area of text mining.Predominantly they have been derived from corresponding data mining techniques thatwere designed to handle long input data records. Let us mention here for example NaiveBayes, Balanced Winnow and LLDA (to be described later). While these methods arequite successful in data mining and were appreciated within text mining community,one important drawback occurs related to the speciﬁc area of text mining. While indata mining the meaning and the value range of individual attributes of an object arerelatively well deﬁned, in text mining it is not the case any more. Same content maybe expressed in different ways, using different words (via synonyms, list of hyponyms)while the same word can express different things in different contexts. This would notbe a big obstacle if not the fact that traditional techniques would require signiﬁcantlylarger bodies of training data, which makes an unbalanced sample much more likely.Not only because of the size of the data sample but also the heterogeneity of the datasources that need to be combined. It is even worse when the trained classiﬁers need tobe applied to unseen data which stems from a dataset that from the human point of viewtouches the same topic but from the computer point of view is written in a completelydifferent style. This gives rise to so-called semantic gap , that is though the trainingand application data sets are semantically similar, their syntactical and bag-of-wordsview differ. In such a case understanding the semantics of documents would be needed,which is unavailable for traditional data mining techniques.In this paper we propose two new document classiﬁcation methods, SemCla (Se-mantic Classiﬁer) and SemCom (Committee with Semantic Categorizer), bridging thesemantic gap between the training set and the application sets of textual data. The meth-ods consist in combining an unsupervised document categorization technique with a a r X i v : . [ c s . I R ] J a n ingle classiﬁer or a classiﬁer ensemble. Via this component the traditional notion ofdocument similarity (based on angles between vectors in term space) is amended to in-clude the concept of semantic similarity . The notion of semantic similarity, as used inthis paper, was described in [1]. Both methods introduced in the paper are based on ourSemCat (Semantic Categorizer) algorithm, that has also been introduced in [1].In Section 2 we deﬁne the problem of document categorization and semantic clas-siﬁcation and recall the work done on the subject by other researchers. In Section 3 wedescribe our categorization methodology, SemCat. Subsequently we show in Section4, how our categorization method can be used in various ways in the classical task ofclassiﬁcation.In Section 5 we explain the setup of experiments we performed to show the useful-ness of SemCla algorithm in classiﬁcation tasks. In subsequent Section 6 showing theresults of these experiments, we demonstrate superiority of the semantic classiﬁcationmethods (SemCom and SemCla) over classical text classiﬁcation approaches, includ-ing traditional classiﬁer ensembles for text classiﬁcation tasks (Section 6.1) as well asin cases when the so-called semantic gap occurs (Section 6.2).Section 7 summarizes achieved results and outlines future research directions. Our contribution in this paper is: – constructing new supervised classiﬁer based on unsupervised semantic documentcategorizator, – demonstrating feasibility of the new classiﬁer for bridging semantic gap betweentest and training set of data, – designing a heterogeneous committee that combines classical classiﬁers and thesemantic classiﬁer. The task of categorization is to assign one or more labels (categories) to a document,or a group of documents (cluster labeling). It ﬁnds multiple practical applications, es-pecially for assisting in text retrieval task: in web page classiﬁcation, e-mail and memoorganization, expanding queries with new terms, expanding / improving ontologies, andmany other.The categorization task can be viewed formally as a special case of classiﬁcation[2,3], but with a couple of differences. First of all, the number of categories signiﬁ-cantly exceeds the number of classes in typical classiﬁcation task. Categories may beﬂat and disjoint, but they may form a tree or even a hierarchy (acyclic graph). And morethan one category may be assigned to a single document. Therefore typical classiﬁcationmethods do not ﬁt well to the task of categorization. Diverse other methods have beenproposed to attack the problem of categorization. Some of them are based on cluster-ing. The most popular representatives of this brand of approaches are Nonnegative Ma-trix Factorization (NMF), Latent Semantic Analysis (LSA), Probabilistic LSA (PLSA),nd Finite Mixture of Multidimensional Bernoulli Distributions, described in [4]. Otherresearchers map the document contents to some semantic resources, in particular toWikipedia ( W ). This approach was exploited in WikipediaMiner Project , developed atthe University of Waikato in Hamilton, New Zeeland [5,6,7]. It uses W topics as cate-gories. Basic idea was key phrase indexing. For terms from W their “keyphraseness” [8]that is share of occurrences in W links is computed. Then these terms are searched ina document to be categorized. Terms with multiple meanings are disambiguated (viasome trained classiﬁer) by choosing the meaning most close to the document topic.For training purposes documents annotated with such keyphrases have to be assignedcategories. Then a classiﬁer is trained.In this paper we exploit our new unsupervised categorization method, SemCat, in-troduced in [1]. Contrary to WikipediaMiner, no classiﬁers are used, hence no trainingcorpora need to be prepared. Also it is not based on W links. Instead the category graphof W is exploited. A novelty here is also the usage of more challenging Polish language[9]. Furthermore, we develop a classiﬁcation method SemCla suitable to apply for datawith semantic gap.The problem of “semantic gap” is understood in literature in many ways. We focuson the aspect encountered in text retrieval where data come for different domains. Thenext paragraphs give a brief overview of the approaches that have been proposed.The article [10] shows a review of cross-domain text categorization problem. Unlikethe classical case, the training and the test data originates from different distributionsor domains. This is very common in practical tasks because (especially for Polish lan-guage) we often do not have a suitable data set of labeled documents. Often what wehave is a corpus which is topically related, but presents the same (or semantically sim-ilar) information in a different way, e.g. using different vocabulary. Many algorithmshave been developed or adapted for cross-domain text classiﬁcation, there are conven-tional algorithms: Rocchio’s Algorithm, Decision Trees like: CART, ID3, C4.5; NaiveBayes classiﬁer, KNN, Support Vector Machines; and some novel cross-domain classi-ﬁcation algorithms: Expectation-Maximization Algorithm, Probabilistic Latent Seman-tic Analysis (PLSA), Latent Dirichlet Allocation(LDA), CFC Algorithm, Co-clusterbased Classiﬁcation Algorithm [11].Paper [12] gives a general overview of the problem of semantic gap in informa-tion retrieval. Authors focus on two separate task: text and multimedia mining/imageretrieval. Semantic gap in text retrieval is deﬁned as a usage of different words (syn-onyms, hypernyms, hyponyms) to describe the same object. In the part about text re-trieval authors concentrate on reorganizing search results by using post-retrieval clus-tering system. They work on search results (“snippets”) and enhance them by adding socalled topics . Topic is a set of words (they have similar meaning) that was as outcome ofProbabilistic-Latent Semantic analysis or Latent Dirichlet Allocation on some externaldata collection. After adding a topic to the snippet they carry out clustering or labeling.In the paper [13] authors propose a way to improve categorization by adding se-mantic knowledge from Wikitology (knowledge repository based on Wikipedia). Theyused various text representation and text enrichment techniques and used Support Vec-tor Machine-SVM to learn a model of classiﬁcation. http://wikipedia-miner.sourceforge.net/ Our taxonomy-based semantic categorization method

Our taxonomy-based categorization method SemCat was described in detail in [1], sobelow we present only its brief description.

Suppose we have a taxonomy of categories (a directed acyclic graph with one rootcategory) like Wikipedia ( W ) category graph or Medical Subject Headings (MeSH)ontology . We assume there is a set of concepts connected with the taxonomy, in thefollowing way: every concept is linked to one or more categories. Every category andconcept is tagged with a string label. Strings connected with categories are used as anoutcome presented to a user. And those attached to concepts are used for mapping a textof document into the set of concepts.For the experimental design we used W category graph with the concept set of W pages. Tags for W categories were their original string names. Set of string tagsconnected with a single W page consists of: lemmatized page name and all names ofdisambiguation pages that link to that page.In the process of document categorization we remove stop words and very rare/ frequent words, lemmatize, ﬁnd phrases and calculate normalized t f id f weights forterms and phrases. Calculation of a standard term frequency inverse document frequency is based on word frequencies from the collection of all W pages.Then we map document’s terms and phrases into a set of concepts. In the caseof homonyms, we disambiguate the concept assignment: we select the concept that isthe nearest by similarity measure deﬁned by Equations (1) and (2) (see Section 3.2)to the set of concepts that was mapped in an unambiguous way. We investigated othermethods of disambiguating e.g taking all meanings of ambiguous terms and weigh themaccordingly. The results for various disambiguation methods are described in Section5.4.When every term in the document is assigned to a proper concept ( W page), thenall concepts are mapped to W categories. In this way usually one term maps to morethan one category, so we transfer the weight associated to that term proportionally toall its categories. Sum of weights assigned to the categories equals to sum of t f id f forterms. The outcome of that procedure is a ranked list of categories with weights. In thelast step we can transform the weighted ranking and / or choose top- N categories out ofit. We use semantic measures for matching concepts ( W pages) and objects of the tax-onomy ( W categories). We were inspired by the paper [14]. The semantic measuresare based on: the unary function IC ( Information Content ) and binary function

MSCA ( Most Speciﬁc Common Abstraction ). Their inputs are categories from a taxonomy. hough superﬁcially similar, our IC deﬁnition differs essentially from that proposed forWordNet. WordNet computes the IC for concepts based on the number of subordinatedconcepts. We compute the IC for categories, based on the count of concepts that belongto subordinated categories. So the IC of a category is weighted by the frequency of itsusage in the language rather than by its deﬁnitional complexity.For a given category k we deﬁne IC ( k ) = − log ( + s k ) / log ( + N ) , where s k isthe number of taxonomy concepts in the category k and all its subcategories, and N is the total number of taxonomy concepts. The main category has the lowest value of IC = k and k we deﬁne MSCA ( k , k ) as the category k ∗ ∈ CA ( k , k ) (the set of super-categories for both categories k and k ) that maximize avalue of the function { IC ( k ) : k ∈ CA ( k , k ) } . The properties of IC ( ˙ ) measure ensurethat the category chosen is most speciﬁc amongst the common super-categories.In the literature dealing with Wordnet many measures based on IC and MSCA havebeen proposed [14], including LIN and PIRRO-SECO similarity: sim

Lin ( k , k ) = · MSCA ( k , k ) IC ( k ) + IC ( k ) (1) sim PirroSeco ( k , k ) = (cid:16) · MSCA ( k , k ) − IC ( k ) − IC ( k ) + (cid:17) (2)Though analogous measures were deﬁned for WordNet, our category similaritymeasures differ from those for WordNet because we deﬁned IC and MSCA differentlythan in Wordnet. Our deﬁnition is based on Wikipedia structure, hence we do not needto refer to WordNet.We used the above measure for categories to deﬁne a similarity measures for concepts( W pages). Similarity between pages p i and p j is computed by aggregation of the simi-larity between each pair of categories ( k i , k j ) such that p i belongs to the category k i and p j to k j : sim PAGE ( p i , p j ) = max { sim CAT ( k i , k j ) : p i ∈ k i ∧ p j ∈ k j } (3) In order to demonstrate the value of semantic categorization, we exploited it as an in-gredient (to a classiﬁer ensemble) in the classical classiﬁcation algorithms and theircommittees, SemCom, as well as an stand-alone classiﬁer SemCla.In this section we recall commonly known classiﬁcation algorithms we used in ourexperiments. These were Naive Bayes, Balanced Winnow, Labeled LDA, as well as thecommittees of classiﬁers (bagging type ensembles) built upon Naive Bayes classiﬁerand Balanced Winnow. We describe also our own semantic categorization based clas-siﬁer SemCla and our heterogeneous committee SemCom (containing both proprietarySemCat method and above-listed supervised classiﬁcation methods). .1 Naive Bayes

Naive Bayes classiﬁcation method (cf. [15]) on the basis of knowledge derived fromtraining data set, creates a probabilistic model assigning one of the predeﬁned classes(i.e. labels) to a new observation (i.e document). In this approach, each document istreated as a bag of words, which does not take into account the order (syntax). Addi-tionally, a simplifying assumption is made, that the individual words in the documentare independent. The probability of a given class c being assigned to a document d iscalculated as follows: P ( c | d ) = P ( c ) ∏ w ∈ d P ( w | c ) n wd P ( d ) , where n wd is the total number of occurrences of word w in the document, a P ( w | c ) is the probability of occurrence of a word w in the class c . P ( c ) is the probability ofthe class c which is estimated based on the fraction of documents that belongs to thisclass. The value of P ( d ) does not depend on the class, thus it is ignored for the purposeof document classiﬁcation. Finally, P ( w | c ) = + ∑ d ∈ Dc n wd k + ∑ w (cid:48) ∑ d ∈ Dc n w (cid:48) d , , where D c is the set ofall documents in the class c , and the number of k is the size of the dictionary (i.e. thenumber of distinct words). Balanced Winnow algorithm details can be found in [16] and [17]. Several versionsof this classiﬁer can be found in the literature. Main concept is based on the Percep-tron algorithm (cf. [18]). For our purpose Balanced Winnow version of the algorithmwas selected because of its high observed efﬁcacy. For each word, algorithm stores twoweights: w + and w − , on the basis of which algorithm calculates document membershipto each class (binary classiﬁcation). Positive weights are in favor of a given class, neg-ative weights against it. The difference between the weights ( w + − w − ) is the overallweight associated with a given word. Assume that the classiﬁed document is a vectorof words with the weights x = ( x , . . . , x n ) . Then the classiﬁcation rule is based on theinequality ∑ ni = ( w + − w − ) x i > θ , for a ﬁxed value of the parameter θ . Training of theclassiﬁer is based on weights modiﬁcation, but only if the training document has beenmisclassiﬁed. Two parameters are introduced: promotion level α > < β <

1. If the error is to classify the document to the class to which it doesnot belong (negative document), then the weights of the words are modiﬁed as follows: w + : = β w + , w − : = α w − . If an error is made on a positive document (by not classifyingit to the positive class), weight modiﬁcation is as follows: w + : = α w + , w − : = β w − . Labeled Latent Dirichlet Allocation (LLDA) is an extension of the popular – amongpractitioners and theorists – Latent Dirichlet Allocation model described in [19]. It isone of many probabilistic topic models useful in analyzing text documents. In particularthe review of this subject can be found in [20].DA is an unsupervised method, where any document is treated as a probabilisticmixture of various topics. Resulting generative model is characterized by the discreteprobability distribution of words within a given topic. The model assumes the follow-ing way to generate each document. The length N of the document is selected (thePoisson distribution is used). Then the proportion of subjects making up the documentis ﬁxed (Dirichlet distribution randomizing the set of K topics). Subsequent words inthe document are generated by the random selection of the topic (with a multinomialdistribution generated above), and then within this topic (determining the distributionof words), a particular word is generated. Assuming such a method of generating eachdocument in a given collection, LDA is trying to recreate a set of topics that are gen-erating the observed collection. Labeled LDA method is a supervised variant, whichrelates every document label to a ﬁxed subset of topics. LLDA algorithm is very similarto its unsupervised prototype, with the exception that the document topics are selectedonly from among those that correspond to the observed document labels – details canbe found in [21]. There are other supervised variants of the LDA algorithm, such asSupervised LDA ([22]). Authors selected LLDA in favor of Supervised LDA since inour experimental settings LLDA gave signiﬁcantly better results. As part of future workit is planned to use also semi-supervised methods such as Partially Labeled DirichletAllocation (cf. [23]). Below we present a description of a new semantic classiﬁer which we call SemCla. Itis based on a category representation of a document produced by SemCat (see Section3.1), which is used in combination with semantic measures (see Section 3.2).

Outline of the algorithm

Recall that SemCat uses words and phrases from the doc-ument to produce a list of categories with weights. This representation of a documentcan be considered as a vector of weights for all category from W category structure.Therefore we call it vector of categories . We use it to calculate cosine product. Wefound out that the algorithm performs better when for each category from the vector ofcategories we add a super category of it (according to W hierarchy) with weight equalthe initial weight multiplied by a constant α (we used the value α = .

33, we explainbelow how we calibrated this parameter). Thus we obtain the extended category vector .This process is visualized in Figure 1.The semantic classiﬁcation is made in the way described below and illustrated inFigure 2.1. documents from training and test sets are categorized to obtain category vectorsthat represent their content,2. category vectors for all documents are changed into extended category vectors (forconstant α ),3. we classify a new document (represented by its extended category vector) by ﬁnd-ing the nearest group (in the sense of the cosine product) in the training set.n the literature, a group to be compared with is represented by its centroid. Al-though the method with centroids works faster, it gives poorer results. Therefore resultspresented in Tables 1 – 4 are for SemCla algorithms that ﬁnd the nearest group usingall documents from the group and taking average similarity. words/phrases tﬁdf category vectorextendedcategory vector(we add super categorieswith diminished weights) Fig. 1.

Single document category representation

New document ext. category vector

Class 1 (doc. group) ext. category vector (1,1)ext. category vector (1,2)...

Class 2 (doc. group) ext. category vector (2,1)ext. category vector (2,2)... ... Class N .........sim()sim() sim() Fig. 2.

Categorization as a classiﬁcation (SemCla algorithm)

Finding optimal α parameter Value of optimal α was found in a separate experimentbefore the experiment discussed in the paper. It was conducted for SemCat algorithm.We took 4 groups of documents from kopalniawiedzy.pl : astronomy-physics, psy-chology, medicine, technology and drew at random N =

100 documents from each ofit. We did not use all document groups from this corpus, we chose 4 groups that weremost different from each other. All documents were categorized with various values of α ranging from 0 . . α resulted in a signiﬁcant deterioration of the out-comes). Then we calculated semantic similarity between categorized document (withdifferent α ), sorted them and ranked. We chose the value of parameter α that maxi-mizes difference between means of rank of documents from the same groups and thoseelonging to different groups. In other words, we found the value that separates bestthese groups of documents. The experimental setting was also based on the ensemble of classiﬁers. For each doc-ument the classiﬁcation process is carried out by every classiﬁer in the ensemble (itmay also be a classiﬁer of the same type, but trained on a different learning sample).Then the results of all classiﬁers are aggregated as the ﬁnal ensemble classiﬁer. In theexisting implementation this can be done in three ways: (a) each classiﬁer has one vote– category with the highest number of votes is selected; (b) votes counting addition-ally takes into account the weights of classiﬁcation results (this option requires that allclassiﬁers are of the same type); (c) ranks of the elements returned by the classiﬁer areaggregated instead of raw votes or weights. In the case when two (or more) categoriesreceived exactly the same number of votes, the result is selected at random from amongthe winning categories.

In our new approach, we developed heterogeneous committee of classiﬁers SemComthat contains the supervised methods of Naive Bayes, Balanced Winnow, LLDA andour proprietary unsupervised categorization method SemCat utilizing taxonomy of W categories.Categorization method is unsupervised, and thus it cannot be trained on different sam-ples in a similar manner to supervised classiﬁers (categorization method utilizes datafrom the complete W taxonomy). For this reason the committee contained only oneinstance of the categorization algorithm. In order to increase the impact of SemCat onthe ﬁnal results of the committee as a whole, categorization votes were counted withthe higher weight. In addition, one should take into account that the categorization al-gorithm returns a ranking of categories (not only a single category). Thus, in the exper-imental settings we included a variant of the committee in which categorization methodadd more than one category with the highest rank in the list (and the correspondinglydecreasing weights). Experimental setting exploited several variants of the ensembles, trained on a differentsubsets from the training set ( W pages for Table 1, 2 and groups of news for Tables 3,4). For the classical classiﬁcation task (Table 1 and 2) from all W pages belongingto a single category [ S = , , W categories that represent these classes. Weill call them W class categories. When we choose W documents for training, we canchoose either documents that have W categories identical with the W class categoriesor their sub-categories. We say that we choose level 1 ( L =

1) documents, if for eachdocument at least one of its categories is identical with class category. If we choose L = ( , , ) means thatwe put top three categories from semantic categorizer with weights 7 , semantic gap task we used S =

50 for Table 3 and S =

200 for Table 4. Ex-perimental committees consisted of 25 classiﬁers based on Naive Bayes and BalancedWinnow methods. Aggregation variant was the one in which each classiﬁer is voting onone category only. More information on ensemble methods can be found in [24].

We performed two types of experiments, their results are reported in Tables 1 – 4.The ﬁrst experiment aimed at demonstrating that adding a semantic categorizer to acommittee of traditional classiﬁers improves the classiﬁcation correctness in classicclassiﬁcation task (Table 1, 2). The second experiment was designed to show that asemantic categorizer is capable of bridging semantic gap between the training data andthe test data (Table 3, 4).

For experimental purpose we used two different benchmark data sets. We needed dif-ferent datasets because of various nature of the investigated problems.

Benchmark used for classiﬁcation comparison .The benchmark data set was based on Polish subdirectory of DMOZ taxonomy/ Open Directory Project . It contains 1063 text ﬁles of Pol-ish web pages just with html tags removed. Selected documents belong to 15 directo-ries that map into W categories. They are: astronomy, biology, economics, philosophy,physics, graphics, history, linguistics, mathematics, education, politics, law, religiousstudies, sociology, technology. None of these categories is a subcategory of another onein the W taxonomy. We omitted a few cases of multi-labeled documents. For the bench-mark documents the reader is referred to the benchmark web page . The various optionsof categorization setting cause the number of categorized document differs. For calcu-lating the results we choose a set of documents that was categorized by every algorithm. Benchmark containing data with semantic gap. he second benchmark was made of documents downloaded from various newspage. It consists of training and evaluation part, they come from various domains. Weused separate collections to achieve different wordings in each of them. The training setconsists of news from the popular science portal kopalniawiedzy.pl merged withdocuments from one directory from forsal.pl – the domain about ﬁnance and econ-omy. Below we show more detailed description of the training set. – documents from kopalniawiedzy.pl : astronomy-physics N=283; medicine

N=2979; life science

N=3122; technology

N=4861; psychology

N=1733; humanities

N=244, – documents from forsal.pl from the directory Giełda (Stock exchange)

N=1987.For evaluation we downloaded directories from (contain-ing medical news) and merged it with economical documents from and (market, ﬁnances, business). Datasets used for evaluation: – directories from : Ginekologia (Gynecology)

N=1034;

Kar-diologia (Cardiology)

N=239;

Onkologia (Oncology)

N=1195, – directories from : Waluty (Currencies)

N=2161;

Finanse (Fi-nances)

N=1991, – documents from N=978.

To assess the efﬁciency of the studied algorithms we use two different measures. Theﬁrst one is commonly used standard precision measure, the second one is modiﬁedprecision based on similarity measure Lin (Equation (1) in Section 3.2). The differ-ence is in using Lin measure instead of indicator function. For documents d , ..., d n with real categories categ ( d i ) and its prediction pred ( d i ) the Lin precision is deﬁnedas: n ∑ ni = Lin ( categ ( d i ) , pred ( d i )) . The motivation for using the latter measure is thatstandard precision does not take into account the dependence between categories. Incase when we make a wrong prediction we would like to know how much predictedcategory is different from the real one. The ﬁrst part of experimental work concerned comparison of various methods of textclassiﬁcation. We proceeded on documents from DMOZ corpus with ﬁxed set of labelsdescribed in Section 5.2. Documents were divided into separate groups based on theirtext length measured by the number of characters ( C ): short (1000 ≤ C < medium (2000 ≤ C < , long (10 , ≤ C ). Files shorter than 1000 characters were notprocessed. Results for various classiﬁcation methods are presented in Tables 1, 2. Theywere divided by a ﬁle size and efﬁciency measure. Methods based on categorizationalgorithm return a list of weighted W categories. Therefore we transformed the outcomecategories into the target set of 15 categories and took only one category with the highestweight. Categorization was based on a selection of 10 words (only nouns) / phrases withhighest t f id f from the document. The experiments were performed for different valuesof parameters, but other settings gave worse results.n Table 1 ﬁrst four rows present various modiﬁcations of categorization method.The difference between them is in the method of disambiguation of ambiguous page.The ﬁrst row presents standard disambiguation method (see Section 3). The next twomethods ﬁnd a set of pages that map unambiguously. Then for every ambiguous pagewe ﬁnd all of its mappings to potential meanings. Then we ﬁgure out their distances tothe set and sort them into descending order. Subsequent possible meanings are givenvarious weights depending on their rank i : (1 / i , 1 / i or uniformly). All of these op-tions gave similar means, so we used paired t-test to compare them. As a reference weused basic disambiguation method. Methods with weighting 1 / i and 1 / i do not differsigniﬁcantly. Method with uniform weights differs.All of these methods took only nouns from the document. We developed two optionsof mapping words into titles of W page. We remove (or not) from the set of possiblepage those of them that do not match in an exact way. The option “exact matching”worked slightly better (although not signiﬁcantly), so we present it. Then we presentindividual classiﬁers followed by the ensembles of classiﬁers. Subsequent results arefor heterogeneous committee. The second experiment focuses on the problem of semantic gap which is observed inclassiﬁcation of data from different domains. For such data often two documents ex-press the same concepts, but as they use different wording (because of existing of syn-onyms, hypernyms, hyponyms), the conventional classiﬁcation / clustering algorithms,based on standard bag-of-words approach, do not work well. Such classiﬁers often donot recognize different linguistic representations for test and training set. Some worksrelating to the problem were presented in Section 2. Our approach, thoroughly pre-sented above, is different from them.There are other linguistic phenomena such as ellipsis, paraphrase and other. We focuson synonyms, hypernyms, hyponyms because of Wikipedia structure on which our al-gorithm is based. We deal with hyper-/hyponyms relation because of W category graphstructure we operate on. This graph is built on these kinds of relations.With synonym relation we cope during the phase of mapping words/phrases from thetext into W pages. The string set attached to a single W page contains the page title andall it’s synonyms. They are extracted from all names of disambiguation pages that pointto this particular page.For experimental design (see Table 3) we used standard classiﬁcation methods in dif-ferent settings. As an input for them we used: 1. terms – terms from the document; 2. categories – categories for a given document produced by SemCat; 3. concepts – set ofdisambiguated concepts ( W page id) produced during SemCat algorithm.In Table 4 we present SemCla, ensembles and the heterogeneous committee with se-mantic classiﬁer. Results

As can be seen in Tables 1 the best method among the considered SemCat algorithms isthe one where upon mapping of terms/phrases to W pages the ranking of pages corre-sponding to a term is computed and all of them are taken into account using appropriateweights. The version using only unambiguous terms and phrases has the poorest per-formance. Modiﬁcations of the base method (variants of ﬁtting, shifting the stage ofcategory projection) do not lead to signiﬁcant changes in performance.Though SemCla outperforms individual non-semantic classiﬁers, one can see that aclassical classiﬁer ensemble is able to outperform SemCla.Therefore we turned to considering the impact of inclusion of SemCat into an en-semble of classical classiﬁers.The size of the ensemble (25x Balanced Winnow + 25x Bayes) guarantees the sta-bility of the results under various selections of the random training samples.Experimental settings included: various levels of W category graph used to cre-ate training samples [Level=1 , , ∞ ] as well as various sample sizes per category [ S = , , S = ∞ led to noticeably worse performance, since W docu-ments selected in the random sample were vaguely related to the desired topic (cate-gory).On the other hand, in every investigated case, results for the Level=1 were worsethan for Level=2, since the randomization of the sample for each instance of the classi-ﬁer was too low (the number of the W documents on level 1 was not sufﬁcient to makea sample).Ensemble of classical classiﬁers was extended with SemCat (Table 2) using variousweights for 1st, 2nd & 3rd category in the SemCat ranking. This setting required furtherinvestigation, but usually weights 14/10/6 led to the best classiﬁcation results. Higherweights caused worse results. Extended ensemble 25x Balanced Winnow + 25x Bayes+ SemCat with Level=2, S =

200 and weights 14/10/6 usually was the optimal setting(with an exception for shortest documents).Further extension of the ensemble with LLDA classiﬁer did not improve the results,both in the case of base ensemble (25x Balanced Winnow + 25x Bayes) and the seman-tic ensemble that included SemCat algorithm.Presented experiments lead to the following conclusions: the best results were achievedfor ensembles that beside standard classiﬁcation methods (25xBayes + 25xBalancedWinnow) a semantic method was included (either SemCat or SemCla algorithm). Sur-prisingly, adding more varied set of standard classiﬁcation methods (Naive Bayes, Win-now and LLDA) did not improve quality of the ensemble.Ensemble of 25xSemCla classiﬁers in most cases does not perform signiﬁcantlybetter than a single SemCla. It is mainly due to low variance of the individual votingmethods within the ensemble. able 1.

Average values of various precision measures for

DMOZ small dataset. Parameter L stands for a level of W documents used for training sample, S is a sample size per each groupof documents. 25x(B,W) stands for an ensemble of 25 Bayes an 25 Balanced Winnow classi-ﬁers. Vector of numbers that follows the SemCat represents weights attached to top-3 categoriesinserted into committee. Lin precision PrecisionMethod Description short medium long short medium longSemCat SemCat algorithm with method disambiguation algorithmSemCat

SemCat : no disambig. concept (pages) 0 .

417 0 .

463 0 .

538 0 .

393 0 .

436 0 . / i SemCat

SemCat : no disambig. concept (pages) 0 .

413 0 .

464 0 .

553 0 .

390 0 .

437 0 . / i SemCat

SemCat : no disambig. concept (pages) 0 .

409 0 .

442 0 .

513 0 .

386 0 .

416 0 . . . Classiﬁer Bayes (avg of 25) L=2 S=200 0 .

381 0 .

473 0 .

572 0 .

282 0 .

394 0 . .

437 0 .

553 0 .

694 0 .

385 0 .

505 0 . SemCla (avg of 25) .

638 0 . .

589 0 . .

558 0 .

652 0 .

682 0 .

434 0 .

558 0 . .

540 0 .

621 0 .

667 0 .

415 0 .

516 0 . .

503 0 .

619 0 .

672 0 .

379 0 .

519 0 . .

577 0 .

684 0 .

722 0 .

515 0 .

65 0 . .

578 0 .

698 0 .

731 0 .

507 0 .

655 0 . Ensemble 25x

SemCla

L=2 S=200 0 .

556 0 .

637 0 .

694 0 .

511 0 .

59 0 . SemCla ) L=2 S=200

Ensemble 25x(B,W) L=2 S=200 + LLDA: 5.0 0 .

572 0 .

689 0 .

748 0 .

500 0 .

646 0 . .

566 0 .

693 0 .

748 0 .

496 0 .

653 0 . .

565 0 .

685 0 .

744 0 .

500 0 .

646 0 . .

545 0 .

666 0 .

740 0 .

485 0 .

624 0 . As visible in Tables 3, 4 in case of the semantic gap problem, semantic methods andcommittees lead to much better results than traditional classiﬁers, even if the latter areoperating on the modiﬁed representation (bag of categories instead of bag of words).It can be seen that the usage of terms alone gives poor results when semantic gapoccurs. Classical methods are most helped if categories are provided for the trainingpurposes, but the usage of concepts is only half the way as good. This means actuallythat our SemCla algorithm uses a much deeper insight into the document content thanjust a category label assignment.It is also worth to stress the fact that however SemCla (contrary to SemCat) is super-vised, it can also be used in unsupervised version. For such a setting, instead of usingunobservable document labels as training classes (cf. Figure 2), one can use documentclusters, where clustering is also based on the semantic categorization (SemCat algo- able 2.

Average values of various precision measures for

626 0 .

715 0 .

74 0 .

577 0 .

715 0 . SemCat :(10.5,7.5,4.5) 0 .

638 0 .

732 0 .

759 0 .

544 0 .

700 0 . SemCat :(14,10.6) 0 .

622 0 .

742 0 .

771 0 .

537 0 .

652 0 . SemCat :(17.5,12.5,7.5) 0 .

610 0 .

732 0 .

770 0 .

522 0 .

620 0 . SemCat :(7,5,3) 0 .

621 0 .

735 0 .

784 0 .

577 0 .

718 0 . SemCat :(10.5,7.5,4.5) 0 .

645 0 .

746 0 . . Heterogen. 25x(B,W) L=2 S=100+

SemCat :(14,10.6) 0 .

634 0 .

752 0 .

805 0 .

559 0 .

671 0 . SemCat :(17.5,12.5,7.5) 0 .

645 0 . .

522 0 .

629 0 . SemCat :(7,5,3) 0 .

635 0 .

731 0 .

777 0 .

577 0 .

722 0 . SemCat :(10.5,7.5,4.5) .

754 0 .

777 0 . . SemCat :(14,10.6) 0.644 .

780 0 .

562 0 . Heterogen. 25x(B,W) L=2 S=200+

SemCat :(17.5,12.5,7.5) 0 .

488 0 .

570 0 .

619 0 .

426 0 .

508 0 . .

623 0 .

718 0 .

776 0 .

581 0 .

712 0 . SemCat :(10.5,7.5,4.5) + LLDA: 10.0Heterogen. 25x(B,W) L=2 S=200 0 .

591 0 .

703 0 .

767 0 .

555 0 .

689 0 . SemCat :(10.5,7.5,4.5) + LLDA: 15.0Heterogen. 25x(B,W) L=2 S=200 0 .

57 0 .

689 0 .

751 0 .

537 0 .

679 0 . SemCat :(10.5,7.5,4.5) + LLDA: 20.0 rithm) and applies semantic similarity measures deﬁned in Section 3.2. We are goingto investigate this direction more deeply in the future, since it has a big advantage incases where document labels are unavailable and training set cannot be created (e.g.collections of web pages).

In this paper we demonstrated the value of semantic approach to the task of docu-ment classiﬁcation. In particular we show here that an unsupervised approach to theclassiﬁcation is possible when using semantic approach. This may be considered as aninteresting result by itself. Acknowledgedly, the semantic classiﬁer we introduce doesnot perform as well as ensembles of traditional classiﬁers but apparently an inclusionof a semantic categorizer into such an ensemble is capable of signiﬁcant improvementof its performance in classic classiﬁcation tasks.Intuitively, one would imagine that a classiﬁer incorporating semantic informationshould be superior to traditional classiﬁers that do not use such information. As we seefrom our experiments it is not that obvious. Though semantic classiﬁer proved to be acompetitor for individual classic classiﬁers, ensembles of classic classiﬁers can beat it.Therefore, exploitation of advantages of semantic information requires some level ofsophistication and cannot be considered as obvious. able 3.

Average values of precision measure for classical methods: Bayes (B), Balanced Win-now (W). Classiﬁcationterms categories conceptsBankier: Bayes 0.397 0.634 0.376Business Biznes Winnow 0.367 0.546 0.323Forsal: Bayes 0.602 0.910 0.620Currencies Winnow 0.720 0.870 0.498Forsal: Bayes 0.847 0.952 0.814Finances Winnow 0.832 0.874 0.695Gynecology Bayes 0.404 0.505 0.233Winnow 0.074 0.205 0.219Cardiology Bayes 0.782 0.746 0.502Winnow 0.350 0.438 0.427Oncology Bayes 0.758 0.824 0.526Winnow 0.227 0.627 0.390

Table 4.

Average values of precision measure for: “semantic classiﬁcation” (SemCla), ensembleof SemCla, ensemble of Bayes (B), Balanced Winnow (W) and for the heterogeneous committee.

SemCla

SemCla )Bankier (Business Biznes): 0.752 0.830 0.789 0.855Forsal (Currencies): 0.972 0.983 0.995 0.999Forsal (Finances): 0.979 0.986 0.965 0.986Gynecology 0.842 0.844 0.732 0.833Cardiology 0.900 0 .

891 0 .

895 0 . What is still more important, the semantic classiﬁer turns out to be superior to clas-sical approaches to classiﬁcation in case of semantic gap between the training data andthe data for which the classiﬁer is to be applied. This fact opens up really new horizonsfor application of machine learning methods in classiﬁcation of documents in casese.g. of mergers between various corporations where the local culture leads usually todevelopment of speciﬁc languages different between the ﬁrms.This research opens up a number of further interesting areas of research. Semanticapproach (in its base, unsupervised setting) could be tested also for clustering tasksunder semantic gap scenario as well as to mixtures of classiﬁcation and clustering.

References

1. Ciesielski, K., Borkowski, P., Klopotek, M.A., Trojanowski, K., Wysocki, K.: Wikipedia-based document categorization. In: Security and Intelligent Information Systems, SIIS 2011,Warsaw, Poland, June 13-14, 2011. (2011) 265–2782. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. (1)(March 2002) 1–473. Sebastiani, F.: Text categorization. In: Text Mining and its Applications to Intelligence,CRM and Knowledge Management, WIT Press (2005) 109–129. Sepp¨anen, J.K., Bingham, E., Mannila, H.: A simple algorithm for topic identiﬁcation in 0-1data. In Lavrac, N., Gamberger, D., Blockeel, H., Todorovski, L., eds.: PKDD. Volume 2838of Lecture Notes in Computer Science., Springer (2003) 423–4345. Medelyan, O., Witten, I.H., Milne, D.: Topic indexing with wikipedia. In: Proceedings ofthe ﬁrst AAAI Workshop on Wikipedia and Artiﬁcial Intelligence (WIKIAI’08). (2008)6. Milne, D., Witten, I.H.: An effective, low-cost measure of semantic relatedness obtainedfrom wikipedia links. In: Proceedings of the ﬁrst AAAI Workshop on Wikipedia and Artiﬁ-cial Intelligence (WIKIAI’08). (2008)7. Milne, D.N., Witten, I.H.: Learning to link with wikipedia. In: Proceedings of the 17thACM Conference on Information and Knowledge Management, CIKM 2008, Napa Valley,CA, USA, October 26-30, 2008, ACM (2008) 509–5188. Mihalcea, R., Csomai, A.: Wikify!: linking documents to encyclopedic knowledge. In:Proceedings of the Sixteenth ACM Conference on Information and Knowledge Management,CIKM 2007, Lisbon, Portugal, November 6-10, 2007, ACM (2007) 233–2429. Wroblewska, A., Sydow, M.: Debora: Dependency-based method for extracting entity-relationship triples from open-domain texts in polish. In Chen, L., Felfernig, A., Liu, J.,Ras, Z., eds.: Foundations of Intelligent Systems. Volume 7661 of Lecture Notes in Com-puter Science. Springer Berlin Heidelberg (2012) 155–16110. Ramakrishna Murty, M., Murthy, J., Prasad Reddy, P., Satapathy, S.: A survey of cross-domain text categorization techniques. In: Recent Advances in Information Technology(RAIT), 2012 1st International Conference on, IEEE (2012) 499–50411. Wang, P., Domeniconi, C., Hu, J.: Using wikipedia for co-clustering based cross-domain textclassiﬁcation. In: Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on,IEEE (2008) 1085–109012. Nguyen, C.T.: Bridging semantic gaps in information retrieval: Context-based approaches.ACM VLDB (2010)13. Raﬁ, M., Hassan, S., Shaikh, M.S.: Content-based text categorization using wikitology.CoRR abs/1208.3623 (2012)14. Pirr`o, G., Seco, N.: Design, implementation and evaluation of a new semantic similaritymetric combining features and intrinsic information content. In: On the Move to MeaningfulInternet Systems. Volume 5332 of LNCS., Springer (2008) 1271–128815. Aas, K., Eikvil, L.: Text categorisation: A survey. Report No. 941 (June 1999) ISBN 82-539-0425-8.16. Grove, A.J., Littlestone, N., Schuurmans, D.: General convergence results for linear discrim-inant updates. Mach. Learn. (3) (June 2001) 173–21017. Littlestone, N.: Learning quickly when irrelevant attributes abound: A new linear-thresholdalgorithm. Machine Learning (1988) 285–31818. Rosenblatt, F.: The perceptron: A perceiving and recognizing automaton. Technical Report85-460-1, Ithaca, New York (January 1957)19. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res.3