Semantic classifier approach to document classification
Piotr Borkowski, Krzysztof Ciesielski, Mieczys?aw A. K?opotek
SSemantic classifier approach to document classification
Piotr Borkowski, Krzysztof Ciesielski, and Mieczysław A. Kłopotek
Institute of Computer Science, Polish Academy of Sciences,ul. Jana Kazimierza 5, 01-238 Warszawa, PolandTel.: (+48) 22 380-05-00Fax: (+48) 22 380-05-10 piotrb, kciesiel, [email protected]
Abstract.
In this paper we propose a new document classification method, bridg-ing discrepancies (so-called semantic gap) between the training set and the ap-plication sets of textual data. We demonstrate its superiority over classical textclassification approaches, including traditional classifier ensembles. The methodconsists in combining a document categorization technique with a single clas-sifier or a classifier ensemble (SemCom algorithm - Committee with SemanticCategorizer).
The text document classification methods are well-established in the area of text mining.Predominantly they have been derived from corresponding data mining techniques thatwere designed to handle long input data records. Let us mention here for example NaiveBayes, Balanced Winnow and LLDA (to be described later). While these methods arequite successful in data mining and were appreciated within text mining community,one important drawback occurs related to the specific area of text mining. While indata mining the meaning and the value range of individual attributes of an object arerelatively well defined, in text mining it is not the case any more. Same content maybe expressed in different ways, using different words (via synonyms, list of hyponyms)while the same word can express different things in different contexts. This would notbe a big obstacle if not the fact that traditional techniques would require significantlylarger bodies of training data, which makes an unbalanced sample much more likely.Not only because of the size of the data sample but also the heterogeneity of the datasources that need to be combined. It is even worse when the trained classifiers need tobe applied to unseen data which stems from a dataset that from the human point of viewtouches the same topic but from the computer point of view is written in a completelydifferent style. This gives rise to so-called semantic gap , that is though the trainingand application data sets are semantically similar, their syntactical and bag-of-wordsview differ. In such a case understanding the semantics of documents would be needed,which is unavailable for traditional data mining techniques.In this paper we propose two new document classification methods, SemCla (Se-mantic Classifier) and SemCom (Committee with Semantic Categorizer), bridging thesemantic gap between the training set and the application sets of textual data. The meth-ods consist in combining an unsupervised document categorization technique with a a r X i v : . [ c s . I R ] J a n ingle classifier or a classifier ensemble. Via this component the traditional notion ofdocument similarity (based on angles between vectors in term space) is amended to in-clude the concept of semantic similarity . The notion of semantic similarity, as used inthis paper, was described in [1]. Both methods introduced in the paper are based on ourSemCat (Semantic Categorizer) algorithm, that has also been introduced in [1].In Section 2 we define the problem of document categorization and semantic clas-sification and recall the work done on the subject by other researchers. In Section 3 wedescribe our categorization methodology, SemCat. Subsequently we show in Section4, how our categorization method can be used in various ways in the classical task ofclassification.In Section 5 we explain the setup of experiments we performed to show the useful-ness of SemCla algorithm in classification tasks. In subsequent Section 6 showing theresults of these experiments, we demonstrate superiority of the semantic classificationmethods (SemCom and SemCla) over classical text classification approaches, includ-ing traditional classifier ensembles for text classification tasks (Section 6.1) as well asin cases when the so-called semantic gap occurs (Section 6.2).Section 7 summarizes achieved results and outlines future research directions. Our contribution in this paper is: – constructing new supervised classifier based on unsupervised semantic documentcategorizator, – demonstrating feasibility of the new classifier for bridging semantic gap betweentest and training set of data, – designing a heterogeneous committee that combines classical classifiers and thesemantic classifier. The task of categorization is to assign one or more labels (categories) to a document,or a group of documents (cluster labeling). It finds multiple practical applications, es-pecially for assisting in text retrieval task: in web page classification, e-mail and memoorganization, expanding queries with new terms, expanding / improving ontologies, andmany other.The categorization task can be viewed formally as a special case of classification[2,3], but with a couple of differences. First of all, the number of categories signifi-cantly exceeds the number of classes in typical classification task. Categories may beflat and disjoint, but they may form a tree or even a hierarchy (acyclic graph). And morethan one category may be assigned to a single document. Therefore typical classificationmethods do not fit well to the task of categorization. Diverse other methods have beenproposed to attack the problem of categorization. Some of them are based on cluster-ing. The most popular representatives of this brand of approaches are Nonnegative Ma-trix Factorization (NMF), Latent Semantic Analysis (LSA), Probabilistic LSA (PLSA),nd Finite Mixture of Multidimensional Bernoulli Distributions, described in [4]. Otherresearchers map the document contents to some semantic resources, in particular toWikipedia ( W ). This approach was exploited in WikipediaMiner Project , developed atthe University of Waikato in Hamilton, New Zeeland [5,6,7]. It uses W topics as cate-gories. Basic idea was key phrase indexing. For terms from W their “keyphraseness” [8]that is share of occurrences in W links is computed. Then these terms are searched ina document to be categorized. Terms with multiple meanings are disambiguated (viasome trained classifier) by choosing the meaning most close to the document topic.For training purposes documents annotated with such keyphrases have to be assignedcategories. Then a classifier is trained.In this paper we exploit our new unsupervised categorization method, SemCat, in-troduced in [1]. Contrary to WikipediaMiner, no classifiers are used, hence no trainingcorpora need to be prepared. Also it is not based on W links. Instead the category graphof W is exploited. A novelty here is also the usage of more challenging Polish language[9]. Furthermore, we develop a classification method SemCla suitable to apply for datawith semantic gap.The problem of “semantic gap” is understood in literature in many ways. We focuson the aspect encountered in text retrieval where data come for different domains. Thenext paragraphs give a brief overview of the approaches that have been proposed.The article [10] shows a review of cross-domain text categorization problem. Unlikethe classical case, the training and the test data originates from different distributionsor domains. This is very common in practical tasks because (especially for Polish lan-guage) we often do not have a suitable data set of labeled documents. Often what wehave is a corpus which is topically related, but presents the same (or semantically sim-ilar) information in a different way, e.g. using different vocabulary. Many algorithmshave been developed or adapted for cross-domain text classification, there are conven-tional algorithms: Rocchio’s Algorithm, Decision Trees like: CART, ID3, C4.5; NaiveBayes classifier, KNN, Support Vector Machines; and some novel cross-domain classi-fication algorithms: Expectation-Maximization Algorithm, Probabilistic Latent Seman-tic Analysis (PLSA), Latent Dirichlet Allocation(LDA), CFC Algorithm, Co-clusterbased Classification Algorithm [11].Paper [12] gives a general overview of the problem of semantic gap in informa-tion retrieval. Authors focus on two separate task: text and multimedia mining/imageretrieval. Semantic gap in text retrieval is defined as a usage of different words (syn-onyms, hypernyms, hyponyms) to describe the same object. In the part about text re-trieval authors concentrate on reorganizing search results by using post-retrieval clus-tering system. They work on search results (“snippets”) and enhance them by adding socalled topics . Topic is a set of words (they have similar meaning) that was as outcome ofProbabilistic-Latent Semantic analysis or Latent Dirichlet Allocation on some externaldata collection. After adding a topic to the snippet they carry out clustering or labeling.In the paper [13] authors propose a way to improve categorization by adding se-mantic knowledge from Wikitology (knowledge repository based on Wikipedia). Theyused various text representation and text enrichment techniques and used Support Vec-tor Machine-SVM to learn a model of classification. http://wikipedia-miner.sourceforge.net/ Our taxonomy-based semantic categorization method
Our taxonomy-based categorization method SemCat was described in detail in [1], sobelow we present only its brief description.
Suppose we have a taxonomy of categories (a directed acyclic graph with one rootcategory) like Wikipedia ( W ) category graph or Medical Subject Headings (MeSH)ontology . We assume there is a set of concepts connected with the taxonomy, in thefollowing way: every concept is linked to one or more categories. Every category andconcept is tagged with a string label. Strings connected with categories are used as anoutcome presented to a user. And those attached to concepts are used for mapping a textof document into the set of concepts.For the experimental design we used W category graph with the concept set of W pages. Tags for W categories were their original string names. Set of string tagsconnected with a single W page consists of: lemmatized page name and all names ofdisambiguation pages that link to that page.In the process of document categorization we remove stop words and very rare/ frequent words, lemmatize, find phrases and calculate normalized t f id f weights forterms and phrases. Calculation of a standard term frequency inverse document frequency is based on word frequencies from the collection of all W pages.Then we map document’s terms and phrases into a set of concepts. In the caseof homonyms, we disambiguate the concept assignment: we select the concept that isthe nearest by similarity measure defined by Equations (1) and (2) (see Section 3.2)to the set of concepts that was mapped in an unambiguous way. We investigated othermethods of disambiguating e.g taking all meanings of ambiguous terms and weigh themaccordingly. The results for various disambiguation methods are described in Section5.4.When every term in the document is assigned to a proper concept ( W page), thenall concepts are mapped to W categories. In this way usually one term maps to morethan one category, so we transfer the weight associated to that term proportionally toall its categories. Sum of weights assigned to the categories equals to sum of t f id f forterms. The outcome of that procedure is a ranked list of categories with weights. In thelast step we can transform the weighted ranking and / or choose top- N categories out ofit. We use semantic measures for matching concepts ( W pages) and objects of the tax-onomy ( W categories). We were inspired by the paper [14]. The semantic measuresare based on: the unary function IC ( Information Content ) and binary function
MSCA ( Most Specific Common Abstraction ). Their inputs are categories from a taxonomy. hough superficially similar, our IC definition differs essentially from that proposed forWordNet. WordNet computes the IC for concepts based on the number of subordinatedconcepts. We compute the IC for categories, based on the count of concepts that belongto subordinated categories. So the IC of a category is weighted by the frequency of itsusage in the language rather than by its definitional complexity.For a given category k we define IC ( k ) = − log ( + s k ) / log ( + N ) , where s k isthe number of taxonomy concepts in the category k and all its subcategories, and N is the total number of taxonomy concepts. The main category has the lowest value of IC = k and k we define MSCA ( k , k ) as the category k ∗ ∈ CA ( k , k ) (the set of super-categories for both categories k and k ) that maximize avalue of the function { IC ( k ) : k ∈ CA ( k , k ) } . The properties of IC ( ˙ ) measure ensurethat the category chosen is most specific amongst the common super-categories.In the literature dealing with Wordnet many measures based on IC and MSCA havebeen proposed [14], including LIN and PIRRO-SECO similarity: sim
Lin ( k , k ) = · MSCA ( k , k ) IC ( k ) + IC ( k ) (1) sim PirroSeco ( k , k ) = (cid:16) · MSCA ( k , k ) − IC ( k ) − IC ( k ) + (cid:17) (2)Though analogous measures were defined for WordNet, our category similaritymeasures differ from those for WordNet because we defined IC and MSCA differentlythan in Wordnet. Our definition is based on Wikipedia structure, hence we do not needto refer to WordNet.We used the above measure for categories to define a similarity measures for concepts( W pages). Similarity between pages p i and p j is computed by aggregation of the simi-larity between each pair of categories ( k i , k j ) such that p i belongs to the category k i and p j to k j : sim PAGE ( p i , p j ) = max { sim CAT ( k i , k j ) : p i ∈ k i ∧ p j ∈ k j } (3) In order to demonstrate the value of semantic categorization, we exploited it as an in-gredient (to a classifier ensemble) in the classical classification algorithms and theircommittees, SemCom, as well as an stand-alone classifier SemCla.In this section we recall commonly known classification algorithms we used in ourexperiments. These were Naive Bayes, Balanced Winnow, Labeled LDA, as well as thecommittees of classifiers (bagging type ensembles) built upon Naive Bayes classifierand Balanced Winnow. We describe also our own semantic categorization based clas-sifier SemCla and our heterogeneous committee SemCom (containing both proprietarySemCat method and above-listed supervised classification methods). .1 Naive Bayes
Naive Bayes classification method (cf. [15]) on the basis of knowledge derived fromtraining data set, creates a probabilistic model assigning one of the predefined classes(i.e. labels) to a new observation (i.e document). In this approach, each document istreated as a bag of words, which does not take into account the order (syntax). Addi-tionally, a simplifying assumption is made, that the individual words in the documentare independent. The probability of a given class c being assigned to a document d iscalculated as follows: P ( c | d ) = P ( c ) ∏ w ∈ d P ( w | c ) n wd P ( d ) , where n wd is the total number of occurrences of word w in the document, a P ( w | c ) is the probability of occurrence of a word w in the class c . P ( c ) is the probability ofthe class c which is estimated based on the fraction of documents that belongs to thisclass. The value of P ( d ) does not depend on the class, thus it is ignored for the purposeof document classification. Finally, P ( w | c ) = + ∑ d ∈ Dc n wd k + ∑ w (cid:48) ∑ d ∈ Dc n w (cid:48) d , , where D c is the set ofall documents in the class c , and the number of k is the size of the dictionary (i.e. thenumber of distinct words). Balanced Winnow algorithm details can be found in [16] and [17]. Several versionsof this classifier can be found in the literature. Main concept is based on the Percep-tron algorithm (cf. [18]). For our purpose Balanced Winnow version of the algorithmwas selected because of its high observed efficacy. For each word, algorithm stores twoweights: w + and w − , on the basis of which algorithm calculates document membershipto each class (binary classification). Positive weights are in favor of a given class, neg-ative weights against it. The difference between the weights ( w + − w − ) is the overallweight associated with a given word. Assume that the classified document is a vectorof words with the weights x = ( x , . . . , x n ) . Then the classification rule is based on theinequality ∑ ni = ( w + − w − ) x i > θ , for a fixed value of the parameter θ . Training of theclassifier is based on weights modification, but only if the training document has beenmisclassified. Two parameters are introduced: promotion level α > < β <
1. If the error is to classify the document to the class to which it doesnot belong (negative document), then the weights of the words are modified as follows: w + : = β w + , w − : = α w − . If an error is made on a positive document (by not classifyingit to the positive class), weight modification is as follows: w + : = α w + , w − : = β w − . Labeled Latent Dirichlet Allocation (LLDA) is an extension of the popular – amongpractitioners and theorists – Latent Dirichlet Allocation model described in [19]. It isone of many probabilistic topic models useful in analyzing text documents. In particularthe review of this subject can be found in [20].DA is an unsupervised method, where any document is treated as a probabilisticmixture of various topics. Resulting generative model is characterized by the discreteprobability distribution of words within a given topic. The model assumes the follow-ing way to generate each document. The length N of the document is selected (thePoisson distribution is used). Then the proportion of subjects making up the documentis fixed (Dirichlet distribution randomizing the set of K topics). Subsequent words inthe document are generated by the random selection of the topic (with a multinomialdistribution generated above), and then within this topic (determining the distributionof words), a particular word is generated. Assuming such a method of generating eachdocument in a given collection, LDA is trying to recreate a set of topics that are gen-erating the observed collection. Labeled LDA method is a supervised variant, whichrelates every document label to a fixed subset of topics. LLDA algorithm is very similarto its unsupervised prototype, with the exception that the document topics are selectedonly from among those that correspond to the observed document labels – details canbe found in [21]. There are other supervised variants of the LDA algorithm, such asSupervised LDA ([22]). Authors selected LLDA in favor of Supervised LDA since inour experimental settings LLDA gave significantly better results. As part of future workit is planned to use also semi-supervised methods such as Partially Labeled DirichletAllocation (cf. [23]). Below we present a description of a new semantic classifier which we call SemCla. Itis based on a category representation of a document produced by SemCat (see Section3.1), which is used in combination with semantic measures (see Section 3.2).
Outline of the algorithm
Recall that SemCat uses words and phrases from the doc-ument to produce a list of categories with weights. This representation of a documentcan be considered as a vector of weights for all category from W category structure.Therefore we call it vector of categories . We use it to calculate cosine product. Wefound out that the algorithm performs better when for each category from the vector ofcategories we add a super category of it (according to W hierarchy) with weight equalthe initial weight multiplied by a constant α (we used the value α = .
33, we explainbelow how we calibrated this parameter). Thus we obtain the extended category vector .This process is visualized in Figure 1.The semantic classification is made in the way described below and illustrated inFigure 2.1. documents from training and test sets are categorized to obtain category vectorsthat represent their content,2. category vectors for all documents are changed into extended category vectors (forconstant α ),3. we classify a new document (represented by its extended category vector) by find-ing the nearest group (in the sense of the cosine product) in the training set.n the literature, a group to be compared with is represented by its centroid. Al-though the method with centroids works faster, it gives poorer results. Therefore resultspresented in Tables 1 – 4 are for SemCla algorithms that find the nearest group usingall documents from the group and taking average similarity. words/phrases tfidf category vectorextendedcategory vector(we add super categorieswith diminished weights) Fig. 1.
Single document category representation
New document ext. category vector
Class 1 (doc. group) ext. category vector (1,1)ext. category vector (1,2)...
Class 2 (doc. group) ext. category vector (2,1)ext. category vector (2,2)... ... Class N .........sim()sim() sim() Fig. 2.
Categorization as a classification (SemCla algorithm)
Finding optimal α parameter Value of optimal α was found in a separate experimentbefore the experiment discussed in the paper. It was conducted for SemCat algorithm.We took 4 groups of documents from kopalniawiedzy.pl : astronomy-physics, psy-chology, medicine, technology and drew at random N =
100 documents from each ofit. We did not use all document groups from this corpus, we chose 4 groups that weremost different from each other. All documents were categorized with various values of α ranging from 0 . . α resulted in a significant deterioration of the out-comes). Then we calculated semantic similarity between categorized document (withdifferent α ), sorted them and ranked. We chose the value of parameter α that maxi-mizes difference between means of rank of documents from the same groups and thoseelonging to different groups. In other words, we found the value that separates bestthese groups of documents. The experimental setting was also based on the ensemble of classifiers. For each doc-ument the classification process is carried out by every classifier in the ensemble (itmay also be a classifier of the same type, but trained on a different learning sample).Then the results of all classifiers are aggregated as the final ensemble classifier. In theexisting implementation this can be done in three ways: (a) each classifier has one vote– category with the highest number of votes is selected; (b) votes counting addition-ally takes into account the weights of classification results (this option requires that allclassifiers are of the same type); (c) ranks of the elements returned by the classifier areaggregated instead of raw votes or weights. In the case when two (or more) categoriesreceived exactly the same number of votes, the result is selected at random from amongthe winning categories.
In our new approach, we developed heterogeneous committee of classifiers SemComthat contains the supervised methods of Naive Bayes, Balanced Winnow, LLDA andour proprietary unsupervised categorization method SemCat utilizing taxonomy of W categories.Categorization method is unsupervised, and thus it cannot be trained on different sam-ples in a similar manner to supervised classifiers (categorization method utilizes datafrom the complete W taxonomy). For this reason the committee contained only oneinstance of the categorization algorithm. In order to increase the impact of SemCat onthe final results of the committee as a whole, categorization votes were counted withthe higher weight. In addition, one should take into account that the categorization al-gorithm returns a ranking of categories (not only a single category). Thus, in the exper-imental settings we included a variant of the committee in which categorization methodadd more than one category with the highest rank in the list (and the correspondinglydecreasing weights). Experimental setting exploited several variants of the ensembles, trained on a differentsubsets from the training set ( W pages for Table 1, 2 and groups of news for Tables 3,4). For the classical classification task (Table 1 and 2) from all W pages belongingto a single category [ S = , , W categories that represent these classes. Weill call them W class categories. When we choose W documents for training, we canchoose either documents that have W categories identical with the W class categoriesor their sub-categories. We say that we choose level 1 ( L =
1) documents, if for eachdocument at least one of its categories is identical with class category. If we choose L = ( , , ) means thatwe put top three categories from semantic categorizer with weights 7 , semantic gap task we used S =
50 for Table 3 and S =
200 for Table 4. Ex-perimental committees consisted of 25 classifiers based on Naive Bayes and BalancedWinnow methods. Aggregation variant was the one in which each classifier is voting onone category only. More information on ensemble methods can be found in [24].
We performed two types of experiments, their results are reported in Tables 1 – 4.The first experiment aimed at demonstrating that adding a semantic categorizer to acommittee of traditional classifiers improves the classification correctness in classicclassification task (Table 1, 2). The second experiment was designed to show that asemantic categorizer is capable of bridging semantic gap between the training data andthe test data (Table 3, 4).
For experimental purpose we used two different benchmark data sets. We needed dif-ferent datasets because of various nature of the investigated problems.
Benchmark used for classification comparison .The benchmark data set was based on Polish subdirectory of DMOZ taxonomy/ Open Directory Project . It contains 1063 text files of Pol-ish web pages just with html tags removed. Selected documents belong to 15 directo-ries that map into W categories. They are: astronomy, biology, economics, philosophy,physics, graphics, history, linguistics, mathematics, education, politics, law, religiousstudies, sociology, technology. None of these categories is a subcategory of another onein the W taxonomy. We omitted a few cases of multi-labeled documents. For the bench-mark documents the reader is referred to the benchmark web page . The various optionsof categorization setting cause the number of categorized document differs. For calcu-lating the results we choose a set of documents that was categorized by every algorithm. Benchmark containing data with semantic gap. he second benchmark was made of documents downloaded from various newspage. It consists of training and evaluation part, they come from various domains. Weused separate collections to achieve different wordings in each of them. The training setconsists of news from the popular science portal kopalniawiedzy.pl merged withdocuments from one directory from forsal.pl – the domain about finance and econ-omy. Below we show more detailed description of the training set. – documents from kopalniawiedzy.pl : astronomy-physics N=283; medicine
N=2979; life science
N=3122; technology
N=4861; psychology
N=1733; humanities
N=244, – documents from forsal.pl from the directory Giełda (Stock exchange)
N=1987.For evaluation we downloaded directories from (contain-ing medical news) and merged it with economical documents from and (market, finances, business). Datasets used for evaluation: – directories from : Ginekologia (Gynecology)
N=1034;
Kar-diologia (Cardiology)
N=239;
Onkologia (Oncology)
N=1195, – directories from : Waluty (Currencies)
N=2161;
Finanse (Fi-nances)
N=1991, – documents from N=978.
To assess the efficiency of the studied algorithms we use two different measures. Thefirst one is commonly used standard precision measure, the second one is modifiedprecision based on similarity measure Lin (Equation (1) in Section 3.2). The differ-ence is in using Lin measure instead of indicator function. For documents d , ..., d n with real categories categ ( d i ) and its prediction pred ( d i ) the Lin precision is definedas: n ∑ ni = Lin ( categ ( d i ) , pred ( d i )) . The motivation for using the latter measure is thatstandard precision does not take into account the dependence between categories. Incase when we make a wrong prediction we would like to know how much predictedcategory is different from the real one. The first part of experimental work concerned comparison of various methods of textclassification. We proceeded on documents from DMOZ corpus with fixed set of labelsdescribed in Section 5.2. Documents were divided into separate groups based on theirtext length measured by the number of characters ( C ): short (1000 ≤ C < medium (2000 ≤ C < , long (10 , ≤ C ). Files shorter than 1000 characters were notprocessed. Results for various classification methods are presented in Tables 1, 2. Theywere divided by a file size and efficiency measure. Methods based on categorizationalgorithm return a list of weighted W categories. Therefore we transformed the outcomecategories into the target set of 15 categories and took only one category with the highestweight. Categorization was based on a selection of 10 words (only nouns) / phrases withhighest t f id f from the document. The experiments were performed for different valuesof parameters, but other settings gave worse results.n Table 1 first four rows present various modifications of categorization method.The difference between them is in the method of disambiguation of ambiguous page.The first row presents standard disambiguation method (see Section 3). The next twomethods find a set of pages that map unambiguously. Then for every ambiguous pagewe find all of its mappings to potential meanings. Then we figure out their distances tothe set and sort them into descending order. Subsequent possible meanings are givenvarious weights depending on their rank i : (1 / i , 1 / i or uniformly). All of these op-tions gave similar means, so we used paired t-test to compare them. As a reference weused basic disambiguation method. Methods with weighting 1 / i and 1 / i do not differsignificantly. Method with uniform weights differs.All of these methods took only nouns from the document. We developed two optionsof mapping words into titles of W page. We remove (or not) from the set of possiblepage those of them that do not match in an exact way. The option “exact matching”worked slightly better (although not significantly), so we present it. Then we presentindividual classifiers followed by the ensembles of classifiers. Subsequent results arefor heterogeneous committee. The second experiment focuses on the problem of semantic gap which is observed inclassification of data from different domains. For such data often two documents ex-press the same concepts, but as they use different wording (because of existing of syn-onyms, hypernyms, hyponyms), the conventional classification / clustering algorithms,based on standard bag-of-words approach, do not work well. Such classifiers often donot recognize different linguistic representations for test and training set. Some worksrelating to the problem were presented in Section 2. Our approach, thoroughly pre-sented above, is different from them.There are other linguistic phenomena such as ellipsis, paraphrase and other. We focuson synonyms, hypernyms, hyponyms because of Wikipedia structure on which our al-gorithm is based. We deal with hyper-/hyponyms relation because of W category graphstructure we operate on. This graph is built on these kinds of relations.With synonym relation we cope during the phase of mapping words/phrases from thetext into W pages. The string set attached to a single W page contains the page title andall it’s synonyms. They are extracted from all names of disambiguation pages that pointto this particular page.For experimental design (see Table 3) we used standard classification methods in dif-ferent settings. As an input for them we used: 1. terms – terms from the document; 2. categories – categories for a given document produced by SemCat; 3. concepts – set ofdisambiguated concepts ( W page id) produced during SemCat algorithm.In Table 4 we present SemCla, ensembles and the heterogeneous committee with se-mantic classifier. Results
As can be seen in Tables 1 the best method among the considered SemCat algorithms isthe one where upon mapping of terms/phrases to W pages the ranking of pages corre-sponding to a term is computed and all of them are taken into account using appropriateweights. The version using only unambiguous terms and phrases has the poorest per-formance. Modifications of the base method (variants of fitting, shifting the stage ofcategory projection) do not lead to significant changes in performance.Though SemCla outperforms individual non-semantic classifiers, one can see that aclassical classifier ensemble is able to outperform SemCla.Therefore we turned to considering the impact of inclusion of SemCat into an en-semble of classical classifiers.The size of the ensemble (25x Balanced Winnow + 25x Bayes) guarantees the sta-bility of the results under various selections of the random training samples.Experimental settings included: various levels of W category graph used to cre-ate training samples [Level=1 , , ∞ ] as well as various sample sizes per category [ S = , , S = ∞ led to noticeably worse performance, since W docu-ments selected in the random sample were vaguely related to the desired topic (cate-gory).On the other hand, in every investigated case, results for the Level=1 were worsethan for Level=2, since the randomization of the sample for each instance of the classi-fier was too low (the number of the W documents on level 1 was not sufficient to makea sample).Ensemble of classical classifiers was extended with SemCat (Table 2) using variousweights for 1st, 2nd & 3rd category in the SemCat ranking. This setting required furtherinvestigation, but usually weights 14/10/6 led to the best classification results. Higherweights caused worse results. Extended ensemble 25x Balanced Winnow + 25x Bayes+ SemCat with Level=2, S =
200 and weights 14/10/6 usually was the optimal setting(with an exception for shortest documents).Further extension of the ensemble with LLDA classifier did not improve the results,both in the case of base ensemble (25x Balanced Winnow + 25x Bayes) and the seman-tic ensemble that included SemCat algorithm.Presented experiments lead to the following conclusions: the best results were achievedfor ensembles that beside standard classification methods (25xBayes + 25xBalancedWinnow) a semantic method was included (either SemCat or SemCla algorithm). Sur-prisingly, adding more varied set of standard classification methods (Naive Bayes, Win-now and LLDA) did not improve quality of the ensemble.Ensemble of 25xSemCla classifiers in most cases does not perform significantlybetter than a single SemCla. It is mainly due to low variance of the individual votingmethods within the ensemble. able 1.
Average values of various precision measures for
DMOZ small dataset. Parameter L stands for a level of W documents used for training sample, S is a sample size per each groupof documents. 25x(B,W) stands for an ensemble of 25 Bayes an 25 Balanced Winnow classi-fiers. Vector of numbers that follows the SemCat represents weights attached to top-3 categoriesinserted into committee. Lin precision PrecisionMethod Description short medium long short medium longSemCat SemCat algorithm with method disambiguation algorithmSemCat
SemCat : no disambig. concept (pages) 0 .
417 0 .
463 0 .
538 0 .
393 0 .
436 0 . / i SemCat
SemCat : no disambig. concept (pages) 0 .
413 0 .
464 0 .
553 0 .
390 0 .
437 0 . / i SemCat
SemCat : no disambig. concept (pages) 0 .
409 0 .
442 0 .
513 0 .
386 0 .
416 0 . . . Classifier Bayes (avg of 25) L=2 S=200 0 .
381 0 .
473 0 .
572 0 .
282 0 .
394 0 . .
437 0 .
553 0 .
694 0 .
385 0 .
505 0 . SemCla (avg of 25) .
638 0 . .
589 0 . .
558 0 .
652 0 .
682 0 .
434 0 .
558 0 . .
540 0 .
621 0 .
667 0 .
415 0 .
516 0 . .
503 0 .
619 0 .
672 0 .
379 0 .
519 0 . .
577 0 .
684 0 .
722 0 .
515 0 .
65 0 . .
578 0 .
698 0 .
731 0 .
507 0 .
655 0 . Ensemble 25x
SemCla
L=2 S=200 0 .
556 0 .
637 0 .
694 0 .
511 0 .
59 0 . SemCla ) L=2 S=200
Ensemble 25x(B,W) L=2 S=200 + LLDA: 5.0 0 .
572 0 .
689 0 .
748 0 .
500 0 .
646 0 . .
566 0 .
693 0 .
748 0 .
496 0 .
653 0 . .
565 0 .
685 0 .
744 0 .
500 0 .
646 0 . .
545 0 .
666 0 .
740 0 .
485 0 .
624 0 . As visible in Tables 3, 4 in case of the semantic gap problem, semantic methods andcommittees lead to much better results than traditional classifiers, even if the latter areoperating on the modified representation (bag of categories instead of bag of words).It can be seen that the usage of terms alone gives poor results when semantic gapoccurs. Classical methods are most helped if categories are provided for the trainingpurposes, but the usage of concepts is only half the way as good. This means actuallythat our SemCla algorithm uses a much deeper insight into the document content thanjust a category label assignment.It is also worth to stress the fact that however SemCla (contrary to SemCat) is super-vised, it can also be used in unsupervised version. For such a setting, instead of usingunobservable document labels as training classes (cf. Figure 2), one can use documentclusters, where clustering is also based on the semantic categorization (SemCat algo- able 2.
Average values of various precision measures for
DMOZ small dataset. Parameter L stands for a level of W documents used for training sample, S is a sample size per each groupof documents. 25x(B,W) stands for an ensemble of 25 Bayes an 25 Balanced Winnow classi-fiers. Vector of numbers that follows the SemCat represents weights attached to top-3 categoriesinserted into committee. Lin precision PrecisionMethod Description short medium long short medium longHeterogen. 25x(B,W) L=2 S=50+ SemCat :(7,5,3) 0 .
626 0 .
715 0 .
74 0 .
577 0 .
715 0 . SemCat :(10.5,7.5,4.5) 0 .
638 0 .
732 0 .
759 0 .
544 0 .
700 0 . SemCat :(14,10.6) 0 .
622 0 .
742 0 .
771 0 .
537 0 .
652 0 . SemCat :(17.5,12.5,7.5) 0 .
610 0 .
732 0 .
770 0 .
522 0 .
620 0 . SemCat :(7,5,3) 0 .
621 0 .
735 0 .
784 0 .
577 0 .
718 0 . SemCat :(10.5,7.5,4.5) 0 .
645 0 .
746 0 . . Heterogen. 25x(B,W) L=2 S=100+
SemCat :(14,10.6) 0 .
634 0 .
752 0 .
805 0 .
559 0 .
671 0 . SemCat :(17.5,12.5,7.5) 0 .
645 0 . .
522 0 .
629 0 . SemCat :(7,5,3) 0 .
635 0 .
731 0 .
777 0 .
577 0 .
722 0 . SemCat :(10.5,7.5,4.5) .
754 0 .
777 0 . . SemCat :(14,10.6) 0.644 .
780 0 .
562 0 . Heterogen. 25x(B,W) L=2 S=200+
SemCat :(17.5,12.5,7.5) 0 .
488 0 .
570 0 .
619 0 .
426 0 .
508 0 . .
623 0 .
718 0 .
776 0 .
581 0 .
712 0 . SemCat :(10.5,7.5,4.5) + LLDA: 10.0Heterogen. 25x(B,W) L=2 S=200 0 .
591 0 .
703 0 .
767 0 .
555 0 .
689 0 . SemCat :(10.5,7.5,4.5) + LLDA: 15.0Heterogen. 25x(B,W) L=2 S=200 0 .
57 0 .
689 0 .
751 0 .
537 0 .
679 0 . SemCat :(10.5,7.5,4.5) + LLDA: 20.0 rithm) and applies semantic similarity measures defined in Section 3.2. We are goingto investigate this direction more deeply in the future, since it has a big advantage incases where document labels are unavailable and training set cannot be created (e.g.collections of web pages).
In this paper we demonstrated the value of semantic approach to the task of docu-ment classification. In particular we show here that an unsupervised approach to theclassification is possible when using semantic approach. This may be considered as aninteresting result by itself. Acknowledgedly, the semantic classifier we introduce doesnot perform as well as ensembles of traditional classifiers but apparently an inclusionof a semantic categorizer into such an ensemble is capable of significant improvementof its performance in classic classification tasks.Intuitively, one would imagine that a classifier incorporating semantic informationshould be superior to traditional classifiers that do not use such information. As we seefrom our experiments it is not that obvious. Though semantic classifier proved to be acompetitor for individual classic classifiers, ensembles of classic classifiers can beat it.Therefore, exploitation of advantages of semantic information requires some level ofsophistication and cannot be considered as obvious. able 3.
Average values of precision measure for classical methods: Bayes (B), Balanced Win-now (W). Classificationterms categories conceptsBankier: Bayes 0.397 0.634 0.376Business Biznes Winnow 0.367 0.546 0.323Forsal: Bayes 0.602 0.910 0.620Currencies Winnow 0.720 0.870 0.498Forsal: Bayes 0.847 0.952 0.814Finances Winnow 0.832 0.874 0.695Gynecology Bayes 0.404 0.505 0.233Winnow 0.074 0.205 0.219Cardiology Bayes 0.782 0.746 0.502Winnow 0.350 0.438 0.427Oncology Bayes 0.758 0.824 0.526Winnow 0.227 0.627 0.390
Table 4.
Average values of precision measure for: “semantic classification” (SemCla), ensembleof SemCla, ensemble of Bayes (B), Balanced Winnow (W) and for the heterogeneous committee.
SemCla
SemCla
SemCla )Bankier (Business Biznes): 0.752 0.830 0.789 0.855Forsal (Currencies): 0.972 0.983 0.995 0.999Forsal (Finances): 0.979 0.986 0.965 0.986Gynecology 0.842 0.844 0.732 0.833Cardiology 0.900 0 .
891 0 .
895 0 . What is still more important, the semantic classifier turns out to be superior to clas-sical approaches to classification in case of semantic gap between the training data andthe data for which the classifier is to be applied. This fact opens up really new horizonsfor application of machine learning methods in classification of documents in casese.g. of mergers between various corporations where the local culture leads usually todevelopment of specific languages different between the firms.This research opens up a number of further interesting areas of research. Semanticapproach (in its base, unsupervised setting) could be tested also for clustering tasksunder semantic gap scenario as well as to mixtures of classification and clustering.
References
1. Ciesielski, K., Borkowski, P., Klopotek, M.A., Trojanowski, K., Wysocki, K.: Wikipedia-based document categorization. In: Security and Intelligent Information Systems, SIIS 2011,Warsaw, Poland, June 13-14, 2011. (2011) 265–2782. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. (1)(March 2002) 1–473. Sebastiani, F.: Text categorization. In: Text Mining and its Applications to Intelligence,CRM and Knowledge Management, WIT Press (2005) 109–129. Sepp¨anen, J.K., Bingham, E., Mannila, H.: A simple algorithm for topic identification in 0-1data. In Lavrac, N., Gamberger, D., Blockeel, H., Todorovski, L., eds.: PKDD. Volume 2838of Lecture Notes in Computer Science., Springer (2003) 423–4345. Medelyan, O., Witten, I.H., Milne, D.: Topic indexing with wikipedia. In: Proceedings ofthe first AAAI Workshop on Wikipedia and Artificial Intelligence (WIKIAI’08). (2008)6. Milne, D., Witten, I.H.: An effective, low-cost measure of semantic relatedness obtainedfrom wikipedia links. In: Proceedings of the first AAAI Workshop on Wikipedia and Artifi-cial Intelligence (WIKIAI’08). (2008)7. Milne, D.N., Witten, I.H.: Learning to link with wikipedia. In: Proceedings of the 17thACM Conference on Information and Knowledge Management, CIKM 2008, Napa Valley,CA, USA, October 26-30, 2008, ACM (2008) 509–5188. Mihalcea, R., Csomai, A.: Wikify!: linking documents to encyclopedic knowledge. In:Proceedings of the Sixteenth ACM Conference on Information and Knowledge Management,CIKM 2007, Lisbon, Portugal, November 6-10, 2007, ACM (2007) 233–2429. Wroblewska, A., Sydow, M.: Debora: Dependency-based method for extracting entity-relationship triples from open-domain texts in polish. In Chen, L., Felfernig, A., Liu, J.,Ras, Z., eds.: Foundations of Intelligent Systems. Volume 7661 of Lecture Notes in Com-puter Science. Springer Berlin Heidelberg (2012) 155–16110. Ramakrishna Murty, M., Murthy, J., Prasad Reddy, P., Satapathy, S.: A survey of cross-domain text categorization techniques. In: Recent Advances in Information Technology(RAIT), 2012 1st International Conference on, IEEE (2012) 499–50411. Wang, P., Domeniconi, C., Hu, J.: Using wikipedia for co-clustering based cross-domain textclassification. In: Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on,IEEE (2008) 1085–109012. Nguyen, C.T.: Bridging semantic gaps in information retrieval: Context-based approaches.ACM VLDB (2010)13. Rafi, M., Hassan, S., Shaikh, M.S.: Content-based text categorization using wikitology.CoRR abs/1208.3623 (2012)14. Pirr`o, G., Seco, N.: Design, implementation and evaluation of a new semantic similaritymetric combining features and intrinsic information content. In: On the Move to MeaningfulInternet Systems. Volume 5332 of LNCS., Springer (2008) 1271–128815. Aas, K., Eikvil, L.: Text categorisation: A survey. Report No. 941 (June 1999) ISBN 82-539-0425-8.16. Grove, A.J., Littlestone, N., Schuurmans, D.: General convergence results for linear discrim-inant updates. Mach. Learn. (3) (June 2001) 173–21017. Littlestone, N.: Learning quickly when irrelevant attributes abound: A new linear-thresholdalgorithm. Machine Learning (1988) 285–31818. Rosenblatt, F.: The perceptron: A perceiving and recognizing automaton. Technical Report85-460-1, Ithaca, New York (January 1957)19. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res.3