Arabic text summarization based on latent semantic analysis to enhance arabic documents clustering
A RABIC T EXT S UMMARIZATION BASED ON L ATENT SEMANTIC ANALYSIS TO E NHANCE A RABIC D OCUMENTS C LUSTERING
Hanane Froud , Abdelmonaime Lachkar and Said Alaoui Ouatik L.S.I.S, E.N.S.A,University Sidi Mohamed Ben Abdellah (USMBA),Fez, Morocco [email protected], [email protected] L.I.M, Faculty of Science Dhar EL Mahraz (FSDM),Fez, Morocco [email protected] A BSTRACT
Arabic Documents Clustering is an important task for obtaining good results with the traditional Information Retrieval (IR) systems especially with the rapid growth of the number of online documents present in Arabic language. Documents clustering aim to automatically group similar documents in one cluster using different similarity/distance measures. This task is often affected by the documents length, useful information on the documents is often accompanied by a large amount of noise, and therefore it is necessary to eliminate this noise while keeping useful information to boost the performance of Documents clustering. In this paper, we propose to evaluate the impact of text summarization using the Latent Semantic Analysis Model on Arabic Documents Clustering in order to solve problems cited above, using five similarity/distance measures: Euclidean Distance, Cosine Similarity, Jaccard Coefficient, Pearson Correlation Coefficient and Averaged Kullback-Leibler Divergence, for two times: without and with stemming. Our experimental results indicate that our proposed approach effectively solves the problems of noisy information and documents length, and thus significantly improve the clustering performance. K EYWORDS
Information Retrieval Systems, Arabic Language, Arabic Text Clustering, Arabic Text Summarization, Similarity Measures, Latent Semantic Analysis, Root and Light Stemmers. I NTRODUCTION
There are several research projects investigating and exploring the techniques in traditional Information Retrieval (IR) systems for the English and European languages such as French, German, and Spanish and in Asian languages such as Chinese and Japanese. However, in Arabic language, there is little ongoing research in Arabic traditional Information Retrieval (IR) systems. Moreover, the traditional Information Retrieval (IR) systems (without documents clustering) are becoming more and more insufficient for handling huge volumes of relevant texts documents, because to retrieve the documents of interest, the user must formulate the query using the keywords that appear in the documents. This is a difficult task for ordinary people who are not familiar with the vocabulary of the data corpus. Documents clustering may be useful as a complement to these traditional Information Retrieval (IR) systems, by organizing these documents by topics (clusters) in the documents feature space. It has been proved by Bellot & El-Bèze in [1] that document clustering increase the precision in Information Retrieval (IR) systems for French language.
On the other hand, for the Arabic Language Sameh H. Ghwanmeh in [2] presented a comparison study between the traditional Information Retrieval system and the clustered one. The concept of clustering documents has shown significant results on precision compared with traditional Information Retrieval systems without clustering. These results assure the results obtained by Bellot & El-Bèze [1] during their test on Amaryllis’99 corpora for French language. Traditional documents clustering algorithms use the full-text in the documents to generate feature vectors. Such methods often produce unsatisfactory results because there is much noisy information in documents. The varying-length problem of the documents is also a significant negative factor affecting the performance. In this paper, we propose to investigate the use of summarization techniques to tackle these issues when clustering documents [13].
The goal of a summary is to produce a short representation of a long document. This problem can be solved by building an abstract representation of the whole document and then generating a shorter text or by selecting a few relevant sentences of the original text.
With a large volume of text documents, presenting the user with a summary of each document greatly facilitates the task of fi nding the desired documents so: • Text Summarization can be used to save time. • Text Summarization can speed up other information retrieval and text mining processes.
In this paper, we propose to use the Latent Semantic Analysis to produce the Arabic summaries that we utilize to represent the documents in the Vector Space Model (VSM) and cluster them, in order to enhance the Arabic documents clustering [14]. Latent Semantics Analysis (LSA) has been successfully applied to information retrieval [13] [15][16][17] as well as many other related domains. It is based on Singular Value Decomposition (SVD), a mathematical matrix decomposition technique closely akin to factor analysis that is applicable to text corpora. Recently, LSA has been introduced into generic text summarization by [18]. This paper is organized as follows. The next section describes the Arabic summarization based Latent Semantic Analysis Model. Section 3 and 4 discuss respectively the Arabic text preprocessing, document representation used in the experiments, and the similarity measures. Section 5 explains experiment settings, dataset, evaluation approaches, results and analysis. Section6 concludes and discusses future work. A RABIC TEXT S UMMARIZATION B ASED ON L ATENT S EMANTIC A NALYSIS MODEL
In this work, we propose to apply the Latent Semantic Analysis Model in order to generic Arabic Text Summarization [13] [17][18][19]. The process starts with the creation of terms by sentences matrix A = [A A ... A n ] with each column vector A i representing the weighted term-frequency vector of sentence i in the document under consideration. The weighted term-frequency vector Ai = [a a ... a ni ] T of sentence i is defined as: ( ). ( ) ij ij ij a L t G t = where : 1. L(t ji ) is the local weighting for term j in sentence i : L(t ji )=tf(t ji ) where tf(t ji ) is the number of times term j occurs in the sentence. G(t ji ) is the global weighting for term j in the whole document: ( ) log( / ( )) ij ij G t N n t = where N is the total number of sentences in the document, and n(t ij ) is the number of sentences that contain term j . If there are a total of m terms and n sentences in the document, then we will have an m x n matrix A for the document. Given an m x n matrix A (such as m ≥ n) the SVD of A is defined as [20]: T A U V = ∑ where U = [u ij ] is an m × n column-orthonormal matrix whose columns are called left singular vectors; Σ = diag( σ , σ , …, σ n ) is an n × n diagonal matrix, whose diagonal elements are non-negative singular values sorted in descending order, and V = [v ij ] is an n × n orthonormal matrix, whose columns are called right singular vectors. If rank(A) = r, then [21] Σ satisfies: ... ... 0 r r n s s s s s + ‡ ‡ = = = ≻ The interpretation of applying the SVD to the terms by sentences matrix A can be made from two different viewpoints. From transformation point of view, the SVD derives a mapping between the m-dimensional space spawned by the weighted term-frequency vectors and the r-dimensional singular vector space. From semantic point of view, the SVD derives the latent semantic structure from the document represented by matrix A. This operation reflects a breakdown of the original document into r linearly-independent base vectors or concepts. Each term and sentence from the document is jointly indexed by these base vectors/concepts. A unique SVD feature is that it is capable of capturing and modeling interrelationships among terms so that it can semantically cluster terms and sentences. Further-more, as demonstrated in [21], if a word combination pattern is salient and recurring in document, this pattern will be captured and represented by one of the singular vectors. The magnitude of the corresponding singular value indicates the importance degree of this pattern within the document. Any sentences containing this word combination pattern will be projected along this singular vector, and the sentence that best represents this pattern will have the largest index value with this vector. As each particular word combination pattern describes a certain topic/concept in the document, the facts described above naturally lead to the hypothesis that each singular vector represents a salient topic/concept of the document, and the magnitude of its corresponding singular value represents the degree of importance of the salient topic/concept. Based on the above discussion, authors [18] proposed a summarization method which uses the matrix V T . This matrix describes an importance degree of each topic in each sentence. The summarization process chooses the most informative sentence for each topic. It means that the k’th sentence we choose has the largest index value in k’th right singular vector in matrix V T . The proposed method in [18] is as follows: 1. Decompose the document D into individual sentences, and use these sentences to form the candidate sentence set S, and set k = 1. Construct the terms by sentences matrix A for the document D. Perform the SVD on A to obtain the singular value matrix ∑ , and the right singular vector matrix V T . In the singular vector space, each sentence i is represented by the column vector [ ] ... Ti i i ir u u uY = of V T . 4. Select the k’th right singular vector from matrix V T . Select the sentence which has the largest index value with the k’th right singular vector, and include it in the summary. If k reaches the predefined number, terminate the operation; otherwise, increment k by one, and go to Step 4.
In Step 5 of the above operation, finding the sentence that has the largest index value with the k’th right singular vector is equivalent to finding the column vector i Y whose k’th element ik u is the largest. In this paper we propose to use the above method to identify semantically important sentences for Arabic Summary creations (Figure 1) in order to enhance the Arabic Documents Clustering task. Figure 1. Arabic Text Summarization based on Latent Semantic Analysis Model
After building the test corpus, we decompose each document into individual sentences; this decomposition is a source of ambiguity, because on the one hand punctuation is rarely used in Arabic texts and other punctuation that, when it exists, is not always critical to
Decomposition Sample :
Input Data Document Decomposition using Table.1 : Sentences Words The weighted term-frequency vector Ai = [a a ... a ni ] T of sentence i Sentences ةءا(cid:10)(cid:11) ة(cid:14)(cid:15)(cid:14)(cid:16)(cid:17)ا (cid:18)(cid:19)(cid:20)(cid:21)(cid:17)ا (cid:18)(cid:22)(cid:23)ا(cid:24)(cid:22)(cid:25)(cid:17)ا (cid:10)(cid:22)(cid:27)(cid:23) (cid:18)(cid:15)ا(cid:14)(cid:28)(cid:17)ا (cid:30)(cid:31)(cid:20) (cid:17)ا !(cid:17)(cid:20)(cid:25)(cid:17)ا م(cid:20)(cid:21) Words ةءا(cid:4)(cid:5) (cid:6)(cid:7)(cid:8)ا(cid:9)(cid:7)(cid:10)(cid:11)ا (cid:6)(cid:12)(cid:13)(cid:14)(cid:11)ا ة(cid:15)(cid:16)(cid:15)(cid:17)(cid:11)ا (cid:6)(cid:16)ا(cid:15)(cid:18)(cid:11)ا (cid:4)(cid:7)(cid:19)(cid:8) (cid:20)(cid:21)(cid:13)(cid:22)(cid:23)(cid:11)ا
Building the terms by sentences matrix A = [A A ... A n ] Apply LSA Model Extracting the Relevant Sentences Document Summary (cid:6)(cid:12)(cid:13)(cid:14)(cid:11)ا (cid:6)(cid:7)(cid:8)ا(cid:9)(cid:7)(cid:10)(cid:11)ا (cid:25)(cid:26) ةءا(cid:4)(cid:5)ة(cid:15)(cid:16)(cid:15)(cid:17)(cid:11)ا (cid:20)(cid:21)(cid:13)(cid:22)(cid:23)(cid:11)ا نأ (cid:29)(cid:11)إ (cid:4)(cid:7)(cid:19)(cid:8) (cid:6)(cid:16)ا(cid:15)(cid:18)(cid:11)ا (cid:25)(cid:26) (cid:4)(cid:18)(cid:22)(cid:14)(cid:31) م ! " guide the decomposition. In addition, some words can mark the beginning of a new sentence (or proposition).
For text decomposition [22] uses: (cid:1)
A morphological decomposition based on punctuation, (cid:1)
Decomposition based on the recognition of markers morphosyntactic or functional words such as: -&. , /0(cid:17) , و , وأ , or, and, but, when. However, these particles may play a role other than to separate phrases. In our experiments, we use the morphosyntactic markers or functional words cited in [23] to decompose the document into individual sentences, in the following table we present some examples of these markers or functional words: Table 1. Samples of Arabic Morphosyntactic Markers and Functional Words (cid:6)(cid:7) The Arabic Morphosyntactic Markers and Functional Words !3 , و , , وأ , مأ , , /0(cid:17) , ْ/0(cid:17) , -ّ&. in, and, then, or, but, when (cid:20)9(cid:15)أ , (cid:14)(cid:21)(cid:31) , , ;(cid:22). , , ا<=(cid:17)و , >(cid:22)(cid:17)و also, after, although, as before, but this, not A RABIC TEXT PREPROCESSING
The Arabic language is the language of the Holy Quran. It is one of the six official languages of the United Nations and the mother tongue of approximately 300 million people. It is a Semitic language with 28 alphabet letters. His writing orientation is from right-to-left. It can be classified into three types: Classical Arabic ( -?@A(cid:17)ا (cid:18)(cid:22)(cid:31)(cid:10)(cid:21)(cid:17)ا ), Modern Standard Arabic ( (cid:18)(cid:22)(cid:31)(cid:10)(cid:21)(cid:17)ا(cid:18)B(cid:15)(cid:14)?(cid:17)ا ) and Colloquial Arabic dialects ( (cid:18)(cid:22)(cid:19)(cid:20)(cid:21)(cid:17)ا (cid:18)(cid:22)(cid:31)(cid:10)(cid:21)(cid:17)ا ). Classical Arabic is fully vowelized and it is the language of the holy Quran. Modern Standard Arabic is the official language throughout the Arab world. It is used in official documents, newspapers and magazines, in educational fields and for communication between Arabs of different nationalities. Colloquial Arabic dialects, on the other hand, are the languages spoken in the different Arab countries; the spoken forms of Arabic vary widely and each Arab country has its own dialect. Modern Standard Arabic has a rich morphology, based on consonantal roots, which depends on vowel changes and in some cases consonantal insertions and deletions to create inflections and derivations which make morphological analysis a very complex task [24]. There is no capitalization in Arabic, which makes it hard to identify proper names, acronyms, and abbreviations.
Arabic word Stemming is a technique that aim to find the lexical root or stem (Figure 2) for words in natural language, by removing affixes attached to its root, because an Arabic word can have a more complicated form with those affixes. An Arabic word can represent a phrase in English, for example the word :” to speak with them ” is decomposed as follows (Table 2): Table 2. Arabic Word Decomposition
Antefix Prefix Root Suffix Postfix ل ي ث(cid:14). نو Preposition meaning “to” A letter meaning the tense and the person of conjugation speak Termination of conjugation A pronoun Meaning “them”
Figure 1.a : Stem Figure 1.b : Root Figure 1.c: Inheritance
Figure 2. An Example of Root/Stem Preprocessing.
Arabic stemming algorithms can be classified, according to the desired level of analysis, as root-based approach (Khoja [4]); and stem-based approach (Larkey [5]). In this section, a brief review on the two stemming approaches for stemming Arabic Text is presented.
Figure 3. Example of Preprocessing with Khoja Stemmer algorithm
Root-Based approach uses morphological analysis to extract the root of a given Arabic word. Many algorithms have been developed for this approach. Al-Fedaghi and Al-Anzi algorithms try to find the root of the word by matching the word with all possible patterns with all possible affixes attached to it [25]. The algorithms do not remove any prefixes or suffixes. Al-Shalabi morphology system uses different algorithms to find the roots and patterns [26]. This algorithm removes the longest possible prefix, and then extracts the root by checking the first five letters (cid:2)(cid:3)(cid:4)(cid:3)(cid:5)(cid:6)(cid:7)ا ت(cid:10)(cid:4)(cid:11)(cid:12)(cid:4)(cid:5)(cid:13) ى(cid:15)(cid:16)(cid:17)(cid:18) م(cid:20)(cid:21)(cid:22)(cid:22)(cid:23) (cid:2)(cid:24)(cid:15)(cid:16)(cid:25)(cid:13)ا (cid:26)(cid:27) (cid:28)(cid:29)(cid:11)(cid:22)(cid:23) (cid:26)(cid:22)(cid:13)ا ت(cid:10)(cid:4)(cid:11)(cid:12)(cid:4)(cid:13)ا و لو(cid:20)(cid:13)ا .(cid:2)(cid:3)(cid:13)و(cid:20)(cid:13)ا و(cid:2)(cid:3)(cid:13)و(cid:20)(cid:13)ا ق" .*(cid:3)(cid:13)ود *(cid:3)(cid:4)(cid:3)(cid:5)(cid:6)ا ت(cid:10)(cid:4)(cid:11)(cid:12)(cid:4)(cid:5)(cid:13) ى(cid:15)(cid:16)(cid:17)(cid:18) م(cid:20)(cid:21)(cid:22)(cid:22)(cid:23) *(cid:24)(cid:15)(cid:16)% (cid:10)(cid:21)% &(cid:5)(cid:25)(cid:22)’ لود *(cid:4)(cid:16)(cid:25)(cid:24) (cid:28)(cid:29)(cid:11)(cid:22)(cid:23) ت(cid:10)(cid:4)(cid:11)(cid:12)(cid:24) لود*(cid:3)(cid:13)ود ق" +(cid:16)% (cid:28)(cid:29), -(cid:4), لود لود +(cid:5)(cid:6) -(cid:4), &(cid:16). م(cid:20)(cid:6) +(cid:16)%لود /(cid:5) +(cid:16)% م(cid:20)(cid:6) &(cid:16). -(cid:4), +(cid:5)(cid:6) لود (cid:28)(cid:29), &(cid:5)% /(cid:5) Preprocess Root-Based Approach Term Weighting
Input Data Document Document Processor and Feature Selection
Removing Stop Word Stemming Term Weighting Naïve Baysian Classifier Training Data Classified Document
Text Mining Application Results of the word. This algorithm is based on an assumption that the root must appear in the first five letters of the word. Khoja has developed an algorithm that removes prefixes and suffixes, all the time checking that it’s not removing part of the root and then matches the remaining word against the patterns of the same length to extract the root [4]. The aim of the
Stem-Based approach or Light Stemmer approach is not to produce the root of a given Arabic word, rather is to remove the most frequent suffixes and prefixes. Light stemmer is mentioned by some authors [27,28,5,29], but till now there is almost no standard algorithm for Arabic light stemming, all trials in this field were a set of rules to strip off a small set of suffixes and prefixes, also there is no definite list of these strippable affixes. In our work, we believe that the preprocessing of Arabic Documents is challenge and crucial stage. It may impact positively or negatively on the accuracy of any Text Mining tasks; therefore the choice of the preprocessing approaches will lead by necessity to the improvement of any Text Mining tasks very greatly. To illustrate this, in Figure 2, we show an example using Khoja and Light stemmers. It produces different results: root and stem level related to the original word. On the other hand Khoja stemmer can produce wrong results, for example, the word ( ت(cid:20)(cid:25)I’(cid:19) ) which means (organizations) is stemmed to (
J(cid:25)K ) which means (he was thirsty) instead of the correct root ( ). Prior to applying document clustering techniques to an Arabic document, the latter is typically preprocessed: it is parsed, in order to remove stop words, and then words are stemmed using tow famous Stemming algorithms: the Morphological Analyzer from Khoja and Garside [4], and the Light Stemmer developed by Larkey [5].
In addition, at this stage in this work, we computed the term-document using tfidf weighting scheme.
There are several ways to model a text document. For example, it can be represented as a bag of words, where words are assumed to appear independently and the order is immaterial. This model is widely used in information retrieval and text mining [6]. Each word corresponds to a dimension in the resulting data space and each document then becomes a vector consisting of non-negative values on each dimension. Let } { ,...,1 D d dn = be a set of documents and } { ,...,1 T t tm = the set of distinct terms occurring in D. A document is then represented as an m-dimensional vector td (cid:2)(cid:2)(cid:3) . Let ( , ) tf d t denote the frequency of term t T ˛ in document t D ˛ . Then the vector representation of a document d is: ( ( , ),..., ( , ))1 t tf d t tf d tmd = (cid:2)(cid:2)(cid:3) Although more frequent words are assumed to be more important, this is not usually the case in practice (in the Arabic language words like (cid:29)(cid:11)إ that means to and (cid:25)5555(cid:26) that means in). In fact, more complicated strategies such as the tfidf weighting scheme as described below is normally used instead. So we choose in this work to produce the tfidf weighting for each term for the document representation.
In the practice terms those appear frequently in a small number of documents but rarely in the other documents tend to be more relevant and specific for that particular group of documents, and therefore more useful for finding similar documents. In order to capture these terms and reflect their importance, we transform the basic term frequencies ( , ) tf d t into the tfidf (term frequency and inversed document frequency) weighting scheme.
Tfidf weights the frequency of a term t in a document d with a factor that discounts its importance with its appearances in the whole document collection, which is defined as: ( , ) ( , ) log( )( )
Dtfidf d t tf d t df t = ·
Here ( ) df t is the number of documents in which term t appears, |D| is the numbers of documents in the dataset. We use , wt d to denote the weight of term t in document d in the following sections. S IMILARITY M EASURES
In this section we discuss the five similarity measures that were tested in [3], and we include these five measures in our work to effect the Arabic text document clustering.
Not every distance measure is a metric. To qualify as a metric, a measure d must satisfy the following four conditions. Let x and y be any two objects in a set and ( , ) d x y be the distance between x and y. The distance between any two points must be non-negative, that is, ( , ) 0 d x y ‡ . The distance between two objects must be zero if and only if the two objects are identical, that is, ( , ) 0 d x y = if and only if x y = . 3. Distance must be symmetric, that is, distance from x to y is the same as the distance from y to x, i.e. ( , ) ( , ) d x y d y x = . 4. The measure must satisfy the triangle inequality, which is ( , ) ( , ) ( , ) d x z d x y d y z £ + . Euclidean distance is widely used in clustering problems, including clustering text. It satisfies all the above four conditions and therefore is a true metric. It is also the default distance measure used with the K-means algorithm. Measuring distance between text documents, given two documents da and db represented by their term vectors ta (cid:2)(cid:3) and tb (cid:2)(cid:3) respectively, the Euclidean distance of the two documents is defined as mD t t w wa t aE b t bt = - ∑ = (cid:2)(cid:3) (cid:2)(cid:3) where the term set is } { ,...,1 T t tm = . As mentioned previously, we use the tfidf value as term weights, that is ( , ), w tfidf d tat a = . Cosine similarity is one of the most popular similarity measure applied to text documents, such as in numerous information retrieval applications [6] and clustering too [7]. Given two documents ta (cid:2)(cid:3) and tb (cid:2)(cid:3) , their cosine similarity is: .( , ) , t ta bSIM t taC b t ta b = · (cid:2)(cid:3) (cid:2)(cid:3)(cid:2)(cid:3) (cid:2)(cid:3) (cid:2)(cid:3) (cid:2)(cid:3) where ta (cid:2)(cid:3) and tb (cid:2)(cid:3) are m-dimensional vectors over the term set } { ,...,1 T t tm = . Each dimension represents a term with its weight in the document, which is non-negative. As a result, the cosine similarity is non-negative and bounded between [ ] . An important property of the cosine similarity is its independence of document length. For example, combining two identical copies of a document d to get a new pseudo document d , the cosine similarity between d and d is 1, which means that these two documents are regarded to be identical. The Jaccard coefficient, which is sometimes referred to as the Tanimoto coefficient, measures similarity as the intersection divided by the union of the objects. For text document, the Jaccard coefficient compares the sum weight of shared terms to the sum weight of terms that are present in either of the two documents but are not the shared terms. The formal definition is: .( , ) 2 2 . t ta bSIM t taJ b t t t ta ab b = + - (cid:2)(cid:3) (cid:2)(cid:3)(cid:2)(cid:3) (cid:2)(cid:3) (cid:2)(cid:3) (cid:2)(cid:3) (cid:2)(cid:3) (cid:2)(cid:3)
The Jaccard coefficient is a similarity measure and ranges between 0 and 1. It is 1 when the t ta b = (cid:2)(cid:3) (cid:2)(cid:3) and 0 when ta (cid:2)(cid:3) and tb (cid:2)(cid:3) are disjoint. The corresponding distance measure is D S IMJ J = - and we will use
D J instead in subsequent experiments.
Pearson’s correlation coefficient is another measure of the extent to which two vectors are related. There are different forms of the Pearson correlation coefficient formula. Given the term set } { ,...,1 T t tm = , a commonly used form is ,1 ,( , ) 2 2 2 2,1 1 , mm w w TF TFat at t b bSIM t taP b m mm w TF m w TFat at t t b b · - · ∑ == - - ∑ ∑ = = (cid:2)(cid:3) (cid:2)(cid:3) where ,1 mTF wa t at = ∑ = and mTF wtb t b = ∑ = This is also a similarity measure. However, unlike the other measures, it ranges from -1 to +1 and it is 1 when t ta b = (cid:2)(cid:3) (cid:2)(cid:3) . In subsequent experiments we use the corresponding distance measure, which is D SIMP P = - when SIM P ‡ and D SIMP P = when SIM P ≺ . In information theory based clustering, a document is considered as a probability distribution of terms. The similarity of two documents is measured as the distance between the two corresponding probability distributions. The Kullback-Leibler divergence (KL divergence), also called the relative entropy, is a widely applied measure for evaluating the differences between two probability distributions. Given two distributions P and Q, the KL divergence from distribution P to distribution Q is defined as ( || ) log( )
PD P Q PKL Q = In the document scenario, the divergence between two distributions of words is: ,( || ) log( ).,1 , wm t aD t t wa t aKL b t wt b = · ∑ = (cid:2)(cid:3) (cid:2)(cid:3) However, unlike the previous measures, the KL divergence is not symmetric, i.e. ( || ) ( || )
D P Q D Q PK L K L „ . Therefore it is not a true metric. As a result, we use the averaged KL divergence instead, which is defined as: ( || ) ( || ) ( || ),1 2 D P Q D P M D Q MKL KLAvgKL p p= + where ,1 2
P QP Q P Q p p= =+ + and
M P Q p p= +
For documents, the averaged KL divergence can be computed with the following formula: ( || ) ( ( || ) ( || )),,1 2 ,1 mD t t D w w D w wa t a t tAvgKL b t bt p p= · + · ∑ = (cid:2)(cid:3) (cid:2)(cid:3) where , ,, ,1 2, ,, , ww t a t bw w w wt a t at b t b p p= =+ + and ,1 2 , w w wt t a t b p p= · + · The average weighting between two vectors ensures symmetry, that is, the divergence from document i to document j is the same as the divergence from document j to document i. The averaged KL divergence has recently been applied to clustering text documents, such as in the family of the Information Bottleneck clustering algorithms [8], to good effect. E XPERIMENTS AND R ESULTS
In our experiments (Figure 4), we used the K-means algorithm as document clustering method. It works with distance measures which basically aim to minimize the within-cluster distances. Therefore, similarity measures do not directly fit into the algorithm, because smaller values
Figure 4. Description of Our Experiments indicate dissimilarity. The Euclidean distance and the averaged KL divergence are distance measures, while the cosine similarity, Jaccard coefficient and Pearson coefficient are similarity measures. [3] applies a simple transformation to convert the similarity measure to distance values. Because both cosine similarity and Jaccard coefficient are bounded in [ ] and monotonic, we take D SIM = - as the corresponding distance value. For Pearson coefficient, which ranges from −1 to +1, we take D SIM = - when SIM ‡ and D SIM = when SIM ≺ . For the testing dataset, we experimented with different similarity measures for three times: without stemming, and with stemming using the Morphological Analyzer from Khoja and Garside [4] , and the Light Stemmer [5], in two case: in the first one, we apply the proposed method above to summarize for the all documents in dataset and then cluster them. In the second case, we cluster the original documents without summarization. Moreover, each experiment was run 5 times and the results are the averaged value over 5 runs. Each run has different initial seed sets. The testing dataset [9] (Corpus of Contemporary Arabic (CCA)) is composed of 12 several categories, each latter contains documents from websites and from radio Qatar. A summary of the testing dataset is shown in Table 3.
Input Data K-means Clustering
Heterogeneous Dataset
12 Categories: Economics, Politics, ect….
Without Stemming
Vector Space Model
With Summarization using LSA Model Without Summarization
With Stemming
Removing Stop Word : ،-(cid:17)إ،!3 ...
Apply the Stemming Approachs
Root-Based Approach : Khoja Stemmer Stem-Based Approach : Light Stemmer
Compute Similarity
Euclidean Distance Cosine Similarity Jaccard Coefficient Pearson Correlation KDL
Clustered Documents
Test Corpus
As mentioned previously, the baseline method is the full-text representation, for each document, we removed stop words and stem the remaining words by using Khoja stemmer’s and Larkey stemmer’s. Then, to illustrate the benefits of our proposed approach, we use document summaries to cluster our dataset. Table 3. Number of texts and number of Terms in each category of the testing dataset
Text Categories Number of Texts Number of Terms Economics
29 67 478
Education
10 25 574
Health and Medicine
32 40 480
Interviews
24 58 408
Politics
9 46 291
Recipes
9 4 973
Religion
19 111 199
Science
45 104 795
Sociology
30 85 688
Spoken
7 5 605
Sports
3 8 290
Tourist and Travel
61 46 093
The quality of the clustering result was evaluated using two evaluation measures: purity and entropy, which are widely used to evaluate the performance of unsupervised learning algorithms [10] [11]. The purity measure evaluates the coherence of a cluster, that is, the degree to which a cluster contains documents from a single category. Given a particular cluster C i of size n i , the purity of C i is formally defined as:
1( ) max( ) hi ihi
P C nn = where max( ) hih n is the number of documents that are from the dominant category in cluster C i and hi n represents the number of documents from cluster C i assigned to category h. In general, the higher the purity value, the better the quality of the cluster is. The entropy measure evaluates the distribution of categories in a given cluster. The entropy of a cluster C i with size n i is defined to be
1( ) log( )log h hk i ii h i i n nE C c n n = = - ∑ where c is the total number of categories in the data set and hi n is the number of documents from the hth class that were assigned to cluster C i . The entropy measure is more comprehensive than purity because rather than just considering the number of objects in and not in the dominant category, it considers the overall distribution of all the categories in a given cluster. Contrary to the purity measure, for an ideal cluster with documents from only a single category, the entropy of the cluster will be 0. In general, the smaller the entropy value, the better the quality of the cluster is. Moreover, the averaged entropy of the overall solution is defined to be the weighted sum of the individual entropy value of each cluster, that is, ( ) k i ii nEntropy E Cn = = ∑ where n is the number of documents in our dataset. In the following, The Table 4 and the Table 5 show the average purity and entropy results for each similarity/distance measure with the Morphological Analyzer from Khoja and Garside [4], the Larkey’s Stemmer [5], and without stemming using the full- text representation. On the other hand, the Table 6 and the Table 7 illustrate the results using document summaries with the same stemmers and similarity/distance measures.
In Table 4, with Khoja’s stemmer, the overall purity values for the Euclidean Distance, the Cosine Similarity and the averaged KL Divergence are quite similar and perform bad relatively to the other measures. Meanwhile, the Jaccard measure is the better in generating more coherent clusters with a considerable purity score. In this context, using the Larkey’s stemmer, the purity value of the averaged KL Divergence measure is the best one with only 1% difference relatively to the other four measures. Table 4. Purity and Entropy Results with
Khoja’s Stemmer , and
Larkey’s Stemmer
Using
Full-Text Representation
The Table 5, shows the higher purity scores (0.77) than those shown in the Table 4 for the Euclidean Distance, the Cosine Similarity and the Jaccard measures. In the other hand the Pearson Correlation and averaged KL Divergence are quite similar but still better than purity values for these measures in the Table 4. The overall entropy value for each measure is shown in the two Tables. Again, the best results are there in the Table 5 that shows the better and similar entropy values for the Euclidean Distance, the Cosine Similarity and the Jaccard measures. However, the averaged KL Divergence performs worst than the other measures but better than the other one in the other Table (Table 4). Euclidean Cosine Jaccard Pearson KLD
Khoja’s stemmer Entropy 0.26
0. 286 0. 286 0. 286 0. 286
Table 5. Purity and Entropy Results without Stemming
Using
Full-Text Representation
Table 6 presents the average purity and entropy results for each similarity/distance measures using document summaries instead the full-text representation with Khoja’s stemmer and Larkey’s stemmer.
As shown in Table 6, for the two stemmers, Euclidean Distance, Cosine Similarity, and Jaccard measures are slightly better in generating more coherent clusters which means the clusters have higher purity and lower entropy scores. On the other hand, Pearson and KLD measures perform worst relatively to the other measures. Comparing these results with those obtained in Table 4, we can conclude that the obtained scores was improved specially the overall entropy values. Table 6. Purity and Entropy Results with
Khoja’s Stemmer , and
Larkey’s Stemmer
Using
Documents Summaries
A closer look at Tables 5 and 7 shows that, in this latter, the overall entropy values of Euclidean Distance, Cosine Similarity, Jaccard and Pearson measures are nearly similar and proves their ability to produce coherent clusters. On the one side, in the Table 6 we can remark that the purity scores (
Khoja’s stemmer,
Larkey’s stemmer) are generally higher than those shown in the Table 7 for the all similarity/distance measures, on the other side, the overall entropy values in this table for the Euclidean Distance, the Cosine Similarity and the Jaccard measures with Khoja’s stemmer performs bad than those in the Table 7. However, with Larkey’s stemmer the overall entropy values for each measure performs contrary to their exiting in Table 7. Table 7. Purity and Entropy Results without Stemming
Using
Documents Summaries
The above results lead as to conclude that: First, the Tables 4 and 5 show that the use of stemming affects negatively the clustering, this is mainly due to the ambiguity created when we applied the stemming (for example, we can obtain two roots that made of the same letters but semantically different). Our observation Euclidean Cosine Jaccard Pearson KLD
Entropy
Khoja’s stemmer Entropy
Euclidean Cosine Jaccard Pearson KLD
Entropy 0.154 broadly agrees with M.El kourdi, A.Bensaid, and T.Rachidi in [12], and with our works in [14][17]. Second, the obtained overall entropy values shown in Tables 6 and 7 proves that the summarizing documents can make their topics salient and improve the clustering performance [13] for two times: with and without stemming. However, the obtained purity values seem not promising to improve the clustering task; this is can be due to the bad choice of the number of sentences in summaries because this latter has great impact on the quality of summaries thus could lead to different clustering results. Too few sentences will result in mach sparse vector representation and are not enough to represent the document fully. Too many sentences may introduce noise and degrade the benefits of the summarization. C ONCLUSION
In this paper, we have proposed to illustrate the benefits of the summarization using the Latent Semantic Analysis Model, by comparing the clustering results based on summarization with the full-text baseline on the Arabic Documents Clustering for five similarity/distance measures for three times: without stemming, and with stemming using Khoja’s stemmer, and the Larkey’s stemmer. We found that the Euclidean Distance, the Cosine Similarity and the Jaccard measures have comparable effectiveness for the partitional Arabic Documents Clustering task for finding more coherent clusters in case we didn’t use the stemming for the full-text representation. On the other hand the Pearson Correlation and averaged KL Divergence are quite similar in theirs results but there are not better than the other measures in the same case. Instead of using full-text as the representation for document clustering, we use LSA model as summarization techniques to eliminate the noise on the documents and select the most salient sentences to represent the original documents. Furthermore, summarization can help overcome the varying length problem of the diverse documents. In our experiments using document summaries, we remark that again the Euclidean Distance, the Cosine Similarity and the Jaccard measures have comparable effectiveness to produce more coherent clusters than the Pearson Correlation and averaged KL Divergence, in the two times: with and without stemming. R EFERENCES [1]
P. Bellot and M. El-Bèze, “Clustering by means of Unsupervised Decision Trees or Hierarchical and K-means-like Algorithm”, in Proc. of RIAO 2000, pp. 344-363. [2]
Sameh H. Ghwanmeh, “Applying Clustering of Hierarchical K-means-like Algorithm on Arabic Language” International Journal of Information Technology IJIT Volume 3 Number 3 2007, p 168-172. [3]
A.Huang, “Similarity Measures for Text Document Clustering”, NZCSRSC 2008, April 2008, Christchurch, New Zealand. [4]
Larkey, Leah S., Ballesteros, Lisa, and Connell, Margaret. (2002) “Improving Stemming for Arabic Information Retrieval: Light Stemming and Co-occurrence Analysis”. In Proceedings of the 25th Annual International Conference on Research and Development in Information Retrieval (SIGIR 2002), Tampere, Finland, August 11-15, 2002, pp. 275-282. [6]
R. B. Yates and B. R. Neto.”Modern Information Retrieval”. ADDISON-WESLEY, New York, 1999. [7]
B. Larsen and C. Aone.” Fast and Effective Text Mining using Linear-time Document Clustering”. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999. [8]
N. Z. Tishby, F. Pereira, and W. Bialek. “The Information Bottleneck Method”. In Proceedings of the 37th Allerton Conference on Communication, Control and Computing, 1999. [9]
L. Al-Sulaiti , E.Atwell, “The Design of a Corpus of Contemporary Arabic”, University of Leeds. [10]
Y. Zhao and G. Karypis.”Evaluation of Hierarchical Clustering Algorithms for Document Datasets”. In Proceedings of the International Conference on Information and Knowledge Management, 2002. [11]
Y. Zhao and G. Karypis.”Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering”. Machine Learning, 55(3), 2004. [12]
M.El kourdi, A.Bensaid, and T.Rachidi.”Automatic Arabic Document Categorization Based on the Naïve Bayes Algorithm”. School of Science & Engineering, Alakhawayn University. [13]
Xuanhui Wang, Dou Shen,Hua-Jun Zeng, Zheng Chen, and Wei-Ying Ma. “Web page clustering enhanced by summarization”. Proceedings of the 2004 ACM CIKM International Conference on Information and Knowledge Management, Washington, DC, USA, November 8-13, 2004. [14]
H.Froud, A.Lachkar, S. Ouatik, and R.Benslimane (2010). Stemming and Similarity Measures for Arabic Documents Clustering. 5th International Symposium on I/V Communications and Mobile Networks ISIVC, IEEE Xplore. [15]
S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman, “Indexing by latent semantic analysis,” Journal of the American Society for Information Science, vol. 41, pp.391-407, 1990. [16]
H. Froud, A. Lachkar, S. Alaoui Ouatik, “Stemming Versus Light Stemming for Measuring the Simitilarity between Arabic Words with Latent Semantic Analysis Model”, 2 nd International Colloquium in Information Science and Technology (CIST) 22-24 October, 2012. IEEE Xplore. [17]
H. Froud, A. Lachkar, S. Alaoui Ouatik, “A Comparative Study Of Root-Based And Stem-Based Approaches For Measuring The Similarity Between Arabic Words For Arabic Text Mining Applications”, published in “Advanced Computing : An International Journal (ACIJ)”. [18]
Y.H. Gong and X. Liu, “Generic text summarization using relevance measure and latent semantic analysis,” Proc. The 24th annual international ACM SIGIR, pp. 19 - 25, 2001. [19]
Steinberger J., Jezek K. Using Latent Semantic Analysis in text summarization and summary evaluation. In: Proceedings of ISIM `04 2004: 93-100. [20]
W. Press and et al., Numerical Recipes in C: The Art of Scientific Computing. Cambridge, England: Cambridge University Press, 2 ed., 1992. [21]
M. W. Berry, S. T. Dumais, G. W O’Brien: Using Linear Algebra for Intelligent Information Retrieval. SIAM Review 1995. [22]
Ouersighni R. 2001. A major offshoot of the DIINAR-MBC project: AraParse, a morphosyntactic analyzer for unvowelled Arabic texts, ACL/EACL 2001 Workshop on Arabic Language Processing, Toulouse July 2001, pp. 9-16. [23] د.(cid:10)(cid:21)(cid:17)ا (cid:18)O [24] Abu-Hamdiyyah, Mohammad.2000. “The Qur'An: An Introduction” [25]
Al-Fedaghi S. and F. Al-Anzi. “A new algorithm to generate Arabic root-pattern forms”. In proceedings of the 11th national Computer Conference and Exhibition. PP 391-400. March 1989. [26]
Al-Shalabi R. and M. Evens. “A computational morphology system for Arabic”. In Workshop on Computational Approaches to Semitic Languages, COLING-ACL98. August 1998. [27]
Aljlayl M. and O. Frieder. “On Arabic search: improving the retrieval effectiveness via a light temming approach”. In ACM CIKM 2002 International Conference on Information and Knowledge Management, McLean, VA, USA. PP 340-347. 2002. [28]
Larkey L., and M. E. Connell. “Arabic information retrieval at UMass in TREC-10”. Proceedings of TREC 2001, Gaithersburg: NIST. 2001. [29]
Chen A. and F. Gey. “Building an Arabic Stemmer for Information Retrieval”. In Proceedings of the 11th Text Retrieval Conference (TREC 2002), National Institute of Standards and Technology. 2002.
Authors Miss. Hanane Froud
Phd Student in Laboratory of Information Science and Systems, ECOLE NATIONALE DES SCIENCES APPLIQUÉES, University Sidi Mohamed Ben Abdellah (USMBA), Fez, Morocco. She has also presented different papers at different National and International conferences.
Pr. Abdelmonaime LACHKAR received his PhD degree from the USMBA, Morocco in 2004 in computer science; He is working as a Professor and Head of Computer Science and Engineering (E.N.S.A), in University Sidi Mohamed Ben Abdellah (USMBA), Fez, Morocco. His current research interests include Arabic Text Mining Applications: Arabic Web Document Clustering and Categorization. Arabic Information and Retrieval Systems, Arabic Text Summarization, etc …, Image Indexing and Retrieval, 3D Shape Indexing and Retrieval in large 3D Objects Databases, Color Image Segmentation, Unsupervised clustering, Cluster Validity Index, on-line and off-line Arabic and Latin handwritten recognition, and Medical Image Applications.
Pr. Said Alaoui Ouatik is working as a Professor in Department of Computer Science, Faculty of Science Dhar EL Mahraz (FSDM), Fez, Morocco. His research interests include high-dimensional indexing and content-based retrieval, Arabic Document Categorization. 2D/3D Shapes Indexing and Retrieval in large 3D Objects Database.is working as a Professor in Department of Computer Science, Faculty of Science Dhar EL Mahraz (FSDM), Fez, Morocco. His research interests include high-dimensional indexing and content-based retrieval, Arabic Document Categorization. 2D/3D Shapes Indexing and Retrieval in large 3D Objects Database.