A Markov Random Field Topic Space Model for Document Retrieval
AA M
A R KOV R A N D O M F I E L D T O P I C S PAC E M O D E L F O R D O C U M E N T R E T R I E VA L
Scott Hand
Abstract
This paper proposes a novel statistical approach to intelligent document re-trieval. It seeks to offer a more structured and extensible mathematicalapproach to the term generalization done in the popular Latent SemanticAnalysis (LSA) approach to document indexing. A Markov Random Field(MRF) is presented that captures relationships between terms and docu-ments as probabilistic dependence assumptions between random variables.From there, it uses the MRF-Gibbs equivalence to derive joint probabilitiesas well as local probabilities for document variables. A parameter learningmethod is proposed that utilizes rank reduction with singular value decom-position in a matter similar to LSA to reduce dimensionality of document-term relationships to that of a latent topic space. Experimental results con-firm the ability of this approach to effectively and efficiently retrieve docu-ments from substantial data sets.
Contents a r X i v : . [ c s . I R ] N ov .2 An MRF Model for Information Retrieval . . . . . . . . . . 102.2.1 Graph Structure . . . . . . . . . . . . . . . . . . . . 102.2.2 Clique Definition . . . . . . . . . . . . . . . . . . . 112.2.3 Clique Potential Functions . . . . . . . . . . . . . . 112.2.4 Joint Distribution . . . . . . . . . . . . . . . . . . . 122.2.5 Local Probabilities . . . . . . . . . . . . . . . . . . 122.2.6 Learning . . . . . . . . . . . . . . . . . . . . . . . 14 A Sample Documents from Classic4 Data Set 26
A.1 CRAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26A.2 CISI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27A.3 CACM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27A.4 MED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
B Sample Queries from the Classic4 Data Set 28
B.1 CRAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28B.1.1 Query 1 of 365 . . . . . . . . . . . . . . . . . . . . 28B.1.2 Query 2 of 365 . . . . . . . . . . . . . . . . . . . . 28B.2 CISI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29B.2.1 Query 1 of 112 . . . . . . . . . . . . . . . . . . . . 29B.2.2 Query 2 of 112 . . . . . . . . . . . . . . . . . . . . 29B.3 CACM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29B.3.1 Query 1 of 64 . . . . . . . . . . . . . . . . . . . . . 29B.3.2 Query 2 of 64 . . . . . . . . . . . . . . . . . . . . . 29B.4 MED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29B.4.1 Query 1 of 30 . . . . . . . . . . . . . . . . . . . . . 29B.4.2 Query 2 of 30 . . . . . . . . . . . . . . . . . . . . . 30 ist of Figures Research in the field of information retrieval is becoming increasingly im-portant as large sources of data become available and users become accus-tomed to powerful and flexible ways of processing this information. It isnow accepted that simple data retrieval methods based on naive term match-ing fail to function effectively for large and varied bodies of data [1]. Inparticular, users are beginning to seek methods of retrieval that examine themeanings of queries rather than the queries themselves. One promising ap-proach to this, Latent Semantic Analysis (LSA), was proposed by [2] asat attempt to generalize terms into latent topic concepts using linear alge-bra techniques. We seek to provide a more structured approach to accom-plishing term generalization similar to LSA using a Markov Random Field odel. We believe that this approach has a more solid foundation and pro-vides researchers with a better understanding of the underlying mathematicsand potential for extension. Latent Semantic Analysis (LSA) is a method used in information retrievalfor smoothing sets of document-term data. Documents in a large collec-tion are subject to statistical over-specification, as each one only contains asmall fraction of the terms despite being relevant with respect to many otherterms. LSA expands upon a vector-space model [3] in which documents arerepresented as row vectors of terms. A co-occurrence matrix X representinga collection of documents can be defined as a matrix whose rows are termvectors T and columns are document vectors D . X = x , · · · x ,n ... . . . ... x m, · · · x m,n The value x t,d refers to the number of times term t appears in document d . This representation is convenient because it allows the similarity of anycolumn vector d of matrix X and query vector q to be calculated as thecosine of the angle between the two vectors using: cos( θ ) = d · q || d || || q || (1)One problem with this approach is that, since it relies solely on termsas being independent, it fails to capture the semantic relationship betweensynonyms and other examples of distinct but related terms. It also results inpoor and uneven recall because it relies on the specific wording of the query,and, without any smoothing, many relevant documents could be missed dueto lexical discrepencies.LSA attempts to generalize terms into a latent topic space by reducingthe dimensionality of the co-occurrence matrix. This is accomplished byfirst taking a Singular Value Decomposition on the co-occurrence matrix.This produces three new matrices, U , S , and V such that X = USV T . U and V contain orthogonal column vectors while S is a diagonal matrix. The iagonal of S forms a vector of singular values σ . X U S V T x , · · · x ,n ... . . . ... x m, · · · x t,n = (cid:2)(cid:2) u (cid:3) . . . (cid:2) u r (cid:3)(cid:3) · σ · · · ... . . . ... · · · σ r · (cid:2) v (cid:3) ... (cid:2) v r (cid:3) To reduce the dimensionality of the matrix, a number k of the singularvalues are kept, and the rest are discarded. The number of singular valuesto keep is arbitrary, but implementations almost always keep large singularvalues ( σ i > or so) and discard small ones ( σ i < . ). Intuitively, theselarger values are important to the document collection, while smaller onesonly serve to contribute to the over-specification.The product of the resulting matrices U k , S k , and V Tk produces a di-mensionally reduced co-occurrence matrix X k . X k U k S k V Tk x , · · · x ,n ... . . . ... x m, · · · x m,n = (cid:2)(cid:2) u (cid:3) . . . (cid:2) u k (cid:3)(cid:3) · σ · · · ... . . . ... · · · σ k · (cid:2) v (cid:3) ... (cid:2) v k (cid:3) Here, the vectors u i = ( u i, , ..., u i,n ) and v i = ( v i, , ..., v i,m ) are leftand right singular row vectors for X .To compare documents and terms in this new latent space, it must beshown that there exists an analog in LSA to the inner space used for findingthe similarity in the original vector space model. The dot products betweenall documents in the collection is calculated with X T X . The following ma-nipulations [2] show that this is equivalent to the following latent space con-cept: X T X = ( USV T ) T USV T = VSU T USV T = VSSV T = ( VS )( VS ) T (2)This means that document comparison is now possible by using the innerproducts of rows from the VS T matrix from equation 2.A comparison among terms is done similarly, by first taking: XX T = USV T ( USV T ) T = USV T VSU T = USSU T = ( US )( US ) T (3) he inner product of rows from equation 3’s US matrix allow terms tobe compared.Finally, a query is represented as a new document vector q containingthe term counts found in the query. This can be transformed into the latentspace as q k = q T U k S − k . The previously mentioned method for comparingdocuments in latent space can now be utilized to rank documents.LSA is a useful technique for improving the quality of query results, butit suffers from a weak mathematical foundation that does not provide a solidset of statistical assumptions about its operations. As it does not specifyany kind of generative model, it produces no clear normalized probabilitydistribution, and instead focuses on finding a rank k matrix that minimizesthe Frobenius norm error with the co-occurrence matrix. While using Sin-gular Value Decompositions with limited singular values has been shown[4] to always produce such a rank k matrix, there is not much room to ex-pand the retrieval model to include concepts like query expansion and termdependence. Probabilistic Latent Semantic Analysis (PLSA) is a way of providing a morestructured approach to the problem of identifying latent concepts [5]. PLSAtakes a stronger statistical approach by constructing a generative model forthe model.PLSA represents documents and terms as vectors D and W , and uses anaspect model that associates an observed class variable z ∈ Z with observeddocuments. The joint distribution is represented as: P ( d, w ) = P ( d ) (cid:88) z ∈ Z P ( w | z ) P ( z | d ) The generative model is then fitted through maximum likelihood withthe Expectation Maximization (EM) algorithm.One improvement to PLSA called Latent Dirichlet Allocation (LDA) [6]was proposed which seeks to capture more of the document collection’s de-pendence relationships. Specifically, LDA takes a Bayesian approach andperforms inference with prior distributions for terms and documents. In par-ticular, this method gives more generalization, as it constructs a true gener-ative model that represents both seen and unseen documents.Both LDA and PLSA reevaluate the mathematical underpinnings of LSAfor Information Retrieval, but do so by discarding the linear algebra ap-proach of LSA in favor of a different, more structurally sound statisticalmodel. .1.3 Information Retrieval with Markov Random Fields The task of expanding the basic vector space model was approached by [7]with a formal Markov Random Field framework. In this approach, threemethods were offered for modeling term dependencies: independent, se-quential, and fully dependant. The suggested approach was the sequentialdependency graph, containing cliques representing documents, terms, or-dered term sequences, and unordered term sequences. The criteria for rank-ing documents based on sequential dependencies was this ranking function: P ( D | Q ) ∝ (cid:88) c ∈ T λ T f T ( c ) + (cid:88) c ∈ O λ O f O ( c ) + (cid:88) c ∈ O ∪ U λ U f U ( c ) (4)The functions f T , f O , and f U are clique potential functions represent-ing the compatibility of clique in the given distribution. The set of weights ( λ T , λ O , λ U ) is then learned by using a hill climbing search to optimize themean average precision. He showed [8] that the surface is concave, so find-ing a global maximum is likely. Clique functions utilize simple smoothingbased on a Dirichlet prior to help generalize the term-document space.This approach uses Markov Random Fields (MRF) as a model for pro-ducing the weighted sum of functions relating terms and documents in equa-tion 4. It is important to note that while, since it is simply another way ofstating common information retrieval formulas, this is not by itself a majoradvance in information retrieval. Its real value lies instead in the firm foun-dation that it provides for applying those formulas, as it specifies both theconditional assumptions made by the equations themselves as well as themethod for applying them together. Because it provides such a solid frame-work for MRF-based document retrieval, its authors successfully build uponthis foundation with extensions describing implicit user preference [9], fea-ture selection [10], and latent concept expansion [11]. In order to achieve the level of flexibility and extensibility achieved by [7] inthat MRF model, we propose another MRF that seeks to capture the smooth-ing gained from the reduced dimensionality co-occurrence matrix in LSA.A general method for defining MRF will be outlined and applied to a term-document dependency graph. A learning strategy will then demonstratethat LSA’s topic clustering can be achieved with the general term-documentMRF approach. Theory
MRFs provide a flexible framework for depicting conditional relationshipsbetween a set of random variables. Unlike similar models such as MarkovChains and Bayesian Networks, MRFs are not limited to specifying one-way(or causal) links between random variables.
MRFs represent a group of random variables with symmetric neighbor rela-tions that satisfy a set [12] of conditions: • The probability of any variable given the rest of the MRF is equal tothe probability of that variable given its neighbors. • The probability of any set of random variables in the MRF is greaterthan zero.The first condition, the Markov property for the MRF, means that com-paring probabilities is much simpler, since many of the random variables canbe ignored when the one being considered does not depend on them. Thesecond condition simply limits local probabilities to an open interval (0 , .To obtain a global distribution for random variables in a MRF, it is firstnecessary to demonstrate the equivalence between the MRF and the Gibbsdistribution [12]. This can be shown with the Hammersley-Clifford theorem.This theorem states that given the random vector x , a collection of graphdependencies G consiting of dependencies based on a symmetric neighborrelation ω ⊂ x × x , and a set of maximal cliques C on this graph, the randomvector is a MRF is given a joint probability distribution: P ( x ) = e − V ( x ) Z Where Z here is a normalization constant that is generally infeasible to cal-culate. V ( x ) refers to a family of potential functions that describe the com-patibility of clique structures on x . This equivalence, know as the Hammersley-Clifford theorem, while never published, was proven in later publications[13]. Define a Graph Structure he first step in constructing a MRF is to produce a graph G that containsa vector of random variables x that satisfies the positivity assumption. Thisassumption may be restated from its previous definition to say that each ran-dom vector may occur with a nonzero probability. In practice, this constraintis easily met with a well constructed graph. Define Clique Structure
While factorizing the maximal cliques in a given graph has been shownto be NP-complete [14], a well-designed structure can lead to an easily ob-tainable and semantically meaningful set of cliques.
Write Clique Potential Functions
Once clique structures have been defined, it is now necessary to defineclique potential functions for them. These potential functions represent thecompatibility of the clique for the particular distribution.The individual clique potential functions combine as: V ( x ) = (cid:88) c ∈ C V c ( x ) (5)Where C is a family of clique configurations and V c ( x ) refers to thepotential function defined for clique configuration c . Obtain Joint Distribution
The Hammersley-Clifford Theorem now allows the joint distributionover x to be defined as: P ( x ) = e − V ( x ) Z (6)Applying function 5 to equation 6 produces: P ( x ) = e − (cid:80) c ∈ C V c ( x ) Z (7)Defining Z as Z = (cid:80) y ∈ S e − V ( y ) where S is the set of all MRF config-urations for x , the joint distribution can be written as: P ( x ) = e − (cid:80) c ∈ C V c ( x ) (cid:80) y ∈ S e − V ( y ) (8) Provide Learning Strategy
The last step is to define a method for learning MRF parameters. An ex-ample of one learning strategy is the hill climbing approach taken by Metzlerto optimize the weights given to the clique potential functions in equation 4. Figure 1: Example Term Document Graph Structure
With these steps defined, it is now possible to construct a MRF model forrepresenting LSA in Information Retrieval.
The random variables in the MRF will be binary valued random variables.This choice to declare the random variables as binary-valued leads to theconcise clique functions and probability calculation done in 2.2.5 and 2.2.6.For brevity, it is often convenient to represent the collection of term vari-ables as a row vector T and the collection of document variables as columnvector D . T = [ t , ..., t n ] (9) D = [ d , ..., d m ] T (10)Now that the variables in the MRF have been defined, it is necessary tosupply neighbor relations ω on our graph G representing conditional depen-dence. For this graph structure, each document will be connected to everyother term, and each term will be connected to every other document. In thisdesign, the t nodes represent the pool of terms in our collection, while the d nodes represent the documents containing one or more of those terms. igure 1 gives an example of this MRF configuration to visually illus-trate the dependence assumptions made in this design. Semantically, thiscan be viewed as making the same independence assumptions made in thevector space model that LSA utilizes. Specificially, we view each documentas only being dependent on the terms it contains. In this way, it is equivalentto the vector-space (bag-of-words model) that stores term counts withoutany dependence information. One benefit to the structure we have defined is that it lends itself to easilyfactored cliques with semantic meaning. There are three types of cliques inthis graph: C = { T , D , T × D } . Cliques over T and D are simple cliquesconsisting of individual documents and terms, while cliques over T × D arepairs representing term occurrences.When producing clique functions, the singleton cliques ( T and D ) pro-vide an opportunity to weight the importance of terms or documents to thejoint distribution. The pairwise cliques ( T × D ) allow the ”compatibility”of documents and terms to have an effect on the distribution. The simplest clique potential function taking the set of random variables X that may be expressed is the sum of the single and double member cliquepotential functions: V ( X ) = (cid:88) i x i v i ( x i ) + (cid:88) i (cid:88) j x i x j v ij ( x i , x j ) (11)This is just a sum of the single and double member cliques. One benefitto giving our random variables binary values is that it allows this expres-sion to be simplified greatly without losing any generality. For any cliquewhose potential function is V ( x i ) = x i v i ( x i ) , it can only take two values: or v i ( x i ) . Furthermore, if we declare that single clique functions evaluate tomembers of parameter vectors b and g such that v i ( t i ) = b i and v i ( d i ) = g i ,then t i v i ( t i ) = 0 or b i and d i v i ( d i ) = 0 or d i . Similarly, if potential func-tions for double member cliques ( t i , d j ) evaluate to members of parametermatrix W such that v ( t i , d j ) = W ij , the expression t i d j v ij ( t i , d j ) = 0 or W ij . Given this flexible representation for individual clique potential func-tions, the sum of all clique potential functions in equation 11 required for he joint distribution may be written without any loss in generality as: V ( X ) = n (cid:88) i b i t i + m (cid:88) j g j d j + n (cid:88) i m (cid:88) j W ij t i d j (12)It will occasionally be convenient to notate this function in terms of vec-tors T and D mentioned in equations 9 and 10. This can be done as such: V ( X ) = bT T + gD + TWD (13)
Now that families of cliques have been defined and given potential functions,an equation for the joint distribution of the MRF model X may be written,using equation 8, as: P ( x ) = exp( (cid:80) ni b i t i + (cid:80) mj g j d j + (cid:80) ni (cid:80) mj W ij t i d j ) (cid:80) y ∈ S exp( (cid:80) ni b i t i + (cid:80) mj g j d j + (cid:80) ni (cid:80) mj W ij t i d j ) (14) For information retrieval, local probabilities for individual random variablesmust be defined. In particular, this is necessary to find the probability of aparticular document d i given a set of query terms. For the manipulationsrequired to demonstrate the derivation of this probability, some compact no-tations will be adopted for the sake of brevity and clarity. • The expression P ( X i = 1) denotes the probability of some binaryvariable, either t i or d i , taking on the value 1. • The expression P ( X − i ) denotes the probability of every value in X except for X i , or P ( X , ..., X i − , X i +1 , ..., X d ) . • The expression P ( X i = k ) denotes the joint probability of X such that X i = k , or P ( X , ..., X i = k, ..., X d ) .The desired probability may be stated as: P ( d i = 1 | X − i ) More clearly, this is equivalent to: P ( d i = 1 | t , ..., t n , d , ..., d i − , d i +1 , ..., d m ) o begin obtaining this probability, it must first be rewritten using a moregeneral form with the compact notation provided above as: P ( X i = 1 | X − i ) This can be manipulated with the following steps: P ( X i = 1 | X − i ) = P ( X i = 1 , X − i ) X − = P ( X i =1 ) P ( X − i )= P ( X i =1 ) P ( X i =1 ) + P ( X i =0 ))= 11 + P ( X i =0 ) P ( X i =1 ) When the joint probability (equation 6) is plugged in, the Z normaliza-tion constants cancel to give:
11 + P ( X i =0 ) P ( X i =1 ) = 11 + exp( − V ( X i =0 ))exp( − V ( X i =1 ))) = 11 + exp( − [ V ( X i =1 ) − V ( X i =0 )]) This takes the form of the sigmoid function, ς ( t ) = e − t . It can bewritten thus as: P ( X i = 1 | X − i ) = ς ( V ( X i =1 ) − V ( X i =0 )) (15)In order to write V ( X i =0 ) − V ( X i =1 ) in terms of individual randomvariables and parameters, it is necessary to make several observations aboutthe potential functions. Because, when X i = 0 , the X i value and its asso-ciated parameter will have no contribution to the sum in its family’s cliquepotential function as written in equation 12. It can therefore be written, inthe special case considered here in which X i is a document variable: V ( X i =0 ) = (cid:88) n b n t n + (cid:88) m (cid:54) = i g m d m + (cid:88) n (cid:88) m (cid:54) = i W nm t n d m (16)Likewise, it is always the case when X i = 1 and X i is a documentvariable, that the clique potential function for that MRF is: V ( X i =1 ) = (cid:88) n b n t n + g i + (cid:88) m (cid:54) = i g m d m + (cid:88) n W ni t n d i + (cid:88) n (cid:88) m (cid:54) = i W nm t n d m (17) hen these are plugged into equation 15, the shared terms cancel toproduce the desired probability in terms of variables and parameters: P ( D i = 1 | X − i ) = ς ( g i + n (cid:88) l =1 W li t l ) (18)This can be represented more concisely using vectors as: P ( D i = 1 | X − i ) = ς ( g i + W Ti T T ) (19)Where W Ti indicates the transpose of the i th column vector of parametermatrix W . The data that will be used to train the model’s parameters will be a set ofobservation vectors ˆT , ..., ˆT n that represent occurrence vectors from thedata collection. ˆT ij may indicate the number of times that term j is presentin document , but normalized counts such as tf − idf vectors are frequentlypreferable.Let us also define a matrix ˆT = (cid:20)(cid:18) ˆT (cid:19) , · · · , (cid:18) ˆT n (cid:19)(cid:21) that representsthe co-occurence matrix with a row of 1s appended to the bottom. This canbe viewed as a global term that is always on which will be used to estimateparameter g .The approach for learning parameters will be the maximization of thefollowing sum squared error objective function: (cid:96) ( W , g ) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) I − (cid:2) W g (cid:3) ˆT (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F (20)Where || X || F indicates the Frobenius norm of some matrix X , and I isan n -dimensional identity matrix whose row vectors represent a configura-tion of the MRF such that the term variable T i corresponding with observedoccurrence vector ˆT i is set to 1.The method of maximizing this will be to solve the following equation: I = (cid:2) W g (cid:3) ˆT The solution is obtained as: (cid:2)
W g (cid:3) = ˆT † k (21) he term ˆT † denotes the Moore-Penrose pseudoinverse of matrix ˆT . Theexpression ˆT † k can be calculated by using a singular value decomposition of ˆT keeping k singular values. To obtain matrices U k , S k , and V k . Thepseudo-inverse may now be calculated as: ˆT † k = V k S − k U Tk (22)It is at this point that the comparison to LSA’s rank reduction can bedrawn. In this context, the row vectors of the (cid:2) W g (cid:3) parameter span a k -dimensional subspace where k is the number of singular values that have notbeen set to zero by the SVD operation. It can be shown [4] that this proce-dure results in finding the (cid:2) W g (cid:3) that minimizes the sum squared objectivefunction that predicts I from ˆT using the formula I = (cid:2) W g (cid:3) ˆT subjectto the constraint that (cid:2) W g (cid:3) has rank k . This means that the subspacespanned by the row vectors of (cid:2) W g (cid:3) is reduced in dimensionality in thesame way the latent space used to compare documents in LSA is reduced.
The goal of these experiments is to validate the novel approach we havedescribed by comparing its performance to popular retrieval methods. Inparticular, we will be looking at various information retrieval metrics andcomparing them for varying numbers of singular values taken to reduce theLSA co-occurence matrix or solve 21 for the MRF approach. In addition,simple vector space term matching will be used as a baseline to evaluate thecontribution of term generalization to the algorithms’ performance. Sincethe most obvious algorithm with which to compare our MRF model is thepopular Latent Semantic Analysis approach described in 1.1.1, it will pro-vide a good baseline for term generalization.
The text collections chosen for this paper are the four widely used collec-tions that, together, comprise the
Classic4 data set. The four collectionscomprising
Classic4 are: • CRAN - 3204 abstracts from the Cranfield Institute of Technology • CACM - 1460 abstracts from the CACM Journal CISI - 1460 abstracts from the Institute for Scientific Information • MED - 1033 abstracts from the National Library of MedicineEach collection comes with a set of queries and relevance judgments.This data set was selected based on the quality of the text and query infor-mation given as well as its contents. Academic abstracts would seem to beexcellent targets for topic generalization because effective topic generaliza-tion manages to resolve the differing jargon that is used in similar academicfields. This particular data set has also been extensively studied in the pastfor similar document retrieval approaches such as LSA [1], [5].
Document Collection
The document collection on which experiments were performed was acombined dataset of the four
Classic4 document collections. Short terms(below 3 characters), as well as common terms (appearing in 95% or moredocuments) were excluded. Stemming was done with the popular Porter’sstemming algorithm [15].
Vector Space Model
The simplest baseline for experimentation is done with simple tf-idf termmatching using vector space methods. Documents are ranked based on theirangular difference from queries in document-term vector space. The methodused involved ranking by highest cosine of the angle, using equation 1 givenduring the description of this approach previously.
Latent Semantic Analysis
Document ranking with LSA follows the procedure outlined in section1.1.1. Specifically, the data collection was loaded as a term-document matrixwith tf-idf adjustments. Then a singular value decomposition was done, X = USV T , where X is the co-occurence matrix. Each query q i wasmapped into the latent space query L i as L i = q i VS − . Comparisonswith the document collection for query k were then done by finding themaximum cosine angle between latent document V i and latent query L k foreach document i . This can be calculated as: cos( θ ) = V i · L k || V || · || L k || The role of the number of values kept from the singular value decom-position is first tested by finding the ideal number of values to keep whendecomposing the co-occurence matrix. Since the style of queries for each ollection differs somewhat (samples are given in Appendix B), it is neces-sary to view the different mean average precision values for each collection’squeries.After this, precision-recall graphs are made using average precision overthe set of all queries for each individual document collection. Markov Random Field Model
Document ranking was done by loading the data collection as a term-document matrix with tf-idf adjustments and then applying the methodsdescribed in part 2 of this paper to obtain the parameters of the MRF. Noweighting is done, the co-occurrence matrix simply records term counts.The formula used in equation is then used to obtain the probability of a cer-tain document given the terms of the MRF, which are set to match the samplequeries given with the collections.The role of the singular value decomposition is first tested by findingthe ideal number of singular values to keep when learning MRF parametersusing a method similar to the previous LSA experiment using mean averageprecision for each collection’s set of queries.Once this is done, it is possible to select good singular value counts foreach query collection and create precision-recall graphs based on the averageprecision values for each set of queries.
The results for the mean precision versus singular values taken tests (forboth LSA and MRF model forms of rank reduction) is shown in Figures 6through 5 for the four text collections. Due to the granularity of the meanaverage precision value difference between differing values kept as well asthe large difference between mean average precision values across documentcollections, each document collection’s graph will be shown indepedently.
00 200 300 400 500 600 700 8000.350.40.450.50.550.6 Singular Values M e a n A v e r a g e P r e c i s i o n Figure 2: MED Collection - Mean Precision for Varying Numbers of SingularValues Used (LSA)
100 200 300 400 500 600 700 8000.2560.2580.260.2620.2640.2660.268 Singular Values M e a n A v e r a g e P r e c i s i o n Figure 3: CRAN Collection - Mean Precision for Varying Numbers of SingularValues Used (LSA) 18
00 200 300 400 500 600 700 8000.3580.360.3620.3640.3660.3680.370.3720.3740.376 Singular Values M e a n A v e r a g e P r e c i s i o n Figure 4: CISI Collection - Mean Precision for Varying Numbers of SingularValues Used (LSA)
100 200 300 400 500 600 700 8000.2450.250.2550.260.2650.270.2750.28 Singular Values M e a n A v e r a g e P r e c i s i o n Figure 5: CACM Collection - Mean Precision for Varying Numbers of SingularValues Used (LSA) 19
00 200 300 400 500 600 700 800 900 10000.390.40.410.420.430.440.450.460.470.48 Singular Values M e a n A v e r a g e P r e c i s i o n Figure 6: MED Collection - Mean Precision for Varying Numbers of SingularValues Used (MRF)
100 200 300 400 500 600 700 800 900 1000 11000.220.230.240.250.260.270.280.290.30.310.32 Singular Values M e a n A v e r a g e P r e c i s i o n Figure 7: CRAN Collection - Mean Precision for Varying Numbers of SingularValues Used (MRF) 20
00 200 300 400 500 600 700 800 900 1000 11000.370.3720.3740.3760.3780.380.3820.384 Singular Values M e a n A v e r a g e P r e c i s i o n Figure 8: CISI Collection - Mean Precision for Varying Numbers of SingularValues Used (MRF) M e a n A v e r a g e P r e c i s i o n Figure 9: CACM Collection - Mean Precision for Varying Numbers of SingularValues Used (MRF)
Precision-recall graphs for the four collection queries, each using thebest number of singular values found in the previous step are given in Fig-ures 12 through 10. Each graph shows results for vector space indexing,LSA, and MRF retrieval. .1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.20.250.30.350.40.450.5 Recall P r e c i s i o n Vector SpaceLSAMRF
Figure 10: CACM Collection - Precision-Recall P r e c i s i o n Vector SpaceLSAMRF
Figure 11: CISI Collection - Precision-Recall22 .1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 100.10.20.30.40.50.60.70.8 Recall P r e c i s i o n Vector SpaceLSAMRF
Figure 12: MED Collection - Precision-Recall P r e c i s i o n Vector SpaceLSAMRF
Figure 13: CRAN Collection - Precision-Recall
A visual depiction of the mean average precision for each algorithm isshown in figure 14. ACM CISI MED CRAN00.10.20.30.40.50.60.7 Vector SpaceLSAMRF
Figure 14: Mean Average Precision Scores for the Three Approaches
The first experimental result concerned the selection of optimal numbers ofsingular values for use in the rank reduction in LSA and the pseudo-inverseusing the MRF method.For LSA, best singular value counts of 100, 600, 100, and 700 werefound for the MED, CRAN, CISI, and CACM collections respectively. Itwas clear that some collections (MED and CISI) benefitted from smallercounts, while it took much larger counts for CRAN and CACM. However,these are still fractions of the almost 6000 terms in the original data set.For the MRF model, it seems that certain data sets were better suited tothis method than others. The MED and CISI had maximums at low (200)singular values. CRAN took 900 singular values before tapering off in per-formance. CACM did not seem suited to the reduced dimensionality, as itcontinued to increase in performance after reaching around a fifth of possiblesingular values (1200 out of 5896).Precision-recall graphs show promise in the MRF method. It succeedsremarkably in querying CISI, where LSA has been known to show signif-icantly worse performance than simple vector space methods [1]. For theCACM and CRAN collections, it outperformed LSA and either matched oroutperformend vector space methods. The only collection in which LSAwas strictly superior was the MED collection. It is not entirely clear whythis is the case, although the MED collection is the smallest of the collec-tions and has a very small query collection, so it is possible that some aspect f this unusual collection produced such polarized results. However, even inthis case, the MRF method still outperformed simple vector space methods.Not only do these results suggest that our approach is sound for infor-mation retrieval, but they also give credibility to our previous assertion thatthe benefits from rank reduction in LSA can be matched by reducing thedimensionality of the MRF parameter matrix W . In this paper, we have presented a methodical approach to defining a MarkovRandom Field (MRF) that captures the independence assumptions made indocument indexing with Latent Semantic Analysis (LSA). A clearly definedgraph structure produces a set of semantically meaningful clique potentialfunctions describing the compatibility of documents, terms, and document-term pairs in the model.After declaring these properties of our graph, we utilized the Hammersley-Clifford theorem to state that the joint distribution of the random variablesin our graph is defined with a Gibbs distribution. Some manipulation ofprobabilities was done to find a concise expression for the probability of anyparticular document given a set of terms.Finally, a method for learning parameters was proposed. This methodminimizes a sum squared error using the Moore-Penrose pseudoinverse. Be-cause this pseudoinverse relies on a singular value decomposition to producethe desired parameters, it is possible to limit the singular values does andachieve the same benefits as the rank reduction in LSA.
Experiments were carried out on the medium-sized
Classic4 data set of sci-entific abstracts. The results showed that, like LSA, the number of singularvalues kept in the rank reduction affects performance. Once the largest effec-tive number of singular values for each collection was determined, queriesfor each collection were executing on an MRF formed by learning with thatnumber of singular values. Average precision-recall graphs for the MRF ap-proach as well as the LSA and vector space methods were constructed foreach set of queries that showed effective retrieval by the MRF method. he results of these queries were promising. Even though CISI waspreviously described as being difficult, precision scores remained above 0.2for all recall values. Both MED and CRAN collections produced excellentresults, with 0.4755 and 0.3184 mean average precision scores respectively.CISI produced a mean average precision of 0.3817, a surprisingly high scorefor such a difficult collection. The most difficult collection with this methodproved to be CACM with a score of 0.3119, but that is not significantlylower than the others. LSA was only able to outperform our approach on thesmall MED collection, but the MRF model outperformed LSA on the other3 collections. The efficacy of our method as a document retrieval engine fordifficult collections is suggested by these results. The greatest benefit of our approach is its potential for future expansion.Now that a clear statistical model has been proposed that utilizes rank reduc-tion in a similar manner to LSA, the next step will be to add new assumptionsto the MRF model that produce more intelligent results. Term dependencies,hierarchical document structures, and query expansion are several ideas forfuture research with this approach.
A Sample Documents from Classic4 Data Set
A.1 CRAN experimental investigation of the aerodynamicsof a wing in a slipstream. an experimentalstudy of a wing in a propeller slipstream wasmade in order to determine the spanwise distributionof the lift increase due to slipstream at differentangles of attack of the wing and at differentfree stream to slipstream velocity ratios. theresults were intended in part as an evaluationbasis for different theoretical treatments ofthis problem. the comparative span loadingcurves, together with supporting evidence, showedthat a substantial part of the lift incrementproduced by the slipstream was due to a /destalling/or boundary-layer-control effect. the integrated emaining lift increment, after subtractingthis destalling lift, was found to agree wellwith a potential flow theory. an empiricalevaluation of the destalling effects was madefor the specific configuration of the experiment. A.2 CISI
The present study is a history of the DEWEYDecimal Classification. The first edition ofthe DDC was published in 1876, the eighteenthedition in 1971, and future editions will continueto appear as needed. In spite of the DDC’slong and healthy life, however, its full storyhas never been told. There have been biographiesof Dewey that briefly describe his system, butthis is the first attempt to provide a detailedhistory of the work that more than any otherhas spurred the growth of librarianship in thiscountry and abroad.
A.3 CACM
This paper discusses the limited problem ofrecognition and retrieval of a given misspelledname from among a roster of several hundrednames, such as the reservation inventory fora given flight of a large jet airliner. A programhas been developed and operated on the Telefile(a stored-program core and drum memory solid-statecomputer) which will retrieve passengers’ recordssuccessfully, despite significant misspellingseither at original entry time or at retrievaltime. The procedure involves an automatic scoringtechnique which matches the names in a condensedform. Only those few names most closely resemblingthe requested name, with their phone numbersannexed, are presented for the agents finalmanual selecton. The program has successfully solated and retrieved names which were subjectedto a number of unusual (as well as usual) misspellings. A.4 MED correlation between maternal and fetal plasmalevels of glucose and free fatty acids. correlationcoefficients have been determined between thelevels of glucose and ffa in maternal and fetalplasma collected at delivery. significant correlationswere obtained between the maternal and fetalglucose levels and the maternal and fetal ffalevels. from the size of the correlation coefficientsand the slopes of regression lines it appearsthat the fetal plasma glucose level at deliveryis very strongly dependent upon the maternallevel whereas the fetal ffa level at deliveryis only slightly dependent upon the maternallevel.
B Sample Queries from the Classic4 DataSet
B.1 CRAN
B.1.1 Query 1 of 365 what similarity laws must be obeyed when constructingaeroelastic models of heated high speed aircraft.
B.1.2 Query 2 of 365 what are the structural and aeroelastic problemsassociated with flight of high speed aircraft. .2 CISI B.2.1 Query 1 of 112
What problems and concerns are there in makingup descriptive titles? What difficulties areinvolved in automatically retrieving articlesfrom approximate titles? What is the usualrelevance of the content of articles to theirtitles?
B.2.2 Query 2 of 112
How can actually pertinent data, as opposedto references or entire articles themselves,be retrieved automatically in response to informationrequests?
B.3 CACM
B.3.1 Query 1 of 64
What articles exist which deal with TSS (TimeSharing System), an operating system for IBMcomputers?
B.3.2 Query 2 of 64
I am interested in articles written either byPrieve or Udo Pooch
B.4 MED
B.4.1 Query 1 of 30 the crystalline lens in vertebrates, includinghumans. .4.2 Query 2 of 30 the relationship of blood and cerebrospinalfluid oxygen concentrations or partial pressures.a method of interest is polarography. References [1] Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K.Landauer, and Richard Harshman. Indexing by latent semantic anal-ysis.
Journal of the American Society for Information Science ,41(6):391–407, 1990.[2] G. W. Furnas, S. Deerwester, S. T. Dumais, T. K. Landauer, R. A.Harshman, L. A. Streeter, and K. E. Lochbaum. Information retrievalusing a singular value decomposition model of latent semantic struc-ture. In
Proceedings of the 11th annual international ACM SIGIR con-ference on Research and development in information retrieval , SIGIR’88, pages 465–480, New York, NY, USA, 1988. ACM.[3] G. Salton, A. Wong, and C. S. Yang. A vector space model for auto-matic indexing.
Commun. ACM , 18:613–620, November 1975.[4] Richard Johnson. On a theorem stated by Eckart and Young.
Psy-chometrika , 28:259–263, 1963. 10.1007/BF02289573.[5] Thomas Hofmann. Probabilistic latent semantic indexing. In
Pro-ceedings of the 22nd annual international ACM SIGIR conference onResearch and development in information retrieval , SIGIR ’99, pages50–57, New York, NY, USA, 1999. ACM.[6] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichletallocation.
J. Mach. Learn. Res. , 3:993–1022, March 2003.[7] Donald Metzler and W. Bruce Croft. A markov random field modelfor term dependencies. In
Proceedings of the 28th annual interna-tional ACM SIGIR conference on Research and development in infor-mation retrieval , SIGIR ’05, pages 472–479, New York, NY, USA,2005. ACM.[8] Donald Metzler, W. Bruce Croft, and Andrew Mccallum. Direct max-imization of rank-based metrics for information retrieval. Technicalreport, 2005.[9] Donald Metzler and W. Bruce Croft. Beyond bags of words: Modelingimplicit user preferences in information retrieval, 2006.
10] Donald Metzler. Automatic feature selection in the markov randomfield model for information retrieval. In
In Proceedings of CIKM07 ,2007.[11] Donald Metzler and W. Bruce Croft. Latent concept expansion usingmarkov random fields. In
In Proceedings of the 30th annual interna-tional ACM SIGIR conference on Research and development in infor-mation retrieval , 2007.[12] Richard M. Golden.
Mathematical Methods for Neural Network Anal-ysis and Design . MIT Press, Cambridge, MA, USA, 1st edition, 1996.[13] Julian Besag. Spatial Interaction and Statistical-Analysis of Lat-tice Systems.
Journal of the Royal Statistical Society Series B-Methodological , 36(2):192–236, 1974.[14] Richard M. Karp. Reducibility among combinatorial problems.In Michael Jnger, Thomas M. Liebling, Denis Naddef, George L.Nemhauser, William R. Pulleyblank, Gerhard Reinelt, Giovanni Ri-naldi, and Laurence A. Wolsey, editors,
50 Years of Integer Program-ming 1958-2008 , pages 219–241. Springer Berlin Heidelberg, 2010.[15] M. F. Porter.
An algorithm for suffix stripping , pages 313–316. MorganKaufmann Publishers Inc., San Francisco, CA, USA, 1997., pages 313–316. MorganKaufmann Publishers Inc., San Francisco, CA, USA, 1997.