Finding Salient Context based on Semantic Matching for Relevance Ranking
FFinding Salient Context based on SemanticMatching for Relevance Ranking st Yuanyuan Qi
Beijing University of Posts and Telecommunications [email protected] nd Jiayue Zhang
Beijing University of Technology [email protected] rd Weiran Xu
Beijing University of Posts and Telecommunications [email protected] th Jun Guo
Beijing University of Posts and Telecommunications
Jun Guo
Abstract —In this paper, we propose a salient-context basedsemantic matching method to improve relevance ranking ininformation retrieval. We first propose a new notion of salientcontext and then define how to measure it. Then we show howthe most salient context can be located with a sliding windowtechnique. Finally, we use the semantic similarity between a queryterm and the most salient context terms in a corpus of documentsto rank those documents. Experiments on various collectionsfrom TREC show the effectiveness of our model compared tothe state-of-the-art methods.
Index Terms —keywords matching, contextual salience, seman-tic matching.
I. I
NTRODUCTION
As the core of understanding multimedia, semantic match-ing plays the role of bridge to connect different forms ofcontent, such as text, image, video and audio, etc. Beforesemantic matching came into existence, the conventional key-words matching methods have been dominant for a long time,says, in Information Retrieval (IR) [1]. They fail, however,to capture the query term’s fine-grained contextual informa-tion. The missing contextual information results in the term-mismatching problem due to the word ambiguity issue. Todeal with this problem, varieties of neural IR models, whichare often called semantic matching, have been proposed toincorporate context information by embedded representation[2]. Some methods consider the whole document as a globalcontext and embed it into one vector. The query term isembedded into a similar vector, and these vectors are usedto calculate the relevance between term and document [3].Other methods consider a certain scope around the keywordas the local context. Only this local context is encoded intoembedding vectors and used to compute the relevance [4].Both parties have made important efforts to do semanticmatching, but we believe that the retrieved documents can fitthe query terms even better. The global context methods failto capture the individual interactions between the query andthe document terms since the whole document is encoded intoone vector. The latter group does not have this problem, butit still leaves the mismatching problem unsolved.To remedy the shortcomings of the previous methods, in thispaper, we propose a salient-context-based semantic matching Fig. 1: Term relevance distribution. The vertical axis denotesthe query term, and the horizontal axis denotes the termposition index. In each box, the upper part shows the termsrelated to the query term “robotic”, and the lower part showsthat to “technology”. The thickness of the line indicates therelevance score of the term, and the thicker the higher. The firsttwo documents are rated relevant by human judges whereasthe third one is irrelevant.model. With this model, we improve the relevance rankingin IR. Fig. 1 explains the concept of salient context withan example. We have the query terms “robot technology”and a corpus of three documents. The three boxes in thefigure correspond to those three documents. The vertical linesindicate positions in the documents which are salient withrespect to the two query terms, and thus give locations ofthe salient context.We can observe that the highly relevant terms are clusteredin the first two boxes, while they are scattered in the third.As the two corresponding documents are labeled related to“robot technology” by a human, the clustering indicates thatthe closer together query-related terms are located, the morerelevant the document is to the query. This behavior leads usto define the locations of these clusters as the salient context.Our goal is to find the most salient context and embed it into a r X i v : . [ c s . I R ] S e p ectors that represent the document. In this way, we eliminatethe risk of single-keyword mismatching, thus addressing theshortcomings of the models mentioned earlier.To locate the most salient context, we define a measurementof the contextual salience. It is based on the semantic similaritybetween the query and the salient context and is designedsuch that it is not influenced by low query-related terms ordominated by a single term. In addition, we use the BM25relevance score as a representation of the global context inthe final relevance function.This paper has threefold contribution. Firstly, we analyzeand demonstrate the aggregation phenomenon of highly query-related terms in relevant documents, and also define our newconcept of salient context. Secondly, we propose a way tomeasure contextual-salience to locate the most salient contextdynamically. Thirdly, rather than using the context surround-ing a keyword, we propose to use the most salient contextas a representation of a document, thereby eliminating themismatching problem.II. M ETHODOLOGY
A. Term-level Semantic Matching
Fig. 2: Analysis of term importance for estimating the rel-evance of a document to the query “robot technology” bysemantic relevance matching.Generally, it is important that each keyword is exactlymatched. It is often particularly important when the keywordsare new or rare. However, traditional keyword matching mightlose to capture the fine-grained contextual information andsemantically related terms. As illustrated by the example inFig. 2, semantic relevance matching is able to highlight theterms with a high semantic relevance to the query “robotictechnology” with dark green being most relevant. We can seethat the semantic matching gives emphasis to semantic relatedterms such as “robot”, “industrial” and “application”.Distributed representations of text, i.e. word embeddings,encapsulate useful contextual information and effectively rep-resent the semantic information of a word. Models that usepre-trained word embeddings [5]–[7] have shown better per-formance than those which use term co-occurrence countingbetween query and documents. Inspired by this, we utilize thepre-trained word embeddings as the basis for our semantic rep-resentation to model the query-document matching interaction.From the embedded vectors, We apply cosine similarity to thecapture of the word-level semantic matching as given by: s ij = ( w i ) T w j (cid:107) w i (cid:107) · (cid:107) w j (cid:107) , (1)where w i and w j represent the vectors for the i-th query termand the j-th document term, respectively. B. Contextual Salience
According to the query-centric assumption proposed in [8],the local context surrounding the location of a found queryterm in a document is relevant when deciding if the documentis a match to the query. In Fig. 2, relevant terms clusteraround the first two sentences, and in Fig. 1 we can seethat these clusters are present at both the beginning, middle,and end of a document. Thus, the position of the salientcontext changes from document to document and thereforeour salience-measure must be able to handle that shift. Weuse a sliding window which moves over the document fromthe start to the end. For a given position of the window, termswhich are highly related to the query are found and thus thatpart of the document will stand out. The window context forthe i-th query term is described as: S i = { s ij | i ∈ Q , j ∈ T } , (2)where s ij is the cosine distance between the i-th query termand the j-th document term in the window, Q is the set ofquery terms, T is the set of document terms in the window,and S i represents the cosine relevance between the i-th queryterm and the document terms which falls inside the window.This approach is different from the deep learning models.As stated above, the deep learning models combine all termsin a document into one single document representation. Ourrepresentation only takes the relevant parts of the documentand embeds those into a document representation. Often, onlya few terms with a high windows relevance score contributeto the final document relevance. In order to filter away textnoise and counteract semantic drift, we choose to only takethe window contextual salience of the top n semantic relevancematches into account. Here is the processing for getting then-maximums of the set S i . S ( ) i = max ( S i ) S ( ) i = max ( S i \ { S ( ) i }) S ( ) i = max ( S i \ { S ( ) i , S ( ) i }) ... S ( n ) i = max ( S i \ { S ( ) i , S ( ) i , · · · , S ( n − ) i }) . (3)The set of n-maximum members of the set is S i then S ni = { S ( ) i , S ( ) i , · · · , S ( n ) i } . (4) S ct i = S ( ) i + α · (cid:205) Kn = S ( n ) i K , (5)where K = lo g ( L ) + , decided by window width L . α is theinfluence factor to balance semantic interactions’ weighting inthe window context.Queries used in IR are short and without complex gram-matical structures. Consequently, we need to take the termimportance into account. The compositional relation betweenthe query terms is usually the simple “and” relation whenearching. Take the given query “arrested development” forexample, a relevant document should refer to “arrested” and“development”, where the term “arrested” is more impor-tant than “development”. There have been many previousstudies on retrieval models showing the importance of termdiscrimination [9]. In the proposed model, we introduce anaggregation weight for each query term which controls howmuch the relevance score on that query contributes to the finalrelevance score: S ct = ql (cid:213) i = g i S ct i , (6) g i = exp ( v T i w i ) (cid:205) qlm = exp ( v T m w m ) , (7)where v i denotes the weight vector of the i-th query termvector w i , and ql is the query length. In our model, we setthe weight vectors equal to their respective query term vector, i.e. v i = w i . Putting this into Equ. 7, we get: g i = exp ( w T i w i ) (cid:205) qlm = exp ( w T m w m ) (8)Here, w T i w i squares each element of w i before summing themtogether. As w i ∈ [− , ] d , with d being the dimension of theweight vector, the resulting scalar will be positive and equalto the square of the magnitude of w i . Equ. 8 is the normalizedexponential, or softmax, function, with g i ∈ [ , ] . It returnsa scalar which is proportional to the normalized magnitude ofthe term vector, but with an emphasis on the vectors with thelargest magnitudes. Thus, it regularizes the relevance score. C. Relevance Aggregation
Different from semantic-matching-based distributional wordembedding, exact keywords matching avoids the risk of rareor new words in query. Hence, we linearly combining theexact keywords matching and use it as a compensation forsemantic matching. Traditional IR models ,such as BM25 [10],is a classical weighting function employed by the Okapisystem. As shown by previous TREC experimentation, BM25usually provides very effective retrieval performance on theTREC collections. In BM25, the relevance score is basedon the within-document term frequency and query term fre-quency. We can utilize BM25 to model relevance matchingin document-level with query terms. In our paper, we applyBM25 to extend model on document-level matching and definethe way to aggregate exact keywords matching interactions byintegrating into BM25 linearly via a parameter β . We also takeinto consideration of the co-occurrence of query terms withindocument in weighting function for the contextual salience inthe document. The two formulas are defined as below: Linear Function : F ( S ct ) = max ( S ct ) + β · BM , (9) CO W ei g htin g Function : F ( S ct ) = log ( co + C ) · max ( S ct ) + β · BM , C ∈ R , (10)where β is the influence factor to balance BM25, decidesthe effects of BM25 in relevance scoring. When β is 0,only contextual salience contributes the relevance scoring, β ∈ (0,1) the contextual salience and BM25 contribute therelevance scoring together. co is the co-occurrence of queryterms within document, and the constant C is a constant tobalance parameter co .III. D ATA S ETS AND E VALUATION
We evaluate the proposed approach on five standard TRECcollections , which are different in their sizes, contents, andtopics. The TREC tasks and topic numbers associated witheach collection are summarized in Table I. For all the testTABLE I: Overview of the TREC collections used
Collection Name Topics Topics Num. Docs
AP8889 51-100 50 164,597WT2G 401-450 50 247,491Robust04 301-450 601-700 250 528,155WT10G 451-550 100 1,692,096Blog06 851-950 100 3,215,171 collections used in our experiments, we apply pre-trainedGloVe word vectors which are trained from a 6 billiontoken collection (Wikipedia 2014 plus Gigawords 5), reliableterm representations can be better acquired from large scaleunlabeled text collections rather than from the limited groundtruth data for IR task. We use the TREC retrieval evaluationscript focusing on MAP, RP (recall precision) and P@5,P@20, NDCG@5, and NDCG@20 in our experiments. Weprovide the source code for the model as well as trainedword vectors. IV. EXPERIMENTSTable II shows the performance comparisons between thebaseline model BM25 and new model CSSM on five col-lections over MAP, RP and P@5, P@20, NDCG@5 andNDCG@20. The percentage of how much our model outper-forms BM25 is also listed. With regards to MAP and RP itindicates that, in general our model performs better than thebaseline model BM25 on all five collections, especially onWT2G, Robust04 and Blog06 collections. It demonstrates theimportance of semantic relevance matching and emphasizescontextual salience is helpful to locate the most relevant localcontext through highly semantic relevance matching. Comparethe results of CSSM l f (linear function) and
CSSM c w (coweighting function), three datasets show improvements, theco-occurrence information of query terms in document canoffer positive connection with contextual salience in the model.The experiment results prove that our model can encode https://nlp.stanford.edu/projects/GloVe/ https://trec.nist.gov/trec eval/ source code is available on https://github.com/YuanyuanQi/CSSM IR/ ABLE II: Comparisons of CSSM and BM25, with MAP, RP and P@5, P@20, NDCG@5, and NDCG@20 over five TRECcollections
Corpus Methods MAP RP P@5 P@20 NDCG@5 NDCG@20
AP8889 BM25 0.278 0.298 0.453 0.404 0.461 0.430
CSSM l f % + % + % + % + % + % CSM M c w % + % + % + % + % + %WT2G BM25 0.313 0.340 0.532 0.391 0.542 0.470 CSSM l f % + % + % + % + % + % CSM M c w % + % + % + % + % + %Robust BM25 0.239 0.283 0.481 0.354 0.497 0.425 CSSM l f % + % + % + % + % + % CSM M c w % + % + % + % + % + %WT10G BM25 0.211 0.244 0.382 0.274 0.418 0.362 CSSM l f % + % + % + % + % + % CSSM c w % + % + % + % + % + %Blog06 BM25 0.318 0.371 0.634 0.605 0.625 0.611 CSSM l f % + % + % + % + % + % CSM M c w % + % + % + % + % + % the critical contextual semantic information in our relevanceranking function for the IR.TABLE III: Comparisons of Deep Learning methods on Ro-bust04 collection Methods MAP P@20 NDCG@20
BM25 0.239 0.354 0.425DRMM 0.256 0.37 0.444PACRR 0.258 0.372 0.443DRMM+PACRR 0.259 0.372 0.444ABEL-DRMM 0.263 0.380 0.456ABEL-DRMM+MV 0.265 0.380 0.455
CSSM l f
CSSM c w Table III shows the performance on Robust04 collectionwith comparison of deep learning based methods recentlyproposed in [5]–[7]. Our performance is better than DRMM,PACRR, DRMM-PACRR, slightly better than ABEL-DRMMand ABEL-DRMM+MV with less extra model training data.Compare with POSIT-DRMM and POSIT-DRMM+MV whichencode multiple views (MV) of terms (context-sensitive termencodings, pre-trained term embeddings, and one-hot termencodings), our model utilizes pre-trained term embeddingsalone. We mainly take into account of two reasons. First,according to our scoring function, directly applying multipleviews of terms is hard to balance the input dimensionsdifferences, one-hot vector is high dimensional and sparse termembedding. Second, it needs sacrifice efficient to take trainingdata to explicitly tune context-sensitive term encodings inmodel. In addition, without model parameters tuning, ourmodel retrieval time costing is less than all supervised deep learning based models in the table, works as efficiently asBM25. V. CONCLUSIONIn this paper, we propose a semantic-matching basedmethod to locate the most salient context for understandinga piece of multimedia content. We propose to prioritize theaction of locating the semantic salient context in the relevancecalculation. On the basis of the prioritization, we define ameasurement of contextual salience to quantify the relevanceof a document towards a query. Furthermore, we apply theproposed method in IR, and it shows promising improvementsover the strong BM25 baseline and several neural relevancematching models. Finally, extensive comparisons betweenseveral neural relevance matching models and our approachsuggest that explicitly modelling the salient query-relatedcontext in document is helpful to improve the effectiveness ofrelevance ranking for IR. Our idea of understanding content bylocating the most salient context provides a new perspectivein multimedia content analysis, and the proposed semantic-matching based method can be applied to other forms ofmultimedia content. The proposed method provides an effi-cient and explainable relevance ranking solution which can begeneralized to other forms of multimedia content as well.VI. ACKNOWLEDGEThis work was supported by Beijing Natural Science Foun-dation (4174098), National Natural Science Foundation ofChina (61702047), National Natural Science Foundation ofChina (61703234) and the Fundamental Research Funds forthe Central Universities (2017RC02).
EFERENCES[1] M. Christopher, R. Prabhakar and S. Hinrich, “Introduction to infor-mation retrieval,” Natural Language Engineering, vol. 16, no. 1, pp.100–103, 2010.[2] O. K. Dilek and Zhang et alet al