[PDF] Graph-based Hierarchical Relevance Matching Signals for Ad-hoc Retrieval

Abstract

The ad-hoc retrieval task is to rank related documents given a query and a document collection. A series of deep learning based approaches have been proposed to solve such problem and gained lots of attention. However, we argue that they are inherently based on local word sequences, ignoring the subtle long-distance document-level word relationships. To solve the problem, we explicitly model the document-level word relationship through the graph structure, capturing the subtle information via graph neural networks. In addition, due to the complexity and scale of the document collections, it is considerable to explore the different grain-sized hierarchical matching signals at a more general level. Therefore, we propose a Graph-based Hierarchical Relevance Matching model (GHRM) for ad-hoc retrieval, by which we can capture the subtle and general hierarchical matching signals simultaneously. We validate the effects of GHRM over two representative ad-hoc retrieval benchmarks, the comprehensive experiments and results demonstrate its superiority over state-of-the-art methods.

Full PDF

GGraph-based Hierarchical Relevance Matching Signals forAd-hoc Retrieval

Xueli Yu , ∗ , Weizhi Xu , , ∗ , Zeyu Cui , , Shu Wu , , † , Liang Wang , Center for Research on Intelligent Perception and Computing, Institute of Automation, Chinese Academy of Sciences University of Chinese Academy of Sciences{xueli.yu,weizhi.xu}@cripac.ia.ac.cn,{zeyu.cui,shu.wu,wangliang}@nlpr.ia.ac.cn

ABSTRACT

The ad-hoc retrieval task is to rank related documents given a queryand a document collection. A series of deep learning based ap-proaches have been proposed to solve such problem and gained lotsof attention. However, we argue that they are inherently based onlocal word sequences, ignoring the subtle long-distance document-level word relationships. To solve the problem, we explicitly modelthe document-level word relationship through the graph structure,capturing the subtle information via graph neural networks. In ad-dition, due to the complexity and scale of the document collections,it is considerable to explore the different grain-sized hierarchicalmatching signals at a more general level. Therefore, we propose a G raph-based H ierarchical R elevance M atching model (GHRM) forad-hoc retrieval, by which we can capture the subtle and general hi-erarchical matching signals simultaneously. We validate the effectsof GHRM over two representative ad-hoc retrieval benchmarks, thecomprehensive experiments and results demonstrate its superiorityover state-of-the-art methods. CCS CONCEPTS • Information systems → Learning to rank . KEYWORDS

Hierarchical graph neural networks, Ad-hoc retrieval, Relevancematching

ACM Reference Format:

Xueli Yu, Weizhi Xu, Zeyu Cui, Shu Wu, Liang Wang. 2021. Graph-based Hi-erarchical Relevance Matching Signals for Ad-hoc Retrieval. In

Proceedingsof the Web Conference 2021 (WWW ’21), April 19–23, 2021, Ljubljana, Slovenia.

ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/3442381.3450115

During recent years, deep learning methods have gained lots ofattention in the field of Information Retrieval (IR), where the taskis to obtain a list of documents that are relevant to a given query(i.e., a short query and a collection of long documents in the ad-hocretrieval). Compared to the traditional methods which utilize the ∗ The first two authors contributed equally to this work. † Corresponding author.This paper is published under the Creative Commons Attribution 4.0 International(CC-BY 4.0) license. Authors reserve their rights to disseminate the work on theirpersonal and corporate Web sites with the appropriate attribution.

WWW ’21, April 19–23, 2021, Ljubljana, Slovenia © 2021 IW3C2 (International World Wide Web Conference Committee), publishedunder Creative Commons CC-BY 4.0 License.ACM ISBN 978-1-4503-8312-7/21/04.https://doi.org/10.1145/3442381.3450115

EXPOrailwaySanta Monica Los Angelesmarvel trafficsleek trainlong-closedlight EXPO Locationtrainlong-closed(a)(b) (c) long-closed ， EXPO ， train The EXPO Line, a light railway which connects the city of Santa Monica to downtown Los Angeles, is a marvel. For decades, the Los Angeles travelling to and from the beach had to sit in traffic. …… Since 2012 when the train opened, repurposing a long-closed Pacific Electric rail line, they have been able to catch a sleek train instead …… rail Figure 1: An example of the process of hierarchical query-document relevance matching. ( a ) The query and the candi-date document (some words are omitted). ( b ) A graph witha part of words in the document. ( c ) A hierarchical graphcontaining the more critical words and discarding the wordsunrelated to the query. hand-crafted features to match between the query and documents,deep learning based approaches can extract the matching patternsbetween them automatically, thus have been widely applied andhave made remarkable success.Generally speaking, the deep learning based query-documentmatching approaches can be roughly divided into two categories,i.e., the semantic matching and the relevance matching. In detail,the former focuses on the semantic signals, which learns the embed-dings of the query and document respectively, and calculates thesimilarity scores between them to make the final prediction, suchas the methods of DSSM [9], CDSSM [25] and ARC-I [7]. The latteremphasizes more on relevance, like the DRMM [5], KNRM [29],and PACRR [10, 11], which capture the relevance signals directlyfrom the word-level similarity matrix of the query and document.As mentioned in [5], the ad-hoc retrieval tasks need more exactsignals rather than the semantic one. Therefore, relevance matchingmethods are more applicable in this scenario. In this paper, we alsofocus on the relevance matching methods.While previous relevance matching approaches achieve satis-fying results, we argue that the fine-grained long-distance wordrelationship in the document has not been explored so far. In other a r X i v : . [ c s . I R ] F e b WW ’21, April 19–23, 2021, Ljubljana, Slovenia Xueli Yu, Weizhi Xu, Zeyu Cui, Liang Wang words, as terms in the query may not always appear exactly to-gether in the candidate document, it is necessary to consider thesubtler document-level word relationship. However, such charac-teristics are ignored in the existing works which rely on localword sequences [10, 20, 21]. One solution is to explicitly modelthe document-level word relationship through the graph structure,such as [30, 36] in the field of text classification, in which the graphneural networks are utilized as a language model to capture thelong-distance dependencies. Taking Figure 1 ( 𝑎 ) and 1 ( 𝑏 ) as anexample, where ( 𝑎 ) presents the query "long-closed EXPO train"and the candidate document, ( 𝑏 ) is a graph structure with a partof words in the document. Although the words "long-closed" and"EXPO" are distributed non-consecutively in the document, theirrelationship can be clearly learned via the high-order connectionin the graph structure of ( 𝑏 ) , which validates the advantage andnecessity of utilizing the graph-based methods to capture subtledocument-level word relationships.However, among the above graph-based language models, thedifferent grain-sized hierarchical signals at a more general levelare almost ignored, which are also critical characteristics to beconsidered for ad-hoc retrieval task due to the complexity andscale of the document collections. Taking Figure 1 ( 𝑏 ) and 1 ( 𝑐 ) asan example, where ( 𝑐 ) is the query-aware hierarchical graph ex-tracted from ( 𝑏 ) . The processing of ( 𝑏 ) to ( 𝑐 ) could be summarizedas two aspects. One is that the words unrelated to the query aredropped, such as "marvel" and "sleek". Another is that the wordswhich may have similar effect for matching the query are integratedto a more critical node. For instance, the words "Santa Monica" and"Los Angeles" in ( 𝑏 ) are integrated into the node "Location" in ( 𝑐 ) ,because of their similar location-based supplement to the query.Therefore, considering the different grain-sized interaction infor-mation of the hierarchical graphs makes the relevance matchingsignals more general and comprehensive. Accordingly, to capturethe above signals, inspired by the hierarchical graph neural net-works methods [14, 31], we introduce a graph pooling mechanismto extract important matching signals hierarchically. Thus, both thesubtle and general hierarchical matching signals can be capturedsimultaneously.In this work, we propose a G raph-based H ierarchical R elevance M atching model (GHRM) to explore different grain-sized query-document interaction information concurrently. Firstly, for eachquery-document pair, we transform the document into the graph-of-words form [24], in which the nodes indicate words and edgesindicate the co-occurrent frequences between each word pairs. Forthe node feature, we represent it through the interaction betweenitself and the query term, which can obtain critical interactionmatching signals for the relevance matching compared to the tra-ditional raw word feature. Secondly, we utilize the architecture ofGHRM to model the different grain-sized hierarchical matchingsignals on the document graph, where the subtle and general in-teraction information can be captured simultaneously. Finally, wecombine the signals captured in each blocks of the GHRM to obtainthe final hierarchical matching signals, feeding them into a denseneural layer to estimate the relevance score. We conduct empirical studies on two representative ad-hoc re-trieval benchmarks, and results demonstrate the effectiveness andrationality of our proposed GHRM .In summary, the contributions of this work are listed as follows: • We model the long-distance document-level word relation-ship via graph-based methods to capture the subtle matchingsignals. • We propose a novel hierarchical graph-based relevance match-ing model to learn different grain-sized hierarchical match-ing signals simultaneously. • We conduct comprehensive experiments to examine the ef-fectiveness of GHRM, where the results demonstrate its su-periority over the state-of-the-art methods in the ad-hocretrieval task.

In this section, we briefly review the previous work in the field ofad-hoc retrieval and graph neural networks.

Ad-hoc retrieval is a task mainly about matching two pieces oftext (i.e. a query and a document). The deep learning techniqueshave been widely utilized in this task, where previous methodscan be roughly grouped into two categories: semantic matchingand relevance matching approaches. In the former, they propose toembed the representations of query and document into two low-dimension spaces independently, and then calculate the relevancescore based on these two representations. For instance, DSSM [9]learns the representations via two independent Multi-Layer Per-ceptrons (MLP) and compute the relevance score as the cosinesimilarity between the outputs of the last layer of two networks.C-DSSM [25] and ARC-I [7] further captures the positional informa-tion by utilizing the Convolutional Neural Network (CNN) insteadof MLP. However, this kind of methods are mainly based on thesemantic signal, which is less effective for the retrieval task [5]. Inthe latter, relevance matching approaches capture the local inter-action signal by modeling query-document pairs jointly. They allfollow a general paradigm (i.e. obtaining the interaction signal ofthe query-document pair first via some similarity functions, andthen employing deep learning models to further explore this signal).Guo et al. [5] propose DRMM, which utilizes a histogram mappingfunction and MLP to process the interaction matrix. Later, [10],[29], and [11] employ CNN to capture the higher-gram matchingpattern in the document, which considers phrases rather than asingle word when interacting with the given query. In addition, aseries of multi-level methods [19, 22] have been proposed, in which[22] models word-level CNN and character-level CNN respectivelyto distinguish the hashtags and the body of the text in the microblogand twitter, capturing multi-perspective information for relevancematching. However, such method is just modeling two differentperspectives of the text, not the exact hierarchical matching signalswhich we will explore in this paper.Recently, a series of BERT-based methods [2, 17] have also beenproposed in this field, where the BERT’s classification vector is Code and data available at https://github.com/CRIPAC-DIG/GHRM raph-based Hierarchical Relevance Matching Signals for Ad-hoc Retrieval WWW ’21, April 19–23, 2021, Ljubljana, Slovenia combined with the existing ad-hoc retrieval architectures (usingBERT’s token vectors) to obtain the benefits from both approaches.

Graph neural networks (GNNs) are a promising way to learn thegraph representation by aggregating the information from neigh-borhoods [8]. They can be roughly divided into two lines (i.e. thespectral method [3, 13] and the spatial method [6, 27]) regardingdifferent aggregation strategies. Due to the ability that capturingthe structural information of data, graph neural networks havebeen widely applied in various domains such as recommendationsystem [16, 28, 32, 34] and Natural Language Processing (NLP) [36].They all model the long-distance relationship between items orwords via utilizing graph neural networks to explore the graphstructure. [35] introduces GNNs to the IR task, in which they em-ploy a multi-layer graph convolutional network (GCN) [13] to learnthe representations of words in the documents. Nevertheless, itlearns the embeddings of the query and document respectively,which is based on the semantic matching rather than relevancematching.In recent years, hierarchical graph neural networks, which areproposed to capture signals with different level of the graph throughthe pooling strategies, have attracted lots of research interest. Sev-eral hierarchical pooling approaches have been proposed so far.DiffPool [31] learns an assignment matrix that divides nodes intodifferent clusters in each layer, hence reducing the scale of graphs.gPool [4] reduces the time complexity by utilizing a projectionvector to calculate the importance score of each node. Furthermore,Lee et al. [14] propose a method namely SAGPool, in which thereare three graph convolutional layers followed by three poolinglayers. They define the outputs of each pooling layer as attentionscores and drop nodes according to the scores.In our previous work [37], we utilize GNNs to model the query-document interaction for ad-hoc retrieval. In this paper, inspired bythe progress in the domain of hierarchical graph neural networks,we further design a graph-based hierarchical relevance matchingarchitecture based on graph neural networks and a graph poolingmechanism.

In this section, we first give the problem definition and describe howto construct the graph of the document according to its interactionsignals with the query. Then, we demonstrate the graph-basedhierarchical matching method in detail. Finally, the proceduresof model training and matching score prediction are described.Figure 2 illustrates the overall process of our proposed architecture,including the graph construction , graph-based hierarchical matching and readout and matching score prediction . For each query and document, denoted as 𝑞 and 𝑑 respectively,we represent them as a sequence of words 𝑞 = (cid:104) 𝑤 ( 𝑞 ) , . . . , 𝑤 ( 𝑞 ) 𝑀 (cid:105) and 𝑑 = (cid:104) 𝑤 ( 𝑑 ) , . . . , 𝑤 ( 𝑑 ) 𝑁 (cid:105) , where 𝑤 ( 𝑞 ) 𝑖 denotes the 𝑖 -th word inthe query, 𝑤 ( 𝑑 ) 𝑖 denotes the 𝑖 -th word in the document, 𝑀 and 𝑁 denote the length of the query and the document respectively. In addition, the aim of this problem is to rank a series of relevancescores 𝑟𝑒𝑙 ( 𝑞, 𝑑 ) regarding the query words and the document words. To capture the document-level word relationships, we construct adocument graph G = (V , E) , where the V is the set of vertexeswith node features, and E is the set of edges. The constructionprocedures of the node feature matrix and the adjacency matrix aredescribed as follows. In the graph G , each noderepresents as the word in the document. Hence the word sequenceis denoted as a series of node set (cid:110) 𝑤 ( 𝑑 ) , . . . , 𝑤 ( 𝑑 ) 𝑛 (cid:111) , where 𝑛 is thenumber of unique words in the document ( |V| = 𝑛 ≤ 𝑁 ). Inaddition, to introduce the query-document interaction signals intothe graph, we set the node feature as the interaction signals betweenits word embedding and the query term embeddings, which issimilar to [37]. The cosine similarity matrix is applied to representsuch interaction matrix, denoted as S ∈ R 𝑛 × 𝑀 , where the element 𝑠 𝑖 𝑗 is set as the similarity score of the node 𝑤 ( 𝑑 ) 𝑖 and the query term 𝑤 ( 𝑞 ) 𝑗 and it is formulated as: 𝑠 𝑖 𝑗 = 𝑐𝑜𝑠𝑖𝑛𝑒 (cid:16) e ( 𝑑 ) 𝑖 , e ( 𝑞 ) 𝑗 (cid:17) (1)where e ( 𝑑 ) 𝑖 and e ( 𝑞 ) 𝑗 denote the word embedding vectors for 𝑤 ( 𝑑 ) 𝑖 and 𝑤 ( 𝑞 ) 𝑗 respectively. Particularly in this work, the word2vec[18] method is utilized to convert each word into a dense vector asthe initial word embedding. As a graph contains nodesand edges, after obtaining the nodes with features, we then focuson the adjacency matrix construction which generally describesthe connection and relationships between the nodes. In detail, weapply a sliding window along with the document word sequences 𝑑 ,building a bi-directional edge between a word pair if they co-occurwithin the sliding window. We guarantee that every word can beconnected with its neighbor words which may share contextualinformation via restricting the size of the window. It is worth men-tioning that compared to the traditional local relevance matchingmethods [10, 11, 29], the graph construction method of our GHRMmodel can further obtain the document-level receptive field bybridging the neighbor words in different hops together. In otherwords, it can capture the subtle document-level relationship thatwe concern.Formally, the adjacency matrix A ∈ R 𝑛 × 𝑛 is denoted as: A 𝑖 𝑗 = (cid:26) 𝑐𝑜𝑢𝑛𝑡 ( 𝑖, 𝑗 ) if 𝑖 ≠ 𝑗 otherwise (2)where 𝑐𝑜𝑢𝑛𝑡 ( 𝑖, 𝑗 ) is the times that the words 𝑤 ( 𝑑 ) 𝑖 and 𝑤 ( 𝑑 ) 𝑗 appearin the same sliding window.Furthermore, in order to alleviate the problem of gradient ex-ploding and vanishing, following the study of [13], the adjacencymatrix is normalized as ˜A = D − AD − , where D ∈ R 𝑛 × 𝑛 is thediagonal degree matrix and D 𝑖𝑖 = (cid:205) 𝑗 A 𝑖 𝑗 . WW ’21, April 19–23, 2021, Ljubljana, Slovenia Xueli Yu, Weizhi Xu, Zeyu Cui, Liang Wang signal ! signal " signal 𝑟𝑒𝑙(𝑞, 𝑑) Graph Construction Graph-based Hierarchical MatchingReadout and Matching

Original Graph Hierarchical Graph 1 Hierarchical Graph 2 𝒘 Query 𝒘 𝒘 … Node Features Document Graph Readout Readout Readout H ! H " H SIGNAL S 𝒘 Document 𝒘 … 𝒘 GNN R S A P t = 1,H ! GNN R S A P ,H " t = 2t = 0 Figure 2: The architecture of the GHRM model, which consists of the following parts: graph construction, graph-based hier-archical matching, readout and matching and the final relevance score. ( ) Graph Construction : The node feature matrix isconstructed via the similarity score between the query and the document, in which each node feature represents the interac-tion signals between its word embedding and the query term embeddings. ( ) Graph-based Hierarchical Matching : Throughthe diffenrent grain-sized hierarchical graphs, the words unrelated to the query are firstly deleted (grey parts of the graph),and several critical nodes which may represent a specific effect on the query (a specific color in the graph) are adaptivelypreserved via the RSAP mechanism. ( ) Readout and Matching : The outputs of the readout layer in each block are combinedtogether to calculate the matching score.

After the graph G is constructed, we continue to utilize its node fea-tures and structure information with the hierarchical graph neuralnetworks. Specifically, both the subtle and general query-documentmatching signals are captured mutually following the hierarchicalmatching structure. As is shown in Figure 2, the architecture of thegraph-based hierarchical matching consists of multi-blocks each ofwhich contains a Graph Neural Network (GNN) layer, a RelevanceSignal Attention Pooling (RSAP) layer and a readout layer. Throughthis module, different grain-sized hierarchical matching signals canbe captured exhaustively. Finally, the outputs of each block in thegraph-based hierarchical matching module are combined togetheras the hierarchical output. To be specific, we set ∀ 𝑡 ∈ [ ,𝑇 ] as the 𝑡 -th block of the hierarchical matching, where 𝑇 is the total numberof blocks. We denote the adjacency ma-trix at 𝑡 -th block as A 𝑡 ∈ R 𝑚 × 𝑚 , and the node feature matrix at 𝑡 -thblock as H 𝑡 ∈ R 𝑚 × 𝑀 , where 𝑚 is the number of nodes at block 𝑡 and 𝑀 is the feature dimension that equals to the number of queryterms. As discussed in Section 3.2, we initialise the H with thequery-document interaction matrix: H = S (3) where H 𝑖 denotes the representation of 𝑖 -th node in the graphwhich equals to S 𝑖 , i.e., the 𝑖 -th row of the interaction matrix S .It is crucial for a word to obtain the information from its con-text since the context is always beneficial for the understand ofthe center word. In a document graph, one word node can aggre-gate contextual information from its 1-hop neighborhood, which isformulated as a 𝑡𝑖 = ∑︁ ( 𝑤 𝑖 ,𝑤 𝑗 ) ∈E ˜A 𝑡𝑖 𝑗 W 𝑡𝑎 H 𝑡𝑗 (4)where a 𝑡𝑖 ∈ R 𝑀 denotes the message aggregated from neighbors, ˜A 𝑡𝑖 𝑗 is the normalized adjacency matrix and W 𝑡𝑎 is a trainable weightmatrix which projects node features into a low-dimension space.The information can be propagated to the 𝑡 -hop neighborhoodwhen we repeat such operation 𝑡 times. Since the node featuresare query-document interaction signals, the proposed model cancapture the subtle signal interaction between nodes within 𝑡 -hopneighborhood on the document graph via the propagation.To incorporate the neighborhood information into the wordnode and also preserve its original features, we employ a GRU-like function [15], which can adjust the importance of the currentembedding of a node H 𝑡𝑖 and the information propagated fromits neighborhoods a 𝑡𝑖 , hence its further representation is ^H 𝑡 = raph-based Hierarchical Relevance Matching Signals for Ad-hoc Retrieval WWW ’21, April 19–23, 2021, Ljubljana, Slovenia GNN ( H 𝑡 ) , where the GNN function is formulated as, z 𝑡𝑖 = 𝜎 (cid:16) W 𝑡𝑧 a 𝑡𝑖 + U 𝑡𝑧 H 𝑡𝑖 + b 𝑡𝑧 (cid:17) (5) r 𝑡𝑖 = 𝜎 (cid:0) W 𝑡𝑟 a 𝑡𝑖 + U 𝑡𝑟 H 𝑡𝑖 + b 𝑡𝑟 (cid:1) (6) ˜H 𝑡𝑖 = tanh (cid:16) W 𝑡ℎ a 𝑡𝑖 + U 𝑡ℎ (cid:0) r 𝑡𝑖 ⊙ H 𝑡𝑖 (cid:1) + b 𝑡ℎ (cid:17) (7) ^H 𝑡𝑖 = ˜H 𝑡𝑖 ⊙ z 𝑡𝑖 + H 𝑡𝑖 ⊙ (cid:0) − z 𝑡𝑖 (cid:1) (8)where 𝜎 (·) is the sigmoid function, ⊙ is the Hardamard productoperation, tanh (·) is the non-linear function, and all W 𝑡 ∗ , U 𝑡 ∗ and b 𝑡 ∗ are trainable parameters.In particular, r 𝑡𝑖 represents the reset gate vector, which is element-wisely multiplied by the hidden state ˜H 𝑡𝑖 to generate the informationto be forgot. Besides, z 𝑡𝑖 determines which component of the currentembeddings to be pushed into next iteration. Notably, we havealso tried another message passing model, i.e., GCN [13] in ourexperiments but did not observe satisfying performance. As attention mech-anisms are widely used in the field of deep learning [26, 27], whichmakes it possible to focus more on the important features than therelatively unimportant ones. Therefore, it is considerable to utilizesuch mechanism to the graph pooling layer, by which the importantgraph nodes and different grain-sized interaction matching signalscan be explored. Inspired by previous hierarchical GNN methods[4, 14, 31], we introduce a Relevance Signal Attention Pooling mech-anism (RSAP) into the pooling layer, obtaining the attention scoresof each node via the graph neural network. As shown in Figure 2,through the RSAP, the hierarchical graph in 𝑡 = and hierarchicalgraph in 𝑡 = can discard the words which are unrelated to thequery (like the grey nodes in the original graph), and adaptivelypreserve the critical nodes that can represent a specific effect on thequery. In detail, the attention score P 𝑡 ∈ R 𝑚 denoting the attentionscore of 𝑚 nodes in 𝑡 -th block is calculated as follows, P 𝑡 = GNN ( ^H 𝑡 · W 𝑡𝑝 ) (9)where GNN is the graph neural network function the same asmentioned above, the W 𝑡𝑝 is an trainable attention matrix.Once the attention scores of each node are obtained, we thenfocus on the important node selection via the hard-attention mech-anism. Following the method of [1, 4, 14], we retain a portion ofnodes in the document graph, which represent critical signals in amore general level. Through this hard-attention mechanism, thewords which are unrelated to the query are filtered out. The pool-ing ratio 𝑟𝑎𝑡𝑒 ∈ ( , ] is a hyperparameter, which determines thenumber of nodes to keep in each RSAP layer. The top ⌈ 𝑚 · 𝑟𝑎𝑡𝑒 ⌉ nodes are selected based on the value of P 𝑡 . 𝑖𝑑𝑥 = top _ rank ( P 𝑡 , ⌈ 𝑚 · 𝑟𝑎𝑡𝑒 ⌉) (10) P 𝑡𝑚𝑎𝑠𝑘 = P 𝑡𝑖𝑑𝑥 (11) A 𝑡 + = A 𝑡𝑖𝑑𝑥,𝑖𝑑𝑥 , H ′ 𝑡 = ^H 𝑡𝑖𝑑𝑥 (12)where top _ rank (·) is the function that returns the indices of thetop ⌈ 𝑚 · 𝑟𝑎𝑡𝑒 ⌉ values, · 𝑖𝑑𝑥 is an indexing operation and P 𝑡𝑚𝑎𝑠𝑘 isthe attention mask, where elements are set to 0 if the nodes arediscarded according to the top _ rank (·) operation. A 𝑡𝑖𝑑𝑥,𝑖𝑑𝑥 is the ⌈·⌉ denotes the round-up operation (e.g., ⌈ . ⌉ = ). row-wise and column-wise indexed adjacency matrix. ^H 𝑡𝑖𝑑𝑥 is therow-wise (i.e., node-wise) indexed feature matrix from ^H 𝑡 .Next, the soft-attention mechanism is applied on the poolingoperation based on the selected important nodes, and the newfeature matrix H 𝑡 + which is fed into the the ( 𝑡 + ) -th block iscalculated as follows, H 𝑡 + = H ′ 𝑡 ⊙ P 𝑡𝑚𝑎𝑠𝑘 (13)where ⊙ is the broadcasted element-wise product, realizing the soft-attention operation, through which the critical query-documentinteraction matching signals are further emphasized. In order to aggregate the node features tomake a fixed size representation as the query-document relevancesignal, we select a fix-sized number of features from 𝐻 𝑡 in each blockthrough the method of 𝑘 -max-pooling strategy on the dimensionof query. The formulas are written as follows, signal 𝑡 = topk ( H 𝑡 ) (14)where the function of topk (·) is operated on the column-wise di-mension of H 𝑡 , denoting the top 𝑘 values for each term of the queryrespectively. Hence signal 𝑡 is the output relevance matching signalof 𝑡 -th block.After obtaining the fix-sized query-document relevance signal ofeach block, we then combine each block’s relevance matrix togetheras the hierarchical relevance matching signals, which is formulatedas, SIGNAL = signal ∥ signal ∥ . . . ∥ signal 𝑡 ∥ . . . ∥ signal 𝑇 (15)where SIGNAL ∈ R 𝑘 ( 𝑇 + )× 𝑀 represents the overall hierarchi-cal relevance signal, which can capture both the subtle and thegeneral query-document matching signals simultaneously. Spe-cially, the signal denotes the topk ( H ) from the initial similaritymatrix of the query and document. To convert the hierarchical relevance signals into the actual rel-evance scores for training and inference, we input the relevancematrix

SIGNAL into the further deep neural networks. Since theelements in each column of

SIGNAL are the relevance signals ofeach corresponding query word, considering that different querywords may have different importances for retrieval, we assign therelevance signals corresponding to each query word with a softgating network [5] as, 𝑔 𝑗 = exp (cid:0) 𝑐 · 𝑖𝑑 𝑓 𝑗 (cid:1)(cid:205) 𝑀𝑖 = exp ( 𝑐 · 𝑖𝑑 𝑓 𝑖 ) (16)where 𝑔 𝑗 is the corresponding term weight, 𝑖𝑑 𝑓 𝑗 is the inversedocument frequency of the 𝑗 -th query term, and 𝑐 is a trainableparameter. Furthermore, we score each term of the query with aweight-shared MLP to reduce the parameters amount and avoidover-fitting, summing the results of it up as the final result, 𝑟𝑒𝑙 ( 𝑞, 𝑑 ) = 𝑀 ∑︁ 𝑗 = 𝑔 𝑗 · 𝑓 ( SIGNAL 𝑗 ) (17) WW ’21, April 19–23, 2021, Ljubljana, Slovenia Xueli Yu, Weizhi Xu, Zeyu Cui, Liang Wang where 𝑓 (·) is a MLP in our model.Finally, the pairwise hinge loss is adopted for training and opti-mizing the model parameters, which is widely used in the field ofinformation retrieval, formulated as, L (cid:0) 𝑞, 𝑑 + , 𝑑 − (cid:1) = max (cid:0) , − 𝑟𝑒𝑙 (cid:0) 𝑞, 𝑑 + (cid:1) + 𝑟𝑒𝑙 ( 𝑞, 𝑑 − ) (cid:1) (18) where L (cid:0) 𝑞, 𝑑 + , 𝑑 − (cid:1) is the pairwise loss based on a triplet of thequery 𝑞 , a relevant (positive) document sample 𝑑 + , and an irrelevant(negative) document sample 𝑑 − . In this section, we conduct experiments on two ad-hoc datasets toanswer the following questions: • RQ1: How does GHRM perform compared with the previousrelevance matching baselines? • RQ2: How does the different grain-sized hierarchical signalsaffect the performance of the model? • RQ3: How does GHRM perform under different hyperparam-eter settings?

In this part, we briefly introduce two datasets usedin our experiments, named Robust04 and ClueWeb09-B. • Robust04 . There are 250 queries and 0.47M documents,which are from TREC disk 4 and 5 in this dataset. • ClueWeb09-B is one of the subset from the full data collec-tion ClueWeb09. There are 50M documents collected fromweb pages and 200 queries. The topics of these texts areobtained from TREC Web Tracks 2009-2012.In both datasets, the train data consists of several query-documentpairs, where a query has a most related document (i.e., the ground-truth label). In addition, there are two parts in a query, i.e., a shortkeyword title and a longer text description, and we only utilize thetitle in our experiments. Table 1 summarises the statistic of the twodatasets. To evaluate the performance of ourproposed model GHRM, we compare it with a variety of baselines,including traditional language models (i.e., query likelihood modeland BM25), deep relevance matching models (i.e., MatchPyramid,DRMM, KNRM, PACRR and Co-PACRR) and a pre-trained BERT-based method (i.e., BERT-MaxP). The brief introduction of eachbaseline model is presented as follows, • QLM (Query likelihood model) [33] is based on Dirichletsmoothing and have achieved convincing results in the do-main of NLP when the deep learning technique has notappeared yet. • BM25 [23] is a famous and effective bag-of-words model,which is based on and the probabilistic retrieval framework. • Pyramid (MatchPyramid) [20] first builds up the interactionmatrix between a query and a document, then they employCNN to process the matrix, extracting the different ordersof matching features. https://trec.nist.gov/data/cd45/index.html https://lemurproject.org/clueweb09/ Dataset Genre news 250 0.47M 460

ClueWeb09-B webpages 200 50M 1506

Table 1: Statistics of datasets. • DRMM [5] is the pioneer work of the relevance matchingapproaches. They perform a histogram pooling over thelocal query-document interaction matrices to summarize thedifferent relevance features. • KNRM [29] applies a kind of kernel pooling to explore thematching features existing in the interaction matrix. • PACRR [10] redesigns CNNs in terms of kernel size andconvolution direction to make CNNs more suitable for theIR task. This model finally utilizes a RNN to capture thelong-term dependency over different signals. • Co-PACRR [11] is a variant of PACRR, which takes thecontextual matching signals into account, and achieves abetter result than PACRR. • BERT-MaxP [2] utilizes BERT to deeply understand thetext for the relevance matching task, demonstrating that thecontextual text representations from BERT are more effectivethan traditional word embeddings.

For text preprocessing, by using theWordNet toolkit, we first make all words in the document andquery in the two datasets white-space tokenised, lowercased, andlemmatised. Secondly, we discard the words appear less than tentimes in the corpus, which is a normal preprocessing operationin NLP tasks. Following the previous work [10][17], we truncatethe first 300 and 500 words in each document, the first 4 and 5words in each query for Robust04 and CluwWeb09-B respectively.The lengths are different for two datasets simply because texts inClueWeb09-B are almost longer than those in Robust04 and mayhave more useful information. We utilize the zero-padding if thelength of a document or a query is less than the truncated length.In addition, we initialize the word embeddings as the output 300-dimension vectors of the Continuous Bag-of-Words (CBOW) model[18] on both two datasets. Also, except for those models that do notneed word embeddings, we use the same initialized embeddingsto keep the fair comparison. All baseline models are implemented,closely following all settings reported in their original paper.Following the setting in the previous work [17], we divide theboth two datasets into three parts: sixty percents of data for training,twenty percents of data for validation and the rest of data for testing.Based on this distribution, we randomly divide the datasets forfive times to generate five folds with different data splits (i.e., thetraining data is different in each fold). Then we utilize the data in around-robin fashion as MacAvaney et al. [17] does. Eventually, wetake the average result of five folds as the final performance of themodel.There are a series of hyperparameters in our model includingthe number of blocks 𝑇 in the graph-based hierarchical matching raph-based Hierarchical Relevance Matching Signals for Ad-hoc Retrieval WWW ’21, April 19–23, 2021, Ljubljana, Slovenia module, the pooling ratio 𝑟𝑎𝑡𝑒 of top _ rank (·) in the RASP layer,the 𝑘 value of topk (·) in the readout layer, the learning rate and thebatch size. They are all tuned on the validation set using the gridsearch algorithm. In the based model GHRM, we set the numberof blocks 𝑇 as 2, the pooling ratio 𝑟𝑎𝑡𝑒 as 0.8, and the number ofvalues 𝑘 = . We train the model with a learning rate of 0.001using the Adam optimizer [12] for 300 epochs. In each epoch, thereare 32 batches and each batch contains 16 positive sampled pairsand 16 negative pairs. We rerank the top 150 candidates generatedby BM25 Anserini toolkit on the stage of testing, which is a normalway to test the model in the IR task.All experiments are conducted using PyTorch 1.5.1 on a Linuxserver equipped with 4 NVIDIA Tesla V100S GPUs (with 32GBmemory each) and 12 Intel Xeon Silver 4214 CPUs (@2.20GHz). We utilize two evaluation matricesin our experiments. One is the normalised discounted cumulativegain at rank 20 (nDCG@20) and another is the precision at rank 20(P@20). Both of them are often used in this kind of ranking task.

The performance of each model on two datasets is clearly shownin Table 2. Based on these results, we have some observations asfollows: • First of all, GHRM outperforms both traditional languagemodels and deep relevance matching models by a significantmargin. To be specific, compared to the strong baseline Co-PACRR, GHRM advances the performance of nDCG@20 andP@20 by 5.6% and 2.9% respectively on Robust04. Besides,on ClueWeb09-B, it achieves an improvement of 15.1% onnDCG@20 and 10.8% on P@20, compared to another con-vincing model DRMM. There are two reasons that may con-tribute to the improvement: on the one hand, the applying ofthe graph neural networks can capture the subtle document-level word relationship via extracting all non-consecutivelydistributed relevant information, yet previous CNN-basedmodels can not capture them. On the other hand, the modelwith a hierarchical architecture is able to attentively discardsome useless information from noisy neighbors and preservethe most important information, thus capturing the differentgrain-sized hierarchical relevance matching signals. Owingto such two advantages, the relevance matching signals canbe obtained comprehensively and the performance of themodel is enhanced. • Compared to BERT-MaxP, the results show that even GHRMdoes not depend on the pre-trained word embeddings, italso gains the comparative performance. In detail, GHRMperforms better than BERT-MaxP on ClueWeb09-B whileworse on Robust04. The reason may be that there are severalcharacteristic differences between the two datasets. On theone hand, the language style of Robust04 is more formal,making the pre-trained word embeddings of BERT-MaxPmore superior. On the other hand, the length of the docu-ments in ClueWeb09-B is relatively long, which may weaken https://github.com/castorini/anserini Model Robust04 ClueWeb09-BnDCG@20 P@20 nDCG@20 P@20QL 0.415 − − − − BM25 0.418 − − − − MP 0.318 − − − − DRMM 0.406 − − − − KNRM 0.415 − − − − PACRR 0.415 − − − − Co-PACRR 0.426 − − − − BERT-MaxP - 0.293 -GHRM 0.450

Table 2: The performance of our proposed model GHRMand baselines. The highest performance on each dataset andmetric is highlighted in boldface. Significant performancedegradation with respect to GHRM is indicated (-) with p-value ≤ the performance of BERT-MaxP since it restricts the input se-quence length as a maximum of only 512 tokens. Meanwhile,the GHRM’s advantage of capturing the subtle long-distanceword relationships can be represented on ClueWeb09-B. • We also observe that the performance of local relevancematching models slightly fluctuate around the performanceof BM25, except the models DRMM and KNRM on ClueWeb09-B. This may mainly because DRMM and KNRM utilize globalpooling strategies while others only focus on local relation-ship. It further validates that only considering the local inter-action is insufficient for the ad-hoc retrieval task, the moreexhaustive information contained in the different grain-sizedhierarchical matching signals may also play a central role. • Another observation is that traditional approaches QL andBM25 still outperform some deep learning methods, whichdemonstrates that the exact matching signal is significant forthe ad-hoc retrieval, which has been pointed out by Guo et al.[5]. That is why we preserve the initial similarity matrix ofthe query and document as the first block of matching signalin GHRM. Besides, traditional models also avoid overfittingthe train data.

To prove the effectiveness of hierarchical signals in the ad-hocretrieval task, we further conduct a comparison experiment to studywhat effects do the different grain-sized hierarchical signals take tothe model. In detail, we discard all pooling layers (i.e., RSAP layers)in GHRM, so that all the words in the document are consideredequally important along the multi-layer graph neural networks. Inaddition, we ensure that the graph structure is fixed during thewhole training process and no hierarchical signal is generated. Wedenote this model as GHRM-nopool. For a fair comparison, we keepall other settings the same in GHRM and GHRM-nopool.As illustrated in Table 3, it is apparently seen that GHRM outper-forms GHRM-nopool by a significant margin on the two datasets

WW ’21, April 19–23, 2021, Ljubljana, Slovenia Xueli Yu, Weizhi Xu, Zeyu Cui, Liang Wang

Model Robust04 ClueWeb09-BnDCG@20 P@20 nDCG@20 P@20GHRM-nopool 0.437 0.378 0.300 0.351GHRM 0.450 0.389 0.312 0.359Improv. 2.97% 2.91% 4.00% 2.28%

Table 3: The comparison of performance between GHRMand GHRM-nopool on Robust04 and ClueWeb09-B. The im-provement in terms of percentage is shown in the last row. and evaluation matrices. Specifically, on Robust04, GHRM outper-forms the non-hierarchical model (i.e., GHRM-nopool) by 2.97%and 2.91% on the metric of nDCG@20 and P@20 respectively.On ClueWeb09-B, GHRM improves the performance by 4% onnDCG@20 and 2.28% on P@20. This reveals that the different grain-sized hierarchical matching signals obtained via GHRM are alsocritical for retrieving relevant documents. It is worth mentioningthat even without the hierarchical signals, the GHRM-nopool stilloutperforms the traditional language models and deep relevancematching models substantially, which demonstrates the superior-ity of the document-level word relationship over the local-basedrelevance matching methods. In addition, based on these subtleinformation, the various grain-sized hierarchical signals can playas a strong supplement in a more general level, hence improvingthe performance of relevance matching further.

In this section, we discuss about the specific hyperparameter settingin GHRM, including the pooling ratio 𝑟𝑎𝑡𝑒 , the number of blocks 𝑇 and the number of 𝑘 in the readout layer. 𝑟𝑎𝑡𝑒 . It is an important hyperparameter inour proposed model since it controls the number of critical nodesselected in each RSAP layer. For example, if the 𝑟𝑎𝑡𝑒 = . , weselect 40% of nodes and discard rest of nodes in the RSAP layer ineach block. As is shown in Figure 3, the performance first growsup continuously when the 𝑟𝑎𝑡𝑒 ranges from 0.4 to 0.8 and thendecreases slightly when the 𝑟𝑎𝑡𝑒 is over 0.8 on both two datasets.The observations could be listed as follows: • GHRM with a pooling ratio 𝑟𝑎𝑡𝑒 of 0.8 peaks at the highestresult on both two datasets, which could be due to the suit-able amount of deleted nodes. It means that using this rate,we could obtain suitable hierarchical signals from the RSAPlayers since those nodes who contribute relatively little tothe model training would be deleted. • A low pooling ratio 𝑟𝑎𝑡𝑒 may not advance the performanceof the model. For example, the model with 𝑟𝑎𝑡𝑒 = . hasthe worst performance of 0.297 and 0.381 on nDCG@20 onClueWeb09-B and Robust04 respectively. It is probably be-cause that some valuable nodes are deleted and the graphtopology becomes sparse, preventing the model from cap-turing the long-distance word relationship. rate P @ Robust04ClueWeb09-B rate n D C G @ Robust04ClueWeb09-B

Figure 3: Influence of different pooling ratios. The modelpeaks at the best results on both datasets and evaluation ma-trices when 𝑟𝑎𝑡𝑒 = . . T P @ Robust04ClueWeb09-B T n D C G @ Robust04ClueWeb09-B

Figure 4: Influence of the number of blocks in the graph-based hierarchical matching module. The best performanceis achieved when 𝑇 = on both datasets and evaluation ma-trices. • When the 𝑟𝑎𝑡𝑒 equals to 1.0, we denote the model as GHRM-soft and it can be regarded as matching the query and docu-ment only with the soft-attention mechanism since nodesare not discarded in each layer. It is worth noting that GHRM-soft is not the same as GHRM-nopool. The difference is thateach signal from nodes are equally processed in GHRM-nopool while soft-attention scores are applied to distinguisheach node in GHRM-soft. The performance of the two modelsimplies that different signals should be considered attentivelybefore combining them to output the relevance score. • The overall results illustrate that the hierarchical matchingsignals obtained by designing proper pooling ratio 𝑟𝑎𝑡𝑒 areimportant for the ad-hoc retrieval. With proper pooling ratio,the GHRM can mutually capture both subtle and general in-teraction information between the query and the document,making the matching signals more exhaustive. 𝑇 in the graph-based hierarchical match-ing module. The number of blocks 𝑇 is also a critical hyperparam-eter in GHRM, which decides the extent of different grain sizeslearned in the hierarchical matching signals. In this part, we per-form experiments on the GHRM from 𝑇 = to 𝑇 = blocksrespectively as shown in Figure 4. We have some observations asfollows: • An improvement can be seen from 𝑇 = to 𝑇 = in Fig-ure 4. When 𝑇 = , the matching signal represents the raph-based Hierarchical Relevance Matching Signals for Ad-hoc Retrieval WWW ’21, April 19–23, 2021, Ljubljana, Slovenia

10 20 30 40 50 60 70 k P @ Robust04ClueWeb09-B

10 20 30 40 50 60 70 k n D C G @ Robust04ClueWeb09-B

Figure 5: Influence of the number of top 𝑘 values in the read-out layer. The best performance is achieved when 𝑘 = onboth datasets and evaluation matrices. one obtained from the initial similarity matrix of the query-document pair. It reveals that the long-distance informationin document-level word relationships, which are capturedvia the graph neural network are significant for the query-document matching. • The performance of model grows with the increasing from 𝑇 = to 𝑇 = , which further illustrates the positive effectthat different grain-sized hierarchical matching signals taketo the model. • We can also see that the performance decreases, when 𝑇 isover 2. The reason could be that nodes may receive noisyinformation from high-order neighbors which deterioratesthe performance of the model when the number of blockscontinue to grow. The 2-hop neighborhood information issufficient to capture the most significant part of word rela-tionships. 𝑘 values in the readout layer. We alsoexplore the effect of the size of the features that are output from thereadout layer of each block in the hierarchical matching module.Figure 5 summarises the experimental performance in terms ofdifferent 𝑘 values of topk (·) in Equation 14. From the figure, wehave the following observations: • There is a moderate growth of performance when 𝑘 is rangedfrom 10 to 40, which implies that some important hierar-chical matching signals are wrongly discarded when the 𝑘 value is small. Furthermore, when continually enlargingthe 𝑘 value, the GHRM could distinguish more relevant hi-erarchical matching signals from the relatively irrelevantone. • The performance begins to decline when 𝑘 continues togrow, which demonstrates that the large size of the readoutfeatures may bring some noisy information, such as the biasinfluence of the document length. • It is worth noting that almost all model variants of GHRMwith different 𝑘 values (except 𝑘 = ) exceed the baselinesin Table 2. This implies that different grain-sized graph-basedhierarchical signals are effective for correctly matching thequery and document. In this paper, we introduce a graph-based hierarchical relevancematching method for ad-hoc retrieval named GHRM. By utilizingthe hierarchical graph neural networks to model different grain-sized matching signals, we can exactly capture the subtle and gen-eral hierarchical interaction matching signals mutually. Extensiveexperiments on the two representative ad-hoc retrieval benchmarksdemonstrate the effectiveness of GHRM over various baselines,which validates the advantages of applying graph-based hierarchi-cal matching signals to ad-hoc retrieval.

This work is supported by National Key Research and DevelopmentProgram (2018YFB1402605, 2018YFB1402600), National Natural Sci-ence Foundation of China (U19B2038, 61772528), Beijing NationalNatural Science Foundation (4182066).

REFERENCES [1] Cătălina Cangea, Petar Veličković, Nikola Jovanović, Thomas Kipf, and PietroLiò. 2018. Towards sparse hierarchical graph classifiers. arXiv preprintarXiv:1811.01287 (2018).[2] Zhuyun Dai and Jamie Callan. 2019. Deeper text understanding for IR withcontextual neural language modeling. In

Proceedings of the 42nd InternationalACM SIGIR Conference on Research and Development in Information Retrieval .985–988.[3] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolu-tional neural networks on graphs with fast localized spectral filtering. In

Advancesin neural information processing systems . 3844–3852.[4] Hongyang Gao and Shuiwang Ji. 2019. Graph U-Nets. In

Proceedings of the 36thInternational Conference on Machine Learning .[5] Jiafeng Guo, Yixing Fan, Qingyao Ai, and W Bruce Croft. 2016. A deep relevancematching model for ad-hoc retrieval. In

Proceedings of the 25th ACM Internationalon Conference on Information and Knowledge Management . 55–64.[6] Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representationlearning on large graphs. In

Advances in neural information processing systems .1024–1034.[7] Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai Chen. 2014. Convolutional neu-ral network architectures for matching natural language sentences. In

Advancesin neural information processing systems . 2042–2050.[8] Fenyu Hu, Yanqiao Zhu, Shu Wu, Weiran Huang, Liang Wang, and Tieniu Tan.2020. Graphair: Graph representation learning with neighborhood aggregationand interaction.

Pattern Recognition (2020), 107745.[9] Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and LarryHeck. 2013. Learning deep structured semantic models for web search usingclickthrough data. In

Proceedings of the 22nd ACM international conference onInformation & Knowledge Management . 2333–2338.[10] Kai Hui, Andrew Yates, Klaus Berberich, and Gerard de Melo. 2017. PACRR: APosition-Aware Neural IR Model for Relevance Matching. In

Proceedings of the2017 Conference on Empirical Methods in Natural Language Processing . 1049–1058.[11] Kai Hui, Andrew Yates, Klaus Berberich, and Gerard De Melo. 2018. Co-PACRR: Acontext-aware neural IR model for ad-hoc retrieval. In

Proceedings of the eleventhACM international conference on web search and data mining . 279–287.[12] Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Opti-mization. In .[13] Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification withGraph Convolutional Networks. In

International Conference on Learning Repre-sentations (ICLR) .[14] Junhyun Lee, Inyeop Lee, and Jaewoo Kang. 2019. Self-attention graph pooling.In . 6661–6670.[15] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. 2016. Gatedgraph sequence neural networks. In

International Conference on Learning Repre-sentations (ICLR) .[16] Zekun Li, Zeyu Cui, Shu Wu, Xiaoyu Zhang, and Liang Wang. 2019. Fi-gnn:Modeling feature interactions via graph neural networks for ctr prediction. In

Pro-ceedings of the 28th ACM International Conference on Information and KnowledgeManagement . 539–548.[17] Sean MacAvaney, Andrew Yates, Arman Cohan, and Nazli Goharian. 2019. CEDR:Contextualized embeddings for document ranking. In

Proceedings of the 42ndInternational ACM SIGIR Conference on Research and Development in Information

WW ’21, April 19–23, 2021, Ljubljana, Slovenia Xueli Yu, Weizhi Xu, Zeyu Cui, Liang Wang

Retrieval . 1101–1104.[18] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013.Distributed representations of words and phrases and their compositionality. In

Advances in neural information processing systems . 3111–3119.[19] Yifan Nie, Yanling Li, and Jian-Yun Nie. 2018. Empirical study of multi-level con-volution models for ir based on representations and interactions. In

Proceedingsof the 2018 ACM SIGIR International Conference on Theory of Information Retrieval .59–66.[20] Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Shengxian Wan, and Xueqi Cheng.2016. Text matching as image recognition. In

Proceedings of the Thirtieth AAAIConference on Artificial Intelligence . 2793–2799.[21] Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Jingfang Xu, and Xueqi Cheng. 2017.Deeprank: A new deep architecture for relevance ranking in information retrieval.In

Proceedings of the 2017 ACM on Conference on Information and KnowledgeManagement . 257–266.[22] Jinfeng Rao, Wei Yang, Yuhao Zhang, Ferhan Ture, and Jimmy Lin. 2019. Multi-perspective relevance matching with hierarchical convnets for social mediasearch. In

Proceedings of the AAAI Conference on Artificial Intelligence , Vol. 33.232–240.[23] Stephen E Robertson and Steve Walker. 1994. Some simple effective approxi-mations to the 2-poisson model for probabilistic weighted retrieval. In

SIGIR’94 .Springer, 232–241.[24] François Rousseau, Emmanouil Kiagias, and Michalis Vazirgiannis. 2015. Textcategorization as a graph classification problem. In

Proceedings of the 53rd AnnualMeeting of the Association for Computational Linguistics and the 7th InternationalJoint Conference on Natural Language Processing (Volume 1: Long Papers) . 1702–1712.[25] Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Grégoire Mesnil. 2014.A latent semantic model with convolutional-pooling structure for informationretrieval. In

Proceedings of the 23rd ACM international conference on conference oninformation and knowledge management . 101–110.[26] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is allyou need. In

Advances in neural information processing systems . 5998–6008. [27] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, PietroLiò, and Yoshua Bengio. 2018. Graph Attention Networks. In

International Con-ference on Learning Representations (ICLR) .[28] Shu Wu, Yuyuan Tang, Yanqiao Zhu, Liang Wang, Xing Xie, and Tieniu Tan. 2019.Session-based recommendation with graph neural networks. In

Proceedings ofthe AAAI Conference on Artificial Intelligence , Vol. 33. 346–353.[29] Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, and Russell Power. 2017.End-to-end neural ad-hoc ranking with kernel pooling. In

Proceedings of the 40thInternational ACM SIGIR conference on research and development in informationretrieval . 55–64.[30] Liang Yao, Chengsheng Mao, and Yuan Luo. 2019. Graph convolutional net-works for text classification. In

Proceedings of the AAAI Conference on ArtificialIntelligence , Vol. 33. 7370–7377.[31] Zhitao Ying, Jiaxuan You, Christopher Morris, Xiang Ren, Will Hamilton, and JureLeskovec. 2018. Hierarchical graph representation learning with differentiablepooling. In

Advances in neural information processing systems . 4800–4810.[32] Feng Yu, Yanqiao Zhu, Qiang Liu, Shu Wu, Liang Wang, and Tieniu Tan. 2020.TAGNN: Target Attentive Graph Neural Networks for Session-based Recommen-dation. In

SIGIR . 1921–1924.[33] Chengxiang Zhai and John Lafferty. 2004. A study of smoothing methods for lan-guage models applied to information retrieval.

ACM Transactions on InformationSystems (TOIS)

22, 2 (2004), 179–214.[34] Mengqi Zhang, Shu Wu, Meng Gao, Xin Jiang, Ke Xu, and Liang Wang. 2020.Personalized graph neural networks with attention mechanism for session-awarerecommendation.

IEEE Transactions on Knowledge and Data Engineering (2020).[35] Ting Zhang, Bang Liu, Di Niu, Kunfeng Lai, and Yu Xu. 2018. Multiresolutiongraph attention networks for relevance matching. In

Proceedings of the 27th ACMInternational Conference on Information and Knowledge Management . 933–942.[36] Yufeng Zhang, Xueli Yu, Zeyu Cui, Shu Wu, Zhongzhen Wen, and Liang Wang.2020. Every Document Owns Its Structure: Inductive Text Classification via GraphNeural Networks. In

Proceedings of the 58th Annual Meeting of the Association forComputational Linguistics .[37] Yufeng Zhang, Jinghao Zhang, Zeyu Cui, Shu Wu, and Liang Wang. 2021. AGraph-based Relevance Matching Model for Ad-hoc Retrieval. arXiv preprintarXiv:2101.11873arXiv preprintarXiv:2101.11873