[PDF] LET: Linguistic Knowledge Enhanced Graph Transformer for Chinese Short Text Matching

Abstract

Chinese short text matching is a fundamental task in natural language processing. Existing approaches usually take Chinese characters or words as input tokens. They have two limitations: 1) Some Chinese words are polysemous, and semantic information is not fully utilized. 2) Some models suffer potential issues caused by word segmentation. Here we introduce HowNet as an external knowledge base and propose a Linguistic knowledge Enhanced graph Transformer (LET) to deal with word ambiguity. Additionally, we adopt the word lattice graph as input to maintain multi-granularity information. Our model is also complementary to pre-trained language models. Experimental results on two Chinese datasets show that our models outperform various typical text matching approaches. Ablation study also indicates that both semantic information and multi-granularity information are important for text matching modeling.

Full PDF

LLET: Linguistic Knowledge Enhanced Graph Transformerfor Chinese Short Text Matching

Boer Lyu, Lu Chen * , Su Zhu, Kai Yu * State Key Laboratory of Media Convergence Production Technology and SystemsMoE Key Lab of Artiﬁcial Intelligence, AI Institute, Shanghai Jiao Tong UniversityX-LANCE Lab, Department of Computer Science and EngineeringShanghai Jiao Tong University, Shanghai, China { boerlv, chenlusz, paul2204, kai.yu } @sjtu.edu.cn Abstract

Chinese short text matching is a fundamental task in natu-ral language processing. Existing approaches usually takeChinese characters or words as input tokens. They have twolimitations: 1) Some Chinese words are polysemous, and se-mantic information is not fully utilized. 2) Some models sufferpotential issues caused by word segmentation. Here we intro-duce HowNet as an external knowledge base and propose aLinguistic knowledge Enhanced graph Transformer (LET) todeal with word ambiguity. Additionally, we adopt the wordlattice graph as input to maintain multi-granularity informa-tion. Our model is also complementary to pre-trained languagemodels. Experimental results on two Chinese datasets showthat our models outperform various typical text matching ap-proaches. Ablation study also indicates that both semanticinformation and multi-granularity information are importantfor text matching modeling.

Short text matching (STM) is generally regarded as a taskof paraphrase identiﬁcation or sentence semantic matching.Given a pair of sentences, the goal of matching models is topredict their semantic similarity. It is widely used in questionanswer systems (Liu, Rong, and Xiong 2018) and dialoguesystems (Gao et al. 2019; Yu et al. 2014).Recent years have seen great progress in deep learningmethods for text matching (Mueller and Thyagarajan 2016;Gong, Luo, and Zhang 2017; Chen et al. 2017; Lan and Xu2018). However, almost all of these models are initially pro-posed for English text matching. For Chinese language tasks,early work utilizes Chinese characters as input to the model,or ﬁrst segment each sentence into words, and then take thesewords as input tokens. Although character-based models canovercome the problem of data sparsity to some degree (Liet al. 2019a), the main drawback is that explicit word infor-mation is not fully utilized, which has been demonstrated tobe useful for semantic matching (Li et al. 2019b).However, a large number of Chinese words are polyse-mous, which brings great difﬁculties to semantic understand-ing (Xu et al. 2016). Word polysemy in short text is more anissue than that in long text because short text usually has lesscontextual information, so it is extremely hard for models to * The corresponding authors are Lu Chen and Kai Yu. sentence-1sentence-2 吹⽜

Chui Niu bragging 吹 Chui blowing ⽜ Niu cattle 在 Zai is 他 Ta He 有些 YouXie are a bit ⽔分 Shui Fen ? 他 Ta His 有 You have 些 Xie some 的话 De Hua words sense :⽔汽 (moisture) sense :夸张 (exaggeration) sense :夸耀 (brag) 信息 information 夸⼤ boast 物质 physical 湿度 dampness sememe Figure 1: An example of word segmentation and the potentialword ambiguity.capture the correct meaning. As shown in Fig. 1, the wordin red in sentence-1 actually has two meanings: one is todescribe bragging ( exaggeration ) and another is moisture . In-tuitively, if other words in the context have similar or relatedmeanings, the probability of them will increase. To integratesemantic information of words, we introduce HowNet (Dongand Dong 2003) as an external knowledge base. In the viewof HowNet, words may have multiple senses/meanings andeach sense has several sememes to represent it. For instance,the ﬁrst sense exaggeration indicates some boast informationin his words. Therefore, it has sememes information and boast . Similarly, we can also ﬁnd the sememe boast de-scribing the sense brag which belongs to the word “ChuiNiu(bragging)” in sentence-2. In this way, model can better de-termine the sense of words and perceive that two sentencesprobably have the same meaning.Furthermore, word-based models often encounter somepotential issues caused by word segmentation. If the word seg-mentation fails to output “ChuiNiu (bragging)” in sentence-2,we will lose useful sense information. In Chinese, “Chui(blowing)” “Niu (cattle)” is a bad segmentation, which de-viates the correct meaning of “ChuiNiu (bragging)”. Totackle this problem, many researchers propose word lattice a r X i v : . [ c s . C L ] F e b raphs (Lai et al. 2019; Li et al. 2020; Chen et al. 2020b),where they retain words existing in the word bank so thatvarious segmentation paths are kept. It has been shown thatmulti-granularity information is important for text matching.In this paper, we propose a Linguistic knowledge Enhancedgraph Transformer (LET) to consider both semantic informa-tion and multi-granularity information. LET takes a pair ofword lattice graphs as input. Since keeping all possible wordswill introduce a lot of noise, we use several segmentationpaths to form our lattice graph and construct a set of sensesaccording to the word. Based on HowNet, each sense hasseveral sememes to represent it. In the input module, startingfrom the pre-trained sememe embeddings provided by Open-HowNet (Qi et al. 2019), we obtain the initial sense represen-tation using a multi-dimensional graph attention transformer(MD-GAT, see Sec. 3.1). Also, we get the initial word rep-resentation by aggregating features from the character-leveltransformer encoder using an Att-Pooling (see Sec. 4.1). Thenit is followed by SaGT layers (see Sec. 4.2), which fuse theinformation between words and semantics. In each layer, weﬁrst update the nodes’ sense representation and then updatesword representation using MD-GAT. As for the sentencematching layer (see Sec. 4.3), we convert word representa-tion to character level and share the message between texts.Moreover, LET can be combined with pre-trained languagemodels, e.g. BERT (Devlin et al. 2019). It can be regardedas a method to integrate word and sense information intopre-trained language models during the ﬁne-tuning phase.Contributions in this work are summarized as: a) We pro-pose a novel enhanced graph transformer using linguisticknowledge to moderate word ambiguity. b) Empirical studyon two Chinese datasets shows that our model outperformsnot only typical text matching models but also the pre-trainedmodel BERT as well as some variants of BERT. c) We demon-strate that both semantic information and multi-granularityinformation are important for text matching modeling, espe-cially on shorter texts. Deep Text Matching Models based on deep learning havebeen widely adopted for short text matching. They canfall into two categories: representation-based methods (Heet al. 2016; Lai et al. 2019) and interaction-based meth-ods (Wang, Hamza, and Florian 2017; Chen et al. 2017).Most representation-based methods are based on Siamese ar-chitecture, which has two symmetrical networks (e.g. LSTMsand CNNs) to extract high-level features from two sentences.Then, these features are compared to predict text similarity.Interaction-based models incorporate interactions featuresbetween all word pairs in two sentences. They generally per-form better than representation-based methods. Our proposedmethod belongs to interaction-based methods.

Pre-trained Language Models , e.g. BERT, have shownits powerful performance on various natural language pro-cessing (NLP) tasks including text matching. For Chinese textmatching, BERT takes a pair of sentences as input and eachChinese character is a separated input token. It has ignoredword information. To tackle this problem, some Chinese variants of original BERT have been proposed, e.g. BERT-wwm (Cui et al. 2019) and ERNIE (Sun et al. 2019). Theytake the word information into consideration based on thewhole word masking mechanism during pre-training. How-ever, the pre-training process of a word-considered BERTrequires a lot of time and resources. Thus, Our model takespre-trained language model as initialization and utilizes wordinformation to ﬁne-tune the model.

In this section, we introduce graph attention networks (GATs)and HowNet, which are the basis of our proposed models inthe next section.

Graph neural networks (GNNs) (Scarselli et al. 2008) arewidely applied in various NLP tasks, such as text classifca-tion (Yao, Mao, and Luo 2019), text generation (Zhao et al.2020), dialogue policy optimization (Chen et al. 2018c,b,2019, 2020c) and dialogue state tracking (Chen et al. 2020a;Zhu et al. 2020), etc. GAT is a special type of GNN thatoperates on graph-structured data with attention mechanisms.Given a graph G = ( V , E ) , where V and E are the set ofnodes x i and the set of edges, respectively. N + ( x i ) is the setincluding the node x i itself and the nodes which are directlyconnected by x i .Each node x i in the graph has an initial feature vector h i ∈ R d , where d is the feature dimension. The representa-tion of each node is iteratively updated by the graph attentionoperation. At the l -th step, each node x i aggregates contextinformation by attending over its neighbors and itself. The up-dated representation h li is calculated by the weighted averageof the connected nodes, h li = σ  (cid:88) x j ∈N + ( x i ) α lij · (cid:0) W l h l − j (cid:1) , (1)where W l ∈ R d × d is a learnable parameter, and σ ( · ) isa nonlinear activation function, e.g. ReLU. The attentioncoefﬁcient α lij is the normalized similarity of the embeddingbetween the two nodes x i and x j in a uniﬁed space, i.e. α lij = softmax j f lsim (cid:0) h l − i , h l − j (cid:1) = softmax j (cid:0) W lq h l − i (cid:1) T (cid:0) W lk h l − j (cid:1) , (2)where W lq and W lk ∈ R d × d are learnable parameters forprojections.Note that, in Eq. (2), α lij is a scalar, which means that alldimensions in h l − j are treated equally. This may limit thecapacity to model complex dependencies. Following Shenet al. (2018), we replace the vanilla attention with multi-dimensional attention. Instead of computing a single scalarscore, for each embedding h l − j , it ﬁrst computes a feature-wise score vector, and then normalizes it with feature-wisedmulti-dimensional softmax (MD-softmax), α lij = MD-softmax j (cid:0) ˆ α lij + f lm (cid:0) h l − j (cid:1)(cid:1) , (3)here ˆ α lij is a scalar calculated by the similarity function f lsim ( · ) in Eq. (2), and f lm ( · ) is a vector. The addition inabove equation means the scalar will be added to every el-ement of the vector. ˆ α lij is utilized to model the pair-wiseddependency of two nodes, while f lm ( · ) is used to estimate thecontribution of each feature dimension of h l − j , f lm ( h l − j ) = W l σ (cid:0) W l h l − j + b l (cid:1) + b l , (4)where W l , W l , b l and b l are learnable parameters. Withthe score vector α lij , Eq. (1) will be accordingly revised as h li = σ  (cid:88) x j ∈N + ( x i ) α lij (cid:12) (cid:0) W l h l − j (cid:1) , (5)where (cid:12) represents element-wise product of two vectors. Forbrevity, we use MD-GAT ( · ) to denote the updating processusing multi-dimensional attention mechanism, and rewriteEq. (5) as follows, h li = MD-GAT (cid:0) h l − i , (cid:8) h l − j | x j ∈ N + ( x i ) (cid:9)(cid:1) . (6)After L steps of updating, each node will ﬁnally have acontext-aware representation h Li . In order to achieve a stabletraining process, we also employ a residual connection fol-lowed by a layer normalization between two graph attentionlayers. 树 tree 样式值 PatternValue 携带 bring 能 able 电脑 computer 特定牌⼦ SpeBrand ⽔果 fruit 苹果（Apple brand/Apple） sense (Apple) sense (Apple brand) sememe ⽣殖 reproduce Figure 2: An example of the HowNet structure.HowNet (Dong and Dong 2003) is an external knowledgebase that manually annotates each Chinese word sense withone or more relevant sememes. The philosophy of HowNetregards sememe as an atomic semantic unit. Different fromWordNet (Miller 1995), it emphasizes that the parts and at-tributes of a concept can be well represented by sememes.HowNet has been widely utilized in many NLP tasks suchas word similarity computation (Liu 2002), sentiment analy-sis (Fu et al. 2013), word representation learning (Niu et al.2017) and language modeling (Gu et al. 2018).An example is illustrated in Fig. 2. The word “Apple”has two senses including

Apple Brand and

Apple . Thesense

Apple Brand has ﬁve sememes including computer , PatternValue , able , bring and SpecificBrand ,which describe the exact meaning of sense.

First, we deﬁne the Chinese short text matching taskin a formal way. Given two Chinese sentences C a = p BERT 他 Ta 的 De 话 Hua 有 You 些 Xie ⽔ Shui 分 Fen 他 Ta [CLS] [SEP] 在 Zai 吹 Chui ⽜ Niu [SEP] character embedding

N x sense updatingword updating sense updatingword updating wordembeddingsenseembedding wordembeddingsenseembedding

InputModuleSaGT character embeddingcharacter embedding character embedding

SentenceMatchingc

CLS

He is braggingHis words are a bit exaggerated

Figure 3: The framework of our proposed LET model. { c a , c a , · · · , c aT a } and C b = { c b , c b , · · · , c bT b } , the goal ofa text matching model f ( C a , C b ) is to predict whether thesemantic meaning of C a and C b is equal. Here, c at and c bt (cid:48) rep-resent the t -th and t (cid:48) -th Chinese character in two sentencesrespectively, and T a and T b denote the number of charactersin the sentences.In this paper, we propose a linguistic knowledge enhancedmatching model. Instead of segmenting each sentence intoa word sequence, we use three segmentation tools and keepthese segmentation paths to form a word lattice graph G =( V , E ) (see Fig. 4 (a)). V is the set of nodes and E is the setof edges. Each node x i ∈ V corresponds to a word w i whichis a character subsequence starting from the t -th characterto the t -th character in the sentence. As introduced in Sec.1, we can obtain all senses of a word w i by retrieving theHowNet.For two nodes x i ∈ V and x j ∈ V , if x i is adjacent to x j in the original sentence, then there is an edge between them. N + fw ( x i ) is the set including x i itself and all its reachablenodes in its forward direction, while N + bw ( x i ) is the set in-cluding x i itself and all its reachable nodes in its backwarddirection.Thus for each sample, we have two graphs G a = ( V a , E a ) and G b = ( V b , E b ) , and our graph matching model is topredict their similarity. As shown in Fig. 3, LET consistsof four components: input module, semantic-aware graphtransformer (SaGT), sentence matching layer and relationclassiﬁer. The input module outputs the initial contextualrepresentation for each word w i and the initial semantic rep-resentation for each sense. The semantic-aware graph trans-former iteratively updates the word representation and senserepresentation, and fuses useful information from each other.The sentence matching layer ﬁrst incorporates word repre-sentation into character level, and then matches two charac-ter sequences with the bilateral multi-perspective matchingmechanism. The relation classiﬁer takes the sentence vectorsas input and predicts the relation of two sentences. .1 Input Module Contextual Word Embedding

For each node x i in graphs, the initial representation of word w i isthe attentive pooling of contextual character repre-sentations. Concretely, we ﬁrst concat the originalcharacter-level sentences to form a new sequence C = { [ CLS ] , c a , · · · , c aT a , [ SEP ] , c b , · · · , c bT b , [ SEP ] } , andthen feed them to the BERT model to obtain the contextualrepresentations for each character { c CLS , c a , · · · , c aT a , c SEP , c b , · · · , c bT b , c SEP } . Assuming that the word w i consists ofsome consecutive character tokens { c t , c t +1 , · · · , c t } ,a feature-wised score vector is calculated with a feedforward network (FFN) with two layers for each character c k ( t ≤ k ≤ t ) , and then normalized with a feature-wisedmulti-dimensional softmax (MD-softmax), u k = MD-softmax k ( FFN ( c k )) , (7)The corresponding character embedding c k is weighted withthe normalized scores u k to obtain the contextual word em-bedding, v i = t (cid:88) k = t u k (cid:12) c k , (8)For brevity, we use Att-Pooling ( · ) to rewrite Eq. (7) and Eq.(8) for short, i.e. v i = Att-Pooling ( { c k | t ≤ k ≤ t } ) . (9) Sense Embedding

The word embedding v i described inSec. 4.1 contains only contextual character information,which may suffer from the issue of polysemy in Chinese.In this paper, we incorporate HowNet as an external knowl-edge base to express the semantic information of words.For each word w i , we denote the set of senses as S ( w i ) = { s i, , s i, , · · · , s i,K } . s i,k is the k -th sense of w i and we denote its corresponding sememes as O ( s i,k ) = { o i,k , o i,k , · · · , o Mi,k } . In order to get the embedding s i,k foreach sense s i,k , we ﬁrst obtain the representation o mi,k foreach sememe o mi,k with multi-dimensional attention function, o mi,k = MD-GAT (cid:16) e mi,k , (cid:110) e m (cid:48) i,k | o m (cid:48) i,k ∈ O ( s i,k ) (cid:111)(cid:17) , (10)where e mi,k is the embedding vector for sememe o mi,k producedthrough the Sememe Attention over Target model (SAT) (Niuet al. 2017). Then, for each sense s i,k , its embedding s i,k is obtained with attentive pooling of all sememe representa-tions, s i,k = Att-Pooling (cid:16)(cid:110) o mi,k | o mi,k ∈ O ( s i,k ) (cid:111)(cid:17) . (11) For each node x i in the graph, the word embedding v i onlycontains the contextual information while the sense embed-ding s i,k only contains linguistic knowledge. In order toharvest useful information from each other, we propose asemantic-aware graph transformer (SaGT). It ﬁrst takes v i and s i,k as initial word representation h i for word w i andinitial sense representation g i,k for sense s i,k respectively,and then iteratively updates them with two sub-steps. For brevity, the superscript of c k ( t ≤ k ≤ t ) is omitted. 北 Bei 京 Jing ⼈ People 和 And 宾 Guest 馆 Shop 北京

Beijing 宾馆 Hotel ⼈和宾馆

RenheHotel 北京⼈

Beijinger w w w w w w w w w w w w w s (a) Lattice graph s uw s s fw bw w , w , w , w ,w , w , w w , w w , w (b) Update sense (c) Update word … w s fw bw … s s s w uw Figure 4: (a) is an example of lattice graph. (b) shows theprocess of sense updating. fw and bw refer to the words inforward and backward directions of w respectively. uw isthe words that w cannot reach. (c) is word updating; we willnot update the corresponding word representation if the wordis not in HowNet. Updating Sense Representation At l -th iteration, the ﬁrstsub-step is to update sense representation from g l − i,k to g li,k .For a word with multiple senses, which sense should beused is usually determined by the context in the sentence.Therefore, when updating the representation, each sense willﬁrst aggregate useful information from words in forward andbackward directions of x i , m l,fwi,k = MD-GAT (cid:16) g l − i,k , (cid:110) h l − j | x j ∈ N + fw ( x i ) (cid:111)(cid:17) , m l,bwi,k = MD-GAT (cid:16) g l − i,k , (cid:8) h l − j | x j ∈ N + bw ( x i ) (cid:9)(cid:17) , (12)where two multi-dimensional attention functionsMD-GAT ( · ) have different parameters. Based on m li,k = [ m l,fwi,k , m l,bwi,k ] , each sense updates its rep-resentation with a gate recurrent unit (GRU) (Cho et al.2014), g li,k = GRU (cid:16) g l − i,k , m li,k (cid:17) . (13)It is notable that we don’t directly use m li,k as the newrepresentation g li,k of sense s i,k . The reason is that m li,k onlycontains contextual information, and we need to utilize a gate,e.g. GRU, to control the fusion of contextual information andsemantic information. Updating Word Representation

The second sub-step isto update the word representation from h l − i to h li based onthe updated sense representations g li,k (1 ≤ k ≤ K ) . Theword w i ﬁrst obtains semantic information from its senserepresentations with the multi-dimensional attention, q li = MD-GAT (cid:16) h l − i , (cid:110) g li,k | s i,k ∈ S ( w i ) (cid:111)(cid:17) , (14) [ · , · ] denotes the concatenation of vectors. nd then updates its representation with a GRU: h li = GRU (cid:0) h l − i , q li (cid:1) . (15)The above GRU function and the GRU function in Eq. (13)have different parameters.After multiple iterations, the ﬁnal word representation h Li contains not only contextual word information but also se-mantic knowledge. For each sentence, we use h ai and h bi todenote the ﬁnal word representation respectively. After obtaining the semantic knowledge enhanced word rep-resentation h ai and h bi for each sentence, we incorporate thisword information into characters. Without loss of generality,we will use characters in sentence C a to introduce the process.For each character c at , we obtain ˆc at by pooling the usefulword information, ˆc at = Att-Pooling (cid:16)(cid:110) h ai | w ai ∈ W ( c at ) (cid:111)(cid:17) , (16)where W ( c at ) is a set including words which contain thecharacter c at . The semantic knowledge enhanced characterrepresentation y t is therefore obtained by y at = LayerNorm ( c at + ˆc at ) , (17)where LayerNorm ( · ) denotes layer normalization, and c at isthe contextual character representation obtained using BERTdescribed in Sec. 4.1.For each character c at , it aggregates information from sen-tence C a and C b respectively using multi-dimensional atten-tion, m selft = MD-GAT ( y at , { y at (cid:48) | c at (cid:48) ∈ C a } ) , m crosst = MD-GAT ( y at , { y bt (cid:48) | c bt (cid:48) ∈ C b } ) . (18)The above multi-dimensional attention functions MD-GAT ( · ) share same parameters. With this sharing mechanism, themodel has a nice property that, when the two sentences areperfectly matched, we have m selft ≈ m crosst .We utilize the multi-perspective cosine distance (Wang,Hamza, and Florian 2017) to compare m selft and m crosst , d k = cosine (cid:16) w cosk (cid:12) m selft , w cosk (cid:12) m crosst (cid:17) , (19)where k ∈ { , , · · · , P } ( P is number of perspectives). w cosk is a parameter vector, which assigns different weightsto different dimensions of messages. With P distances d , d , · · · , d P , we can obtain the ﬁnal character representa-tion, ˆy at = FFN (cid:16)(cid:104) m selft , d t (cid:105)(cid:17) , (20)where d t (cid:44) [ d , d , · · · , d P ] , and FFN ( · ) is a feed forwardnetwork with two layers.Similarly, we can obtain the ﬁnal character representa-tion ˆy bt for each character c bt in sentence C b . Note that theﬁnal character representation contains three kinds of infor-mation: contextual information, word and sense knowledge,and character-level similarity. For each sentence C a or C b ,the sentence representation vector r a or r b is obtained usingthe attentive pooling of all ﬁnal character representations forthe sentence. With two sentence vectors r a , r b , and the vector c CLS ob-tained with BERT, our model will predict the similarity oftwo sentences, p = FFN (cid:0)(cid:2) c CLS , r a , r b , r a (cid:12) r b , | r a − r b | (cid:3)(cid:1) , (21)where FFN ( · ) is a feed forward network with two hiddenlayers and a sigmoid activation after output layer.With N training samples {C ai , C bi , y i } Ni =1 , the training ob-ject is to minimize the binary cross-entropy loss, L = − N (cid:88) i =1 ( y i log ( p i ) + (1 − y i ) log (1 − p i )) , (22)where y i ∈ { , } is the label of the i -th training sample and p i ∈ [0 , is the prediction of our model taking the sentencepair {C ai , C bi } as input. Dataset

We conduct experiments on two Chinese short textmatching datasets: LCQMC (Liu et al. 2018) and BQ (Chenet al. 2018a).LCQMC is a question matching corpus with large-scaleopen domain. It consists of 260068 Chinese sentence pairsincluding 238766 training samples, 8802 development sam-ples and 12500 test samples. Each pair is associated with abinary label indicating whether two sentences have the samemeaning or share the same intention. Positive samples are30% more than negative samples.BQ is a domain-speciﬁc large-scale corpus for bank ques-tion matching. It consists of 120000 Chinese sentence pairsincluding 100000 training samples, 10000 development sam-ples and 10000 test samples. Each pair is also associatedwith a binary label indicating whether two sentences have thesame meaning. The number of positive and negative samplesare the same.

Evaluation metrics

For each dataset, the accuracy (ACC.)and F1 score are used as the evaluation metrics. ACC. isthe percentage of correctly classiﬁed examples. F1 score ofmatching is the harmonic mean of the precision and recall.

Hyper-parameters

The input word lattice graphs are pro-duced by the combination of three segmentation tools:jieba (Sun 2012), pkuseg (Luo et al. 2019) and thulac (Liand Sun 2009). We use the pre-trained sememe embeddingprovided by OpenHowNet (Qi et al. 2019) with 200 dimen-sions. The number of graph updating steps/layers L is 2 onboth datasets, and the number of perspectives P is 20. Thedimensions of both word and sense representation are 128.The hidden size is also 128. The dropout rate for all hiddenlayers is 0.2. The model is trained by RMSProp with an ini-tial learning rate of 0.0005 and a warmup rate of 0.1. Thelearning rate of BERT layer is multiplied by an additionalfactor of 0.1. As for batch size, we use 32 for LCQMC and64 for BQ. Our code is available at https://github.com/lbe0613/LET. odels Pre-Training Interaction BQ LCQMCACC. F1 ACC. F1

Text-CNN(He et al. 2016) × × × × × × × √ × √

LET (Ours) × √

BERT-wwm (Cui et al. 2019) √ √ √ √ √ √ √ √

LET-BERT (Ours) √ √

Table 1: Performance of various models on LCQMC and BQ test datasets. The results are average scores using 5 different seeds.All the improvements over baselines are statistically signiﬁcant ( p < . ). We compare our models with three types of baselines:representation-based models, interaction-based models andBERT-based models. The results are summarized in Table 1.All the experiments in Table 1 and Table 2 are running ﬁvetimes using different seeds and we report the average scoresto ensure the reliability of results. For the baselines, we runthem ourselves using the parameters mentioned in Cui et al.(2019).

Representation-based models include three baselinesText-CNN, BiLSTM and Lattice-CNN. Text-CNN (He et al.2016) is one type of Siamese architecture with ConvolutionalNeural Networks (CNNs) used for encoding each sentence.BiLSTM (Mueller and Thyagarajan 2016) is another type ofSiamese architecture with Bi-directional Long Short TermMemory (BiLSTM) used for encoding each sentence. Lattice-CNN (Lai et al. 2019) is also proposed to deal with thepotential issue of Chinese word segmentation. It takes wordlattice as input and pooling mechanisms are utilized to mergethe feature vectors produced by multiple CNN kernels overdifferent n -gram contexts of each node in the lattice graph. Interaction-based models include two baselines: BiMPMand ESIM. BiMPM (Wang, Hamza, and Florian 2017) is abilateral multi-perspective matching model. It encodes eachsentence with BiLSTM, and matches two sentences frommulti-perspectives. BiMPM performs very well on some nat-ural language inference (NLI) tasks. There are two BiLSTMsin ESIM (Chen et al. 2017). The ﬁrst one is to encode sen-tences, and the other is to fuse the word alignment informa-tion between two sentences. ESIM achieves state-of-the-artresults on various matching tasks. In order to be compara-ble with the above models, we also employ a model whereBERT in Fig. 3 is replaced by a traditional character-leveltransformer encoder, which is denoted as LET.The results of the above models are shown in the ﬁrst partof Table 1. We can ﬁnd that our model LET outperforms allbaselines on both datasets. More speciﬁcally, the performanceof LET is better than that of Lattice-CNN. Although they both utilize word lattices, Lattice-CNN only focuses on localinformation while our model can utilize global information.Besides, our model incorporates semantic messages betweensentences, which signiﬁcantly improves model performance.As for interaction-based models, although they both use themulti-perspective matching mechanism, LET outperformsBiMPM and ESIM. It shows the utilization of word latticewith our graph neural networks is powerful.

BERT-based models include four baselines: BERT,BERT-wwm, BERT-wwm-ext and ERNIE. We compare themwith our presented model LET-BERT. BERT is the ofﬁcialChinese BERT model released by Google. BERT-wwm is aChinese BERT with whole word masking mechanism usedduring pre-training. BERT-wwm-ext is a variant of BERT-wwm with more training data and training steps. ERNIE isdesigned to learn language representation enhanced by knowl-edge masking strategies, which include entity-level maskingand phrase-level masking. LET-BERT is our proposed LETmodel where BERT is used as a character level encoder.The results are shown in the second part of Table 1. Wecan ﬁnd that the three variants of BERT (BERT-wwm, BERT-wwn-ext, ERNIE) all surpass the original BERT, which sug-gests using word level information during pre-training isimportant for Chinese matching tasks. Our model LET-BERTperforms better than all these BERT-based models. Comparedwith the baseline BERT which has the same initialization pa-rameters, the ACC. of LET-BERT on BQ and LCQMC isincreased by 0.8% and 2.65%, respectively. It shows thatutilizing sense information during ﬁne-tuning phrases withLET is an effective way to boost the performance of BERTfor Chinese semantic matching.We also compare results with K-BERT (Liu et al. 2020),which regards information in HowNet as triples { word, con-tain, sememes } to enhance BERT, introducing soft positionand visible matrix during the ﬁne-tuning and inferring phases.The reported ACC. for the LCQMC test set of K-BERT is86.9%. Our LET-BERT is 1.48% better than that. Differ-ent from K-BERT, we focus on fusing useful informationbetween word and sense. eg. Sense ACC. F1 jieba √ √ √ √ lattice × In our view, both multi-granularity information and semanticinformation are important for LET. If the segmentation doesnot contain the correct word, our semantic information willnot exert the most signiﬁcant advantage.Firstly, to explore the impact of using different segmenta-tion inputs, we carry out experiments using LET-BERT onLCQMC test set. As shown in Table 2, when incorporatingsense information, improvement can be observed betweenlattice-based model (the fourth row) and word-based mod-els: jieba, pkuseg and thulac. The improvements of latticewith sense over other models in Table 2 are all statisticallysigniﬁcant ( p < . ). The possible reason is that lattice-based models can reduce word segmentation errors to makepredictions more accurate.Secondly, we design an experiment to demonstrate theeffectiveness of incorporating HowNet to express the seman-tic information of words. In the comparative model withoutHowNet knowledge, the sense updating module in SaGT isremoved, and we update word representation only by a multi-dimensional self-attention. The last two rows in Table 2 listthe results of combined segmentation (lattice) with and with-out sense information. The performance of integrating senseinformation is better than using only word representation.More speciﬁcally, the average absolute improvement in ACC.and F1 scores are 0.7% and 0.45%, respectively, which indi-cates that LET has the ability to obtain semantic informationfrom HowNet to improve the model’s performance. Besides,compared with using a single word segmentation tool, se-mantic information performs better on lattice-based model.The probable reason is lattice-based model incorporates morepossible words so that it can perceive more meanings.We also study the role of GRU in SaGT. The ACC. ofremoving GRU in lattice-based model is 87.82% on aver-age, demonstrating that GRU can control historical messagesand combine them with current information. Through experi-ments, we ﬁnd that the model with 2 layers of SaGT achievesthe best. It indicates multiple information fusion will reﬁnethe message and make the model more robust. Inﬂuences of text length on performance

As listed in Ta-ble 3, we can observe that text length also has great impactson text matching prediction. The experimental results showthat the shorter the text length, the more obvious the im-provement effect of utilizing sense information. The reasonis, on the one hand, concise texts usually have rare contex-tual information, which is difﬁcult for model to understand. text length number ofsamples ACC. RER(%) w/o sense sense < − − > Table 3: Inﬂuences of text length on LCQMC test dataset. Rel-ative error reduction (RER) is calculated by sense − w/o sense − w/o sense × .However, HowNet brings a lot of useful external informa-tion to these weak-context short texts. Therefore, it is easierto perceive the similarity between texts and gain great im-provement. On the other hand, longer texts may contain morewrong words caused by insufﬁcient segmentation, leading toincorrect sense information. Too much incorrect sense infor-mation may confuse the model and make it unable to obtainthe original semantics. Case study

We compare LET-BERT between the modelwith and without sense information (see Fig. 5). The modelwithout sense fails to judge the relationship between sen-tences which actually have the same intention, but LET-BERTperforms well. We observe that both sentences contain theword “yuba”, which has only one sense described by se-meme food . Also, the sense of “cook” has a similar sememe edible narrowing the distance between texts. Moreover,the third sense of “fry” shares the same sememe cook withthe word “cook”. It provides a powerful message that makes“fry” attend more to the third sense.

A: 腐竹和什么煮好吃？What is delicious to cook with yuba?B: 腐竹和什么炒好吃Fried yuba and what is delicious

Text Sememe (part) 腐竹(yuba)煮(cook)炒(fry) Sense1：食品( food )Sense1：冒险( venture ) 供( provide ) 商业( commerce ) 资金( fund )赚( earn ) 多( many )Sense2：开除( discharge ) Sense3：烹调( cook )Sense1：食物( edible ) 烹调( cook ) Label:

Prediction: w/o sense: 0 (mismatch) sense: 1 (match)

Figure 5: An example of using sense information to get thecorrect answer.

In this work, we proposed a novel linguistic knowledge en-hanced graph transformer for Chinese short text matching.Our model takes two word lattice graphs as input and inte-grates sense information from HowNet to moderate wordambiguity. The proposed method is evaluated on two Chinesebenchmark datasets and obtains the best performance. Theablation studies also demonstrate that both semantic informa-tion and multi-granularity information are important for textmatching modeling.

Acknowledgments

We thank the anonymous reviewers for their thoughtfulcomments. This work has been supported by No. SKLM-CPTS2020003 Project.

References

Chen, J.; Chen, Q.; Liu, X.; Yang, H.; Lu, D.; and Tang, B.2018a. The bq corpus: A large-scale domain-speciﬁc chinesecorpus for sentence semantic equivalence identiﬁcation. In

Proceedings of the 2018 Conference on Empirical Methodsin Natural Language Processing (EMNLP) , 4946–4951.Chen, L.; Chang, C.; Chen, Z.; Tan, B.; Gaˇsi´c, M.; and Yu,K. 2018b. Policy adaptation for deep reinforcement learning-based dialogue management. In

Proceedings of IEEE Interna-tional Conference on Acoustics Speech and Signal Processing(ICASSP) , 6074–6078. IEEE.Chen, L.; Chen, Z.; Tan, B.; Long, S.; Gaˇsi´c, M.; and Yu, K.2019. AgentGraph: Toward universal dialogue managementwith structured deep reinforcement learning.

IEEE/ACMTransactions on Audio, Speech, and Language Processing

AAAI , 7521–7528.Chen, L.; Tan, B.; Long, S.; and Yu, K. 2018c. StructuredDialogue Policy with Graph Neural Networks. In

Proceed-ings of the 27th International Conference on ComputationalLinguistics (COLING) , 1257–1268.Chen, L.; Zhao, Y.; Lyu, B.; Jin, L.; Chen, Z.; Zhu, S.; andYu, K. 2020b. Neural Graph Matching Networks for ChineseShort Text Matching. In

Proceedings of the 58th AnnualMeeting of the Association for Computational Linguistics ,6152–6158.Chen, Q.; Zhu, X.; Ling, Z.-H.; Wei, S.; Jiang, H.; and Inkpen,D. 2017. Enhanced LSTM for Natural Language Inference.In

Proceedings of the 55th Annual Meeting of the Associa-tion for Computational Linguistics (Volume 1: Long Papers) ,1657–1668.Chen, Z.; Chen, L.; Liu, X.; and Yu, K. 2020c. DistributedStructured Actor-Critic Reinforcement Learning for Univer-sal Dialogue Management.

IEEE/ACM Transactions on Au-dio, Speech, and Language Processing

28: 2400–2411.Cho, K.; Van Merri¨enboer, B.; Gulcehre, C.; Bahdanau, D.;Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learningphrase representations using RNN encoder-decoder for sta-tistical machine translation. arXiv preprint arXiv:1406.1078 .Cui, Y.; Che, W.; Liu, T.; Qin, B.; Yang, Z.; Wang, S.; andHu, G. 2019. Pre-Training with Whole Word Masking forChinese BERT. arXiv preprint arXiv:1906.08101 .Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019.BERT: Pre-training of Deep Bidirectional Transformers forLanguage Understanding. In

Proceedings of the 2019 Con-ference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,Volume 1 (Long and Short Papers) , 4171–4186.Dong, Z.; and Dong, Q. 2003. HowNet-a hybrid languageand knowledge resource. In

International Conference onNatural Language Processing and Knowledge Engineering,2003. Proceedings. 2003 , 820–824. IEEE.Fu, X.; Liu, G.; Guo, Y.; and Wang, Z. 2013. Multi-aspectsentiment analysis for Chinese online social reviews basedon topic modeling and HowNet lexicon.

Knowledge-BasedSystems

37: 186–195.Gao, J.; Galley, M.; Li, L.; et al. 2019. Neural approaches toconversational ai.

Foundations and Trends® in InformationRetrieval arXiv preprintarXiv:1709.04348 .Gu, Y.; Yan, J.; Zhu, H.; Liu, Z.; Xie, R.; Sun, M.; Lin, F.;and Lin, L. 2018. Language modeling with sparse product ofsememe experts. arXiv preprint arXiv:1810.12387 .He, T.; Huang, W.; Qiao, Y.; and Yao, J. 2016. Text-attentional convolutional neural network for scene text detec-tion.

IEEE transactions on image processing

Proceedings of the AAAI Conference on ArtiﬁcialIntelligence , volume 33, 6634–6641.Lan, W.; and Xu, W. 2018. Neural network models for para-phrase identiﬁcation, semantic textual similarity, natural lan-guage inference, and question answering. In

Proceedings ofthe 27th International Conference on Computational Linguis-tics , 3890–3902.Li, X.; Meng, Y.; Sun, X.; Han, Q.; Yuan, A.; and Li, J. 2019a.Is Word Segmentation Necessary for Deep Learning of Chi-nese Representations? In

Proceedings of the 57th AnnualMeeting of the Association for Computational Linguistics ,3242–3252.Li, X.; Yan, H.; Qiu, X.; and Huang, X. 2020. FLAT: Chi-nese NER Using Flat-Lattice Transformer. arXiv preprintarXiv:2004.11795 .Li, Y.; Yu, B.; Xue, M.; and Liu, T. 2019b. Enhancing Pre-trained Chinese Character Representation with Word-alignedAttention. arXiv preprint arXiv:1911.02821 .Li, Z.; and Sun, M. 2009. Punctuation as implicit annotationsfor Chinese word segmentation.

Computational Linguistics

Computational linguistics and Chinese language processing

AAAI , 2901–2908.Liu, X.; Chen, Q.; Deng, C.; Zeng, H.; Chen, J.; Li, D.;and Tang, B. 2018. Lcqmc: A large-scale chinese questionatching corpus. In

Proceedings of the 27th InternationalConference on Computational Linguistics , 1952–1962.Liu, Y.; Rong, W.; and Xiong, Z. 2018. Improved text match-ing by enhancing mutual information. In

AAAI .Luo, R.; Xu, J.; Zhang, Y.; Ren, X.; and Sun, X. 2019.PKUSEG: A Toolkit for Multi-Domain Chinese Word Seg-mentation.

CoRR abs/1906.11455.Miller, G. A. 1995. WordNet: a lexical database for English.

Communications of the ACM thirtiethAAAI conference on artiﬁcial intelligence .Niu, Y.; Xie, R.; Liu, Z.; and Sun, M. 2017. Improved wordrepresentation learning with sememes. In

Proceedings of the55th Annual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , 2049–2058.Qi, F.; Yang, C.; Liu, Z.; Dong, Q.; Sun, M.; and Dong, Z.2019. Openhownet: An open sememe-based lexical knowl-edge base. arXiv preprint arXiv:1901.09957 .Scarselli, F.; Gori, M.; Tsoi, A. C.; Hagenbuchner, M.; andMonfardini, G. 2008. The graph neural network model.

IEEETransactions on Neural Networks

Proceedings of AAAI .Sun, J. 2012. Jieba chinese word segmentation tool.

Accessed:Jun

25: 2018.Sun, Y.; Wang, S.; Li, Y.; Feng, S.; Chen, X.; Zhang, H.;Tian, X.; Zhu, D.; Tian, H.; and Wu, H. 2019. ERNIE: En-hanced Representation through Knowledge Integration. arXivpreprint arXiv:1904.09223 .Wang, Z.; Hamza, W.; and Florian, R. 2017. Bilateral multi-perspective matching for natural language sentences. In

Proceedings of the 26th International Joint Conference onArtiﬁcial Intelligence , 4144–4150.Xu, J.; Liu, J.; Zhang, L.; Li, Z.; and Chen, H. 2016. Improvechinese word embeddings by exploiting internal structure. In

Proceedings of the 2016 Conference of the North AmericanChapter of the Association for Computational Linguistics:Human Language Technologies , 1041–1050.Yao, L.; Mao, C.; and Luo, Y. 2019. Graph convolutionalnetworks for text classiﬁcation. In

Proceedings of AAAI ,7370–7377.Yu, K.; Chen, L.; Chen, B.; Sun, K.; and Zhu, S. 2014. Cog-nitive Technology in Task-Oriented Dialogue Systems: Con-cepts, Advances and Future.

Chinese Journal of Computers

Proceedings of the58th Annual Meeting of the Association for ComputationalLinguistics , 732–741. Zhu, S.; Li, J.; Chen, L.; and Yu, K. 2020. Efﬁcient Contextand Schema Fusion Networks for Multi-Domain DialogueState Tracking. arXiv preprint arXiv:2004.03386arXiv preprint arXiv:2004.03386