[PDF] CorDEL: A Contrastive Deep Learning Approach for Entity Linkage

Abstract

Entity linkage (EL) is a critical problem in data cleaning and integration. In the past several decades, EL has typically been done by rule-based systems or traditional machine learning models with hand-curated features, both of which heavily depend on manual human inputs. With the ever-increasing growth of new data, deep learning (DL) based approaches have been proposed to alleviate the high cost of EL associated with the traditional models. Existing exploration of DL models for EL strictly follows the well-known twin-network architecture. However, we argue that the twin-network architecture is sub-optimal to EL, leading to inherent drawbacks of existing models. In order to address the drawbacks, we propose a novel and generic contrastive DL framework for EL. The proposed framework is able to capture both syntactic and semantic matching signals and pays attention to subtle but critical differences. Based on the framework, we develop a contrastive DL approach for EL, called CorDEL, with three powerful variants. We evaluate CorDEL with extensive experiments conducted on both public benchmark datasets and a real-world dataset. CorDEL outperforms previous state-of-the-art models by 5.2% on public benchmark datasets. Moreover, CorDEL yields a 2.4% improvement over the current best DL model on the real-world dataset, while reducing the number of training parameters by 97.6%.

Full PDF

CC O R

D E L: A Contrastive Deep Learning Approachfor Entity Linkage

Zhengyang Wang

Texas A&M UniversityCollege Station, TX, [email protected]

Bunyamin Sisman, Hao Wei, Xin Luna Dong

Amazon.comSeattle, WA, USA { bunyamis, wehao, lunadong } @amazon.com Shuiwang Ji

Texas A&M UniversityCollege Station, TX, [email protected]

Abstract —Entity linkage (EL) is a critical problem in datacleaning and integration. In the past several decades, ELhas typically been done by rule-based systems or traditionalmachine learning models with hand-curated features, both ofwhich heavily depend on manual human inputs. With theever-increasing growth of new data, deep learning (DL) basedapproaches have been proposed to alleviate the high cost ofEL associated with the traditional models. Existing explorationof DL models for EL strictly follows the well-known twin-network architecture. However, we argue that the twin-networkarchitecture is sub-optimal to EL, leading to inherent drawbacksof existing models. In order to address the drawbacks, wepropose a novel and generic contrastive DL framework for EL.The proposed framework is able to capture both syntactic andsemantic matching signals and pays attention to subtle but criticaldifferences. Based on the framework, we develop a contrastive DLapproach for EL, called C OR DEL, with three powerful variants.We evaluate C OR DEL with extensive experiments conductedon both public benchmark datasets and a real-world dataset.C OR DEL outperforms previous state-of-the-art models by 5.2 % on public benchmark datasets. Moreover, C OR DEL yields a 2.4 % improvement over the current best DL model on the real-worlddataset, while reducing the number of training parameters by97.6 % . Index Terms —Entity linkage, twin network, deep learning

I. I

NTRODUCTION

Entity linkage (EL), also known as entity matching, recordlinkage, entity resolution, and duplicate detection, refers to thetask of determining whether two data records represent thesame real-world entity. For example, in a product database,a black ink tank for printers produced by Canon can berepresented as (Black ink tank, Canon) with attributes (Producttitle, Brand). However, there exist many other ways to buildrecords for the same product, such as (Ink tank [black], Canon)and (Black ink tank, Canon R (cid:13)

Ink). As a result, there mightbe many data records referring to the same real-world entity,needing to be cleaned and integrated.EL has been a fundamental problem in data cleaning andintegration in many domains such as e-commerce [1], [2],and data warehouses [3]. Because of its importance, it hasbeen extensively studied for several decades [2], [4]–[12].Models for EL have evolved with the development of machinelearning [13]–[17], incorporating rule-based methods [18]–[21] and crowd-sourcing [1], [22]–[24]. However, because ofthe explosion in the volume and diversity of data, we are stillfar away from solving EL. Newly generated data may have different data distributions, requiring new models and thus a lotof human resources. For example, traditional machine learningmodels, such as support vector machines and random forests,usually require humans to hand-craft features for different datato maximize the model accuracies [25].The success of DL approaches in various areas, such asnatural language processing (NLP), computer vision, robotics,and database [26], [27] in recent years have drawn the attentionof the EL research community to a promising direction. Com-pared with traditional machine learning methods, DL is knownto be capable of extracting task-speciﬁc features from raw dataautomatically through the learning process. In addition, thedevelopment of distributed representations enables DL modelsto process textual data directly [28]–[31]. These properties ofDL are highly desirable for EL frameworks.Our work is not the ﬁrst DL approach for EL. Existing DLmethods for EL [2], [11] employ the twin-network architecturein Figure 1(a), which is commonly used for other matchingtasks in NLP in the literature. In NLP, the twin-networkarchitecture is usually employed for semantic matching taskssuch as question answering that require matching abstract textrepresentations. However, semantic matching is not effectiveon many EL tasks. For example, in product EL tasks, therecord pair (Black ink tank, Canon) and (Cyan ink tank,Canon), where the attributes are (Product title, Brand), isa non-match since they have different colors. However, thewords representing different colors are semantically close toeach other, making it difﬁcult to distinguish this pair based onsemantic matching. Another example is the record pair (Coca-Cola 12 ﬂ oz 8 pack, Coca-Cola) and (Coca-Cola 12 ﬂ oz 6pack, Coca-Cola), where the only difference lies in the numberof bottles in a pack. It is a non-match as well, even thoughwords representing numbers have similar semantic meanings.In addition to these non-match cases, semantic matching couldalso fail on matches. For instance, the beer product record pair(Amber ale, Third Base Sports Bar & Brewery) and (Americanred ale, Third Base Sports Bar & Brewery) is a match. Butthe word ‘American’ in one record is not semantically similarto any word in the other record, which may confuse semanticmatching models. Besides these examples, recent studies havealso shown that deep neural networks work like low-pass ﬁltersand have the effect of smoothing out small differences [32],[33]. Since the comparisons in the twin-network architecture a r X i v : . [ c s . D B ] D ec s made after the records are projected onto the embeddingspace, small but crucial differences may be ignored, resultingin failures on EL tasks.Because of these limitations of the twin-network architec-ture, existing DL models for EL do NOT show consistentlyimproved performance over current non-DL machine learningmodels on various EL tasks. The fact that DL models maycause decreased performance in some cases hinders the use ofthese models for EL in practice.In order to develop more effective and practical DL modelsfor EL, we propose to jump out of the existing DL frameworkbased on the twin-network architecture. Instead, we proposea new contrastive DL framework for EL, as shown in Fig-ure 1(b) . In contrast to the twin-network architecture, ourframework is able to capture both syntactic and semantic sig-nals. More importantly, our framework avoids the smoothingeffect of deep neural networks and pays attention to subtle butcritical differences. As an instantiation of this contrastive DLframework, we build a powerful DL model called C OR DEL(C O nt R astive Deep Entity Linkage). Our contributions can besummarized in three aspects: • We propose a novel and generic contrastive DL frame-work for EL, as shown in Figure 1(b). Our contrastiveframework addresses the limitations of the twin-networkarchitecture in Figure 1(a) by capturing both syntacticand semantic signals and paying attention to subtle butcritical differences between entities. • We propose a powerful DL model called C OR DEL(C O nt R astive Deep Entity Linkage) as an instantiationof our proposed contrastive DL framework, as illustratedin Figure 2. Concretely, we develop three variants ofC OR DEL, named C OR DEL-Sum, C OR DEL-Attention,and C OR DEL-Context Attention. • We perform extensive experiments on both public bench-mark datasets and a large real-world dataset. C OR DELis able to outperform previous state-of-the-art models by5.2 % on public benchmark datasets. C OR DEL also yieldsa 2.4 % improvement over the current best DL modelon the real-world dataset, while reducing 97.6 % trainingparameters. In addition, C OR DEL shows great stabilityover different runs. These results indicate that C OR DELis a reliable, efﬁcient, and effective DL approach for EL.II. R

ELATED W ORK

In this section, we discuss the twin-network architecture andreview previous DL models for EL.The twin-network architecture in Figure 1(a) has beenwidely applied on matching tasks in natural language pro-cessing (NLP), such as paraphrase identiﬁcation, questionanswering, automatic dialogue, and textual entailment [34]–[37]. A notable property shared by these matching tasksis that they focus on semantic matching, i.e. , the matching Our contrastive DL framework does not correspond to the contrastivelearning in the ﬁelds of deep metric learning and self-supervised learning.The “contrastive” here refers to contrasting one input to the other in the rawstring level, as explained in Sections III-B and III-C. (a) Twin-Network Architecture(b) Our Contrastive DL Framework 𝑡 " Deep NeuralNetwork yMatching+Classification 𝑡 Deep NeuralNetworkShared E m b e dd i ng 𝑡 " LocalInteractionModule yDeep NeuralNetwork 𝑡 E m b e dd i ng Fig. 1. Two types of DL architectures for matching tasks. (a) The twinarchitecture employed by existing DL models for EL. The problems ofapplying the twin-network architecture on EL are analyzed in Section I.(b) Our proposed contrastive DL framework for EL followed by C OR DEL.Details are described in Sections III-B and III-C. The advantages of ourframework are discussed in Section III-E. prediction is mainly determined by the semantic relationsbetween two textual inputs. Employing twin-networks withpre-trained distributed representations [28]–[31] suits thesetasks well. It is because distributed representations are ableto model semantic meanings. For example, words with similarsemantic meanings have distributed representations with smalldistances. On the other hand, there are other matching tasksin NLP that do not fall into this category, such as relevancematching [38], paper citation matching [39], etc. In these tasks,signals from syntactic matching are as important as those fromsemantic matching. In order to address these tasks, several DLmatching models have been developed [38]–[40]. Typically,these models perform local interaction among two inputs andconstruct a matching histogram or comparison matrix. Deepneural networks are then applied on the matching histogramor comparison matrix to make predictions.In the literature, most existing DL models for EL followthe twin-network architecture [2], [11], [41], [42]. D

EEP -M ATCHER [2] proposed a general twin-network template ofDL models for EL, with four different instantiations: SIF,RNN, Attention, and Hybrid. D

EEP

ER [11] shared high sim-ilarities with the SIF and RNN versions of D

EEP M ATCHER in terms of both the network architectures and performance.Seq2SeqMatcher [41] augmented the twin-network architec-ture by proposed a sequence-to-sequence alignment layer,which shared certain similarities with the D

EEP M ATCHER -Attention. AutoEM [42] explored the transfer learning settingswhile still employing twin-network based DL models.We have pointed out that the twin-network architecture isnot suitable to EL tasks in Section I. In addition, becauseof the differences between EL and other tasks, existing non-twin DL matching models in NLP [38]–[40] could not bedirectly applied on EL tasks. In this work, we propose a noveland generic contrastive DL framework for EL, as shown inFigure 1(b). We propose a simple yet effective instantiation ofhis framework, named C OR DEL.III. M

ETHOD

In this section, we ﬁrst formally deﬁne the problem ofEL in Section III-A. Then we propose our contrastive DLframework for EL in Section III-B. As an instantiation of theframework, we introduce C OR DEL, a novel DL model forEL in Section III-C. We provide different powerful variants ofC OR DEL in Section III-D. Finally, we analyze the advantagesof C OR DEL in Section III-E.

A. Problem Deﬁnition

We focus on EL that refers to the matching task betweentwo data records. In detail, data records are saved by followinga certain schema. That is, given an ordered set of pre-deﬁnedattributes, data are stored by putting its values under corre-sponding attributes. For example, the product record (Blackink tank, Canon) is saved with pre-deﬁned attributes (Producttitle, Brand).Formally, given pre-deﬁned attributes A , A , . . . , A m ,a data record t can be represented as a tuple ( t [ A ] , t [ A ] , . . . , t [ A m ]) , where t [ A i ] , i = 1 , , . . . , m refers to the value of the attribute A i in the record t . In anEL dataset, all the records should have the same schema, thatis, the same set of attributes in the same order. The EL taskis to determine whether a pair of records t and t , where t (cid:54) = t , refer to the same real-world entity. In particular, it isformulated as a binary classiﬁcation problem: y = F ( t , t ) ∈ { , } , (1)where F represents a model for EL that outputs a binaryprediction y . In practice, it is common to let F ﬁrst output acontinuous number y ∈ [0 , , and set a threshold to translateit into the binary classiﬁcation result. The continuous outputis called the matching score and can be interpreted as thelikelihood of t and t being a match. B. Contrastive DL Framework

We ﬁrst propose a novel and generic contrastive DL frame-work specially designed for EL, upon which we developC OR DEL. The framework is illustrated in Figure 1(b). Wedescribe it component by component in this section.

Local interaction module (LIM) : In order to allow syntac-tic signal to be captured, our contrastive DL framework avoidsprojecting inputs into the embedding space at the beginning.Instead, it ﬁrst employs a LIM to enable the two input recordsto interact with each other in the raw string level. The LIMcompares and contrasts the input records in terms of stringtokens, where the tokens can be characters, words, and phrases.After the LIM, all the string tokens from two input recordsare re-grouped, where each group captures speciﬁc syntacticsignals. As a result, the outputs of the LIM are simply severalgroups of string tokens. Our instantiation, C OR DEL, exploresa simple LIM that simply separates the different words fromthe shared words appearing in both records, as introduced inSection III-C. Θ W o r d E m b e dd i ng s Ψ Φ 𝑠 𝐴 𝑢 & 𝐴 𝑢 ( 𝐴 𝑡 & [𝐴 ]𝑡 ( [𝐴 ] For each attribute 𝐴 𝑠𝑖𝑚 𝐴 𝑑𝑖𝑓 𝐴 𝑟 𝐴 𝜳 𝒋 , 𝜱 𝒋 : Sequence Processing Modules

Ω: Classification ModuleΘ : Local Interaction Module 𝑡 & 𝑡 ( Ωy Fig. 2. An illustration of our C OR DEL described in Section III-C. It followsthe proposed contrastive DL framework introduced in Section III-B. Weprovide different options for Ψ j , Φ j , and Ω in Section III-D, leading tothree variants of C OR DEL.

Embedding : With syntactic signals captured by the LIMthrough grouping, distributed embeddings [28]–[31] of stringtokens allow semantic signals to be taken into consideration bythe following deep neural network. Therefore, our frameworkhas an embedding layer after the LIM, which transformseach string token into a numeric vector embedding throughdistributed representations. The outputs of the embedding layerare thus sequences of vector embeddings corresponding togroups of string tokens. Note that, as the syntactic signalsare encoded by the grouping, they will not be lost through theembedding layer. In other words, both syntactic and semanticsignals are captured in the outputs of the embedding layer.

Deep neural network : Finally, a deep neural network isapplied on top of the embedding layer to process both syntacticand semantic signals and make the prediction. As the inputsare sequences of vector embeddings, the deep neural networkcan be decomposed into three parts: sequence processing,information aggregation, and classiﬁcation. First, for eachgroup of vector embeddings, a sequence processing moduleis employed to summarize the information into a ﬁxed-sizevector representation. Next, the information from differentgroups needs to be aggregated, and then serves as inputs to aclassiﬁcation module.The proposed contrastive DL framework is the ﬁrst DLframework for EL that considers both syntactic and semanticsignals. In the next section, we propose a powerful DL modelas an instantiation of this framework, called C OR DEL.

C. An Instantiation- C OR DELAn illustration of the proposed C OR DEL (C O nt R astiveDeep Entity Linkage) is provided in Figure 2. Speciﬁcally,under our proposed contrastive DL framework for EL, wedevelop a simple yet effective LIM followed by a carefullydesigned deep neural network. Local interaction module (LIM) : The LIM of C OR DEL isdesigned based on human intuition: given an input record pair,e tend to treat the differences between two records as signalsfor a non-match, and regard the common part as signals fora match. Therefore, our LIM simply separates the differentwords from the shared words appearing in both records. Thisresults in re-clustering the tokens into three groups: two groupsof unique words in either record, and one group of sharedwords. Speciﬁcally, the proposed LIM is achieved throughsimple set operations, as described below.Formally, let t and t denote the input record pair, where t i = ( t i [ A ] , t i [ A ] , . . . , t i [ A m ]) , i = 1 , , and each attributevalue t i [ A j ] , i = 1 , , j = 1 , , . . . , m , is a sequence of words.Our LIM Θ of C OR DEL contrasts attribute-wise local tokens.For each attribute A j , j = 1 , , . . . , m , the two sequences ofwords t [ A j ] and t [ A j ] are compared in terms of the exactmatching between token sets. After Θ , all the tokens in t [ A j ] and t [ A j ] are distributed into three groups: ( s [ A j ] , u [ A j ] , u [ A j ]) = Θ( t [ A j ] , t [ A j ]) , (2)where s [ A j ] contains shared words appearing in both t [ A j ] and t [ A j ] , and u i [ A j ] , i = 1 , , includes the unique wordsthat are only in t i [ A j ] . In other words, the comparison step Θ can be written as s [ A j ] = t [ A j ] ∩ t [ A j ] ,u [ A j ] = t [ A j ] \ s [ A j ] ,u [ A j ] = t [ A j ] \ s [ A j ] . (3) Embedding : Accordingly, C OR DEL employs pre-trainedword embeddings to transform the outputs of Θ into wordembeddings. Without loss of clarity, the same notations ( s [ A j ] , u [ A j ] , u [ A j ]) are used to denote the correspondingthree sequences of word embeddings. Deep neural network : We introduce the correspondingdeep neural network in the order of sequence processing,information aggregation, and classiﬁcation.(1)

Sequence processing : For each attribute A j , two se-quence processing modules, Ψ j and Φ j , are used to gen-erate an attribute similarity representation vector sim [ A j ] and an attribute difference representation vector dif [ A j ] from ( s [ A j ] , u [ A j ] , u [ A j ]) , respectively: sim [ A j ] = Ψ j ( s [ A j ]) , (4) dif [ A j ] = Φ j ( u [ A j ] , u [ A j ]) . (5)Note that we use one sequence processing module Φ j toprocess two groups u [ A j ] and u [ A j ] instead of two distinctones. This is because both groups include different words,which can be viewed as one group as well. Here, the attributesimilarity representation vector sim [ A j ] encodes informationfrom shared words under the attribute A j in both records,serving as evidence that supports the prediction of the inputrecord pair as a match. On the contrary, the attribute differencerepresentation vector dif [ A j ] encodes information from dif-ferent words under the attribute A j in either record, supportingthe opposite prediction. (2) Information aggregation : In order to aggregate infor-mation, C OR DEL concatenates sim [ A j ] and dif [ A j ] as theattribute representation vector r [ A j ] : r [ A j ] = Concat ( sim [ A j ] , dif [ A j ]) . (6)(3) Classiﬁcation : Finally, a classiﬁcation module Ω takesall m attribute representation vectors as inputs and performsa binary classiﬁcation task: y = Ω( r [ A ] , r [ A ] , . . . , r [ A m ]) ∈ [0 , , (7)where y is the predicted matching score. A threshold can beset to translate the matching scores into binary classiﬁcationresults. The classiﬁcation module Ω has to merge m vectorsﬁrst and makes the prediction. In DL, it is common to let Ω output two numbers, use the Softmax function to normalizethem, and treat one of them as the y in Eqn. (7) [25].With the true label y ∗ from the training dataset, C OR DELcan be trained with the cross-entropy loss through back-propagation [43].

D. Variants of C OR DELIn this section, we provide variants of C OR DEL by spec-ifying Ψ j in Eqn. (4), Φ j in Eqn. (5), and Ω in Eqn. (7). Inparticular, Ψ j and Φ j are each required to take in one and twovariable-length sequence of word embeddings and produce aﬁxed-size vector, respectively. And Ω has m ﬁxed-size vectorsas inputs and performs a two-way classiﬁcation. C OR DEL-Sum:

In order to demonstrate the effectivenessof our proposed C OR DEL, we build C OR DEL-Sum, an ex-tremely simple variant of C OR DEL.C OR DEL-Sum employs summation followed by a one-layermutlilayer perceptron (MLP) for both Ψ j and Φ j . Summa-tion, although without any training parameters, is a powerfulprocess in DL models for classiﬁcation tasks [31], [44]. Theone-layer MLP is used to perform dimension reduction, whichavoids having an excessive number of parameters in thefollowing classiﬁcation module Ω . Speciﬁcally, we have sim [ A j ] = Ψ j ( s [ A j ]) = σ ( W Ψ j · (cid:88) s ∈ s [ A j ] s ) ,dif [ A j ] = Φ j ( u [ A j ] , u [ A j ]) = σ ( W Φ j · (cid:88) u ∈ u [ A j ] ∪ u [ A j ] u ) , where W Ψ j and W Φ j represent corresponding one-layerMLPs, and σ refers to an activation function. The biasterms are omitted. In particular, Φ j sums all the input wordembeddings from both sequences of difference words. It isworth noting that the one-layer MLPs are independent for eachattribute A j , leading to m one-layer MLPs in total.Afterwards, Ω of C OR DEL-Sum is simply implemented asa concatenation of m input vectors followed by a two-layerMLP with two output units: y = MLP ( Concat ( r [ A ] , r [ A ] , . . . , r [ A m ])) . (8)C OR DEL-Sum is extremely light-weight yet powerful. Thetraining parameters only lie in m one-layer MLPs plus a two-layer MLP. As shown in Section IV, C OR DEL-Sum achievesigniﬁcantly improved performance over current non-DL andDL models. The success of C OR DEL-Sum demonstrates thepower of our proposed C OR DEL.

Attention-based C OR DEL:

Despite the effectiveness ofC OR DEL-Sum, using summation to perform sequence pro-cessing may limit the performance in some cases, as sum-mation gives equal importance to each word in the sequence.This contradicts with the intuition that words in s [ A j ] and ( u [ A j ] , u [ A j ]) should contribute differently to sim [ A j ] and dif [ A j ] , respectively. Therefore, we explore attention-basedmodules for Ψ j , Φ j , and Ω to further enhance our C OR DEL.The attention mechanism is able to perform a weighted sum-mation over word embeddings, giving larger weights to moreimportant words.The attention mechanism has been widely used in DLmodels for various computer vision and NLP tasks [45]–[51]. In general, the attention mechanism has three parts ofinputs: a query vector q ∈ R d , n key vectors that form amatrix K = [ k , k , . . . , k n ] ∈ R d × n , and n value vectorsthat form a matrix V = [ v , v , . . . , v n ] ∈ R d × n . Notably,the dimension of the query vector and key vectors are thesame, and key vectors and value vectors have a one-to-onecorrespondence. The attention mechanism [46] is deﬁned as o = V · Softmax ( K T · q √ d ) ∈ R d . (9)In order to use the attention mechanism, we need to specifywhere the q , K , and V come. With different choices, wedevelop two attention-based variants of C OR DEL, namedC OR DEL-Attention and C OR DEL-Context Attention. Theydiffer in Φ j , while having the same Ψ j and Ω .We describe the shared Ψ j ﬁrst. To simplify the notations,let s [ A j ] = [ x , x , . . . , x n ] ∈ R d × n denote the inputs to Ψ j . Note that n can be any number so that s [ A j ] is avariable-length sequence of embeddings. The K and V inthe attention mechanism are computed from the inputs s [ A j ] through K = W k X and V = W v X , where W k ∈ R d × d and W v ∈ R d × d are training parameters. Meanwhile, thequery vector q is simply randomly initialized and tuned duringtraining [45].In terms of Φ j , both C OR DEL-Attention and C OR DEL-Context Attention follow a sub-twin architecture, that is, twoattention mechanisms with shared training parameters areapplied on u [ A j ] and u [ A j ] , respectively. And the outputof Φ j is the summation of the outputs from the two attentionmechanisms. Like the attention mechanism in in Ψ j , theattention mechanisms on u i [ A j ] compute K and V from theinputs u i [ A j ] . However, C OR DEL-Attention and C OR DEL-Context Attention have different choises on q . C OR DEL-Attention employ the attention with trainable q as in Φ j , whileC OR DEL-Context Attention uses the output of Φ j as q , i.e. , q = sim [ A j ] .The motivation of C OR DEL-Attention is straightforward.As the attention mechanism may be more powerful thansummation in some cases, C OR DEL-Attention uses at-tention mechanisms with trainable q to replace summa- tions in C OR DEL-Sum. On the other hand, C OR DEL-Context Attention uses sim [ A j ] to guide the attention mech-anisms that generate dif [ A j ] . The motivation is that sim [ A j ] may contain contextual information, and can be useful indetermining the importance of words in u [ A j ] and u [ A j ] .For example, the model can ﬁgure out that the domain of theinput records is music. Within this context, words indicatingthe versions of the music records, such as ‘live’ and ‘remix’,should be paid more attention to.Both C OR DEL-Attention and C OR DEL-Context Attentionexploit self-attention [46], [48] in Ω . By having m queryvectors, the attention mechanism is able to transform a se-quence of embeddings into another sequence of embeddingswith the same length [46], [48]. In particular, let R =[ r [ A ] , r [ A ] , . . . , r [ A m ]] , we have Q = W Q R , K = W k R ,and V = W v R , where W q , W k , and W v are trainingparameters. Using self-attention to replace the concatenation inEqn. (8) allows explicit cross-attribute interaction, leading toimproved performance in some cases, as shown in Section IV. E. Analysis of C OR DELWe analyze the C OR DEL and demonstrate its advantages.In particular, we demonstrate how it appropriately addressesthe problems of existing DL models for EL.By taking the LIM Θ , C OR DEL takes syntactic signals fromraw strings into consideration. Meanwhile, semantic signalsare still captured through word embeddings. On one hand, Θ helps C OR DEL avoid mistakes caused by the fact thatsome semantically similar words are the key evidence forthe prediction of a non-match. Taking the example of (Coca-Cola 12 ﬂ oz 8 pack, Coca-Cola) and (Coca-Cola 12 ﬂ oz 6pack, Coca-Cola), the words ‘8’ and ‘6’ will be put into thegroups of unique words in either record, and encoded by theattribute difference representation vector dif [ A j ] . In the casethat ‘8’ and ‘6’ have similar word embeddings as they aresemantically close, C OR DEL is still able to know that thereis a numeric difference between the two input records, whilethe twin networks are not sensitive to such a difference. Onthe other hand, C OR DEL is also effective in the case thatsemantically different but unimportant words make the modelfail to identify a true match. As the ﬁnal classiﬁer takes boththe attribute similarity representation vector sim [ A j ] and theattribute difference representation vector dif [ A j ] into consid-eration, C OR DEL is able to determine whether the captureddifferences serve as important evidence for the prediction.In addition, C OR DEL is unaffected by the smoothing effectof deep neural networks. The differences are isolated fromthe common parts of the input record pair and processedseparately. Therefore, no matter how small the differences are,C OR DEL is capable of capturing them.To summarize, unlike existing DL models for EL, C OR DELis able to identify subtle but critical differences between inputrecords, which is a fundamental requirement for solving EL.IV. E

XPERIMENTAL S TUDIES

In this section, we conduct thorough experiments to eval-uate our proposed C OR DEL and show its superiority in theollowing aspects: • On public benchmark datasets, C OR DEL outperformsexisting non-DL and DL models on all types of EL tasks.In particular, C OR DEL is the ﬁrst DL approach withconsistent and signiﬁcant improvements over the non-DLapproach on all three types of EL tasks. • On a real-world dataset, C OR DEL achieves better perfor-mance over existing DL models in terms of two practicalevaluation metrics. In addition, C OR DEL demonstratessigniﬁcantly improved stability over independent trainingruns, which is highly desired in practice. • C OR DEL is a much more efﬁcient DL approach in termsof required computational resources.

A. Experimental Setup

We describe the models and conﬁgurations used in ourexperiments.

Baselines:

We select non-DL and DL baselines for compar-ison. • The non-DL baseline is Magellan [17], the state-of-the-art machine learning based approach for EL. In particular,Magellan selects the best classiﬁer from decision tree,random forest, Naive Bayes, support vector machine andlogistic regression. The features used in Magellan aredesigned by experts. • The DL baseline is D

EEP M ATCHER [2], which repre-sents a wide range of twin-network based DL modelsfor EL. D

EEP M ATCHER has four versions, named SIF,RNN, Attention, and Hybrid, with increasing complexity.D

EEP

ER [11] and Seq2SeqMatcher [41] can be regardedas extensions of D

EEP M ATCHER . D

EEP M ATCHER hasbeen made publicly available as a Python package. C OR DEL:

We evaluate C OR DEL-Sum, C OR DEL-Attention, and C OR DEL-Context Attention in ourexperiments. The details are provided below.

Word Embeddings : For fair comparison, the distributedrepresentations used to transform words into word embeddingsare 300-dimensional pretrained FastText embeddings [31],which is the same as D

EEP M ATCHER [2]. The embeddingsare not ﬁne-tuned during training.

Training : C OR DEL is trained through the Adam opti-mizer [52] with a learning rate of 0.0001. The training batchsize is set to 64 for public datasets and 256 for the real-worlddataset.C OR DEL -Sum : As described in Section III-D, the trainingparameters of C OR DEL-Sum only lie in m one-layer MLPsplus a two-layer MLP, where m is the number of attributesin the dataset. The output dimension is set to 64 for the m one-layer MLPs. The dimension of the hidden layer in thetwo-layer MLP is set to 256.C OR DEL -Attention & C OR DEL -Context Attention : As in-troduced in Section III-D, we only need to specify the dimen-sion of training parameters in the attention mechanism, i.e. , d , d and d . In particular, d depends on the dimension of wordembeddings so that is 300 as indicated above. In the attentionmechanism with a trainable query vector q , we d to 4, a TABLE IS

TATISTICS OF PUBLIC BENCHMARK DATASETS PROVIDED BY [2]

ANDOUR REAL - WORLD MUSIC DATASET . T

HE RIGHTMOST THREE COLUMNS , i.e. AIRS ,

ATCHES ,

TTRS , CORRESPOND TO THE NUMBER OFRECORD PAIRS , MATCHES , ATTRIBUTES IN THE DATASET , RESPECTIVELY . Type Dataset Domain music 539 132 8Fodors-Zagats restaurant 946 110 6DBLP-ACM citation 12,363 2,220 4DBLP-Scholar citation 28,707 5,347 4Amazon-Google software 11,460 1,167 3Walmart-Amazon electronics 10,242 962 5Textual Abt-Buy product 9,575 1,028 3Dirty iTunes-Amazon music 539 132 8DBLP-ACM citation 12,363 2,220 4DBLP-Scholar citation 28,707 5,347 4Walmart-Amazon electronics 10,242 962 5Real-World Amazon-Wikipedia music ∼ ∼ small number to prevent over-ﬁtting. In the context-attentionand self-attention modules, d is set to 64. In all cases, d isset to 64. B. Datasets

Experiments are performed on public benchmark datasetsand a real-world dataset. Various evaluation metrics are used.

Public Benchmark Datasets:

We conduct experiments onthe public datasets provided by [2]. These public datasets covera wide range of EL tasks in different domains. In particular,they represent three types of EL tasks. • Structured EL : In a structured EL dataset, the records ina pair have relatively clean and aligned attribute values.In addition, the number of tokens in an attribute value isusually limited. • Textual EL : As indicated by the name, a textual ELdataset has long textual data as attribute values. • Dirty EL : A dirty EL dataset differs from a structuredEL dataset in the aspect that the attribute values maybe mistakenly disposed. The value of one attribute couldappear as part of the value of another attribute.In total, there are 7 structured, 1 textual, and 4 dirty ELdatasets. The statistics of these datasets are provided in Table I.Following [2], we divide each dataset into training, validation,and evaluation splits with the ratio of 3:1:1.In the experiments on these public datasets, we follow [2]to employ the F score as the evaluation metric, which allowsthe direct comparison between our proposed C OR DEL andbaselines. Note that, according to Eqn. (7), the output ofC OR DEL is a matching score y ∈ [0 , . A threshold has to beset to transform the matching score into a binary classiﬁcationresult. As with [2], we set the threshold to 0.5 to compute F .It is easy and beneﬁcial for research purpose to classifycurrent public benchmark datasets for EL tasks [2] accordingto such categorization. However, real-world EL datasets maybe a mixture of the three types. Therefore, a general approach ABLE IIC

OMPARISONS BETWEEN C OR DEL

AND BASELINES ON STRUCTURED EL DATASETS FROM [2]

IN TERMS OF THE F SCORE . “C A

TTENTION ” IS SHORTFOR “C ONTEXT A TTENTION ”. T

HE BEST PERFORMANCE IS HIGHLIGHTED WITH BOLDFACE . I F C OR DEL

ACHIEVES THE BEST PERFORMANCE , WEMARK THE BEST RESULTS OBTAINED BY BASELINES WITH UNDERLINES , AND VICE VERSA . I

N PARTICULAR , WHEN C OR DEL

SETS THE NEWSTATE - OF - THE - ART RECORD , THE RELATIVE IMPROVEMENT RATE AGAINST THE PREVIOUS BEST PERFORMANCE IS COMPUTED .Magellan [17] D

EEP M ATCHER [2] C OR DEL (Ours)Dataset SIF RNN Attention Hybrid Sum Attention C AttentionBeerAdvo-RateBeer 78.8 58.1 72.2 64.0 72.7 ↑ . ↑ . ↑ . DBLP-ACM ↑ . ↑ . ↑ . F ↑ . OMPARISONS BETWEEN C OR DEL

AND BASELINES ON TEXTUAL EL DATASETS FROM [2]

IN TERMS OF THE F SCORE .Magellan [17] D

EEP M ATCHER [2] C OR DEL (Ours)Dataset SIF RNN Attention Hybrid Sum Attention C AttentionAbt-Buy 43.6 35.1 39.4 56.8 62.8 58.2 ↑ . OMPARISONS BETWEEN C OR DEL

AND BASELINES ON DIRTY EL DATASETS FROM [2]

IN TERMS OF THE F SCORE .Magellan [17] D

EEP M ATCHER [2] C OR DEL (Ours)Dataset SIF RNN Attention Hybrid Sum Attention C AttentioniTunes-Amazon ↑ . DBLP-ACM F ↑ . for EL that is able to achieve good performance consistentlyon any type of EL task is highly desired in practice. Real-world Dataset:

We collect a real-world EL datasetin the music domain. Speciﬁcally, music records are crawledand sampled from Amazon and Wikipedia [53]. That is, in arecord pair t and t from this dataset, t is from Amazonand t is from Wikipedia. We have 10 attributes describingbasic information about the music track records. In order toobtain the training dataset, we sample 0.4 million record pairsinvolving 822,276 distinct entities and employ a noisy strongkey to label them. Meanwhile, the testing dataset containsrecord pairs that are manually labelled by human annotators,ensuring that the evaluation is accurate.We adopt more comprehensive and practical evaluationmetrics for experiments on this real-world dataset: Area Un-der the Precision-Recall Curve (PRAUC) and Recall whenPrecision=95 % (R@P=95 % ). The F score evaluates themodel when a chosen threshold. In contrast, PRAUC summa-rizes the model performance with all thresholds. In addition,as most EL datasets are imbalanced, PRAUC is known to bemore suitable for evaluating binary classiﬁers on imbalanceddatasets [54]. R@P=95 % is a practical evaluation metric for EL. Data integration typically has the requirement for highprecision. That is because a low-precision approach for ELwould result in wrongly merges records, causing unrecoverabledata loss. C. Results on Public Datasets

We compare C OR DEL with baselines on three types ofpublic EL datasets separately. The results of baselines areprovided by [2].

Structured EL : Results on the 7 structured EL datasetsare reported in Table II. All versions of C OR DEL improvethe performance by a large margin in terms of the average F score. Notably, C OR DEL-Sum achieves the state-of-the-art performance on 5 out of 7 datasets. On DBLP-Scholar ,C OR DEL-Sum is the second best model while the best modelD

EEP M ATCHER -Hybrid has 32x more parameters, as shown inSection IV-D1. On Walmart-Amazon , C OR DEL-Sum outper-forms all versions of the DL baseline. In addition, C OR DEL-Attention achieves the best result on Walmart-Amazon , beingthe only DL model that beats the non-DL baseline.C OR DEL-Sum yields a 5.2 % improvement over the previ-ous state-of-the-art model in terms of the average F scores. Fig. 3. Case studies on public benchmark datasets. The top example is fromWalmart-Amazon , the bottom example is from BeerAdvo-RateBeer. Both ofthem are non-matches, with subtle but critical differences. C OR DEL makesthe correct prediction in both cases, while D

EEP M ATCHER fails.

While existing DL models can only achieve competitiveperformance with non-DL models, C OR DEL is the ﬁrst DLapproach that demonstrates the advantages of DL on structuredEL tasks.

Textual EL : Table III shows the results on the textualEL dataset Abt-Buy. It is a valid concern that the localstring comparison step Θ breaks the long textual attributevalues, such as sentences and paragraphs, which might harmthe performance of C OR DEL on textual EL tasks. However,experimental results indicate that our proposed C OR DELremains powerful. Moreover, C OR DEL-Attention sets the newstate-of-the-art record, increasing the best F score by 3.3 % . Dirty EL : Table IV provides the results on the 4 dirtyEL datasets. The advantage of using DL models for dirty ELtasks is inherited by C OR DEL. While only obtaining the bestresults on 1 out of 4 datasets by C OR DEL-Context Attention,C OR DEL achieves the best average F score. Particularly,C OR DEL-Context Attention improves the best average F score by 2.6 % . It indicates that C OR DEL is more robust todifferent datasets.To conclude, C OR DEL is the ﬁrst DL approach that yieldsconsistently and signiﬁcantly improved performance on var-ious datasets for different types of EL tasks, serving as ageneral DL approach for EL.

1) Case Studies:

We perform case studies to show whyC OR DEL achieves better performance. Speciﬁcally, we exam-ine examples in the testing dataset, where C OR DEL makesthe correct prediction but D

EEP M ATCHER fails. Figure 3provides two representative examples from Walmart-Amazon and BeerAdvo-RateBeer, respectively. Both of them are non-matches, with subtle but critical differences. However, D EEP -M ATCHER identiﬁes them as matches, indicating its inabilityto capture those subtle but critical differences between in-put records. On the contrary, as discussed in Section III-E,C OR DEL has an outstanding ability to handle such cases.

D. Results on the Real-world Dataset

To further demonstrate the advantages of C OR DEL overthe DL baseline, we perform experiments on a real-world ELdataset, which casts more challenges compared to the public (a)

Twin-Network (D EEP M ATCHER -Hybrid)(b) C OR DEL-Sum

Fig. 4. The precision-recall curves for D

EEP M ATCHER -Hybrid andC OR DEL-Sum, with 10 independent runs for each of them. C OR DEL-Sumis more stable with better performance especially for the high precision band. benchmark datasets. It is hard to classify a real-world datasetinto one of the three types of EL task, since it is usually amixture of them. In addition, a practical DL approach for ELneeds to be stable, i.e. , different training runs should lead tosimilar inference performance. This stability is crucial to makeDL models reliable.In order to evaluate the stability, we repeat each experimentfor 10 times independently and report the mean and standarddeviation over 10 runs. For the baseline D

EEP M ATCHER , wechoose two versions, the simplest D

EEP M ATCHER -SIF andthe most powerful D

EEP M ATCHER -Hybrid.The comparisons between C OR DEL and D

EEP M ATCHER are summarized in Table V. C OR DEL has better and more sta-ble performance in terms of both PRAUC and R@P=95 % . For-mally, we conduct an unequal variance t -test on the PRAUCresults between C OR DEL-Attention and D

EEP M ATCHER -Hybrid. The p -value is 0.0069, indicating the improvementis statistically signiﬁcant.In order to show the superiority of C OR DEL more directly,Figure 4 plots the precision-recall curves for D

EEP M ATCHER -Hybrid and C OR DEL-Sum, with 10 independent runs for eachof them. The instability of D

EEP M ATCHER can be easilyobserved. In addition, it is worth noting that, particularly,C OR DEL has a much better and stable performance in thehigh-precision area.

1) Efﬁciency Analysis:

Another practical challenge in ap-plying DL models on real-world EL tasks is the concern ofefﬁciency. In particular, DL models tend to have a considerably

ABLE VC

OMPARISONS BETWEEN C OR DEL

AND BASELINES ON A REAL - WORLD DATASET IN TERMS OF A REA U NDER THE P RECISION -R ECALL C URVE (PRAUC), R

ECALL WHEN P RECISION =95 % (R@P=95 % ), AND THE NUMBER OF TRAINING PARAMETERS IN TOTAL (

ARAMS ). T

HE RELATIVEIMPROVEMENT RATES AGAINST THE PREVIOUS BEST MODEL , D

EEP M ATCHER -H YBRID , ARE COMPUTED .Model PRAUC R@P=95 % EEP M ATCHER -SIF 88.1 ± ± EEP M ATCHER -Hybrid 90.5 ± ± OR DEL-Sum 91.6 ± ± ↑ . OR DEL-Attention ± ↑ . ± ↓ . large amount of training parameters, requiring large computa-tional resources to train and deploy.We compare the number of training parameters betweenC OR DEL and D

EEP M ATCHER in the last column of Ta-ble V. We can see that even the simplest D

EEP M ATCHER -SIF has more parameters than C OR DEL, while C OR DELyields much better performance as shown in the experimentsabove. In addition, the existing state-of-the-art DL approach,D

EEP M ATCHER -Hybrid, has millions of training parameters,preventing it from being applied on large-scale datasets. Onthe contrary, C OR DEL is a light-weight and efﬁcient DLapproach. V. C

ONCLUSIONS

In this work, we propose a novel contrastive DL approachfor EL, called C OR DEL. We point out the limitations ofcurrent twin-network DL models and motivate our work.We perform extensive experiments on both public benchmarkdatasets and a large real-world dataset for rigorous evaluations.The experimental results show the effectiveness of C OR DELwith signiﬁcant and consistent improvements in performance.Moreover, C OR DEL is more efﬁcient as a light-weight DLapproach, and more reliable with stable performance.A

CKNOWLEDGMENT

The authors would like to thank Christos Faloutsos, AndrewBorthwick, Yifan Ethan Xu, and Jialong Han for valuablesuggestions.. R

EFERENCES[1] C. Gokhale, S. Das, A. Doan, J. F. Naughton, N. Rampalli, J. Shavlik,and X. Zhu, “Corleone: hands-off crowdsourcing for entity matching,” in

Proceedings of the 2014 ACM International Conference on Managementof Data , 2014, pp. 601–612.[2] S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park, G. Krishnan, R. Deep,E. Arcaute, and V. Raghavendra, “Deep learning for entity matching: Adesign space exploration,” in

Proceedings of the 2018 ACM InternationalConference on Management of Data , 2018, pp. 19–34.[3] W. E. Winkler, “Data quality in data warehouses,” in

Encyclopedia ofData Warehousing and Mining, Second Edition . IGI Global, 2009, pp.550–555.[4] H. L. Dunn, “Record linkage,”

American Journal of Public Health andthe Nations Health , vol. 36, no. 12, pp. 1412–1416, 1946.[5] I. P. Fellegi and A. B. Sunter, “A theory for record linkage,”

Journal ofthe American Statistical Association , vol. 64, no. 328, pp. 1183–1210,1969.[6] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios, “Duplicaterecord detection: A survey,”

IEEE Transactions on Knowledge and DataEngineering , vol. 19, no. 1, pp. 1–16, 2006. [7] F. Naumann and M. Herschel, “An introduction to duplicate detection,”

Synthesis Lectures on Data Management , vol. 2, no. 1, pp. 1–87, 2010.[8] P. Christen,

Data matching: concepts and techniques for record linkage,entity resolution, and duplicate detection . Springer Science & BusinessMedia, 2012.[9] L. Getoor and A. Machanavajjhala, “Entity resolution: theory, practice &open challenges,”

Proceedings of the VLDB Endowment , vol. 5, no. 12,pp. 2018–2019, 2012.[10] Z. Sehili, L. Kolb, C. Borgs, R. Schnell, and E. Rahm, “Privacypreserving record linkage with ppjoin,”

Datenbanksysteme f¨ur Business,Technologie und Web (BTW 2015) , 2015.[11] M. Ebraheem, S. Thirumuruganathan, S. Joty, M. Ouzzani, and N. Tang,“Distributed representations of tuples for entity resolution,”

Proceedingsof the VLDB Endowment , vol. 11, no. 11, pp. 1454–1467, 2018.[12] R. Trivedi, B. Sisman, X. L. Dong, C. Faloutsos, J. Ma, and H. Zha,“Linknbed: Multi-graph representation learning with entity linkage,”in

Proceedings of the 56th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers) , 2018, pp. 252–262.[13] W. W. Cohen and J. Richman, “Learning to match and cluster largehigh-dimensional data sets for data integration,” in

Proceedings of the8th ACM SIGKDD International Conference on Knowledge Discovery& Data Mining , 2002, pp. 475–480.[14] S. Sarawagi and A. Bhamidipaty, “Interactive deduplication using activelearning,” in

Proceedings of the 8th ACM SIGKDD International Con-ference on Knowledge Discovery & Data Mining , 2002, pp. 269–278.[15] M. Bilenko and R. J. Mooney, “Adaptive duplicate detection usinglearnable string similarity measures,” in

Proceedings of the 9th ACMSIGKDD International Conference on Knowledge Discovery & DataMining , 2003, pp. 39–48.[16] P. Singla and P. Domingos, “Entity resolution with markov logic,” in

Proceedings of the 6th International Conference on Data Mining . IEEE,2006, pp. 572–582.[17] P. Konda, S. Das, P. Suganthan GC, A. Doan, A. Ardalan, J. R. Ballard,H. Li, F. Panahi, H. Zhang, J. Naughton et al. , “Magellan: Towardbuilding entity matching management systems,”

Proceedings of theVLDB Endowment , vol. 9, no. 12, pp. 1197–1208, 2016.[18] J. Wang, G. Li, J. X. Yu, and J. Feng, “Entity matching: How similaris similar,”

Proceedings of the VLDB Endowment , vol. 4, no. 10, pp.622–633, 2011.[19] W. Fan, X. Jia, J. Li, and S. Ma, “Reasoning about record matchingrules,”

Proceedings of the VLDB Endowment , vol. 2, no. 1, pp. 407–418, 2009.[20] R. Singh, V. V. Meduri, A. Elmagarmid, S. Madden, P. Papotti, J.-A. Quian´e-Ruiz, A. Solar-Lezama, and N. Tang, “Synthesizing entitymatching rules by examples,”

Proceedings of the VLDB Endowment ,vol. 11, no. 2, pp. 189–202, 2017.[21] R. Singh, V. Meduri, A. Elmagarmid, S. Madden, P. Papotti, J.-A. Quian´e-Ruiz, A. Solar-Lezama, and N. Tang, “Generating conciseentity matching rules,” in

Proceedings of the 2017 ACM InternationalConference on Management of Data , 2017, pp. 1635–1638.[22] J. Wang, T. Kraska, M. J. Franklin, and J. Feng, “Crowder: crowdsourc-ing entity resolution,”

Proceedings of the VLDB Endowment , vol. 5,no. 11, pp. 1483–1494, 2012.[23] M. Stonebraker, D. Bruckner, I. F. Ilyas, G. Beskales, M. Cherniack,S. B. Zdonik, A. Pagan, and S. Xu, “Data curation at scale: The datatamer system.” in

Cidr , 2013.[24] N. Vesdapunt, K. Bellare, and N. Dalvi, “Crowdsourcing algorithms forentity resolution,”

Proceedings of the VLDB Endowment , vol. 7, no. 12,pp. 1071–1082, 2014.25] C. M. Bishop,

Pattern recognition and machine learning . springer,2006.[26] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature , vol. 521,no. 7553, pp. 436–444, 2015.[27] W. Wang, M. Zhang, G. Chen, H. Jagadish, B. C. Ooi, and K.-L. Tan,“Database meets deep learning: Challenges and opportunities,”

ACMSIGMOD Record , vol. 45, no. 2, pp. 17–22, 2016.[28] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,“Distributed representations of words and phrases and their composi-tionality,” in

Advances in Neural Information Processing Systems , 2013,pp. 3111–3119.[29] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectorsfor word representation,” in

Proceedings of the 2014 Conference onEmpirical Methods in Natural Language Processing (EMNLP) , 2014,pp. 1532–1543.[30] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching wordvectors with subword information,”

Transactions of the Association forComputational Linguistics , vol. 5, pp. 135–146, 2017.[31] A. Joulin, ´E. Grave, P. Bojanowski, and T. Mikolov, “Bag of tricks forefﬁcient text classiﬁcation,” in

Proceedings of the 15th Conference ofthe European Chapter of the Association for Computational Linguistics(Volume 2: Short Papers) , 2017, pp. 427–431.[32] S. J. Hamilton and A. Hauptmann, “Deep d-bar: Real-time electricalimpedance tomography imaging with deep neural networks,”

IEEETransactions on Medical Imaging , vol. 37, no. 10, pp. 2367–2377, 2018.[33] H. NT and T. Maehara, “Revisiting graph neural networks: All we haveis low-pass ﬁlters,” arXiv preprint arXiv:1905.09550 , 2019.[34] P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck, “Learningdeep structured semantic models for web search using clickthroughdata,” in

Proceedings of the 22nd ACM International Conference onInformation and Knowledge Management , 2013, pp. 2333–2338.[35] J. Gao, P. Pantel, M. Gamon, X. He, and L. Deng, “Modeling interesting-ness with deep neural networks,” in

Proceedings of the 2014 Conferenceon Empirical Methods in Natural Language Processing (EMNLP) , 2014,pp. 2–13.[36] Y. Shen, X. He, J. Gao, L. Deng, and G. Mesnil, “Learning semanticrepresentations using convolutional neural networks for web search,” in

Proceedings of the 23rd International Conference on World Wide Web ,2014, pp. 373–374.[37] B. Hu, Z. Lu, H. Li, and Q. Chen, “Convolutional neural networkarchitectures for matching natural language sentences,” in

Advances inNeural Information Processing Systems , 2014, pp. 2042–2050.[38] J. Guo, Y. Fan, Q. Ai, and W. B. Croft, “A deep relevance matchingmodel for ad-hoc retrieval,” in

Proceedings of the 25th ACM Interna-tional Conference on Information and Knowledge Management , 2016,pp. 55–64.[39] L. Pang, Y. Lan, J. Guo, J. Xu, S. Wan, and X. Cheng, “Text matchingas image recognition,” in

Proceedings of the 30th AAAI Conference onArtiﬁcial Intelligence . Association for the Advancement of ArtiﬁcialIntelligence (AAAI), 2016.[40] Z. Lu and H. Li, “A deep architecture for matching short texts,” in

Advances in Neural Information Processing Systems , 2013, pp. 1367–1375.[41] H. Nie, X. Han, B. He, L. Sun, B. Chen, W. Zhang, S. Wu, and H. Kong,“Deep sequence-to-sequence entity matching for heterogeneous entityresolution,” in

Proceedings of the 28th ACM International Conferenceon Information and Knowledge Management , 2019, pp. 629–638.[42] C. Zhao and Y. He, “Auto-em: End-to-end fuzzy entity-matching usingpre-trained deep models and transfer learning,” in

The World Wide WebConference , 2019, pp. 2413–2424.[43] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learningapplied to document recognition,”

Proceedings of the IEEE , vol. 86,no. 11, pp. 2278–2324, 1998.[44] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are graphneural networks?” in

Proceedings of the International Conference onLearning Representations , 2019.[45] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy, “Hierarchicalattention networks for document classiﬁcation,” in

Proceedings of the2019 Conference of the North American Chapter of the Association forComputational Linguistics: Human Language Technologies , 2016, pp.1480–1489.[46] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in

Advancesin Neural Information Processing Systems , 2017, pp. 5998–6008. [47] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural net-works,” in

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , 2018, pp. 7794–7803.[48] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-trainingof deep bidirectional transformers for language understanding,” in

Proceedings of the 2019 Conference of the North American Chapterof the Association for Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) , 2019, pp. 4171–4186.[49] Z. Wang, N. Zou, D. Shen, and S. Ji, “Non-local U-Nets for biomedicalimage segmentation,” in

Proceedings of the 34th AAAI Conference onArtiﬁcial Intelligence . Association for the Advancement of ArtiﬁcialIntelligence (AAAI), 2020.[50] Z. Wang, X. Hu, and S. Ji, “icapsnets: Towards interpretable capsulenetworks for text classiﬁcation,” arXiv preprint arXiv:2006.00075 , 2020.[51] Y. Liu, H. Yuan, Z. Wang, and S. Ji, “Global pixel transformers forvirtual staining of microscopy images,”

IEEE Transactions on MedicalImaging , vol. 39, no. 6, pp. 2256–2266, 2020.[52] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”in

Proceedings of the International Conference on Learning Represen-tations , 2015.[53] Q. Zhu, H. Wei, B. Sisman, D. Zheng, C. Faloutsos, X. L. Dong,and J. Han, “Collective multi-type entity alignment between knowledgegraphs,” in

The World Wide Web Conference . Association for Comput-ing Machinery, 2020.[54] T. Saito and M. Rehmsmeier, “The precision-recall plot is more informa-tive than the roc plot when evaluating binary classiﬁers on imbalanceddatasets,”