[PDF] Deep Entity Matching with Pre-Trained Language Models

Abstract

We present Ditto, a novel entity matching system based on pre-trained Transformer-based language models. We fine-tune and cast EM as a sequence-pair classification problem to leverage such models with a simple architecture. Our experiments show that a straightforward application of language models such as BERT, DistilBERT, or RoBERTa pre-trained on large text corpora already significantly improves the matching quality and outperforms previous state-of-the-art (SOTA), by up to 29% of F1 score on benchmark datasets. We also developed three optimization techniques to further improve Ditto's matching capability. Ditto allows domain knowledge to be injected by highlighting important pieces of input information that may be of interest when making matching decisions. Ditto also summarizes strings that are too long so that only the essential information is retained and used for EM. Finally, Ditto adapts a SOTA technique on data augmentation for text to EM to augment the training data with (difficult) examples. This way, Ditto is forced to learn "harder" to improve the model's matching capability. The optimizations we developed further boost the performance of Ditto by up to 9.8%. Perhaps more surprisingly, we establish that Ditto can achieve the previous SOTA results with at most half the number of labeled data. Finally, we demonstrate Ditto's effectiveness on a real-world large-scale EM task. On matching two company datasets consisting of 789K and 412K records, Ditto achieves a high F1 score of 96.5%.

Full PDF

DDeep Entity Matching with Pre-Trained Language Models[Scalable Data Science]

Yuliang Li , Jinfeng Li , Yoshihiko Suhara , AnHai Doan , and Wang-Chiew Tan Megagon Labs University of Wisconsin-Madison {yuliang, jinfeng, yoshi, wangchiew}@megagon.ai, [email protected] ABSTRACT

We present Ditto, a novel entity matching system based on pre-trained Transformer-based language models. We ﬁne-tune and castEM as a sequence-pair classiﬁcation problem to leverage such mod-els with a simple architecture. Our experiments show that a straight-forward application of language models such as BERT, DistilBERT,or ALBERT pre-trained on large text corpora already signiﬁcantlyimproves the matching quality and outperforms previous state-of-the-art (SOTA), by up to 19% of F1 score on benchmark datasets.We also developed three optimization techniques to further improveDitto’s matching capability. Ditto allows domain knowledge tobe injected by highlighting important pieces of input informationthat may be of interest when making matching decisions. Dittoalso summarizes strings that are too long so that only the essentialinformation is retained and used for EM. Finally, Ditto adapts aSOTA technique on data augmentation for text to EM to augmentthe training data with (diﬃcult) examples. This way, Ditto is forcedto learn “harder” to improve the model’s matching capability. Theoptimizations we developed further boost the performance of Dittoby up to 8.5%. Perhaps more surprisingly, we establish that Dittocan achieve the previous SOTA results with at most half the num-ber of labeled data. Finally, we demonstrate Ditto’s eﬀectivenesson a real-world large-scale EM task. On matching two companydatasets consisting of 789K and 412K records, Ditto achieves ahigh F1 score of 96.5%.

PVLDB Reference Format:

Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, Wang-Chiew Tan.Deep Entity Matching with Pre-Trained Language Models.

PVLDB , (): xxxx-yyyy, .DOI:

1. INTRODUCTION

Entity Matching (EM) refers to the problem of determining whethertwo data entries refer to the same real-world entity. Consider the twodatasets about products in Figure 1. The goal is to determine the setof pairs of data entries, one entry from each table so that each pairof entries refer to the same product.

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. To view a copyof this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. Forany use beyond those covered by this license, obtain permission by [email protected]. Copyright is held by the owner/author(s). Publication rightslicensed to the VLDB Endowment.

Proceedings of the VLDB Endowment,

Vol. , No.ISSN 2150-8097.DOI:

If the datasets are large, it can be expensive to determine the pairsof matching entries. For this reason, EM is typically accompaniedby a pre-processing step, called blocking , to prune pairs of entriesthat are unlikely matches to reduce the number of candidate pairsto consider. As we will illustrate, correctly matching the candi-date pairs requires substantial language understanding and domain-speciﬁc knowledge. Hence, entity matching remains a challengingtask even for the most advanced EM solutions.We present Ditto, a novel EM solution based on pre-trainedTransformer-based language models (or pre-trained language mod-els in short). We cast EM as a sequence-pair classiﬁcation prob-lem to leverage such models, which have been shown to generatehighly contextualized embeddings that capture better language un-derstanding compared to traditional word embeddings. Ditto fur-ther improves its matching capability through three optimizations:(1) It allows domain knowledge to be added by highlighting impor-tant pieces of the input that may be useful for matching decisions.(2) It summarizes long strings so that only the most essential infor-mation is retained and used for EM. (3) It augments training datawith (diﬃcult) examples, which challenges Ditto to learn “harder”and also reduces the amount of training data required. Figure 2 de-picts Ditto in the overall architecture of a complete EM workﬂow.There are 9 candidate pairs of entries to consider for matching intotal in Figure 1. The blocking heuristic that matching entries musthave one word in common in the title will reduce the number ofpairs to only 3: the ﬁrst entry on the left with the ﬁrst entry on theright and so on. Perhaps more surprisingly, even though the 3 pairsare highly similar and look like matches, only the ﬁrst and last pairof entries are true matches. Our system, Ditto, is able to discernthe nuances in the 3 pairs to make the correct conclusion for everypair while some state-of-the-art systems are unable to do so.The example illustrates the power of language understanding givenby Ditto’s pre-trained language model. It understands that instantimmersion spanish deluxe 2.0 is the same as instant immers spanishdlux 2 in the context of software products even though they are syn-tactically diﬀerent. Furthermore, one can explicitly emphasize thatcertain parts of a value are more useful for deciding matching deci-sions. For books, the domain knowledge that the grade level or edi-tion is important for matching books can be made explicit to Ditto,simply by placing tags around the grade/edition values. Hence, forthe second candidate pair, even though the titles are highly simi-lar (i.e., they overlap in many words), Ditto is able to focus onthe grade/edition information when making the matching decision.The third candidate pair shows the power of language understand-ing for the opposite situation. Even though the entries look dissim-ilar Ditto is able to attend to the right parts of a value (i.e., themanf./modelno under diﬀerent attributes) and also understand thesemantics of the model number to make the right decision. a r X i v : . [ c s . D B ] A p r itle manf./modelno price instant immersion spanish deluxe 2.0 topics entertainment 49.99 adventure workshop 4th-6th grade 7th edition encore software 19.99 sharp printing calculator sharp el1192bl 37.63 title manf./modelno price instant immers spanish dlux 2 NULL 36.11 encore inc adventure workshop 4th-6th grade 8th edition

NULL 17.1 new-sharp shr-el1192bl two-color printing calculator 12-digit lcd black red

NULL 56.0 ✘ ✔✔ Figure 1: Entity Matching: determine the matching entries from two datasets.

Contributions

In summary, the following are our contributions: • We present Ditto, a novel EM solution based on pre-trained lan-guage models (LMs) such as BERT, DistilBERT, and ALBERT.We ﬁne-tune and cast EM as a sequence-pair classiﬁcation prob-lem to leverage such models with a simple architecture. To thebest of our knowledge, Ditto is the ﬁrst EM solution that lever-ages pre-trained Transformer-based LMs, which are powerful LMsthat have been shown to provide deeper language understanding. • We also developed three optimization techniques to further im-prove Ditto’s matching capability through injecting domain knowl-edge, summarizing long strings, and augmenting training datawith (diﬃcult) examples. The ﬁrst two techniques help Ditto fo-cus on the right information for making matching decisions. Thelast technique, data augmentation, is adapted from [24] for EMto help Ditto learn “harder” to understand the data invarianceproperties that may exist but are beyond the provided labeled ex-amples and also, reduce the amount of training data required. • We evaluated the eﬀectiveness of Ditto on three benchmark datasets:the Entity Resolution benchmark [21], the Magellan dataset [20],and the WDC product matching dataset [31] of various sizes anddomains. Our experimental results show that Ditto consistentlyoutperforms the previous SOTA EM solutions in all datasets andby up to 25% in F1 scores. Furthermore, Ditto consistently per-forms better on dirty data and is more label eﬃcient: it achievesthe same or higher previous SOTA accuracy using less than halfthe labeled data. • We applied Ditto to a real-world large-scale matching task ontwo company datasets, containing 789K and 412K entries respec-tively. To deploy an end-to-to EM pipeline eﬃciently, we devel-oped an advanced blocking technique to help reduce the num-ber of pairs to consider for Ditto. Ditto obtains high accuracy,96.5% F1 on a holdout dataset. The blocking phase also helpedspeed up the end-to-end EM deployment signiﬁcantly, by up to3.8 times, compared to naive blocking techniques. • Finally, we will open-source Ditto in the future.

Outline

Section 2 overviews Ditto and pre-trained LMs. Section3 describes how we optimize Ditto with domain knowledge, sum-marization, and data augmentation. Our experimental results aredescribed in Section 4 and the case study is presented in Section 5.We discuss related work in Section 6 and conclude in Section 7.

Blocker Matcher

Table A:Table B: Matched PairsSample & Label

Train

Ditto

SummarizeSerializeInject DK AugmentCandidate Pairs

Train Advanced Blocking ①② ③

Figure 2:

An EM system architecture with Ditto as the matcher. In ad-dition to the training data, the user of Ditto can specify (1) a method forinjecting domain knowledge (DK), (2) a summarization module for keep-ing the essential information, and (3) a data augmentation (DA) operator tostrengthen the training set.

2. BACKGROUND AND ARCHITECTURE

We present the main concepts behind EM and provide some back-ground on pre-trained LMs before we describe how we ﬁne-tune theLMs on EM datasets to train EM models. We also present a simplemethod for reducing EM to a sequence-pair classiﬁcation problemso that pre-trained LMs can be used for solving the EM problem.

Notations

Ditto’s EM pipeline takes as input two collections D and D (cid:48) of data entries (e.g., rows of relational tables, XML docu-ments, JSON ﬁles, text paragraphs) and outputs a set M ⊆ D × D (cid:48) of pairs where each pair ( e, e (cid:48) ) ∈ M is thought to represent thesame real-world entity (e.g., person, company, laptop, etc.). A dataentry e is a set of key-value pairs e = { ( attr i , val i ) } ≤ i ≤ k where attr i is the attribute name and val i is the attribute’s value repre-sented as text. Note that our deﬁnition of data entries is generalenough to capture both structured and semi-structured data such asJSON ﬁles.As described earlier, an end-to-end EM system consists of a blocker and a matcher . The goal of the blocking phase is to quickly identifya small subset of D × D (cid:48) of candidate pairs of high recall (i.e., ahigh proportion of actual matching pairs are that subset). The goalof a matcher (i.e., Ditto) is to accurately predict, given a pair ofentries, whether they refer to the same real-world entity. Unlike prior learning-based EM solutions that rely on word em-beddings and customized RNN architectures to train the matchingmodel (See Section 6 for a detailed summary), Ditto trains thematching models by ﬁne-tuning pre-trained LMs in a simpler ar-chitecture.Pre-trained LMs such as BERT [11], ALBERT [22], and GPT-2 [32] have demonstrated good performance on a wide range of NLPtasks. They are typically deep neural networks with multiple Trans-former layers [42], typically 12 or 24 layers, pre-trained on large textcorpora such as Wikipedia articles in an unsupervised manner. Dur-ing pre-training, the model is self-trained to perform auxiliary taskssuch as missing token and next-sentence prediction. Studies [7, 41]have shown that the shallow layers capture lexical meaning whilethe deeper layers capture syntactic and semantic meanings of theinput sequence after pre-training.A speciﬁc strength of pre-trained LMs is that it learns the seman-tics of words better than conventional word embedding techniquessuch as word2vec, GloVe, or FastText. This is largely because theTransformer architecture calculates token embeddings from all thetokens in the input sequence and thus, the embeddings it generatesare highly-contextualized and captures the semantic and contextualunderstanding of the words. Consequently, such embeddings cancapture polysemy, i.e., discern that the same word may have diﬀer-ent meanings in diﬀerent phrases. For example, the word

Sharp hasdiﬀerent meanings in “Sharp resolution” versus “Sharp TV” . Pre-trained LMs will embed “

Sharp ” diﬀerently depending on the con-text while traditional word embedding techniques such as FastTextalways produce the same vector independent of the context. Suchmodels can also understand the opposite, i.e., that diﬀerent wordsmay have the same meaning. For example, the words immersion and immers (respectively, ( deluxe , dlux ) and (2.0, 2)) are likely the same oftMax [CLS] T1 T2 [SEP] Tm ... ... P r e - t r a i ned L M ( e . g ., BE R T , D i s t il BE R T ) First entity e Second entity e’

Linear layer

Task-specificTransformer LayerTransformer Layer attr 1 val 1 ... [SEP] S e r i a li z e T o k en i z e ... E’ [CLS] E’ E’ E’ [SEP] E’ m... E’ [SEP] E [CLS] E E E [SEP] E m... E [SEP] Contextualized EmbeddingsEmbeddings ... attr 1 val 1

Figure 3:

Ditto’s model architecture. Ditto serializes the two entries asone sequence and feeds it to the model as input. The model consists of (1)token embeddings and Transformer layers [49] from a pre-trained languagemodel (e.g., BERT) and (2) task-speciﬁc layers (linear followed by softmax).Conceptually, the [CLS] token “summarizes” all the contextual informationneeded for matching as a contextualized embedding vector E (cid:48) [ CLS ] which thetask-speciﬁc layers take as input for classiﬁcation. given their respective contexts. Thus, such language understandingcapability of pre-trained LMs can improve the EM performance. A pre-trained LM can be ﬁne-tuned with task-speciﬁc trainingdata so that it becomes better at performing that task. Here, weﬁne-tune a pre-trained LM for the EM task with a labeled trainingdataset consisting of positive and negative pairs of matching andnon-matching entries as follows:1. Add task-speciﬁc layers after the ﬁnal layer of the LM. For EM,we add a simple fully connected layer and a softmax output layerfor binary classiﬁcation.2. Initialize the modiﬁed network with parameters from the pre-trained LM.3. Train the modiﬁed network on the training set until it converges.The result is a model ﬁne-tuned for the EM task. In Ditto, weﬁne-tune the popular base 12-layer BERT model [11] and its dis-tilled variant DistilBERT [36], which is smaller but more eﬃcient.However, our proposed techniques are independent of the choice ofpre-trained LMs and our experimental results (Table 6) indicate thatDitto can potentially perform even better with larger pre-trainedLMs. We illustrate the model architecture in Figure 3. The pairof data entries is serialized (see next section) as input to the LMand the output is a match or no-match decision. Ditto’s architec-ture is much simpler when compared to many state-of-the-art EMsolutions today [27, 12]. Even though the bulk of the “work” is sim-ply oﬀ-loaded to pre-trained LMs, we show that this simple schemeworks surprisingly well in our experiments.

Since LMs take token sequences (i.e., text) as input, a key chal-lenge is to convert the candidate pairs into token sequences so thatthey can be meaningfully ingested by Ditto.Ditto serializes data entries as follows: for each data entry e = { ( attr i , val i ) } ≤ i ≤ k , we let serialize ( e ) ::= [ COL ] attr [ VAL ] val . . . [ COL ] attr k [ VAL ] val k , where [ COL ] and [ VAL ] are special tokens for indicating the start ofattribute names and values respectively. For example,the ﬁrst entryof the second table is serialized as: [COL] title [VAL] instant immers spanish dlux 2 [COL] manf./modelno[VAL] NULL [COL] price [VAL] 36.11 To serialize a candidate pair ( e, e (cid:48) ) , we let serialize ( e, e (cid:48) ) ::= [ CLS ] serialize ( e ) [ SEP ] serialize ( e (cid:48) ) [ SEP ] , where [ SEP ] is the special token separating the two sequences and [ CLS ] is the special token necessary for BERT to encode the se-quence pair into a 768-dimensional vector which will be fed intothe fully connected layers for classiﬁcation. Other serialization schemes

There are diﬀerent ways to serializedata entries so that LMs can treat the input as a sequence classiﬁ-cation problem. For example, one can also omit the special tokens“ [ COL ] ” and/or “ [ VAL ] ”, or exclude attribute names attr i duringserialization. We found that including the special tokens to retainthe structure of the input does not hurt the performance in generaland excluding the attribute names tend to help only when the at-tribute names do not contain useful information (e.g., names suchas attr1, attr2, ...) or when the entries contain only one column. Amore rigorous study on this matter is left for future work. Heterogeneous schemas

As shown, the serialization method ofDitto does not require data entries to adhere to the same schema. Italso does not require that the attributes of data entries to be matchedprior to executing the matcher, which is a sharp contrast to other EMsystems such as DeepER [12] or DeepMatcher [27]. Furthermore,Ditto can also ingest and match hierarchically structured data en-tries by serializing nested attribute-value pairs with special start andend tokens (much like Lisp or XML-style parentheses structure).

3. OPTIMIZATIONS IN DITTO

As we will describe in Section 4, the basic version of Ditto,which leverages only the pre-trained LM, is already outperformingthe SOTA on average. Here, we describe three further optimizationtechniques that will facilitate and challenge Ditto to learn “harder”,and consequently make better matching decisions.

Our ﬁrst optimization allows domain knowledge to be injectedinto Ditto through pre-processing the input sequences (i.e., seri-alized data entries) to emphasize what pieces of information arepotentially important. This follows the intuition that when humanworkers make a matching/non-matching decision on two data en-tries, they typically look for spans of text that contain key infor-mation before making the ﬁnal decision. Even though we can alsotrain deep learning EM solutions to learn such knowledge, we willrequire a signiﬁcant amount of training data to do so. As we will de-scribe, this pre-processing step on the input sequences is lightweightand yet can yield signiﬁcant improvements. Our experiment resultsshow that with less than 5% of additional training time, we can im-prove the model’s performance by up to 8%.There are two main types of domain knowledge that we can pro-vide Ditto.

Span Typing

The type of a span of tokens is one kind of domainknowledge that can be provided to Ditto. Product id, street number,publisher are examples of span types. Span types help Ditto avoidmismatches. With span types, for example, Ditto is likelier to avoidmatching a street number with a year or a product id.Table 1 summarizes the main span types that human workerswould focus on when matching three types of entities in our bench-mark datasets. In DeepMatcher, the requirement that both entries have the same schemacan be removed by treating the values in all columns as one value under oneattribute. able 1:

Main span types for matching entities in our benchmark datasets.

Entity Type Types of Important Spans

Publications, Movies, Music Persons (e.g., Authors), Year, PublisherOrganizations, Employers Last 4-digit of phone, Street numberProducts Product ID, Brand, Conﬁgurations (num.)

The developer speciﬁes a recognizer to type spans of tokens fromattribute values. The recognizer takes a text string v as input andreturns a list recognizer ( v ) = { ( s i , t i , type i ) } i ≥ of start/end po-sitions of the span in v and the corresponding type of the span.Ditto’s current implementation leverages an open-source Named-Entity Recognition (NER) model [39] to identify known types suchas persons, dates, or organizations and use regular expressions toidentify speciﬁc types such as product IDs, last 4 digits of phonenumbers, etc.After the types are recognized, the original text v is replaced bya new text where special tokens are inserted to reﬂect the types ofthe spans. For example, a phone number “ (866) 246-6453 ” maybe replaced with “ ( 866 ) 246 - [LAST] 6453 [/LAST] ” where [LAST] / [/LAST] indicates the start/end of the last 4 digits and ad-ditional spaces are also added because of tokenization. In our imple-mentation, when we are sure that the span type has only one tokenor the NER model is inaccurate in determining the end position, wedrop the end indicator and keep only the start indicator token.Intuitively, these newly added special tokens are additional sig-nals to the self-attention mechanism that already exists in pre-trainedLMs, such as BERT. If two spans have the same type, then Dittopicks up the signal that they are likelier to be the same and hence,they are aligned together for matching. In the above example, “ .. .. [LAST] 0000 [/LAST] .. ” when the model sees two encoded sequences with the [LAST] spe-cial tokens, it is likely to take the hint to align “6453” with “0000”without relying on other patterns elsewhere in the sequence that maybe harder to learn. Span Normalization

The second kind of domain knowledge thatcan be passed to Ditto rewrites syntactically diﬀerent but equiva-lent spans into the same string. This way, they will have identicalembeddings and it becomes easier for Ditto to detect that the twospans are identical. For example, we can enforce that “

VLDB jour-nal ” and “

VLDBJ ” are the same by writing them as

VLDBJ . Sim-ilarly, we can enforce the general knowledge that “ ” vs. “ ” are equal by writing them as “ ”.The developer speciﬁes a set of rewriting rules to rewrite spans.The speciﬁcation consists of a function that ﬁrst identiﬁes the spansof interest before it replaces them with the rewritten spans. Dittocontains a number of rewriting rules for numbers, including rulesthat round all ﬂoating point numbers to 2 decimal places and drop-ping all commas from integers (e.g., “2,020” → “2020”). For abbre-viations, we allow the developers to specify a dictionary of synonympairs to normalize all synonym spans to be the same. When the value is an extremely long string, it becomes harder forthe LM to understand what to pay attention to when matching. Inaddition, one limiting factor of Transformer-based pre-trained LMsis that there is a limit on the sequence length of the input. For ex-ample, the input to BERT can have at most 512 sub-word tokens.It is thus important to summarize the serialized entries down to themaximum allowed length while retaining the key information. Acommon practice is to truncate the sequences so that they ﬁt withinthe maximum length. However, the truncation strategy does not work well for EM in general because the important information formatching is usually not at the beginning of the sequences.There are many ways to perform summarization [25, 33, 35]. InDitto’s current implementation, we use a TF-IDF-based summa-rization technique that retains non-stopword tokens with the highTF-IDF scores . We ignore the start and end tags generated by spantyping in this process and use the list of stop words from scikit-learnlibrary. By doing so, Ditto feeds only the most informative tokensto the LM. We found that this technique works well in practice. Ourexperiment results show that it improves the F1 score of Ditto ona text-heavy dataset from ∼

40% to over 92% and we plan to addmore summarization techniques to Ditto’s library in the future.

We describe how we apply data augmentation to augment thetraining data for entity matching.Data augmentation (DA) is a commonly used technique in com-puter vision for generating additional training data from existingexamples by simple transformation such as cropping, ﬂipping, ro-tation, padding, etc. The DA operators not only add more trainingdata, but the augmented data also allows to model to learn to makepredictions invariant of these transformations.Similarly, DA can add training data that will help EM modelslearn “harder”. Although labeled examples for EM are arguablynot hard to obtain, invariance properties are very important to helpmake the solution more robust to dirty data, such as missing val-ues (NULLs), values that are placed under the wrong attributes ormissing some tokens.Next, we introduce a set of DA operators for EM that will helptrain more robust models.

Augmentation operators for EM

The proposed DA operators aresummarized in Table 2. If s is a serialized pair of data entries witha match or no-match label l , then an augmented example is a pair ( s (cid:48) , l ) , where s (cid:48) is obtained by applying an operator o on s and s (cid:48) has the same label l as before.Table 2: Data augmentation operators in Ditto. The operators are 3 dif-ferent levels: span-level, attribute-level, and entry-level. All samplings aredone uniformly at random.

Operator Explanation span_del Delete a randomly sampled span of tokensspan_shuﬄe Randomly sample a span and shuﬄe the tokens’ orderattr_del Delete a randomly chosen attribute and its valueattr_shuﬄe Randomly shuﬄe the orders of all attributesentry_swap Swap the order of the two data entries e and e (cid:48) The operators are divided into 3 categories. The ﬁrst categoryconsists of span-level operators, such as span_del and span_shuﬄe.These two operators are used in NLP tasks [48, 24] and shown to beeﬀective for text classiﬁcation. For span_del, we randomly deletefrom s a span of tokens of length at most 4 without special tokens(e.g., [SEP], [COL], [VAL] ). For span_shuﬄe, we sample a span oflength at most 4 and randomly shuﬄe the order of its tokens.These two operators are motivated by the observation that mak-ing a match/no-match decision can sometimes be “too easy” whenthe candidate pair of data entries contain multiple spans of text sup-porting the decision. For example, suppose our negative examplesfor matching company data in the existing training data is similar towhat is shown below. [CLS] . . . [VAL] Google LLC . . . [VAL] (866) 246-6453 [SEP] . . . [VAL] Alphabet inc . . . [VAL] (650) 253-0000 [SEP] The model may learn to predict “no-match” based on the phonenumber alone, which is insuﬃcient in general. On the other hand,y corrupting parts of the input sequence (e.g., dropping phonenumbers), DA forces the model to learn beyond that, by leverag-ing the remaining signals, such as the company name, to predict“no-match”.The second category of operators is attribute-level operators: attr_deland attr_shuﬄe. The operator attr_del randomly deletes an attribute(both name and value) and attr_shuﬄe randomly shuﬄes the orderof the attributes of both data entries. The motivation for attr_del issimilar to span_del and span_shuﬄe but it gets rid of an attributeentirely. The attr_shuﬄe operator allows the model to learn theproperty that the matching decision should be independent of theordering of attributes in the sequence.The last operator, entry_swap, swaps the order of the pair ( e, e (cid:48) ) with probability / . This teaches the model to make symmetricdecisions (i.e., F ( e, e (cid:48) ) = F ( e (cid:48) , e ) ) and helps double the size ofthe training set if both input tables are from the same data source. MixDA: interpolating the augmented data

Unlike DA operatorsfor images which almost always preserve the image labels, the op-erators for EM can distort the input sequence so much that the labelbecomes incorrect. For example, the attr_del operator may drop thecompany name entirely and the remaining attributes may contain nouseful signals to distinguish the two entries.To address this issue, Ditto applies MixDA, a recently proposeddata augmentation technique for NLP tasks [24] illustrated in Figure4. Instead of using the augmented example directly, MixDA com-putes a convex interpolation of the original example with the aug-mented examples. Hence, the interpolated example is somewhere inbetween, i.e., it is a “partial” augmentation of the original exampleand this interpolated example is expected to be less distorted thanthe augmented one.The idea of interpolating two examples is originally proposed forcomputer vision tasks [53]. For EM or text data, since we cannotdirectly interpolate sequences, MixDA interpolates their represen-tations by the language model instead. We omit the technical detailsand refer the interested readers to [24]. In practice, augmentationwith MixDA slows the training time because the LM is called twice.However, the prediction time is not aﬀected since the DA operatorsare only applied to training data.

Original Language Model (BERT)Augmented

DA-Op [.1, .2,.5, .1,... .4, .3][.1, .3,.4, .2, ....4, .1] [.1, .25,.45, .15,....4, .2]

SequenceRepresentationsInputSequences

Interpolate(MixUp)

Linear↓↓Softmax↓↓Loss

Back propagate λ λ Figure 4:

Data augmentation with MixDA. To apply MixDA, we ﬁrst trans-form the example with a DA operator and pass it to the LM. We then interpo-late the representations of the original and the augmented examples. Finally,we feed the interpolation to the rest of the NN and back-propagate.

4. EXPERIMENTS

We present the experiment results on benchmark datasets for EM:the ER Benchmark datasets [21], the Magellan datasets [20] and theWDC product data corpus [31]. Ditto achieves new SOTA resultson all these datasets and outperforms the previous best results byup to 25% in F1 score. The results show that Ditto is more ro-bust to dirty data and performs well when the training set is small.Ditto is also more label-eﬃcient as it achieves the previous SOTAresults using only 1/2 of the training data across multiple subsets ofthe WDC corpus. Our ablation analysis shows that (1) using pre-trained LMs contributes to over 50% of Ditto’s performance gain Table 3:

The 13 datasets divided into 4 categories of domains. The datasetsmarked with † are text-heavy (Textual). Each dataset with ∗ has an additionaldirty version to test the models’ robustness against noisy data. Datasets Domains

Amazon-Google, Walmart-Amazon ∗ software / electronicsAbt-Buy † , Beer productDBLP-ACM*, DBLP-Scholar*, iTunes-Amazon* citation / musicCompany † , Fodors-Zagats company / restaurant and (2) all 3 optimizations, domain knowledge (DK), summariza-tion (SU) and data augmentation (DA), are eﬀective. For example,SU improves the performance on a text-heavy dataset by 41%, DKleads to 1.98% average improvement on the ER-Magellan datasetsand DA improves on the WDC datasets by 2.53% on average. We experimented with all the 13 publicly available datasets usedfor evaluating DeepMatcher [27]. These datasets are from the ERBenchmark datasets [21] and the Magellan data repository [10]. Wesummarize the datasets in Table 3 and refer to them as ER-Magellan.These datasets are for training and evaluating matching models forvarious domains including products, publications, and businesses.Each dataset consists of candidate pairs from two structured tablesof entity records of the same schema. The pairs are sampled fromthe results of blocking and manually labeled. The positive rate (i.e.,the ratio of matched pairs) ranges from 9.4% (Walmart-Amazon) to25% (Company). The number of attributes ranges from 1 to 8.Among the datasets, the Abt-Buy and Company datasets are text-heavy meaning that at least one attributes contain long text. Also,following [27], we use the dirty version of the DBLP-ACM, DBLP-Scholar, iTunes-Amazon, and Walmart-Amazon datasets to mea-sure the robustness of the models against noise. These datasets aregenerated from the clean version by randomly emptying attributesand appending their values to another randomly selected attribute.Each dataset is split into the training, validation, and test sets us-ing the ratio of . We list the size of each dataset in Table 5.The WDC product data corpus [31] contains 26 million productoﬀers and descriptions collected from e-commerce websites [47].The goal is to ﬁnd product oﬀer pairs that refer to the same prod-uct. To evaluate the accuracy of product matchers, the dataset pro-vides 4,400 manually created golden labels of oﬀer pairs from 4categories: computers, cameras, watches, and shoes. Each cate-gory has a ﬁxed number of 300 positive and 800 negative pairs. Fortraining, the dataset provides for each category pairs that share thesame product ID such as GTINs or MPNs mined from the product’swebpage. The negative examples are created by selecting pairs thathave high textual similarity but diﬀerent IDs. These labels are fur-ther reduced to diﬀerent sizes to test the models’ label eﬃciency.We summarize the diﬀerent subsets in Table 4. We refer to thesesubsets as the WDC datasets.Table 4:

Diﬀerent subsets of the WDC product data corpus. Each subset(except Test) is split into a training set and a validation set with a ratio of . The last column shows the positive rate (%POS) of each category in thexLarge set. The positive rate on the test set is 27.27% for all the categories.

Categories Test Small Medium Large xLarge %POS

Computers 1,100 2,834 8,094 33,359 68,461 14.15%Cameras 1,100 1,886 5,255 20,036 42,277 16.98%Watches 1,100 2,255 6,413 27,027 61,569 15.05%Shoes 1,100 2,063 5,805 22,989 42,429 9.76%All 4,400 9,038 25,567 103,411 214,736 14.10%

Each entry in this dataset has 5 attributes. Ditto uses only the title attribute because it contains rich product information such asbrands, IDs, and price, making the rest of the attributes redundant.eanwhile, DeepMatcher is allowed to use any subsets of attributesto determine the best attribute set as in [31].

We implemented Ditto in PyTorch [29] and the Transformerslibrary [49]. The default setting uses the uncased 6-layer Distil-BERT [36] pre-trained model and half-precision ﬂoating-point (fp16)to accelerate the training and prediction speed. In all the experi-ments, we ﬁx the learning rate to be 3e-5 and the max sequencelength to be 256. The batch size is 32 if MixDA is used and 64 oth-erwise. The training process runs a ﬁxed number of epochs (10, 15,or 40 depending on the dataset size) and returns the checkpoint withthe highest F1 score on the validation set. We conducted all experi-ments on a p3.8xlarge

AWS EC2 machine with 4 V100 GPUs (oneGPU per run).

Compared methods.

We compare Ditto with the SOTA EMsolution DeepMatcher and its variants. We also compare with vari-ants of Ditto without the data augmentation (DA) and/or domainknowledge (DK) optimization to evaluate the eﬀectiveness of eachcomponent. We summarize these methods below. We report theaverage F1 of 5 repeated runs in all the settings. • DeepMatcher:

DeepMatcher [27] is the SOTA matching solu-tion. Compared to Ditto, DeepMatcher customizes the RNN ar-chitecture to aggregate the attribute values, then compares/alignsthe aggregated representations of the attributes. DeepMatcherleverages FastText [4] to train the word embeddings. When re-porting DeepMatcher’s F1 scores, we use the numbers in [27]for the ER-Magellan datasets and numbers in [31] for the WDCdatasets. We also reproduced those results using the open-sourcedimplementation and report the training time. • DeepMatcher+:

Follow-up work [19] slightly outperforms Deep-Matcher in the DBLP-ACM dataset and [15] achieves better F1 inthe Walmart-Amazon and Amazon-Google datasets. Accordingto [27], the Magellan system ([20], based on classical ML mod-els) outperforms DeepMatcher in the Beer and iTunes-Amazondatasets. For these cases, we denote by DeepMatcher+ the bestF1 scores among DeepMatcher and these works aforementioned. • Ditto : This is the full version of our system with all 3 optimiza-tions, domain knowledge (DK), TF-IDF summarization (SU), anddata augmentation (DA) turned on. See the details below. • Ditto (DA):

This version only turns on the DA (with MixDA)and SU but does not have the DK optimization. We apply oneof the span-level or attribute-level DA operators listed in Table 2with the entry_swap operator. We compare the diﬀerent combi-nations and report the best one. Following [24], we apply MixDAwith the interpolation parameter λ sampled from a Beta distribu-tion Beta (0 . , . . • Ditto (DK):

With only the DK and SU optimizations on, thisversion of Ditto is expected to have lower F1 scores but trainmuch faster. We apply the span-typing to datasets of each do-main according to Table 1 and apply the span-normalization onthe number spans. • Baseline:

This base form of Ditto corresponds simply to ﬁne-tuning a pre-trained LM (DistilBERT) on the EM task. We didnot apply any optimizations on the baseline. We pick DistilBERTinstead of larger models such as BERT or ALBERT because Dis-tilBERT is faster to train and it also makes a tougher compari-son for Ditto since larger models are generally perceived to havemore powerful language understanding capabilities [52, 23, 22].

Table 5 shows the results of the ER-Magellan datasets. Overall,Ditto (with optimizations) achieves signiﬁcantly higher F1 scores than the SOTA results (DeepMatcher+). Ditto without optimiza-tions (i.e., the baseline) achieves comparable results with Deep-Matcher+. Ditto outperforms DeepMatcher+ in 10/13 cases andby up to 25% (Dirty, Walmart-Amazon) while the baseline out-performs DeepMatcher+ in 8/13 cases and by up to 16% (Dirty,Walmart-Amazon). On the 3 cases that Ditto performs slightlyworse than DeepMatcher+, it turns out that using a larger pre-trainedLMs such as BERT or ALBERT helps ﬁll the gaps (see Table 6).These initial results led us to believe that larger pre-trained languagemodels will further improve Ditto’s results and we leave as futurework to further verify this hypothesis.In addition, we found that Ditto is better at datasets with smalltraining sets. Particularly, the average improvement on the 7 small-est datasets is 9.96% vs. 0.32% on average on the rest of datasets.Ditto is also more robust against data noise than DeepMatcher+. Inthe 4 dirty datasets, the performance degradation of Ditto is only0.68 on average while the performance of DeepMatcher+ degradesby 8.21. These two properties make Ditto more attractive in prac-tical EM settings.Ditto also achieves promising results on the WDC datasets (Ta-ble 7). Ditto achieves the highest F1 score of 94.08 when usingall the 215k training data, outperforming the previous best result by3.92. Similar to what we found in the ER-Magellan datasets, theimprovements are higher on settings with fewer training examples(to the right of Table 7). The results also show that Ditto is more label eﬃcient than DeepMatcher. For example, when using only1/2 of the data (Large), Ditto already outperforms DeepMatcherwith all the training data (xLarge) by 2.89 in All. When using only1/8 of the data (Medium), the performance is within 1% close toDeepMatcher’s F1 when 1/2 of the data (Large) is in use. The onlyexception is the shoes category. This may be caused by the largegap of the positive label ratios between the training set and the testset (9.76% vs. 27.27% according to Table 4).Table 5:

F1 scores on the ER-Magellan EM datasets. The numbers of Deep-Matcher+ (DM+) are the highest available found in [15, 19, 27].

Datasets DM+ Ditto Ditto(DA) Ditto(DK) Baseline SizeStructuredAmazon-Google 70.7 71.42 (+0.72) 72.10 71.53 70.04 11,460Beer 78.8 82.12 (+3.32) 80.72 83.11 73.44 450DBLP-ACM 98.45 98.65 (+0.2) 98.83 98.54 98.65 12,363DBLP-Google 94.7 94.57 (-0.13) 94.53 94.53 94.67 28,707Fodors-Zagats 100 97.76 (-2.24) 98.18 98.16 96.76 946iTunes-Amazon 91.2 94.19 (+2.99) 90.23 92.47 91.38 539Walmart-Amazon 73.6 80.66 (+7.06) 81.30 79.07 76.95 10,242DirtyDBLP-ACM 98.1 98.63 (+0.53) 98.43 98.52 98.60 12,363DBLP-Google 93.8 94.68 (+0.88) 94.48 94.62 94.66 28,707iTunes-Amazon 79.4 93.16 (+13.76) 92.30 91.72 89.88 539Walmart-Amazon 53.8 78.87 (+25.07) 78.80 76.79 70.40 10,242TextualAbt-Buy 62.8 82.60 (+19.80) 81.66 81.90 81.74 9,575Company 92.7 92.43 (-0.27) 92.29 92.63 41.00 112,632

Table 6:

F1 scores of Ditto with the base BERT and ALBERT modelson the 3 datasets where Ditto with DistilBERT does not outperform Deep-Matcher+ (DM+), the SOTA matching models.Datasets DM+ Ditto (BERT) delta Ditto (ALBERT) deltaDBLP-Google 94.70 94.80 (+0.10) 94.73 (+0.03)Fodors-Zagats 100.00 100.00 0.00 100.00 0.00Company 92.70 93.15 (+0.45) 92.89 (+0.19)

Training time.

We plot the training time required by DeepMatcherand Ditto in Figure 6. We do not plot the time for Ditto(DA) be-cause the DK optimization only pre-processes the data and adds

0k 100k 200k80859095 F s c o r e all 10k 35k 70k70809095 computers 5k 20k 40kTrain+Valid Size708090 cameras 10k 30k 60k70809095 watches 5k 20k 40k708090 shoes DMDittoDitto (DA)Ditto (DK)Baseline

Figure 5:

F1 scores on the WDC datasets of diﬀerent versions of Ditto. DM : DeepMatcher. Table 7:

F1 scores on the WDC product matching datasets. The numbersfor DeepMatcher (DM) are taken from [31].Size xLarge (1/1) Large (1/2) Medium (1/8) Small (1/20)Methods DM Ditto DM Ditto DM Ditto DM DittoComputers 90.80 95.45 89.55 91.70 77.82 88.62 70.55 80.76+4.65 +2.15 +10.80 +10.21Cameras 89.21 93.78 87.19 91.23 76.53 88.09 68.59 80.89+4.57 +4.04 +11.56 +12.30Watches 93.45 96.53 91.28 95.69 79.31 91.12 66.32 85.12+3.08 +4.41 +11.81 +18.80Shoes 92.61 90.11 90.39 88.07 79.48 82.66 73.86 75.89-2.50 -2.32 +3.18 +2.03All 90.16 94.08 89.24 93.05 79.94 88.61 76.34 84.36+3.92 +3.81 +8.67 +8.02

1k 10k 100kTraining set size10 T r a i n i n g t i m e ( s ) DMDittoDitto (DK)Baseline

2k 10k 50k 200kTraining set size10 DMDittoDitto (DK)Baseline

Figure 6:

Training time vs. dataset size for the ER-Megallan datasets(left) and the WDC datasets (right). Each point corresponds to the train-ing time needed for a dataset using diﬀerent methods. Ditto(DK) does notuse MixDA thus is faster than the full Ditto. DeepMatcher (DM) ran out ofmemory on the Company dataset so the data point is not reported. We omitDitto(DA) in the ﬁgures because its running time is very close to Ditto’s. no more than 5% of training time. The running time ranges from69 seconds (450 examples) to 5.2 hours (113k examples). Dittohas a similar training time to DeepMatcher although DistilBERT,which is used by Ditto, has a Transformer-based architecture thatis deeper and more complex. The speed-up is due to DistilBERTand the fp16 optimization. Ditto with MixDA is about 2-3x slowerthan Ditto(DK) without MixDA. This is because MixDA requiresadditional time for generating the augmented pairs and computingwith the LM twice. However, this overhead only aﬀects oﬄine train-ing and does not aﬀect online prediction.

Next, we analyze the eﬀectiveness of each component (i.e., LM,SU, DK, and DA) by comparing Ditto with its variants withoutthese optimizations. The results are shown in Table 5 and Figure 5.The use of a pre-trained LM contributes to a large portion of theperformance gain. In the ER-Magellan datasets (excluding Com-pany), the average improvement of the baseline compared to Deep-Matcher+ is 3.49, which accounts for 58% of the improvement ofthe full Ditto (6.0). While DeepMatcher+ and the baseline Ditto(essentially ﬁne-tuning DistilBERT) are comparable on the Struc-tured datasets, the baseline performs much better on all the Dirtydatasets and the Abt-Buy dataset. This conﬁrms our intuition thatthe language understanding capability is a key advantage of Dittoover existing EM solutions. The Company dataset is a special case because the length of the company articles (3,123 words on aver-age) is much greater than the max sequence length of 256. The SUoptimization increases the F1 score of this dataset from 41% to over92%. In the WDC datasets, across the 20 settings, LM contributesto 3.41 F1 improvement on average, which explains 55.3% of im-provement of the full Ditto (6.16).The DK optimization is more eﬀective on the ER-Magellan datasets.Compared to the baseline, the improvement of Ditto(DK) is 1.98on average and is up to 9.67 on the Beer dataset while the improve-ment is only 0.22 on average on the WDC datasets. We inspectedthe span-typing output and found that only 66.2% of entry pairs havespans of the same type. This is caused by the current NER modulenot extracting product-related spans with the correct types. We ex-pect DK to be more eﬀective if we use an NER model trained onthe product domain.DA is eﬀective on both datasets and more signiﬁcantly on theWDC datasets. The average F1 score of the full Ditto improvesupon Ditto(DK) (without DA) by 0.53 and 2.53 respectively in thetwo datasets. In the WDC datasets, we found that the span_del oper-ator always performs the best while the best operators are diverse inthe ER-Magellan datasets. We list the best operator for each datasetin Table 8. We note that there is a large space of tuning these opera-tors (e.g., the MixDA interpolation parameter, maximal span length,etc.) and new operators to further improve the performance. Find-ing the best DA operators for EM is future work beyond the scopeof this paper.Table 8:

Datasets that each DA operator achieves the best performance.The suﬃx (S)/(D) and (Both) denote the clean/dirty version of the datasetor both of them. All operators are applied with the entry_swap operator.

Operator Datasets span_shuﬄe DBLP-ACM (Both), DBLP-Google (Both), Abt-Buyspan_del Walmart-Amazon(D), Company, all of WDCattr_del Beer, iTunes-Amazon(S), Walmart-Amazon(S)attr_shuﬄe Fodors-Zagats, iTunes-Amazon(D)

5. CASE STUDY: EMPLOYER MATCHING

We present a case of applying Ditto to a real-world EM task. Anonline recruiting platform would like to join its internal employerrecords with newly collected public records to enable downstreamaggregation tasks. Formally, given two tables A and B (internaland public) of employer records, the goal of the task is to ﬁnd, forevery record in table B , a record in table A that represents the sameemployer. Both tables have 6 attributes: name, addr, city, state,zipcode, and phone . Our goal is to ﬁnd matching record pairs withboth high precision and recall. Basic blocking.

Our ﬁrst challenge is size of the datasets. Asshown in Table 9, both tables are of nontrivial sizes even after dedu-plication. Thus, a naive pairwise comparison is not feasible. Theﬁrst blocking method we designed is to only match companies with the same zipcode . However, since 60% of records in Table A donot have the zipcode attribute and some large employers have mul-tiple sites, we use a second blocking method that returns for eachrecord in Table B the top-20 most similar records in A ranked bythe TF-IDF cosine similarity of name and addr attributes. We useable 9: Sizes of the two employer datasets to be matched.TableA TableB the union of these two methods as our blocker, which produces 10million candidate pairs.

Data labeling.

We labeled 10,000 pairs sampled from the resultsof each blocking method (20,000 labels in total). We sampled pairsof high similarity with higher probability to increase the diﬃcultyof the dataset to train more robust models. The positive rate of allthe labeled pairs is 39%. We split the labeled pairs into training,validation, and test sets by the ratio of . Applying

Ditto . The user of Ditto does not need to exten-sively tune the hyperparameters but only needs to specify the do-main knowledge and choose a data augmentation operator. We ob-serve that the street number and the phone number are both usefulsignals for matching. Thus, we implemented a simple recognizer that tags the ﬁrst number string in the addr attribute and the last 4digits of the phone attribute. Since we would like the trained modelto be robust against the large number of missing values, we choosethe attr_del operator for data augmentation.We plot the model’s performance in Figure 7. Ditto achieves thehighest F1 score of 96.53 when using all the training data. Dittooutperforms DeepMatcher (DM) in F1 and trains faster (even whenusing MixDA) than DeepMatcher across diﬀerent training set sizes.

DMDittoDitto (DA)Ditto (DK)

Figure 7:

F1 scores and training time for the employer matching models.

Advanced blocking.

Optionally, before applying the trained modelto all the candidate pairs, we can use the labeled data to improve thebasic blocking method. We leverage Sentence-BERT [34], a vari-ant of the BERT model that trains sentence embeddings for sentencesimilarity search. The trained model generates a high-dimensional(e.g., 768 for BERT) vector for each record. Although this modelhas a relatively low F1 (only 92%) thus cannot replace Ditto, wecan use it with vector similarity search to quickly ﬁnd record pairsthat are likely to match. We can greatly reduce the matching timeby only testing those pairs of high cosine similarity. We list therunning time for each module in Table 10. With this technique, theoverall EM process is accelerated by 3.8x (1.69 hours vs. 6.49 hourswith/without advanced blocking).Table 10:

Running time for blocking and matching with Ditto. Advancedblocking consists of two steps: computing the representation of each recordwith Sentence-BERT [34] (Encoding) and similarity search by blocked ma-trix multiplication [1] (Search). With advanced blocking, we only matcheach record with the top-10 most similar records according to the model.Basic Encoding Search MatchingBlocking (GPU) (CPU) (top-10) (ALL)Time (s) 537.26 2,229.26 1,981.97 1,339.36 22,823.43

6. RELATED WORK

EM solutions have tackled the blocking problem [2, 6, 14, 28, 45]and the matching problem with rules [9, 13, 38, 44], crowdsourc-ing [16, 18, 43], or machine learning [37, 8, 3, 16, 20]. Recently, EM solutions used deep learning and achieved promis-ing results [12, 15, 19, 27, 55]. DeepER [12] trains EM modelsbased on the LSTM [17] neural network architecture with word em-beddings such as word2vec [26] or GloVe [30]. DeepER also pro-posed a blocking technique to represent each entry by the LSTM’soutput. Our advanced blocking technique based on Sentence-BERT [34],described in Section 5, is inspired by this. Auto-EM [55] improvesdeep learning-based EM models by pre-training the EM model onan auxiliary task of entity type detection. Ditto also leveragestransfer learning by ﬁne-tuning pre-trained LMs, which are morepowerful models in language understanding. We did not compareDitto with Auto-EM in experiments because the entity types re-quired by Auto-EM are not available in our benchmarks. However,we expect that pre-training Ditto with EM-speciﬁc data/tasks canimprove the performance of Ditto further and is part of our futurework. DeepMatcher introduced a design space for applying deeplearning methods to EM. Following their template architecture, onecan think of Ditto as replacing both the attribute embedding andsimilarity representation components in the architecture with a sin-gle pre-trained LM such as BERT, thus providing a much simpleroverall architecture.All systems, Auto-EM, DeepER, DeepMatcher, and Ditto for-mulate matching as a binary classiﬁcation problem. The ﬁrst threetake a pair of data entries of the same arity as input and aligns theattributes before passing them to the system for matching. On theother hand, Ditto serializes both data entries as one input withstructural tags intact. This way, data entries of diﬀerent schemas canbe uniformly ingested, including hierarchically formatted data suchas those in JSON. Our serialization scheme is not only applicable toDitto, but also to other systems such as Auto-EM, DeepMatcher,and DeepER. In fact, we serialized data entries to DeepMatcher un-der one attribute using our scheme and observed that DeepMatcherimproved by as much as 1.94% on some datasets.External knowledge is known to be eﬀective in improving neu-ral network models in NLP tasks [5, 40]. Instead of directly mod-ifying the network architecture [46, 51] or the loss function [54]to incorporate domain knowledge, Ditto modularizes the way do-main knowledge is incorporated by allowing users to specify andcustomize rules for preprocessing input entries. Data augmentationhas been extensively studied in computer vision and has recentlyreceived more attention in NLP [24, 48, 50]. We designed a set ofdata augmentation operators suitable for EM and apply them withMixDA [24], a recently proposed DA strategy based on convex in-terpolation. To the best of our knowledge, this is the ﬁrst time dataaugmentation has been applied to EM.

7. CONCLUSION

We present Ditto, the ﬁrst EM system based on ﬁne-tuned pre-trained Transformer-based language models. Ditto uses a simplearchitecture to leverage pre-trained LMs and is further optimizedby injecting domain knowledge, text summarization, and data aug-mentation. Our results show that it outperforms existing EM solu-tions on all three benchmark datasets with signiﬁcantly less trainingdata. Ditto’s good performance can be attributed to the improvedlanguage understanding capability mainly through pre-trained LMs,the more accurate text alignment guided by the injected knowledge,and the data invariance properties learned from the augmented data.We plan to further explore our design choices for injecting domainknowledge, text summarization, and data augmentation. In addi-tion, we plan to extend Ditto to other data integration tasks beyondEM, such as entity type detection and schema matching with theultimate goal of building a BERT-like model for tables. . REFERENCES [1] F. Abuzaid, G. Sethi, P. Bailis, and M. Zaharia. To index ornot to index: Optimizing exact maximum inner productsearch. In

Proc. ICDE ’19 , pages 1250–1261. IEEE, 2019.[2] L. R. Baxter, R. Baxter, P. Christen, et al. A comparison offast blocking methods for record. 2003.[3] M. Bilenko and R. J. Mooney. Adaptive duplicate detectionusing learnable string similarity measures. In

Proc. KDD’03 , pages 39–48, 2003.[4] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov.Enriching word vectors with subword information.

TACL ,5:135–146, 2017.[5] Q. Chen, X. Zhu, Z.-H. Ling, D. Inkpen, and S. Wei. Neuralnatural language inference models enhanced with externalknowledge. In

Proc. ACL ’18 , pages 2406–2417, 2018.[6] P. Christen. A survey of indexing techniques for scalablerecord linkage and deduplication.

TKDE , 24(9):1537–1555,2011.[7] K. Clark, U. Khandelwal, O. Levy, and C. D. Manning. Whatdoes BERT look at? an analysis of BERT’s attention. In

Proc. BlackBoxNLP ’19 , pages 276–286, 2019.[8] W. W. Cohen and J. Richman. Learning to match and clusterlarge high-dimensional data sets for data integration. In

Proc.KDD ’02 , pages 475–480, 2002.[9] N. Dalvi, V. Rastogi, A. Dasgupta, A. Das Sarma, andT. Sarlos. Optimal hashing schemes for entity matching. In

Proc. WWW ’13 , pages 295–306, 2013.[10] S. Das, A. Doan, P. S. G. C., C. Gokhale, P. Konda,Y. Govind, and D. Paulsen. The magellan data repository. https://sites . google . com/site/anhaidgroup/projects/data .[11] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT:Pre-training of deep bidirectional transformers for languageunderstanding. In Proc. NAACL-HLT ’19 , pages 4171–4186,2019.[12] M. Ebraheem, S. Thirumuruganathan, S. Joty, M. Ouzzani,and N. Tang. Distributed representations of tuples for entityresolution.

PVLDB , 11(11):1454–1467, 2018.[13] A. Elmagarmid, I. F. Ilyas, M. Ouzzani, J.-A. Quiané-Ruiz,N. Tang, and S. Yin. NADEEF/ER: generic and interactiveentity resolution. In

Proc. SIGMOD ’14 , pages 1071–1074,2014.[14] J. Fisher, P. Christen, Q. Wang, and E. Rahm. Aclustering-based framework to control block sizes for entityresolution. In

Proc. KDD ’15 , pages 279–288, 2015.[15] C. Fu, X. Han, L. Sun, B. Chen, W. Zhang, S. Wu, andH. Kong. End-to-end multi-perspective matching for entityresolution. In

Proc. IJCAI ’19 , pages 4961–4967. AAAIPress, 2019.[16] C. Gokhale, S. Das, A. Doan, J. F. Naughton, N. Rampalli,J. Shavlik, and X. Zhu. Corleone: Hands-oﬀ crowdsourcingfor entity matching. In

Proc. SIGMOD ’14 , pages 601–612,2014.[17] S. Hochreiter and J. Schmidhuber. Long short-term memory.

Neural computation , 9(8):1735–1780, 1997.[18] A. M. E. W. D. Karger and S. M. R. Miller. Human-poweredsorts and joins.

PVLDB , 5(1), 2011.[19] J. Kasai, K. Qian, S. Gurajada, Y. Li, and L. Popa.Low-resource deep entity resolution with transfer and activelearning. In

Proc. ACL ’19 , pages 5851–5861, 2019.[20] P. Konda, S. Das, P. S. G. C., A. Doan, A. Ardalan, J. R.Ballard, H. Li, F. Panahi, H. Zhang, J. F. Naughton, S. Prasad, G. Krishnan, R. Deep, and V. Raghavendra.Magellan: Toward building entity matching managementsystems.

PVLDB , 9(12):1197–1208, 2016.[21] H. Köpcke, A. Thor, and E. Rahm. Evaluation of entityresolution approaches on real-world match problems.

PVLDB , 3(1-2):484–493, 2010.[22] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, andR. Soricut. ALBERT: A lite BERT for self-supervisedlearning of language representations. In

Proc. ICLR ’20 ,2020.[23] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy,M. Lewis, L. Zettlemoyer, and V. Stoyanov. RoBERTa: Arobustly optimized bert pretraining approach. arXiv preprintarXiv:1907.11692 , 2019.[24] Z. Miao, Y. Li, X. Wang, and W.-C. Tan. Snippext:Semi-supervised opinion mining with augmented data. In

Proc. WWW ’20 , 2020.[25] R. Mihalcea and P. Tarau. TextRank: Bringing order intotext. In

Proc. EMNLP ’04 , pages 404–411, 2004.[26] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, andJ. Dean. Distributed representations of words and phrasesand their compositionality. In

Proc. NIPS ’13 , pages3111–3119, 2013.[27] S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park,G. Krishnan, R. Deep, E. Arcaute, and V. Raghavendra. Deeplearning for entity matching: A design space exploration. In

Proc. SIGMOD ’18 , pages 19–34, 2018.[28] G. Papadakis, D. Skoutas, E. Thanos, and T. Palpanas.Blocking and ﬁltering techniques for entity resolution: Asurvey. arXiv preprint arXiv:1905.06167 , 2019.[29] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury,G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga,et al. PyTorch: An imperative style, high-performance deeplearning library. In

Proc. NeurIPS ’19 , pages 8024–8035,2019.[30] J. Pennington, R. Socher, and C. D. Manning. Glove: Globalvectors for word representation. In

Proc. EMNLP ’14 , pages1532–1543, 2014.[31] A. Primpeli, R. Peeters, and C. Bizer. The WDC trainingdataset and gold standard for large-scale product matching.In

Companion Proc. WWW ’19 , pages 381–386, 2019.[32] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, andI. Sutskever. Language models are unsupervised multitasklearners. 2019.[33] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, andI. Sutskever. Language models are unsupervised multitasklearners.

OpenAI Blog , 1(8):9, 2019.[34] N. Reimers and I. Gurevych. Sentence-BERT: Sentenceembeddings using Siamese BERT-networks. In

Proc.EMNLP-IJCNLP ’19 , pages 3982–3992, 2019.[35] A. M. Rush, S. Chopra, and J. Weston. A neural attentionmodel for abstractive sentence summarization. In

Proc.EMNLP ’15 , 2015.[36] V. Sanh, L. Debut, J. Chaumond, and T. Wolf. DistilBERT, adistilled version of BERT: smaller, faster, cheaper andlighter. In

Proc. EMC ’19 , 2019.[37] S. Sarawagi and A. Bhamidipaty. Interactive deduplicationusing active learning. In Proc. KDD ’02 , pages 269–278,2002.[38] R. Singh, V. V. Meduri, A. Elmagarmid, S. Madden,P. Papotti, J.-A. Quiané-Ruiz, A. Solar-Lezama, and N. Tang.ynthesizing entity matching rules by examples.

PVLDB ,11(2):189–202, 2017.[39] Spacy. https://spacy.io/api/entityrecognizer.[40] Y. Sun, S. Wang, Y. Li, S. Feng, X. Chen, H. Zhang, X. Tian,D. Zhu, H. Tian, and H. Wu. ERNIE: Enhancedrepresentation through knowledge integration. arXiv preprintarXiv:1904.09223 , 2019.[41] I. Tenney, D. Das, and E. Pavlick. BERT rediscovers theclassical NLP pipeline. In

Proc. ACL ’19 , pages 4593–4601,2019.[42] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is allyou need. In

Proc. NIPS ’17 , pages 5998–6008, 2017.[43] J. Wang, T. Kraska, M. J. Franklin, and J. Feng. CrowdER:crowdsourcing entity resolution.

PVLDB , 5(11):1483–1494,2012.[44] J. Wang, G. Li, J. X. Yu, and J. Feng. Entity matching: Howsimilar is similar.

PVLDB , 4(10):622–633, 2011.[45] Q. Wang, M. Cui, and H. Liang. Semantic-aware blocking forentity resolution.

TKDE , 28(1):166–180, 2015.[46] X. Wang, X. He, Y. Cao, M. Liu, and T.-S. Chua. KGAT:Knowledge graph attention network for recommendation. In

Proc. KDD ’19 , page 950âĂŞ958, 2019.[47] WDC Product Data Corpus.http://webdatacommons.org/largescaleproductcorpus/v2. [48] J. Wei and K. Zou. EDA: Easy data augmentation techniquesfor boosting performance on text classiﬁcation tasks. In

Proc.EMNLP-IJCNLP ’19 , pages 6382–6388, 2019.[49] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue,A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, et al.Huggingface’s Transformers: State-of-the-art naturallanguage processing. arXiv preprint arXiv:1910.03771 , 2019.[50] Q. Xie, Z. Dai, E. Hovy, M.-T. Luong, and Q. V. Le.Unsupervised data augmentation. arXiv preprintarXiv:1904.12848 , 2019.[51] B. Yang and T. Mitchell. Leveraging knowledge bases inLSTMs for improving machine reading. In

Proc. ACL ’17 ,pages 1436–1446, 2017.[52] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov,and Q. V. Le. XLNet: Generalized autoregressive pretrainingfor language understanding. In

Proc. NeurIPS ’19 , pages5754–5764, 2019.[53] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz.mixup: Beyond empirical risk minimization. In

Proc. ICLR’18 , 2018.[54] Z. Zhang, X. Han, Z. Liu, X. Jiang, M. Sun, and Q. Liu.ERNIE: Enhanced language representation with informativeentities. In

Proc. ACL ’19 , pages 1441–1451, 2019.[55] C. Zhao and Y. He. Auto-EM: End-to-end fuzzyentity-matching using pre-trained deep models and transferlearning. In