[PDF] Alaska: A Flexible Benchmark for Data Integration Tasks

Abstract

Data integration is a long-standing interest of the data management community and has many disparate applications, including business, science and government. We have recently witnessed impressive results in specific data integration tasks, such as Entity Resolution, thanks to the increasing availability of benchmarks. A limitation of such benchmarks is that they typically come with their own task definition and it can be difficult to leverage them for complex integration pipelines. As a result, evaluating end-to-end pipelines for the entire data integration process is still an elusive goal. In this work, we present Alaska, the first benchmark based on real-world dataset to support seamlessly multiple tasks (and their variants) of the data integration pipeline. The dataset consists of ~70k heterogeneous product specifications from 71 e-commerce websites with thousands of different product attributes. Our benchmark comes with profiling meta-data, a set of pre-defined use cases with diverse characteristics, and an extensive manually curated ground truth. We demonstrate the flexibility of our benchmark by focusing on several variants of two crucial data integration tasks, Schema Matching and Entity Resolution. Our experiments show that our benchmark enables the evaluation of a variety of methods that previously were difficult to compare, and can foster the design of more holistic data integration solutions.

Full PDF

AAlaska: A Flexible Benchmark for Data Integration Tasks

Valter Crescenzi , Andrea De Angelis , Donatella Firmani , Maurizio Mazzei , PaoloMerialdo , Federico Piai , and Divesh Srivastava Roma Tre University, [email protected] AT&T Chief Data Ofﬁce, [email protected]

Abstract

Data integration is a long-standing interestof the data management community and hasmany disparate applications, including busi-ness, science and government. We have re-cently witnessed impressive results in speciﬁcdata integration tasks, such as Entity Resolu-tion, thanks to the increasing availability ofbenchmarks. A limitation of such benchmarksis that they typically come with their owntask deﬁnition and it can be difﬁcult to lever-age them for complex integration pipelines.As a result, evaluating end-to-end pipelinesfor the entire data integration process is stillan elusive goal. In this work, we presentAlaska, the ﬁrst benchmark based on real-world dataset to support seamlessly multipletasks (and their variants) of the data integra-tion pipeline. The dataset consists of 70k het-erogeneous product speciﬁcations from 71 e-commerce websites with thousands of differ-ent product attributes. Our benchmark comeswith proﬁling meta-data, a set of pre-deﬁneduse cases with diverse characteristics, and anextensive manually curated ground truth. We demonstrate the ﬂexibility of our benchmarkby focusing on several variants of two crucialdata integration tasks,

Schema Matching and

Entity Resolution . Our experiments show thatour benchmark enables the evaluation of a va-riety of methods that previously were difﬁcultto compare, and can foster the design of moreholistic data integration solutions.

Data integration is the fundamental problemof providing a uniﬁed view over multiple datasources. The sheer number of ways hu-mans represent and misrepresent informationmakes data integration a challenging prob-lem. Data integration is typically thought ofas a pipeline of multiple tasks rather than asingle problem. Different works identify dif-ferent tasks as constituting the data integra-tion pipeline, but some tasks are widely rec-ognized at its core, namely, Schema Matching(SM) [35, 4] and Entity Resolution (ER) [17].A simple illustrative example is provided inFigure 1 involving two sources A and B, withsource A providing two camera speciﬁcations1 a r X i v : . [ c s . D B ] F e b ith two attributes each, and source B pro-viding two camera speciﬁcations with threeattributes each. SM results in an integratedschema with four attributes, and ER results inan integrated set of three entities, as shownin the uniﬁed view in Figure 1. Source AID Model ResolutionA.1 Canon4000D 18.0MpA.2 Canon250D 24.1Mp Source BID Model Sensor MPB.1 Canon4000D CMOS 17B.2 KodakSP1 CMOS 18

Uniﬁed viewID

Model (A, B)

Resolution (A) MP (B) Sensor (B) ) A.1B.1

Canon 4000D (A, B) (A) (B) CMOS (B)

A.2 Canon250D (A) (A) (null)

B.2 KodakSP1 (B) (B) CMOS (B)

Figure 1: Illustrative example of SchemaMatching and Entity Resolution tasks. Bold-face column names represent attributesmatched by Schema Matching, boldface rowIDs represent rows matched by Entity Resolu-tion.Although fully automated pipelines are stillfar from being available [13], we have re-cently witnessed impressive results in spe-ciﬁc tasks (notably, Entity Resolution [5, 6,16, 26, 28]) fueled by recent advances in theﬁelds of machine learning [19] and naturallanguage processing [11] and to the unprece-dented availability of real-world benchmarksfor training and testing (such as, the Magel-lan data repository [9]). The same trend –i.e., benchmarks inﬂuencing the design of in-novative solutions – was previously observedin other areas. For instance, the Transaction Processing Performance Council proposed thefamily of TPC benchmarks for OLTP andOLAP [29, 31], NIST promoted several initia-tives with the purpose of supporting researchin Information Retrieval tasks such as TREC, the ImageNet benchmark [10] supports re-search on visual object recognition.In the data integration landscape, theTransaction Processing Performance Councilhas developed TPC-DI [32], a data integra-tion benchmark that focuses on the typicalETL tasks, i.e., extraction and transforma-tion of data from a variety of sources andsource formats (e.g., CSV, XML, etc.). Un-fortunately, for the ER and SM tasks, onlyseparate task-speciﬁc benchmarks are avail-able. For instance, the aforementioned Mag-ellan data repository provides datasets for theER task where the SM problem has been al-ready solved. Thus, the techniques inﬂuencedby such task-speciﬁc benchmarks can be dif-ﬁcult to integrate into a complex data inte-gration pipeline. As a result, the techniquesevaluated with such benchmarks are limitedby choices made in the construction of thedatasets and cannot take into account all thechallenges that might arise when tackling theintegration pipeline using the original sourcedata. Our approach.

Since it is infeasible to geta benchmark for the entire pipeline by justcomposing isolated benchmarks for differenttasks, we provide a benchmark that can sup-port by design the development of end-to-enddata integration methods. Our intuition is to https://trec.nist.gov The Alaska benchmark.

The Alaskadataset consists of almost 70k product spec-iﬁcations extracted from 71 different datasources available on the Web, over 3 do-mains, or verticals : camera , monitor and notebook . Figure 2a shows a sample spec-iﬁcation/record from the datasource of the camera domain. Each recordis extracted from a different web page andrefers to a single product. Records con-tain attributes with the product properties(e.g., the camera resolution), and are repre-sented in a ﬂat JSON format. The default attribute shows the html titleof the original web page and typically containssome of the product properties in an unstruc-tured format. The entire dataset comprises15k distinct attribute names. Data has beencollected from real-life web data sources bymeans of a three-steps process: ( i ) web data https://github.com/merialdo/research.alaska sources in our benchmark are discovered andcrawled by the Dexter focused crawler [34], ( ii ) product speciﬁcations are extracted fromweb pages by means of an ad-hoc method, and ( iii ) ground truth for the benchmark tasks ismanually curated by a small set of domain ex-perts, prioritizing annotations with the notionof beneﬁt introduced in [18].Preliminary versions of our Alaska bench-mark has been recently used for the 2020 SIG-MOD Programming Contest and for two edi-tions of the DI2KG challenge.The main properties of the Alaska bench-mark are summarized below.• Supports Multiple tasks.

Alaska sup-ports a variety of data integration tasks.The version described in this work fo-cuses on Schema Matching and EntityResolution, but our benchmark can beeasily extended to support other tasks(e.g., Data Extraction). For this reason,Alaska is suitable for holistically assess-ing complex data integration pipelines,solving more than one individual task.•

Heterogeneous.

Data sources includedin Alaska cover a large spectrum ofdata characteristics, from small and cleansources to large and dirty ones. Dirtysources may present a variety of records,each providing a different set of at-tributes, with different representations,while clean sources have more homoge-neous records, both in the set of the at-tributes and in the representation of theirvalues. To this end, Alaska comes with aset of proﬁling metrics that can be used3o select subsets of the data, yielding dif-ferent use-cases, with tunable properties.•

Manually curated.

Alaska comes with aground truth that has been manually cu-rated by domain experts. Such a groundtruth is large both in terms of numberof sources and in the number of recordsand attributes. For this reason, Alaska issuitable for evaluating methods with highaccuracy, without overlooking tail knowl-edge.

Paper outline.

Section 2 discusses relatedwork. Section 3 describes the Alaska datamodel and the supported tasks. Section 4 pro-vides a proﬁling of the datasets and of theground truth. A pre-deﬁned set of use casesand experiments demonstrating the beneﬁtsof using Alaska is illustrated in Section 5).Section 6 describes the data collection pro-cess behind the Alaska benchmark. Sec-tion 7 summarizes the experiences with theAlaska benchmark in the DI2KG Challengesand in the 2020 SIGMOD Programming Con-test. Section 8 concludes the paper and brielypresents future works.

We now review recent benchmarks for dataintegration tasks – namely Schema Match-ing and Entity Resolution – either comprisingreal-world datasets or providing utilities forgenerating synthetic ones.

Real-world datasets for SM.

In the contextof SM, the closest works to ours are XBench-Match and T2D. XBenchMatch [14, 15] is apopular benchmark for XML data, compris-ing data from a variety of domains, such asﬁnance, biology, business and travel. Each do-main includes two or a few more schemata.T2D [36] is a more recent benchmark forSM over HTML tables, including hundreds ofthousands of tables from the Web Data Com-mons archive, with DBPedia serving as thetarget schema. Both XBenchMatch and T2Dfocus on dense datasets, where most of theattributes have non-null values. In contrast,we provide both dense and sparse sources, al-lowing each of the sources in our benchmarkto be selected as the target schema, yielding avariety of SM problems with varying difﬁculty.Other benchmarks provides smaller datasetsand they are suitable either for a pairwise SMtask [1, 20], like in XBenchMatch, or for amany to one SM task [12], like in T2D. Finally,we mention the Ontology Alignment Evalua-tion Initiative (OAEI) [2] and SemTab (TabularData to Knowledge Graph Matching) [22] forthe sake of completeness. Such datasets arerelated to the problem of matching ontologies(OAEI) and HTML tables (SemTab) to rela-tions of a Knowledge Graph and can be poten-tially used for building more complex datasetsfor the SM tasks.

Real-world datasets for ER.

There is awide variety of real-world datasets for theER task in literature. The Cora CitationMatching Dataset [27] is a popular collec- http://webdatacommons.org/ with a limited Relying on a product identiﬁer, such as the UPC, canproduce incomplete results since there are many stan-dards, and hence the same product can be associatedwith different codes in different sources. amount of manually veriﬁed data ( ≈

2K match-ing pairs per dataset). In contrast, the entireground truth available in our Alaska bench-mark has been manually veriﬁed by domainexperts.

Data generators.

While real-world data haveﬁxed size and characteristics, data genera-tors can be ﬂexibly used in a controlled set-ting. MatchBench [20], for instance, can gen-erate various SM tasks by injecting differenttypes of synthetic noise on real schemata [23].STBenchmark [1] provides tools for gener-ating synthetic schemata and instances withcomplex correspondences. The Febrl sys-tem [8] includes a data generator for the ERtask, that can create ﬁctitious hospital pa-tients data with a variety of cluster size dis-tributions and error probabilities. More re-cently, EMBench++ [21] provides a systemto generate data for benchmarking ER tech-niques, consisting of a collection of matchingscenarios, capturing basic real-world situa-tions (e.g., syntactic variations and structuraldifferences) as well as more advanced situa-tions (i.e., evolving information over time). Fi-nally, the iBench [3] metadata generator canbe used to evaluate a wide-range of integra-tion tasks, including SM and ER, allowing con-trol of the size and characteristics of the data,such as schemata, constraints, and mappings.In our benchmark, we aim to get the best ofboth worlds: considering real-world datasetsand providing more ﬂexibility in terms of taskdeﬁnition and selection of data characteris-tics.5

Benchmark Tasks

Let S be a set of data sources providing in-formation about a set of entities , where eachentity is representative of a real-world prod-uct (e.g., Canon EOS 1100D). Each source S ∈ S consists of a set of records, where each record r ∈ S refers to an entity and the sameentity can be referred to by different records.Each record consists of a title and a set of attribute-value pairs . Each attribute refers toone or more properties , and the same prop-erty can be referred to by different attributes.We describe below a toy example that will beused in the rest of this section. Example 1 (Toy example)

Consider thesample records in Figure 2. Records r and r refer to entity “Canon EOS 1100D” while r refers to “Sony A7”. Attribute brand of r and manufacturer of r refer to the same under-lying property. Attribute battery of r referto the same properties as battery_model and battery_chemistry of r . We identify each attribute with the combi-nation of the source name and the attributename – such as, S . battery . We refer tothe set of attributes speciﬁed for a givenrecord r as schema ( r ) and to the set of at-tributes speciﬁed for a given source S ∈ S ,that is, the source schema , as schema ( S ) = (cid:83) r ∈ S schema ( r ) . Note that in each record weonly specify the non-null attributes. Example 2

In our toy example, schema ( S ) = schema ( r ) = { S . brand , S . resolution , S . battery , S . digital_screen , S . size } ,while schema ( S ) is the union of schema ( r ) { "": "Cannon EOS1100D - Buy", "brand": "Canon", "resolution" : "32 mp" "battery": "NP-400, Lithium-ion rechargeable battery", "digital_screen": "yes", "size": "7.5 cm" } (a) Record r ∈ S { "": "Cameraspecifications for Canon EOS1100D", "manufacturer": "Canon", "resolution" : "32000000 p", "battery_model": "NP-400", "battery_chemistry": "Li-Ion" } (b) Record r ∈ S { "": "Sony A7", "manufacturer": "Sony", "size": "94.4 x 126.9 x 54.8mm", "rating": "4.7 stars" } (c) Record r ∈ S Figure 2: Toy example.6 nd schema ( r ) , i.e., schema ( S ) = { S . manufacturer , S . resolution , S . battery_model , S . battery_chemistry , S . size , S . rating } . Finally, we refer to the set of text tokens ap-pearing in a given record r as tokens ( r ) andto the set of text tokens appearing in a givensource, that is, the source vocabulary , S ∈ S ,as tokens ( S ) = (cid:83) r ∈ S tokens ( r ) . Example 3

In our toy example, tokens ( S ) = tokens ( r ) = { Cannon , EOS , , Buy , Canon , , mp , NP-400 , · · · } , while tokens ( S ) is theunion of tokens ( r ) and tokens ( r ) . Based on the presented data model, wenow deﬁne the tasks and their variants con-sidered in this work to illustrate the ﬂexibil-ity of our benchmark. For SM we considertwo variants, namely: catalog-based and me-diated schema . For ER we consider threevariants, namely: self-join , similarity-join anda schema-agnostic variant. Finally, we dis-cuss the evaluation metrics used in our exper-iments. Table 1 summarizes the notation usedin this section and in the rest of the paper. Given a set of sources S , let A = (cid:83) S ∈S schema ( S ) and T be a set of propertiesof interest. T is referred to as the targetschema and can be either deﬁned manuallyor set equal to the schema of a given sourcein S . Schema Matching is the problem of ﬁnd-ing correspondences between elements of twoschemata, such as A and T [4, 35]. A formaldeﬁnition is given below. Deﬁnition 1 (Schema Matching)

Let M + ⊆ A × T be a set s.t. ( a, t ) ∈ M + iffthe attribute a refers to the property t , SMcan be deﬁned as the problem of ﬁnding pairsin M + . Example 4

Let S = { S , S } and A = schema ( S ) ∪ schema ( S ) = { a , . . . , a } with a = S . battery , a = S . brand , a = S . resolution , a = S . digital_screen , a = S . size , a = S . battery_model , a = S . battery_chemistry , a = S . manufacturer , a = S . score and a = S . size . Let T = { t , t , t } , with t = “battery model”, t = “battery chemistry”and t = “brand”. Then M + = { ( a , t ) , ( a , t ) , ( a , t ) , ( a , t ) , ( a , t ) , ( a , t ) } . We consider two popular variants for the SMproblem.1.

Catalog SM.

Given a set of sources S = { S , S , . . . } and a Catalog source S ∗ ∈ S s.t. T = schema ( S ∗ ) , ﬁnd the correspon-dences in M + . It is worth mentioningthat, in a popular instance of this variant,the Catalog source may correspond to aKnowledge Graph.2. Mediated SM.

Given a set of sources S = { S , S , . . . } and a mediated schema T , deﬁned manually with a selection ofdifferent real-world properties, ﬁnd thecorrespondences in M + .In both the variants above, S can consist ofone or several sources. State of the art algo-rithms (e.g., FlexMatcher [7]), typically con-sider one or two sources at a time, rather thanthe entire set of sources available.7ymbol Deﬁnition S a set of sources S ∈ S a source r ∈ S a record schema ( r ) non-null attributes in a record r ∈ Sschema ( S ) non-null attributes in a source S ∈ S tokens ( r ) text tokens in a record r ∈ Stokens ( S ) text tokens in a source S ∈ S A non-null attributes in a set of sources S T target schema V records in a set of sources S M + ⊆ A × T matching attributes E + ⊆ V × V matching record pairsTable 1: Notation table. Challenges.

The main challenges providedby our dataset for the SM problem are de-tailed below.•

Synonyms.

Our dataset contains at-tributes with different names but re-ferring to the same property, such as S . brand and S . manufacturer ;• Homonyms.

Our dataset contains at-tributes with the same names but re-ferring to different properties, such as S . size (the size of the digital screen) and S . size (dimension of the camera body);• Granularity.

In addition to one-to-one correspondences (e.g., S . brand with S . manufacturer ), our dataset containsone-to-many and many-to-many corre-spondences, due to different attributegranularities. S . battery , for instance,corresponds to both S . battery_model and S . battery_chemistry . Given a set of sources S , let V = (cid:83) S ∈S S bethe set of records in S . Entity Resolution (ER)is the problem of ﬁnding records referring tothe same entity. Formal deﬁnition is given be-low. Deﬁnition 2 (Entity Resolution)

Let E + ⊆ V × V be a set s.t. ( u, v ) ∈ E + iff u and v referto the same entity, ER can be deﬁned as theproblem of ﬁnding pairs in E + . We note that E + is transitively closed, i.e., if ( u, v ) ∈ E + and ( u, w ) ∈ E + , then ( v, w ) ∈ E + ,and each connected component { C , . . . , C k } is a clique representing a distinct entity. Wecall each clique C i a cluster of V . Example 5

Let S = { S , S } and V = { r , r , r } . Then E + = { ( r , r ) } and there aretwo clusters, namely C = { r , r } represent- ng a “Canon EOS 1100D” and C = { r } rep-resenting a “Sony A7”. We consider three variants for the ER prob-lem.1.

Similarity-join ER.

Given two sources S = { S left , S right } ﬁnd the record pairs in E + sim ⊆ E + , E + sim ⊆ S left × S right .2. Self-join ER.

Given a set of sources S = { S , S . . . } ﬁnd the record pairs in E + .3. Schema-agnostic ER.

In both the vari-ants above, we assume that a solution forthe SM problem is available, that is, at-tributes of every input record for ER arepreviously aligned to a manually-speciﬁedmediated schema T . This variant is thesame as self-join ER but has no SM infor-mation available. Challenges.

The main challenges providedby our dataset for the ER problem are detailedbelow.•

Variety.

Different sources may use dif-ferent format and naming conventions.For instance, the records r and r con-ceptually have the same value for the resolution attribute, but use differentformats. In addition, records can containtextual descriptions such as the batteryin r and data sources of different coun-tries can use different conventions (i.e.,inches/cm and “EOS”/“Rebel”). As a re-sult, records referring to different enti-ties can have more similar representationthan records referring to the same entity. • Noise.

Records can contain noisy or er-roneous values, due to entity misrepre-sentation in the original web page and inthe data extraction process. An exampleof noise in the web page is “Cannon” inplace of “Canon” in the page title of r . Anexample of noise due to data extraction isthe rating in r , which was extracted froma different portion of the web page thanthe product’s speciﬁcation.• Skew.

Finally, the cluster size distri-bution over the entire set of sources isskewed, meaning that some entities areover-represented and others are under-represented.

We use the standard performance measuresof precision, recall and F1-measure for boththe SM and ER tasks. More speciﬁcally, let M − = ( A × T ) \ M + and E − = ( V × V ) \ E + . Foreach of the tasks deﬁned in this paper, we con-sider manually veriﬁed subsets of M + , M − , E + and E − that we refer to as ground truth ,and then we deﬁne precision, recall and F1-measure with respect to the manually veriﬁedground truth.Let such ground truth subsets be denotedas M + gt , M − gt , E + gt and E − gt . In our benchmarkwe ensure that M + gt ∪ M − gt yields a completebipartite sub-graph of A × T and E + gt ∪ E − gt yieldsa complete sub-graph of V × V . More detailson the ground truth can be found in Section 4.Finally, in case the SM or ER problem vari-ant at hand has prior restrictions (e.g., a SMrestricted to one-to-one correspondences or9 similarity-join ER restricted to inter-sourcepairs), we only consider the portion of theground truth satisfying those restrictions. Alaska contains data from three different e-commerce domains: camera , monitor , and notebook . We refer to such domains as ver-ticals . Table 2 reports, for each vertical v , thenumber of sources |S v | , the number of records | V | , the number of records in the largestsource | S max | , S max = argmax S ∈S v | S | , the to-tal number of attributes | A | , and the averagenumber of record attributes in each vertical a avg = avg r ∈ V | schema ( r ) | , the average num-ber of record tokens t avg = avg r ∈ V | tokens ( r ) | (i.e., the record length). Table 2 also showsdetails about our manually curated groundtruth: the number of attributes | T ∗ | in the tar-get schema for the SM task; the number ofentities k ∗ and the size of the largest clus-ter | C ∗ max | for the ER task. Note that foreach source we provide the complete set ofattributes and record matches with respect tothe mentioned target schema and entities.In the rest of this section, we provide proﬁl-ing metrics for our benchmark and show theirdistribution over Alaska sources. Proﬁlingmetrics include: ( i ) traditional size metricssuch as the number of records and attributes,and ( ii ) three new metrics, dubbed AttributeSparsity , Source Similarity and

VocabularySize , that quantify different dimensions of het-erogeneity for one or several sources. Thegoal of the considered proﬁling metrics is toallow users of our benchmark to pick and choose different subsets of sources for eachvertical so as to match their desired use case.Possible scenarios with different difﬁculty lev-els include for instance integration tasks overhead sources, tail sources, sparse sources,dense sources, clean sources, dirty sourcesand so on.

Size.

Figures 3a and 3b plot the distri-bution of Alaska sources with respect tothe number of records ( | S | ) and the aver-age number of attributes per record ( a avg = avg r ∈ S | schema ( r ) | ), respectively. Figure 3ashows that the camera and notebook verticalshave a few big sources and a long tail of smallsources, while the monitor vertical have moremedium-sized sources. As for the number ofattributes, Figure 3b shows that camera and notebook follow the same distribution, while monitor has notably more sources with higheraverage number of attributes.Figure 4 reports the distribution of targetattributes (4a) and entities (4b) in our man-ually curated ground truth, with respect tothe number of sources in which they appear.It is worth noticing that a signiﬁcant fractionof attributes in our ground truth are presentin most sources, but there are also tail at-tributes that are present in just a dfew sources(less than ). Regarding entities, observethat in notebook most entities span less than of sources, while in monitor and camera there are popular entities that are presentin up to of sources. Figure 5 showsthe cluster size distribution of entities in theground truth. Observe that the three Alaskaverticals are signiﬁcantly different: camera has the largest cluster (184 records) monitor v |S v | | V | | S max | | A | a avg t avg ground truth | T ∗ | k ∗ | C ∗ max | camera

24 29,787 14,274 4,660 16.74 8.53 56 103 184 monitor

26 16,662 4,282 1,687 25.64 4.05 87 232 33 notebook

27 23,167 8,257 3,099 29.42 9.85 44 208 91Table 2: Verticals in the Alaska dataset. . . . . . . . . . . s o u r c e s camera monitor notebook (a) . . . . . . . . . . avgerage s o u r c e s camera monitor notebook (b) Figure 3: (a) Number of records and (b) av-erage number of attributes per source. Notethat the x -axis is in log scale, where the valuerange on the x -axis is divided into 10 bucketsand each bucket is a constant multiple of theprevious bucket. t a r g e t a tt r i b u t e s camera monitor notebook (a) e n t i t i e s camera monitor notebook (b) Figure 4: (a) Target attributes and (b) entitiesby number of Alaska sources in which they arementioned. e n t i t i e s camera 10 Figure 5: Cluster size distribution.11 .1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0attribute sparsity0510 s o u r c e s camera monitor notebook Figure 6: Source attribute sparsity, fromdense to sparse.has both small and medium-size clusters, and notebook has the most skewed cluster sizedistribution, with many small clusters.

Attribute Sparsity.

Attribute sparsity aimsat measuring how often the attributes of asource are actually used by its records. At-tribute Sparsity, AS : S → [0 , , is deﬁned asfollows: AS ( S ) = 1 − (cid:80) r ∈ S | schema ( r ) || schema ( S ) | · | S | (1)In a dense source (low AS ), most records havethe same set of attributes. In contrast, in a sparse source (high AS ), most attributes arenull for most of the records. Denser sourcesrepresent easier instances for both SM andER, as they provide a lot of non-null values foreach attribute and more complete informationfor each record.Figure 6 shows the distribution of Alaskasources according to their AS value: notebook has the largest proportion of dense sources( AS < . ), while monitor has the largest pro-portion of sparse sources ( AS > . ). Source Similarity.

Source Similarity has the . . . . . . . . . . vocabulary overlap05001000 s o u r c e p a i r s camera monitor notebook Figure 7: Pair-wise Source Similarity, fromless similar to more similar pairs. Note thatthe x -axis is in log scale, namely, each bucketis a constant multiple of the previous bucket.objective of measuring the similarity of twosources in terms of attribute values. Sourcesimilarity, SS : S × S → [0 , , is deﬁned as theJaccard index of the two source vocabularies,as follows: SS ( S , S ) = | tokens ( S ) ∩ tokens ( S ) || tokens ( S ) ∪ tokens ( S ) | (2)Source sets with high pair-wise SS valueshave typically similar representations of enti-ties and properties, and are easy instances forSM.Figure 7 shows the distribution of sourcepairs according to their SS value. As for SS,in all three Alaska verticals the majority ofsource pairs have medium-small overlap. In camera there are notable cases with SS closeto zero, providing extremely challenging in-stances for SM. Finally, in notebook there arepairs of sources that use approximately thesame vocabulary ( SS close to one) which canbe used as a frame of comparison when test-ing vocabulary-sensitive approaches. Vocabulary Size.

Vocabulary Size,

V S : S → . . . . . . . . . . vocabulary size0510 s o u r c e s camera monitor notebook Figure 8: Source vocabulary size, from con-cise to verbose. Note that the x -axis is in logscale. N , measures the number of distinct tokensthat are present in a source: V S ( S ) = | tokens ( S ) | (3)Sources can have high VS for multiple rea-sons, including values with long textual de-scriptions, or attributes with large ranges ofcategorical or numerical values. In addition,large sources with low sparsity typically havehigher VS than small ones. Sources with high V S usually represent easier instances for theschema-agnostic ER task.Figure 8 shows the distribution of Alaskasources according to their VS value. The camera vertical has the most verbose sources,while notebook has the largest proportion ofsources with smaller vocabulary. Therefore, camera sources are more suitable on averagefor testing NLP-based approaches for ER suchas [28].

We now describe a collection of signiﬁcant us-age scenarios for our Alaska benchmark. In each scenario, we consider one of the tasks inSection 3 and show the performance of a rep-resentative method for that task over differ-ent source selections, with different proﬁlingproperties.Our experiments were performed on aserver machine equipped with a CPU IntelXeon E5-2699 chip with 22 cores running 44threads at speed up to 3.6 GHz, 512 GB ofDDR4 RAM and 4 GPU NVIDIA Tesla P100-SXM2 each with 16 GB of available VRAMspace. The operating system was Ubuntu17.10, kernel version 4.13.0, Python version3.8.1 (64-Bit) and Java 8.

Let us illustrate the representative methodsthat we have considered for demonstratingthe usage scenarios for our benchmark, andthe settings used in our experiments.

Schema Matching Methods.

We selectedtwo methods for the Schema Matching tasks,namely

Nguyen [30] and FlexMatcher [7].•

Nguyen [30] is a system designed to en-rich product catalogs with external spec-iﬁcations retrieved from the Web. Thissystem addresses several issues and in-cludes a catalog SM algorithm. We re-implemented this SM algorithm and usedit in our catalog SM experiments. TheSM algorithm in

Nguyen computes a sim-ilarity score between the attributes of anexternal source and the attributes of acatalog; then the algorithm matches at-tribute pairs with the highest score. Thesimilarity score is computed by means of13 classiﬁer, which is trained with pairsof attributes with the same name, basedon the assumption that they representpositive correspondences. It is impor-tant to observe that some of these pairscould actually be false positives due tothe presence of homonyms. In our exper-iment, each run of the

Nguyen algorithmrequired less than 10 minutes.• FlexMatcher [7] is one of the tools in-cluded in BigGorilla, a broad project thatgathers solutions for data integration anddata preparation. We use the originalPython implementation in our mediatedSM experiments. As suggested by theFlexMatcher documentation, we use twosources for training and a source for test,and kept the default conﬁguration. Flex-Matcher uses the training sources to trainmultiple classiﬁers by considering severalfeatures including attribute names andattribute values. A meta-classiﬁer even-tually combines the predictions made byeach base classiﬁer to make a ﬁnal pre-diction. It detects the most importantsimilarity features for each attribute ofthe mediated schema. In our experi-ments, the training step required from 2to 5 minutes, and the test step from 10seconds to 25 minutes, depending on thesize of input sources. Entity Resolution Methods.

In principle, https://biggorilla.org https://pypi.org/project/flexmatcher version1.0.4 https://flexmatcher.readthedocs.io our benchmark can be used for evaluating avariety of ER techniques. We have selectedtwo recent (and popular) Deep Learning so-lutions, that have been proved very effective,especially when dealing with textual data,namely, DeepMatcher [28] and DeepER [16].• DeepMatcher [28] is a suite of DeepLearning models speciﬁcally tailored forER. We used the Hybrid model ofDeepMatcher for our self-join ER andsimilarity-join ER experiments. Thismodel, which performed best in the orig-inal paper, uses a bidirectional RNN withdecomposable attention. We used theoriginal DeepMatcher Python implemen-tation. • DeepER [16] is a schema-agnostic ER sys-tem based on a distributed representa-tion of tuples. We re-implemented theDeepER version with RNN and LSTM,which performed best in the original pa-per, and used it for our schema-agnosticER experiments.We trained and tested both DeepMatcherand DeepER on the same selection of the data.For each experiment, we split the groundtruth into training (60%), validation (15%) andtest (25%). The results, expressed throughthe classic precision, recall and F1-measuremetrics, are calculated on the test set. Sinceour ER ground truths are transitively closed,so are the training, validation and test sets.For the self-join ER tasks (schema-based andschema-agnostic) we down-sample the non-matching pairs in the training set to reduce https://github.com/anhaidgroup/deepmatcher ( u, v ) , we sam-pled two non-matching pairs ( u, v (cid:48) ) and ( u (cid:48) , v ) with high similarity (i.e., high Jaccard index oftoken sets). DeepMatcher training required30 minutes on average on the similarity-joinER tasks, and 1 hour on average on the self-join ER tasks. DeepER training required 25minutes on average on the schema-agnosticER tasks. In the following we show the results ofour representative selection of state-of-the-artmethods over different usage scenarios. Ineach scenario, we consider one of the ER andSM variants described in Section 3 and showthe performance of a method for the sametask over three (or more) different source se-lections. Sources are selected according tothe proﬁling metrics introduced in Section 4so as to yield task instances with differentlevels of difﬁculty (e.g., from easy to hard).For the sake of completeness, we use differ-ent verticals of the benchmark for differentexperiments. Results over other verticals areanalogous and therefore are omitted. Usagescenarios discussed in this section are sum-marized in Table 3. Catalog SM approaches are con-ceived to enrich a target clean source, re- For the tasks similarity-join ER and self-join ER weuse the SM ground truth. o v e r a l l c a t e g o r i c a l c o u n t m e a s u r e v a l u e F-measure precision recall (a) S ∗ = jrlinton o v e r a l l c a t e g o r i c a l m e a s u r e m u l t i v a l u e d v a l u e F-measure precision recall (b) S ∗ = vology o v e r a l l c a t e g o r i c a l c o u n t m e a s u r e v a l u e F-measure precision recall (c) S ∗ = nexus-t-co-uk Figure 9:

Nguyen results for the catalog SMtask on different instances from the monitor vertical.15 ask Method Vertical Main source selection criteriaCatalog SM

Nguyen monitor catalog sources with decreasing number ofrecordsMediated SM FlexMatcher camera sources with different attribute sparsity andsource similaritySimilarity-join ER DeepMatcher camera sources with increasing attribute sparsitySelf-join ER DeepMatcher notebook sources with increasing attribute sparsityand skewed cluster size distributionSchema-agnostic ER DeepER notebook sources with increasing vocabulary size andskewed cluster size distribution

Table 3: Usage scenarios and main source selection criteria. S ∗ | S ∗ | AS ( S ∗ ) Difﬁcultyjrlinton 1,089 0.50 easyvology 528 0.46 mediumnexus-t-co-uk 136 0.30 hardTable 4: Sources considered for our catalogSM experiments.ferred to as catalog , with many other, possi-bly dirty, sources. Larger catalogs (i.e., witha larger number of records) typically yieldbetter performance, as more redundancy canbe exploited. For this task, we consider the

Nguyen algorithm [30].Figure 9 shows

Nguyen performance overthree different choices of the catalog source S ∗ in the monitor vertical. We select cata-log sources with decreasing size and reason-ably low attribute sparsity ( AS ( S ∗ ) ≤ . ), asreported in Table 4. For each run, we set S = S monitor \ S ∗ .As expected, the overall F-measure is lowerfor smaller catalog sources. We also observethat even in the same catalog, some attributescan be much easier to match than others. For this reason, in addition to the overall F-measure, we also report the partial F-measureof Nguyen when considering only attributes ofthe following types.•

Measure indicates an attribute providinga numerical value (like cm, inch...).•

Count indicates a quantity (number ofUSB ports...).•

Single-value indicates categorical at-tributes providing a single value for agiven property (e.g., the color of a prod-uct).•

Multivalued are categorical attributeswith multiple possible values (e.g., the setof ports of a laptop).Finally, we note that thanks to the redun-dancy in larger catalogs,

Nguyen can be ro-bust to moderate and even to high attributesparsity. However, when sparsity become toohigh, performances can quickly decrease. Forinstance, we run PSE using eBay as catalogsource (99% sparsity), and reported 0.33 asF-measure, despite the large-size source.16 a s y m e d i u m h a r d _ h a r d _ v a l u e F-measure precision recall

Figure 10: FlexMatcher results for the medi-ated SM task on different instances from the camera vertical.

Mediated SM.

Given a mediated schema T ,recent mediated SM approaches can use a setof sources with known attribute correspon-dences with T – i.e., the training sources –to learn features of mediated attribute namesand values and then use such features topredict correspondences for unseen sources– i.e., the test sources. It is worth observ-ing that test sources with low attribute spar-sity are easier to process. It is also impor-tant that there must be enough source simi-larity between the test source and the trainingsources. For this task, we consider the Flex-Matcher approach [7].Figure 10 reports the performance ofthe FlexMatcher approach for four differentchoices for the training sources S T , S T , andtest sources S t in the camera vertical. Givenour manually curated mediated schema, weset S = { S t } . Each selected test source t has different sparsity AS ( S t ) as in Table 5.For each test source t we select two trainingsources T , T with different average sourcesimilarity SS = SS ( S T ,S t )+ SS ( S T ,S t )2 .Results in Figure 10 show lower F-measure e a s y m e d i u m h a r d v a l u e F-measure precision recall

Figure 11: DeepMatcher results for thesimilarity-join ER task on different instancesfrom the camera verticalover test sources with higher attribute spar-sity and lower source similarity with trainingsources. Note that both attribute sparsity ofthe test source and its source similarity withthe training sources matter and solving SMover the same test source (e.g., t = buzzil-lions) can become more difﬁcult when select-ing training sources with less source similar-ity. The similarity-join ERtask takes as input two sources and returnsthe pairs of records that refer to the same en-tity. Similarity-join ER is typically easier whenthe sources provide as few null values as pos-sible, which can be quantiﬁed by our attributesparsity metric. For this task, we consider theDeepMatcher approach [28].Figure 11 shows the performance of Deep-Matcher for different choices of the inputsources S and S from the camera domain.Since similarity-join ER tasks traditionally as-sume that the schemata of the two sourcesare aligned, we consider a subset of the at-17 T , S T S t AS ( S t ) SS Difﬁcultyprice-hunt, mypriceindia 0.42 0.35 easypricedekhocambuy, buzzillions 0.87 0.09 mediumpcconnectioncambuy, buzzillions 0.87 0.05 hardﬂipkartcambuy, ebay 0.99 0.05 hardﬂipkartTable 5: Sources considered for our mediated SM experiment.tributes of the selected sources, and manuallyalign them by using our SM ground truth. Considered attributes for this experiment are brand , image_resolution and screen_size .We select pairs of sources with increasing at-tribute sparsity AS ( S ∪ S ) as reported inTable 6. In the table, for the sake of com-pleteness, we also report the total number ofrecords | S | + | S | . It is worth saying that wecan select easy/hard instances from both sidesof the source size spectrum.Results in Figure 11 show that lower F-measure is obtained for source pairs withhigher attribute sparsity. We observe that pre-cision is reasonably high even for the harderinstances: the main challenge in this scenariois indeed that of recognizing record pairs thatrefer to the same entity even when they haveonly a few non-null attributes that can be usedfor the purpose. Those pairs contribute to therecall. Self-join ER.

Analogously to similarity-join Note that the AS metric, for this experiment, is com-puted accordingly on the aligned schemata. e a s y m e d i u m h a r d v a l u e F-measure precision recall

Figure 12: DeepMatcher results for the self-join ER task on different instances from the notebook verticalER approaches, self-join approaches also tendto work better when the attribute sparsity islow. In addition, a skewed cluster size distri-bution can make the problem harder, as thereare more chances that larger clusters withmany different representations of the sameentity can be erroneously split, or mergedwith smaller clusters. For this task, we con-sider again the DeepMatcher approach [28].Figure 11 shows the performance of Deep-Matcher on different subsets of 10 sourcesfrom the notebook collections. We choosethe notebook vertical for this experiment18 , S AS ( S ∪ S ) | S | + | S | Difﬁcultycambuy, 0.08 748 easyshopmaniaebay, 0.11 14,904 mediumshopmaniaﬂipkart, 0.41 338 hardhenrysTable 6: Sources considered for our similarity-join ER experiments.as it has the largest skew in its clustersize distribution (see Figure 5). Analo-gously to the similarity-join ER task, theschemata of the selected sources are manu-ally aligned by using our SM ground truth.The considered attributes for this experi-ment are brand , display_size , cpu_type , cpu_brand , hdd_capacity , ram_capacity and cpu_frequency . We select source sets with in-creasing attribute sparsity AS ( (cid:83) S ∈S S ) as re-ported in Table 7. In the table, for the sake ofcompleteness, we also report the total numberof records in each source set.Results in Figure 12 show lower F-measurefor source groups with higher attribute spar-sity, with similar decrease both in precisionand recall. Schema-agnostic ER.

Our last usage sce-nario consists of the same self-join ER taskjust discussed, but without the alignmentof the schemata. The knowledge of whichattribute pairs refer to the same property,available in schema-agnostic and self-join ap-proaches, is here missing. The decision aboutwhich records ought to be matched is madesolely on the basis of attribute values, con- e a s y m e d i u m h a r d v a l u e F-measure precision recall

Figure 13: DeepER results for the schema-agnostic ER task on different instances fromthe notebook vertical.catenating all the text tokens of a record intoa long descriptive paragraph. For this reason,the difﬁculty of schema-agnostic ER instancesdepends on the variety of the text tokens avail-able, which can be quantiﬁed by our vocab-ulary size metric. For this task, we use theDeepER [16] approach.Figure 13 shows the performance ofDeepER on different subsets of the sources inthe notebook vertical. For each record, weconcatenate the attribute val-ues and the values of all other attributes inthe record. We refer to the resulting attributeas description . We select source sets withincreasing vocabulary size as reported in Ta-19 AS ( (cid:83) S ∈S S ) (cid:80) S ∈S | S | Difﬁcultybidorbuy, livehotdeals, staples, amazon, pricequebec,vology, topendelectronic, mygofer, wallmartphoto, soft-warecity 0.16 7,311 easybidorbuy, bhphotovideo, ni, overstock, livehotdeals, sta-ples, gosale, topendelectronic, mygofer, wallmartphoto 0.36 1,638 mediumbidorbuy, bhphotovideo, ni, overstock, amazon, price-quebec, topendelectronic, ﬂexshopper, wallmartphoto,softwarecity 0.53 4,629 hard

Table 7: Sources considered for our self-join ER experiment. All sets S have cardinality 10.ble 8. Speciﬁcally, for each source S ∈ S wecompute our V S metric separately for the to-kens in attribute description that come fromthe attribute (

V S ( S title ) ) andall the other tokens ( V S ( S –title ) ). Then we re-port the average V S ( S title ) and V S ( S –title ) val-ues over all the sources in the selected set S .In the table, for the sake of completeness, wealso report the total number of records in eachsource set.Results in Figure 13 show lower F-measurefor source groups that have, on average,smaller vocabulary size. In this section, we describe the data collec-tion process behind the Alaska benchmark.Such process consists of the following mainsteps: ( i ) source discovery, ( ii ) data extrac-tion, ( iii ) data ﬁltering, and ( iv ) source ﬁlter-ing. Source Discovery.

The ﬁrst step has thegoal of ﬁnding websites publishing pages with speciﬁcations for products of a vertical of in-terest. To this end, we used the focusedcrawler of the

DEXTER project [34]. It discov-ers the target websites and downloads all theproduct pages, i.e., pages providing detailedinformation about a main product.

Data Extraction.

The pages collected bythe DEXTER focused crawler are then pro-cessed by an ad-hoc data extraction tool –that we call Carbonara – speciﬁcally devel-oped for the Alaska benchmark. From each in-put page, Carbonara extracts a product spec-iﬁcation, that is, a set of key-value pairs. Tothis end, Carbonara uses a classiﬁer which istrained to select HTML tables and lists thatare likely to contain the product speciﬁca-tion. The features used by the classiﬁer arerelated to DOM properties (e.g., number ofURLs in the table element, number of boldtags, etc.) and to the presence of domain key-words, i.e., terms that depends on the speciﬁcvertical. The training set, as well as the do-main keywords were manually created. The https://github.com/disheng/DEXTER avg S ∈S V S ( S title ) avg S ∈S V S ( S –title ) (cid:80) S ∈S | S | Difﬁcultytigerdirect, bidorbuy, gosale, ebay, sta-ples, softwarecity, mygofer, buy, then-erds, ﬂexshopper 1,579.3 19,451.0 11,561 easytigerdirect, bidorbuy, gosale, wallmart-photo, staples, livehotdeals, priceque-bec, softwarecity, mygofer, thenerds 733.8 8,931.0 2,453 mediumtigerdirect, bidorbuy, gosale, wallmart-photo, staples, ni, livehotdeals, bhpho-tovideo, topendelectronic, overstock 458.3 5,135.0 1,621 hardTable 8: Sources considered for our schema-agnostic ER experiment.output of Carbonara is a JSON object withall the successfully extracted key-value pairs,plus a special key-vale pairs composed by the key whose value corresponds tothe HTML page title.

Data Filtering.

Since the extraction processis error prone, the JSON objects produced byCarbonara are ﬁltered based on a heuristics,which we refer to as rule, with theobjective of eliminating noisy attributes dueto an imprecise extraction. The ruleworks as follows. First, key-value pairs witha key that is not present in at least pagesof the same source are ﬁltered out. Then,JSON objects with less than key-value pairsare discarded. Finally, only sources with morethan JSON objects (i.e., product spec-iﬁcations) are gathered.

Source Filtering.

Some of the sources dis-covered by

DEXTER can contain only copies of speciﬁcations from other sources. The laststep of the data collection process aims at eliminating these sources. To this end, weeliminate sources that are either known ag-gregators (such as, ), orcountry-speciﬁc versions of a larger source(such as, and with respect to ). Preliminary versions of the Alaska benchmarkhave been recently used for the 2020 SIG-MOD Programming Contest and for two edi-tions of the DI2KG challenge. In these ex-periences, we asked the participants to solveone or multiple tasks on one or more verti-cals. We provided participants with a sub-set of our manually curated ground truth fortraining purposes. The rest of the groundtruth was kept secret and used for comput-ing F-measure of submitted solutions. Moredetails follow. http://di2kg.inf.uniroma3.it I2KG 2019.

The DI2KG 2019 challenge wasco-located with the 1st International Work-shop on Challenges and Experiences fromData Integration to Knowledge Graphs (1stDI2KG workshop), held in conjunction withKDD 2019. Tasks included in this challengewere mediated SM and schema-agnostic ER.The only available vertical was camera . Wereleased to participants 99 labelled recordsand 288 labelled attribute/mediated attributepairs.

SIGMOD 2020.

The SIGMOD 2020 Program-ming Contest was co-located with the 2020 In-ternational ACM Conference on Managementof Data. The task included in the contestwas schema-agnostic ER. The only availablevertical was camera . We released to partic-ipants 297,651 labelled record pairs (44,039matching and 253,612 non-matching). Withrespect to the previous challenge, we had thechance of ( i ) augmenting our ground truthby labelling approximately 800 new records; ( ii ) designing a more intuitive data format forboth the source data and the ground truthdata. DI2KG 2020.

The DI2KG 2020 challengewas co-located with the 2nd DI2KG Workshop,held in conjunction with VLDB 2020. Tasksincluded in this challenge were mediated SMand schema-agnostic ER. Available verticalswere monitor and notebook . For monitor wereleased to the participants 111,156 labelledrecord pairs (1,073 matching and 110,083non-matching) and 135 labelled attribute/me-diated attribute pairs. For notebook , wereleased to the participants 70,125 labelled record pairs (895 matching and 69,230 non-matching) and 189 labelled attribute/medi-ated attribute pairs. With respect to the previ-ous challenges, we organized tracks account-ing for different task solution techniques, in-cluding supervised machine learning, unsu-pervised methods and methods leveraging do-main knowledge, such as product catalogspublicly available on the Web.

Learning from our past experiences, we planto extend the Alaska benchmark working onthe directions below.• Improve usability of the benchmark, to letit evolve into a full-ﬂedged benchmarkingtool;• Include verticals with fundamentally dif-ferent content, such as biological data;• Include variants of the current verticalsby removing or anonymizing targeted in-formation in the records, such as modelnames of popular products;• Add meta-data supporting variants of thetraditional F-measure, such as takinginto account the difference between syn-onyms and homonyms in the SM tasksor between head entities (with manyrecords) and tail entities in the ER tasks;• Collect ground truth data for other tasksin the data integration pipeline, such asData Extraction and Data Fusion.22inally, especially for the latter direction,we plan to reconsider the ground truth con-struction process in order to reduce the man-ual effort by re-engineering and partly autom-atizing the curation process.

We presented Alaska, a ﬂexible benchmarkfor evaluating several variations of SchemaMatching and Entity Resolution systems. Forthis purpose, we collected real-world data setsfrom the Web, including sources with differ-ent characteristics that provide speciﬁcationsabout three categories of products. We manu-ally curated a ground both for Schema Match-ing and Entity Resolution tasks. We presenteda proﬁling of the dataset under several dimen-sions, to show the heterogeneity of data inour collection and to allow the users to se-lect a subset of sources that best reﬂect thetarget setting of the systems to evaluate. Weﬁnally illustrated possible usage scenarios ofour benchmark, by running some representa-tive state-of-the-art systems on different selec-tions of sources from our dataset.

Acknowledgements

We thank Alessandro Micarelli and Fabio Gas-paretti for providing the computational re-sources employed in this research. Specialthanks to Vincenzo di Cicco who contributedto implement the Carbonara extraction sys-tem. 23 eferences [1] B. Alexe, W.-C. Tan, and Y. Velegrakis.Stbenchmark: towards a benchmark formapping systems.

PVLDB , 1(1):230–244,2008.[2] A. Algergawy, D. Faria, A. Ferrara,I. Fundulaki, I. Harrow, S. Hertling,E. Jiménez-Ruiz, N. Karam, A. Khiat,P. Lambrix, et al. Results of the ontol-ogy alignment evaluation initiative 2019.In

CEUR Workshop Proceedings , volume2536, pages 46–85, 2019.[3] P. C. Arocena, B. Glavic, R. Ciucanu, andR. J. Miller. The ibench integration meta-data generator.

PVLDB , 9(3):108–119,2015.[4] P. A. Bernstein, J. Madhavan, andE. Rahm. Generic schema matching,ten years later.

PVLDB , 4(11):695–701,2011.[5] U. Brunner and K. Stockinger. En-tity matching with transformerarchitectures-a step forward in dataintegration. In

International Conferenceon Extending Database Technology,Copenhagen, 30 March-2 April 2020 ,2020.[6] R. Cappuzzo, P. Papotti, and S. Thiru-muruganathan. Creating embeddingsof heterogeneous relational datasets fordata integration tasks. In

Proceedingsof the 2020 ACM SIGMOD InternationalConference on Management of Data ,pages 1335–1349, 2020. [7] C. Chen, B. Golshan, A. Y. Halevy, W.-C.Tan, and A. Doan. Biggorilla: An open-source ecosystem for data preparationand integration.

IEEE Data Eng. Bull. ,41(2):10–22, 2018.[8] P. Christen. Febrl- an open source datacleaning, deduplication and record link-age system with a graphical user in-terface. In

Proceedings of the 14thACM SIGKDD international conferenceon Knowledge discovery and data min-ing , pages 1065–1068, 2008.[9] S. Das, A. Doan, P. S. G. C., C. Gokhale,P. Konda, Y. Govind, and D. Paulsen.The magellan data repository.https://sites.google.com/site/anhaidgroup/useful-stuff/data.[10] J. Deng, W. Dong, R. Socher, L.-J. Li,K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In , pages 248–255. Ieee, 2009.[11] J. Devlin, M.-W. Chang, K. Lee, andK. Toutanova. Bert: Pre-training ofdeep bidirectional transformers for lan-guage understanding. arXiv preprintarXiv:1810.04805 , 2018.[12] H.-H. Do, S. Melnik, and E. Rahm. Com-parison of schema matching evaluations.In

Net. ObjectDays: International Con-ference on Object-Oriented and Internet-Based Technologies, Concepts, and Ap-plications for a Networked World , pages221–237. Springer, 2002.2413] A. Doan, A. Halevy, and Z. Ives.

Princi-ples of data integration . Elsevier, 2012.[14] F. Duchateau and Z. Bellahsene. Design-ing a benchmark for the assessment ofschema matching tools.

Open Journal ofDatabases , 1(1):3–25, 2014.[15] F. Duchateau, Z. Bellahsene, andE. Hunt. Xbenchmatch: a bench-mark for xml schema matching tools.In

The VLDB Journal , volume 1, pages1318–1321. Springer Verlag, 2007.[16] M. Ebraheem, S. Thirumuruganathan,S. Joty, M. Ouzzani, and N. Tang. Dis-tributed representations of tuples for en-tity resolution.

PVLDB , 11(11):1454–1467, 2018.[17] A. K. Elmagarmid, P. G. Ipeirotis, andV. S. Verykios. Duplicate record de-tection: A survey.

IEEE Transactionson knowledge and data engineering ,19(1):1–16, 2006.[18] D. Firmani, B. Saha, and D. Srivastava.Online entity resolution using an oracle.

PVLDB , 9(5):384–395, 2016.[19] I. Goodfellow, Y. Bengio, A. Courville, andY. Bengio.

Deep learning , volume 1. MITpress Cambridge, 2016.[20] C. Guo, C. Hedeler, N. W. Paton, and A. A.Fernandes. Matchbench: benchmark-ing schema matching algorithms forschematic correspondences. In

BritishNational Conference on Databases ,pages 92–106. Springer, 2013. [21] E. Ioannou and Y. Velegrakis. Em-bench: Generating entity-related bench-mark data.

ISWC , pages 113–116, 2014.[22] E. Jiménez-Ruiz, O. Hassanzadeh,V. Efthymiou, J. Chen, and K. Srinivas.Semtab 2019: Resources to benchmarktabular data to knowledge graph match-ing systems. In

European Semantic WebConference , pages 514–530. Springer,2020.[23] W. Kim and J. Seo. Classifying schematicand data heterogeneity in multidatabasesystems.

Computer , 24(12):12–18, 1991.[24] H. Köpcke, A. Thor, and E. Rahm. Theleipzig db groupdatasets. https://dbs.uni-leipzig.de/research/projects/object_matching/benchmark_datasets_for_entity_resolution .[25] H. Köpcke, A. Thor, and E. Rahm. Evalu-ation of entity resolution approaches onreal-world match problems.

PVLDB , 3(1-2):484–493, 2010.[26] Y. Li, J. Li, Y. Suhara, A. Doan, andW.-C. Tan. Deep entity matching withpre-trained language models.

PVLDB ,14(1):50–60, 2020.[27] A. McCallum. , 2004.[28] S. Mudgal, H. Li, T. Rekatsinas, A. Doan,Y. Park, G. Krishnan, R. Deep, E. Arcaute,and V. Raghavendra. Deep learning for25ntity matching: A design space explo-ration. In

Proceedings of the 2018 Inter-national Conference on Management ofData , pages 19–34, 2018.[29] R. O. Nambiar and M. Poess. The mak-ing of tpc-ds. In

VLDB , volume 6, pages1049–1058, 2006.[30] H. Nguyen, A. Fuxman, S. Paparizos,J. Freire, and R. Agrawal. Synthesiz-ing products for online catalogs. arXivpreprint arXiv:1105.4251 , 2011.[31] M. Poess and C. Floyd. New tpc bench-marks for decision support and web com-merce.

ACM Sigmod Record , 29(4):64–71, 2000.[32] M. Poess, T. Rabl, H.-A. Jacobsen, andB. Cauﬁeld. Tpc-di: the ﬁrst industrybenchmark for data integration.

PVLDB ,7(13):1367–1378, 2014.[33] A. Primpeli, R. Peeters, and C. Bizer.The WDC training dataset and gold stan-dard for large-scale product matching.In S. Amer-Yahia, M. Mahdian, A. Goel,G. Houben, K. Lerman, J. J. McAuley,R. Baeza-Yates, and L. Zia, editors,

Com-panion of The 2019 World Wide Web Con-ference, WWW 2019, San Francisco, CA,USA, May 13-17, 2019 , pages 381–386.ACM, 2019.[34] D. Qiu, L. Barbosa, X. L. Dong, Y. Shen,and D. Srivastava. Dexter: large-scale discovery and extraction of prod-uct speciﬁcations on the web.

PVLDB ,8(13):2194–2205, 2015. [35] E. Rahm and P. A. Bernstein. A survey ofapproaches to automatic schema match-ing. the VLDB Journal , 10(4):334–350,2001.[36] D. Ritze, O. Lehmberg, and C. Bizer.Matching html tables to dbpedia. In