Local Embeddings for Relational Data Integration
Riccardo Cappuzzo, Paolo Papotti, Saravanan Thirumuruganathan
CCreating Embeddings of Heterogeneous RelationalDatasets for Data Integration Tasks
Riccardo Cappuzzo [email protected]
Paolo Papotti [email protected]
SaravananThirumuruganathan [email protected], HBKU
ABSTRACT
Deep learning based techniques have been recently used withpromising results for data integration problems. Some meth-ods directly use pre-trained embeddings that were trainedon a large corpus such as Wikipedia. However, they may notalways be an appropriate choice for enterprise datasets withcustom vocabulary. Other methods adapt techniques fromnatural language processing to obtain embeddings for theenterprise’s relational data. However, this approach blindlytreats a tuple as a sentence, thus losing a large amount ofcontextual information present in the tuple.We propose algorithms for obtaining local embeddings that are effective for data integration tasks on relationaldatabases. We make four major contributions. First, we de-scribe a compact graph-based representation that allows thespecification of a rich set of relationships inherent in the re-lational world. Second, we propose how to derive sentencesfrom such a graph that effectively “describe" the similarityacross elements (tokens, attributes, rows) in the two datasets.The embeddings are learned based on such sentences. Third,we propose effective optimization to improve the quality ofthe learned embeddings and the performance of integrationtasks. Finally, we propose a diverse collection of criteria toevaluate relational embeddings and perform an extensiveset of experiments validating them against multiple baselinemethods. Our experiments show that our framework, EmbDI,produces meaningful results for data integration tasks suchas schema matching and entity resolution both in supervisedand unsupervised settings.
Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies are notmade or distributed for profit or commercial advantage and that copies bearthis notice and the full citation on the first page. Copyrights for componentsof this work owned by others than ACM must be honored. Abstracting withcredit is permitted. To copy otherwise, or republish, to post on servers or toredistribute to lists, requires prior specific permission and/or a fee. Requestpermissions from [email protected].
SIGMOD’20, June 14–19, 2020, Portland, OR, USA © 2020 Association for Computing Machinery.ACM ISBN 978-1-4503-6735-6/20/06...$15.00https://doi.org/10.1145/3318464.3389742
CCS CONCEPTS • Theory of computation → Data integration ; KEYWORDS data integration; embeddings; deep learning; schema match-ing; entity resolution
ACM Reference Format:
Riccardo Cappuzzo, Paolo Papotti, and Saravanan Thirumuru-ganathan. 2020. Creating Embeddings of Heterogeneous RelationalDatasets for Data Integration Tasks . In
Proceedings of the 2020ACM SIGMOD International Conference on Management of Data(SIGMOD’20), June 14–19, 2020, Portland, OR, USA.
ACM, New York,NY, USA, 15 pages. https://doi.org/10.1145/3318464.3389742
Data in an enterprise is often scattered across informationsilos. The problem of data integration concerns the combi-nation of information from heterogeneous relational datasources [19]. It is a challenging first step before data analyticscan be performed to extract value from data. Unfortunately,it is also an expensive task for humans [33]. An often citedstatistic is that data scientists spend 80% of their time inte-grating and curating their data [17]. Due to its importance,the problem of data integration has been studied extensivelyby the database community. Traditional approaches requiresubstantial effort from domain scientists to generate featuresand labeled data or domain specific rules [19]. There hasbeen increasing interest in achieving accurate data integra-tion with dramatically less human effort.
Embeddings have been successfully used for data integrationtasks such as entity resolution [8, 14, 25, 30, 35, 38], schemamatching [16, 26, 29], identification of related concepts [15],and data curation in general [24, 36]. Typically, these worksfall into two dominant paradigms based on how they ob-tain word embeddings. The first is to reuse pre-trained wordembeddings computed for a given task. The second is tobuild local word embeddings that are specific to the dataset.These methods treat each tuple as a sentence by reusing the a r X i v : . [ c s . D B ] S e p ame techniques for learning word embeddings employed innatural language processing.However, both approaches fall short in some circum-stances. Enterprise datasets tend to contain custom vocab-ulary. For example, consider the small datasets reported inthe left-hand side of Figure 1. The pre-trained embeddingsdo not capture the semantics expressed by these datasetsand do not contain embeddings for the word “Rick”. Ap-proaches that treat a tuple as a sentence miss a number ofsignals such as attribute boundaries, integrity constraints,and so on. Moreover, existing approaches do not considerthe generation of embeddings from heterogeneous datasets,with different attributes and alternative value formats. Theseobservations motivate the generation of local embeddingsfor the relational datasets at hand. We advocate for the design of local embeddings that leverageboth the relational nature of the data and the downstreamtask of data integration.
Tuples are not sentences.
Simply adapting embedding tech-niques originally developed for textual data ignores the richerset of semantics inherent in relational data. Consider a cellvalue t [ A i ] of an attribute A i in tuple t , e.g., “Mike” in thefirst relation from the top. Conceptually, it has a semanticconnections with both other attributes of tuple t (such as“iPad 4th”) and other values from the domain of attribute A i (such as “Paul”). Existing embedding techniques cannot suchsemantic connections. Embedding generation must span different datasets.
Embed-dings must be trained using heterogeneous datasets, so thatthey can meaningfully leverage and surface similarity acrossdata sources. A notion of similarity between different typesof entities, such as tuples and attributes, must be developed.Tuple-tuple and attribute-attribute similarity are importantfeatures for entity resolution and schema matching.There are multiple challenges to overcome. First, it is notclear how to encode the semantics of the relational datasetsinto the embedding learning process. Second, datasets mayshare very limited amount of information, have radicallydifferent schemas, and contain a different number of tuples.Finally, datasets are often incomplete and noisy. The learningprocess is affected by low information quality, generatingembeddings that do not correctly represent the semantics ofthe data.
We introduce EmbDI, a framework for building relational, lo-cal embeddings for data integration that introduces a numberof innovations to overcome the challenges above. We iden-tify crucial components and propose effective algorithms for instantiating each of them. EmbDI is designed to be modularso that any one can customize it by plugging in other algo-rithms and benefit from the continuing improvements fromthe deep learning and the database communities. The right-hand side of Figure 1 shows the main steps in our solution.
1. Graph Construction.
We leverage a compact tripartitegraph-based representation of relational datasets that caneffectively represent a rich set of syntactic and semanticrelationships between cell values. Specifically, we use a het-erogeneous graph with three types of nodes.
Token nodescorrespond to the unique values found in the dataset.
RecordId nodes (RIDs) represent a unique token for each tuple.
Column Id nodes (CIDs) represent a unique token for eachcolumn/attribute. These nodes are connected by edges basedon the structural relationships in the schema. This graph isa compact representation of the original datasets that high-lights overlap and explicitly represent the primitives for dataintegration tasks, i.e., records and attributes.
2. Embedding Construction.
We formulate the problemof obtaining local embeddings for relational data as a graphembeddings generation problem. We use random walks toquantify the similarity between neighboring nodes and toexploit metadata such as tuple and attribute IDs. This methodensures that nodes that share similar neighborhoods will bein close proximity in the final embeddings space. The corpusthat is used to train our local embeddings is generated bymaterializing these random walks.
3. Optimizations.
Learning embeddings can be a difficulttask in the presence of noisy and incomplete heterogeneousdatasets. For this reason, we introduce an array of optimiza-tion techniques that handle difficult cases and enable refine-ment of the generated embeddings. The flexibility of thegraph enables us to naturally represent external information,such as data dictionaries, to merge values in different for-mats, and data dependencies, to impute values and identifyerrors. We propose optimizations to handle imbalance in thedatasets’ size and the presence of numerical values (usuallyignored in textual word embeddings).
Experimental Results.
We propose an extensive set ofdesiderata for evaluating relational embeddings for data in-tegration. Specifically, our evaluation focuses on three majordimensions that measure how well do the embeddings (a)learn the tuple-, attribute- and constraint-based relation-ships in the data, (b) learn integration specific informationsuch as tuple-tuple and attribute-attribute similarities, and(c) improve the behavior of DL-based data integration algo-rithms. As we shall show in the experiments, our proposedalgorithms perform well on each of these dimensions.
Outline.
Section 2 introduces background about embeddingsand data integration. Section 3 shows a motivating example aul iPad
Mike
GalaxySteve SamsungApple
Pre-trained embeddings
Wiki, News, ...
Doc Corpus
Word2Vec, fastText, ... r Paul r Apple A Samsung r Rick A Paul ...r Paul r iPad_4th A Galaxy r Steve r Galaxy ... Paul iPad 4th
Mike
Rick
GalaxySteve SamsungApple
Local embeddings
EmbDI
Paul iPad 4th
Mike iPad 4thSteve Galaxy
Rick
SamsungPaul Apple
Datasets
Paul iPad 4thMike iPad 4thSteve GalaxyRick SamsungPaul Apple r r r r r A A A A A A A A r r r r r r r r r r A A A A Figure 1: Illustration of a simplified vector space learned from text (prior approaches) and from data (
EmbDI ). that highlights the limitations of prior approaches and iden-tifies a set of desiderata for relational embeddings. Section 4details the major components of the framework. Section 5presents our optimizations to handle data imbalance, missingvalues, and external information. Section 6 describes how weuse embeddings for data integration tasks. Section 7 reportsextensive experiments validating our approach. We concludein Section 8 with some promising next steps. Embeddings.
Embeddings map an entity such as a word toa high dimensional real valued vector. The mapping is per-formed in such a way that the geometric relation between thevectors of two entities represents the co-occurrence/semanticrelationship between them. Algorithms used to learn embed-dings rely on the notion of “neighborhood”: intuitively, iftwo entities are similar, they frequently belong to the samecontextually defined neighborhood. When this occurs, theembeddings generation algorithm will try to force the vec-tors that represent these two entities to be close to each otherin the resulting vector space.
Word Embeddings [3, 37] are trained on a large corpus oftext and produce as output a vector space where each wordin the corpus is represented by a real valued vector. Usually,the generated vector space has either 100 or 300 dimensions.The vectors for words that occur in similar context – such asSIGMOD and VLDB – are in close proximity to each other.Popular architectures for learning embeddings include con-tinuous bag-of-words (CBOW) or skip-gram (SG). Recentapproaches rely on using the context of word to obtain acontextual word embedding [13, 32].
Node Embeddings.
Intuitively, node embeddings [20] mapnodes to a high dimensional vector space so that the likeli-hood of preserving node neighborhoods is maximized. Oneway to achieve this is by performing random walks start-ing from each node to define an appropriate neighborhood. Popular node embeddings are often based on the skip-grammodel, since it maximizes the probability of observing anode’s neighborhood given its embedding. By varying thetype of random walks used, one can obtain diverse types ofembeddings [9].
Embeddings for Relational Datasets.
The pioneeringwork of [6] was the first to apply embedding techniquesfor extracting latent information from relation data. Recentextensions [5, 7] leverage the learned embeddings to developa “cognitive” database system with sophisticated function-ality for answering complex semantic, reasoning and pre-dictive queries. Termite [15] seeks to project tokens fromstructured and unstructured data into a common representa-tional space that could then be used for identifying relatedconcepts through its Termite-Join approach. Freddy [21] andRetroLive [22] produce relational embeddings that combinerelational and semantic information through a retrofittingstrategy. There has been prior work that learn embeddingsfor specific tasks like entity matching (such as DeepER [14]and DeepMatcher [30]) and schema matching (Rema [26]).Our goal is to learn relational embeddings that is tailored fordata integration and can be used for multiple tasks.
In this section, we discuss an illustrative example that high-lights the weaknesses of current approaches and motivatesus to design a new approach for relational, local embedding.Consider the scenario where one utilizes popular pre-trained embeddings such as word2vec, GloVe, or fastText.Figure 1 shows a hypothetical filtered vector spaces for thetokens in an example with two small customer datasets. Weobserve that the pre-trained embeddings suffer from a num-ber of issues when we use them to model the two relations.(1) A number of words, such as “Rick”, in the dataset arenot in the pre-trained embedding. This is especially prob-lematic for enterprise datasets where tokens are oftenunique and not found in pre-trained embeddings.2) Embeddings might contain geometric relationships thatexist in the corpus they were trained on, but that are miss-ing in the relational data. For example, the embeddingfor token “Steve” is closer to tokens “iPad” and “Apple”even though it is not implied in the data.(3) Relationships that do occur in the data, such as betweentokens “Paul” and “Mike”, are not observed in the pre-trained vector space.Naturally, learning local embeddings from the relationaldata often produces better results. However, computing em-beddings for non integrated data sources is a non trivialtask. This becomes especially challenging in settings wheredata is scattered over different datasets with heterogeneousstructures, different formats, and only partially overlappingcontent. Prior approaches express such datasets as sentencesthat can be consumed by existing word embedding methods.However, we find that these solutions are still sub-optimalfor downstream data integration tasks.
Technical Challenges.
We enumerate four challenges thatmust be overcome to obtain effective embeddings.
1. Incorporating Relational Semantics.
Relational data exhibitsa rich set of semantics. Relational data also follows set se-mantics where there is no natural ordering of attributes.Representing the tuple as a single sentence is simplistic andoften not expressive enough for these signals.
2. Handling Lack of Redundancy.
A key reason for the successof word embeddings is that they are trained on large corporawhere there are adequate redundancies and co-occurrence tolearn relationships. However, databases are often normalizedto remove redundant information. This has an especiallydeleterious impact on the quality of learned embeddings.Rare words, which are very common in relational data, aretypically ignored by word embedding methods.
3. Handling Multiple Datasets.
We cannot assume that eachof the datasets have the same set of attributes, or that thereis sufficient overlapping values in the tuples, or even thatthere is a common dictionary for the same attribute.
4. Handling Hierarchical Data.
Databases are inherently hier-archical, with entities such as cell values, tuples, attributes,dataset and so on. Incorporating these hierarchical units asfirst class citizens in embedding training is a major challenge.
In this section, we provide a description of our approachand how these design choices address the aforementionedtechnical challenges. Our framework, EmbDI, consists ofthree major components, as depicted in the right-hand sideof Figure 1. (1) In the
Graph Construction stage, we process the relationaldataset and transform it to a compact tripartite graphthat encodes various relationships inherent in it. Tupleand attribute ids are treated as first class citizens.(2) Given this graph, the next step is
Sentence Construction through the use of biased random walks. These walksare carefully constructed to avoid common issues suchas rare words and imbalance in vocabulary sizes. Thisproduces as output a series of sentences.(3) In
Embedding Construction , the corpus of sentences ispassed to an algorithm for learning word embeddings.Depending on available external information, we performoptimizations to the graph and the workflow to improvethe embeddings’ quality.
Why construct a Graph?
Prior approaches for local em-beddings seek to directly apply an existing word embeddingalgorithm on the relational dataset. Intuitively, all tuples in arelation are modeled as sentences by breaking the attributeboundaries. The collection of sentences for each tuple in therelation then makes up the corpus, which is then used to trainthe embedding. This approach produces embeddings that arecustomized to that dataset, but it also ignores signals thatare inherent in relational data. We represent the relationaldata as a graph, thus enabling a more expressive representa-tion with a number of advantages. First, it elegantly handlesmany of the various relationships between entities that arecommon in relational datasets. Second, it provides a straight-forward way to incorporate external information such as“two tokens are synonyms of each other”. Finally, when mul-tiple relations are involved, a graph representation enables aunified view over the different datasets that is invaluable forlearning embeddings for data integration.
Simple Approaches.
Consider a relation R with attributes { A , A , . . . , A m } . Let t be an arbitrary tuple and t [ A i ] thevalue of attribute A i for tuple t . A naive approach is to cre-ate a chain graph where tokens corresponding to adjacentattributes such as t [ A i ] and t [ A i + ] are connected. This willresult in m edges for each tuple. Of course, if two differenttuples share the same token, then they will reuse the samenode. However, relational algebra is based on set semantics,where the attributes do not have an inherent order. So, sim-plistically connecting adjacent attributes is doomed to fail.Another extreme is to create a complete subgraph, where anedge exists between all possible pairs of t [ A i ] and t [ A i + ] .Clearly, this will result in (cid:0) m (cid:1) edges per tuple. This approachresults in the number of edges is quadratic in the numberof attributes and ignores other token relationships such as“token t and token t belong to the same attribute”. elational Data as Heterogeneous Graph. We proposea heterogeneous graph with three types of nodes.
Token nodes correspond to information found in the dataset (i.e. thecontent of each cell in the relation). Multi-word tokens maybe represented as a single entity, get split over multiple nodesor use a mix of the two strategies. We describe the effect ofeach strategy more in depth in Section 7.
Record Id nodes(RIDs) represent each tuple in the dataset,
Column Id nodes(CIDs) represent each column/attribute. These nodes areconnected by edges according to the structural relationshipsin the schema. This representation can produce a vector forall RIDs (CIDs) rather than representing them by combiningthe vectors of the values in each tuple (column). r r r r r PauliPad 4thMikeSteveGalaxyRickSamsungApple A A A A Figure 2: The graph for the two tables in Figure 1.
Consider a tuple t with RID r t . Then, nodes for tokenscorresponding to t [ A ] , . . . , t [ A m ] are connected to the node r t . Similarly, all the tokens belonging to a specific attribute A i are connected to the corresponding CID, say c i . Thisconstruction is generic enough to be augmented with othertypes of relationships. Also, if we know that two tokensare synonyms (e.g. via wordnet), this information could beincorporated by reusing the same node for both tokens. Notethat a token could belong to different record ids and columnids when two different tuples/attributes share the same token.Numerical values are rounded to a number of significantfigures decided by the user, then they are assigned a node likeregular categorical values; null values are not representedin the graph. We discuss more sophisticated approaches forhandling numeric, noisy, and null values in Section 5.Algorithm 1 shows the operations performed during thegraph creation with hybrid representation of multi-wordtokens. Figure 2 shows a graph constructed for the datasetsin Figure 1. Note that this could be considered as a variantof tripartite graph. A key advantage of this choice is that ithas the same expressive power as the complete sub-graphapproach, while requiring orders of magnitude fewer edges. Algorithm 1
GenerateTripartiteGraph
Input : relational dataset D let G = empty graph for all c i in columns( D ) do G.addNode( c i ) for all r i in rows( D ) do G.addNode( R i ) // R i is the record id of r i for all value v k in r i doif v k is multi-word thenfor all word in tokenize( v k ) do G.addNode(word)G.addEdge(word, R i ), G.addEdge(word, c k ) else if v k is single-word then G.addNode( v k )G.addEdge( v k , R i ), G.addEdge( v k , c k ) Output : graph G Graph Traversal by Random Walks.
To generate the dis-tributed representation of each node in the graph, we pro-duce a large number of random walks and gather them ina training corpus where each random walk will correspondto a sentence. Using graphs and random walks allows us tohave a richer and more diverse set of neighborhoods thanwhat would be possible by encoding a tuple as a single sen-tence. For example, a walk starting from node ‘Paul’ couldgo to node A , and then to node ‘Rick’. This walk implicitlydefines the neighborhood based on attribute co-occurrence.Similarly, the walk from ‘Paul’ could have gone to ‘ r ’ andthen to ‘Apple’, incorporating the row level relationships.Our approach is agnostic to the specific type of random walkused, with different choices yielding different embeddings.For example, one could design random walks that are biasedtowards other nodes belonging to the same tuple, or towardsrare nodes. To better represent all nodes, we assign a “budget”of random walks to each of them and guarantee that all nodeswill be the starting point of at least as many random walksas their budget. After choosing the starting point T i , the ran-dom walk is generated by choosing a neighboring RID of T i , R j . The next step in the random walk will then be chosenat random among all neighbors of node R j , for example bymoving on C a . Then, a new neighbor of C a will be chosenand the process will continue until the random walk hasreached the target length. We use uniform random walks inmost of our experiments to guarantee good execution timeson large datasets, while providing high quality results. Wecompare alternative random walks in the experiments. From Walks to Sentences.
It is important to note that thepath on the graph represented by the random walk doesnot necessarily reflect the sentence that will be inserted in lgorithm 2
GenerateRandomWalk
Input : starting node n j , random walk length lr j = findNeighboringRID( n j ) W = seq( r j , n j )currentNode = n j while length( W ) < l do nextNode = findRandomNeighbor(currentNode) W .add(nextNode)currentNode = nextNode Output : walk W the training corpus. For example, a possible random walkcould be the following: R a T b R c T d C e T f C д T h , where T ∗ , R ∗ , C ∗ correspond to nodes of type tokens, record ids, and columnids, respectively. We note that the random walks includenodes corresponding to RIDs and CIDs. We noticed that thepresence (or absence) of CIDs and RIDs in the sentences thatbuild the training corpus has large effects on the data integra-tion performance of the algorithm. Indeed, we observe thattreating these as first order citizens, we can represent themas points in the vector space in the same way as any othertoken. For example, two nodes corresponding to differentattributes might co-occur in many random walks, resultingin embeddings that are closer to each other: this may implythat these two attributes represent similar information. Asimilar phenomenon could also be obtained for tuple embed-dings. A number of prior approaches such as DeepER [14]or DeepMatcher [30] only learn embeddings for tokens andthen obtain embeddings for tuples by averaging them orcombining by using a RNN. The use of our random walks assentences provides additional information about the neigh-borhood of each node, which would not be so easily obtainedby using only the structured data format. The generated sentences are then pooled together to build acorpus that is used to train the embeddings algorithm. Ourapproach is agnostic to the actual word embedding algorithmused. We piggyback on the plethora of effective embeddingsalgorithms such as word2vec, GloVe, fastText, and so on.Every year, improved embedding training algorithms arereleased, and this has a transitive effect on our approach.Broadly, these techniques can be categorized as word-based(such as word2vec) or character-based (such as fastText). Wediscuss the hyperparameters for embedding algorithms suchas learning method (either CBOW or Skip-Gram), dimen-sionality of the embeddings, and size of context window inSection 7.
Algorithm 3 provides the pseudocode for learning the localand relational embeddings based on our discussion. In thenext section, we discuss a number of practical improvementsto this basic algorithm.
Algorithm 3
Meta Algorithm for EmbDI Input: relational datasets D , number of random walks n walks , number of nodes n nodes W = [] G = GenerateTripartiteGraph( D ) for all n j ∈ nodes ( G ) do for i = 1 to ( n walks / n nodes ) do w i = GenerateRandomWalk( n j ) W .add( w i ) E = GenerateEmbeddings( W ) Output:
Local relational embeddings E In this section, we discuss a number of challenging issuesthat occur when applying EmbDI in practice.
In a real-world scenario, there often are multiple relationsand local embeddings must be learned for each of them. Fora single relation, one can simply perform multiple randomwalks from each token node. This approach directly amelio-rates the issue of infrequent words that plagues word em-bedding approaches, by guaranteeing that even rare wordswill appear frequently enough to be properly represented.A further complication arises when one relation containsmany more nodes than the other. If we perform an equalamount of random walks starting from each node, the sig-nals from the larger dataset might overwhelm those comingfrom the smaller dataset. We found that an effective heuris-tic is to start random walks only from nodes that co-occurin both datasets. This approach often produces sentenceswhere the proportion of larger and smaller datasets is compa-rable. Furthermore, these nodes also happen to be the mostinformative ones as they connect two relationships and oftenquite useful for integrating these two relations. Even withdatasets with a minimum amount of overlap (less than 2%),this approach ensures adequate coverage of all nodes andminimizes the issues due to relation imbalance.The overlapping tokens are the bridge between the twodatasets to be integrated. To maximize their impact in theembedding creation, one could start every sentence witha RID or CID, randomly picked from those connected tothe token at hand. This small change in the random walkreation affects the results by creating evidence of similarityfor the corresponding rows and columns.
Example.
Assume that node token T a appears in two rows R a and R b over two large datasets. Since the token is rare,it will appear most likely only once as the first node in thewalk, therefore the embedding algorithm will only see itin few patterns, such as T a R b T c or T a C d T e . To improve themodeling of the T a we start the sentence with a RID or CIDconnected to T a , such as C d T a C c and R a T a R b . This way, evenif the token is rare, it gives strong signals that the attributesand the row that contain it are related. Many real-world datasets contain a large amount of missingdata, so any effective approach for learning embeddings musthave a cogent strategy for this scenario. The ideal approachemploys imputation techniques to minimize the number ofmissing values. Unfortunately, this might not always be pos-sible, since algorithms for imputation and data repair oftendo not provide good results in a relational setting. Prior ap-proaches for learning relational embeddings skip missingvalues when computing embeddings. However, this approachis often counter-productive as missing data can be an indica-tion of systemic error. Approaches where all missing valuesare treated as if they were the same entity (so one nodefor all nulls), or unique entities (individual nodes for eachnull) are not appropriate. The first approach creates a supernode to store all NULL values, which has multiple negativeeffects on the result and produces no benefit. The secondapproach creates a unique node for each NULL: this doesnot cause any issues, but does not provide any additionalinformation either. Moreover, if the number of NULLs islarge, this approach increases the processing time withoutany commensurate benefit.We propose a simple mechanism to use classical databasetechniques such as Skolemization [23] to handle missingdata. Approaches for data repairs [10] are very accurate inidentifying the errors, but struggle to identify the correctupdated value [1, 2]. When there is no certain update to make,most methods put a placeholder , like a variable or the outputof a function that is related to Skolemization. Our modelis able to naturally consume and model these placeholdersto obtain better embeddings. Hence, the data repairing taskcould be used to address both missing and noisy values.Consider the scenario with two relations R and R . With-out loss of generality, let us assume that they both haveattributes A , A , A , A . Suppose there are two tuples: R ( a , N , c , N ) and R ( a , b , c ′ , N ) Here N , N , N denote the null values. If A is the keyattribute, we can derive three important updates in the data, including the creation of two placeholders, and rewrite thetwo tuples are follows: R ( a , b , X , X ) and R ( a , b , X , X ) where X models the conflict between c and c ′ and X merges the two nulls. This reduces the heterogeneity of thedata and improves the quality of the embeddings. Consideralso that all occurrences of c and c ′ are merged in the graph,even in tuples that do not satisfy the pattern of this functionaldependency. A single placeholder may end up merging alarge number of token occurrences in the original dataset. Node Merging.
Our graph representation allows one to in-corporate external information such as wordnet or otherdomain specific dictionaries in a seamless manner. This is an optional step to improve the quality of embeddings. For ex-ample, consider two attributes from different relations – onestores country codes while the other contains complete coun-try names. If some mapping between these two exists, thenwe can merge the nodes corresponding to, say, Netherlandsand NL. The same reasoning applies to tuples (attributes): iftrustable information about possible token matches is avail-able, we merge different RIDs (CIDs) in the same node. Merg-ing of nodes could be achieved by using external functions,such as matchers based on syntactic similarity, pre-trainedembeddings, or clustering. This often increases the numberof overlapping tokens across datasets and produces betterembeddings for data integration.
Node Replacement in Random Walks.
Merging of nodesis only viable if we are confident that the two tokens refer tothe same underlying entity. In practice, the mapping betweentwo entities is imperfect. For example, one could have a ma-chine learning algorithm that says that tokens T i and T j aresimilar with confidence of 0 .
8. The extreme approaches ofmerging the two nodes (such as by applying a fixed threshold)or ignoring this strong information are both sub-optimal. Wepropose the use of a replacement strategy where, during theconstruction of the sentence corpus, token T i is replaced by T j (and vice versa) with a probability proportionate to theircloseness. Note that this only affects the sentence construc-tion. The random walk by itself is not affected. Specifically,if the random walk is at node T i , it might output T j in thesentence instead of T i . However, when choosing the nextnode, it will only pick the neighbors of node T i . Handling Numeric Data.
Integer and real-valued at-tributes are very common in relational data. A straightfor-ward approach is to treat them as strings, so that each distinctvalue is assigned to a node in the graph. However, this sim-plistic approach does not always work well, as it ignoreseometric relationships between numbers such as the Eu-clidean distance. One way to use this distance information isto replace two numbers if they are within a threshold distance.Unfortunately, identifying an effective threshold is quite chal-lenging in general. Consider two set of tokens { , , , . . . , } and { , . , . , . . . , } . In the former, we can plausi-bly replace 1 with 2 while it would not be appropriate in thelatter scenario. We apply an effective heuristic that combinesnode replacement with data distribution-aware distance be-tween two numbers. Typically, most numeric attributes canbe approximated by a small number of distributions, such asGaussian or Zipfian. As an example, if a particular attribute isGaussian, we can efficiently estimate its parameters – meanand variance. Then, given a number i , we generate a randomnumber r around i in accordance with the learned parame-ters. If the new random number is part of the domain of theattribute, then we replace i with r . Algorithm 4
AlignEmbeddings Input : relations R , R , E = EmbDI (concat( R , R )) let U i be the set of unique words in R i ∀ i ∈ , let A = U ∩ U A = E ( w i ) ∀ w i ∈ R B = E ( w j ) ∀ w j ∈ R W ∗ = argmin W , A ( W A − B ) A ′ = W ∗ A for all w i ∈ R ∪ R do if w i ∈ R ∩ R then E ′ ( w i ) = average( A ′ ( w i ) , B ( w i ) ) else if w i ∈ R then E ′ ( w i ) = A ′ ( w i ) else E ′ ( w i ) = B ( w i ) Output : Aligned embeddings E ′ Typically, embeddings for multiple relations are trainedusing two extreme approaches – either by training embed-dings one relation at a time or by pooling all the relations andtraining a common space. The individual approach is morescalable, but misses out on patterns that could be inferredby pooling the data. The pooled approach must ensure thatsignals from larger relations do not overpower those fromsmaller ones. We advocate for a novel embedding alignmentapproach, adapted from multilingual translation [11].We begin by training embeddings each relation individ-ually. This may cause RID and CID vectors that representdifferent instances of the same entity to differ from eachother when the datasets share a small number of commontokens. To mitigate this problem, we align the embeddings of the values contained by the two datasets that were trained inthe initial execution by pivoting on the new information, ba-sically changing the vector space that represents one datasetto better match the vector space of the other. This allows usto better materialize relationships between tokens, even ifthey do not co-occur in a single relation. Furthermore, thisapproach ensures that the geometric relationships betweentokens within each individual dataset are retained.Assume that we have two relations R and R with ade-quate overlap, and that A and B represent the embeddings ofwords in R and R , respectively. It is possible to formulate anorthogonal Procrustes problem [11] by seeking a translationmatrix W ∗ = argmin W , A ( W A − B ) , with A = U ∩ U beingthe intersection of unique values (the anchors ) in commonbetween the two starting relations. Applying the translationmatrix W ∗ to A yields a translated matrix A ′ , which mini-mizes the distance between anchor points. To employ thistechnique in the ER and SM tasks, we use matching CIDs andRIDs in the original embeddings as anchors to perform therotation. We then match again on the rotated embeddings.Algorithm 4 describes the embedding alignment. Multi-word tokens are common in relational dataset (suchas “Adobe Photoshop CS3”). There are a number of waysin which multi-word cells could be tokenized. One simpleoption is to treat the entire word sequence as a single token.The other option is to tokenize the word sequence, computethe word embeddings for each of the tokens, and then ag-gregate these token embeddings to get the embedding forthe multi-word cell. There are two key problems: how totokenize a multi-world cell and how to aggregate the tokenembeddings to get the cell embeddings. There are no simpleanswers to this problem. In some cases, these multi-wordtokens contain substrings that would yield additional infor-mation if they were represented as stand-alone nodes (in theexample above, “Adobe” and “Photoshop” are likely candi-dates). Unfortunately, in the general case it is very hard topinpoint cases where performing the expansion would im-prove the results; consider a counterexample such as “SavingPrivate Ryan”: in this case, we would rather have a singlenode to represent the movie title as it likely is a “primarykey” in the dataset and as such would help when performingintegration tasks.To mitigate both issues, we found a simple yet effectiveheuristic that allows us to handle both multi-word tokensand rare tokens at the same time. Instead of representing allunique values in both datasets as nodes, we make a distinc-tion between nodes that are present in both as they alreadyappear, and those that appear only in one dataset. Then, wetokenize the shared tokens and expand those that are not inommon. This effectively allows us to extract the informationpresent within multi-word tokens and, possibly, introduceconnections that would be missed otherwise. Moreover, rep-resenting the common values as unique tokens introduces“bridges” between the datasets, which can be exploited duringthe step of random walks generation to introduce semanticconnections that would not be identified otherwise.
Once the embeddings are trained, they can be used for com-mon data integration tasks. We now describe unsupervised algorithms that employ the embeddings produced by Em-bDI to perform two tasks widely studied in data integration,Schema Matching and Entity Resolution.
Schema Matching (SM).
Traditional approaches rely ongrouping attributes based on the value distributions or useother similarity measures. Recently, [16] used embeddings toidentify relationships between attributes using both syntacticand semantic similarities. However, they use embeddingsonly on attribute/relation names and do not consider theinstances – i.e. values taken by the attribute.Algorithm 5 describes the steps taken to perform schemamatching between two attributes by exploiting their cosinedistance in the vector space. Consider that, to prevent falsepositives in the column alignment, we terminate the algo-rithm after two iterations have been completed, even if somecandidate pools may still contain values.
Algorithm 5
Schema Matching let C be the set of CIDs of dataset D and C be the setof CIDs of dataset D let d ( c i ) be the list of distances between column c i ∈ C and all other columns c k ∈ C , sorted in ascending orderof distance (and viceversa). let T = C ∪ C be the set of columns to be matched while T (cid:44) ∅ do for all c k ∈ T do if d ( c k ) (cid:44) ∅ then c ′ k = findClosest( d ( c k ) ) c ′′ k = findClosest( d ( c ′ k ) ) if c ′′ k == c k then c k and c ′ k are matched remove c k , c ′ k from T else removeCandidate( d ( c k ) , c ′ k ) removeCandidate( d ( c ′ k ) , c k ) else remove c k from T Entity Resolution (ER).
Recent works used pre-existingembeddings to represent tuples [14, 30]. In contrast, our ap-proach relies on the use of RIDs as nodes in the heterogenousgraph. This allows EmbDI to learn better embeddings for theentire record from the data itself, rather than relying on com-bination methods such as averaging or concatenating theembeddings of the terms in the tuple. This information isthen used to perform unsupervised ER by computing the dis-tance between RIDs. We will also discuss in the experimentshow one can piggyback on prior supervised approaches bypassing the trained embeddings as features to [14, 30].Algorithm 6 describes the steps taken to identify thematches in the Entity Resolution task. We assume that nomatches for R i are present in D . Algorithm 6
Entity Resolution let R be the set of RIDs ∈ D let R be the set of RIDs ∈ D let d ( r i ) be the list of distances between RID r i ∈ R i andthe closest n top RIDs ∈ D j , with i (cid:44) j . for all r i ∈ D ∪ D do d ( r i ) = findClosest( r i , n top ) for all r k ∈ D do r ′ k = findClosest( d ( r k ) ) r ′′ k = findClosest( d ( r ′ k ) ) if r ′′ k == r k then r k and r ′ k are matchedVerifying the symmetry of the relationship has the ad-vantage of increasing the precision by reducing the FalsePositive Rate, without penalizing the recall. The effect of n top is described in Table 5. In both algorithms, many el-ements (either RIDs or CIDs) will have no matches in theother dataset. If appropriate embeddings were learned forthe RIDs, then this approach will produce good matches,which is indeed what we observe in our experiments. Token Matching (TM).
We also consider the problem ofmatching tokens that are conceptual synonyms of each other,a task that is also known as string matching [34, 39]. Forexample, one relation could encode a language as “English”while other could encode it as “EN”. Note that this is differentfrom schema matching, where the objective is to identify at-tributes that represent the same information. Instead, we areinterested in finding pairs of tokens from different relationsthat are related conceptually. Given two aligned attributes A i and A j , we seek to identify if two tokens t k ∈ Dom ( A i ) and t l ∈ Dom ( A j ) are related. Given the token t k , we identifythe set of top-n token ids that are closest to t k . We announcethat the first token t l ∈ Dom ( A j ) that occurs in the rankedlist is the conceptual synonym of t k .ame (shorthand) Table 1: Dataset properties.
In this section we first demonstrate that our proposed em-beddings learn the major relationships inherent in structureddata (Section 7.1). We then show the positive impact of ourembeddings for multiple data integration tasks in supervisedand unsupervised settings (Section 7.2). Finally, we analyzethe contributions of our design choices (Section 7.3).
Datasets.
Pre-trained Embeddings.
In the following, pre-trained word embeddings have been obtained from fastText [4].We tested also GloVe [31] and obtained comparable qualityresults. We relied on state of the art methods to combinewords in tuples and to obtain embeddings for words that arenot in the pre-trained vocabulary [8, 14].
Embedding Generation Algorithms.
We test four algo-rithms for the generation of local embeddings from relationaldataset. All local methods make use of our tripartite graphand exploit record and column IDs in the integration tasks.The first method is Basic, which creates embeddings frompermutations of row tokens and sentences with samples ofattribute tokens. As the method is aware of the structureof the database, it can learn representation for tuples andattributes. We fixed the size of the sentence corpus for Basicto contain the same number of tokens in EmbDI’s corpus.The second method is Node2Vec [20], a widely used al-gorithm for learning node representation on graphs. Givenour graph as input, it learns vectors for all nodes. We usedthe implementation from the paper with default parameters. The third method is Harp [9], a state of the art algorithmthat learns embeddings for graph nodes by preserving higher-order structural features. This method represents generalmeta-strategies that build on top of existing neural algo-rithms to improve performance. We used the implementationfrom the paper with default parameters.The fourth method is the one presented in Section 4, werefer to it as EmbDI in the following (https://gitlab.eurecom.fr/cappuzzo/embdi). The default configuration uses our tri-partite graph, walks (sentences) of size 60, 300 dimensionsfor the embeddings space, the Skip-Gram model in word2vecwith a window size of 3, and different tokenization strategiesto convert cell values in nodes. We report the numbers ofgenerated sentences for each dataset in Table 1. The numberof sentences depends on the desired number of tokens inthe corpus, we discuss a rule-of-thumb to obtain reasonablesizes in the ablation analysis.By default, EmbDI uses optimizations in data integrationtasks. However, to be fair to pre-trained embeddings, ourdefault configuration does not exploit external information,therefore the techniques in Sections 5.2, 5.3, and 5.4 are notused - we show their impact in the ablation study. Experi-ments have been conducted on a laptop with a CPU Inteli7-8550U, 8x1.8GHz cores and 32GB RAM.
We introduce three kinds of tests to measure how well embed-dings learn the relationships inherent in the relational data.Each test consists of a set of tokens taken from the datasetas input, while the goal is to identify which token does notbelong to the set (function doesnt_match in Python library gensim ). For the
MatchAttribute (MA) tests, we randomlysample four values from an attribute and a fifth value from adifferent attribute at random in the same dataset, e.g., given(Rambo III, The matrix, E.T., A star is born,
M. Douglas ),the test is passed if M. Douglas is identified. In
MatchRow (MR), we pick all tokens from a row and replace one of themasic Node2Vec Harp EmbDI
MA MR MC AVG MA MR MC AVG MA MR MC AVG MA MR MC AVG BB .99 .33 .32 .55 .97 .66 .92 .85 .96 .65 .95 .85 .92 .50 .77 .73WA .19 .27 .12 .19 mem mem mem mem .16 .32 .13 .20 .94 1.00 .99 .98 AG .10 .51 .39 .99 .37 .79 .38 .79FZ .08 .30 .00 .13 .84 .88 .62 .78 .80 .86 .89 .85 .94 .99 .94 .95 IA .09 .11 .09 .09 mem mem mem mem .81 .59 .96 .78 .89 .85 .98 .90
DA .08 .29 .02 .13 .79 .77 .18 .58 .51 .74 .49 .58 .79 .91 .66 .79 DS .58 .69 .76 mem mem mem mem .12 .06 .06 .08 .90 .99 .99 .96 IM .99 .34 .64 .66 mem mem mem mem .07 .29 .10 .16 .74 .42 .78 .65MSD .31 .37 .51 .39 mem mem mem mem t.o. t.o. t.o. t.o. .60 .95 .83 .79Table 2: Quality results for local embeddings generation. at random with a value from a different row, also selectedat random from the same dataset, e.g., (S. Stallone, RamboIII, , P. MacDonald). Finally, in MatchConcept (MC), wemodel more subtle relationships. We manually identify twoattributes A and A that are in a one to many relationship.For a random token x in A , we identify all tuples T such that( A = x ), we take three A distinct values in T and we finallyadd a random value y (not in T ) from A . The test is passedif y is identified as unrelated from the other tokens, e.g., (Q.Tarantino, Pulp fiction, Kill Bill, Jackie Brown, Titanic ). Thistest observes whether the relationship between co-occurringelements (such as directors and their movies) is strongerthan the relationship between elements that belong to thesame attribute. We took the union of the (aligned) datasetsfor each scenario and created between 1000 and 11000 tests,depending on its size in terms of rows and attributes.We report the quality results in Table 2, where eachnumber represents the fraction of passed tests. With largedatasets, some methods either failed the execution or havebeen stopped after a cut-off time of 10 hours. While on aver-age the local embeddings generated by EmbDI are superiorto all other methods, our solution is beaten in few cases.By increasing the percentage of row permutations in Ba-sic, results improve for MR but decrease for MA, withoutsignificant benefit for MC. This shows that complex relation-ships are not modelled by row and attribute co-occurrence.Node2Vec fails on our configuration for the larger scenarioswith memory errors (mem), while Harp has been stoppedafter 10 hours for MSD (t.o.). We do not report results forpre-trained embedding as they are not aware of the relation-ships in the dataset and perform very poorly for this task.For example, they obtain .33 on average for dataset BB (MA:.49, MR: .27, MC: .24) and 0.16 on average for dataset AG(MA: .03, MR: .22, MC: .22).
Take-away : our graph preserves the structure of the datasetand EmbDI generates local embeddings that model column, row, and inter-tuple relationships better than other embed-ding generation methods. UnsupervisedBase EmbDI Node2Vec Harp SEEP P SEEP L BB .75 .75WA mem .60 .60 .80AG FZ IA mem .50 .75DA mem .50 .75 .81DS .50 mem .60 .73IM .60 .78 mem .78 .68 .75 Table 3: F-Measure results for Schema Matching (SM).
We test schema matching and entity resolution in every in-tegration scenario with two datasets and report preliminaryresults on token matching. In the following, we measurethe quality of the results w.r.t. hand crafted ground truthfor each task with precision, recall, and their combination(F-measure). Execution times are reported in seconds.
Schema Matching.
We test an unsupervised setting usingAlgorithm 6 with overlap of columns treated as bag-of-words(Base) and with local embeddings. We also report for anexisting system with both pre-trained embeddings (Seep P ),as in the original paper [16], and EmbDI local embeddings(Seep L ), as they are the ones with competitive performancethat we could generate in all cases.Table 3 reports the results w.r.t. manually defined attributematches. All methods are unsupervised, but we distinguishtwo groups. In the first group, local embeddings are gener-ated and then used with Algorithm 6 from Section 6. Basiclocal embeddings lead to 0 attribute matches in this experi-ment and we do not report them in the table. While EmbDInsupervised Supervised Task specificPre-trained Local (5% labelled) (5% labelled)fastText EmbDI-S EmbDI-F EmbDI-O Node2Vec Harp DeepER P DeepER L DeepER P DeepER L BB .59 .50 .82 .86 .86 .86 .81 mem .78 0.58 0.62 0.62 0.63AG .18 .14 .57 .59 .70 .71
IA .10 .09 .09 .11 mem .14 .76 .81 .77
DA .72 .95 .94 .95 .87 .97 .84 .89 .86 .90DS .80 .85 .75 .92 mem .81 .80 .87 .82 .91IM .31 .90 .64 .94 mem .95 .82 .88 .84 .91
Table 4: F-Measure results for Entity Resolution (ER). embeddings lead to the best results in most cases, for DSHarp gets better results. While we can get comparable resultswith optimizations (Section 5), this shows that our graph en-ables other, more complex embedding schemes to get goodresults. Base performs very well across most datasets and itis outperformed by local embeddings in one case.In the second group, we compare pre-trained and EmbDIembeddings with an existing matching system. We have twomain remarks. First, the simple unsupervised method withEmbDI embeddings outperforms by at least an absolute 10%the Seep P baseline in terms of F-measure in all scenarios.Second, the baseline method improves by an average absolute6% in F-measure when it is executed with EmbDI embeddings,showing their superior quality for SM w.r.t. pre-trained ones.We observe that results for Seep P depend on the quality ofthe original attribute labels. If we replace the original (expres-sive and correct) labels with synthetic ones, Seep P obtainsF-measure values between .30 and .38. Local embeddingsfrom EmbDI do not depend on the presence of the attributelabels. Finally, we tested a traditional instance-based schemamatcher that does not use embeddings [27, 28], whose resultsare lower than the ones obtained by EmbDI in all scenarios. Take-away : EmbDI local embeddings are more effective thanpre-trained ones for the schema matching task when testedwith two different unsupervised algorithms.
Entity Resolution.
For ER, we study both unsupervisedand supervised settings. To enable baselines to execute thisscenario, we aligned the attributes with the ground truth.EmbDI can handle the original scenario where the schemashave not been aligned with a limited decrease in ER quality.As baseline for the unsupervised case, we use Algorithm 6with pre-trained embeddings (fastText). We report for ourintegration algorithm with EmbDI embeddings in three vari-ants of the way in which we tokenize the cell values in thedataset. EmbDI-S (Simple) uses the original value as a tokennode in the graph (e.g., “iPad 4th 2012”), while EmbDI-F (Flat-ten) models it as single words (e.g., nodes “iPad“, “4th”, “2012” connected to the same RID and to the same CID). The firststrategy is more accurate in the modeling on tokens withmore than one word as each token gets its own embedding;this is more precise than the one derived from combiningthe embeddings of the single words. However, a finer gran-ularity is mandatory for heterogeneous datasets with longtexts in the cell values for two reasons. First, accurate nodemerging is challenging with long sequences of words. Sec-ond, in different datasets the same entities can be split acrossattributes or grouped in one attribute. As an example, the BBdatasets contain attributes “beer name" and “brewing com-pany" but in one dataset oftentimes the name of the brewingcompany appears in the beer name “brewing_company_Abeer_name_1", while in the other dataset only beer_name_1appears in the name column. As we do not assume any user-defined pre-processing of the attribute values, modeling thewords individually is beneficial in these cases. The third tok-enization strategy, EmbDI-O (Overlap) is a trade off betweenthe two that preserves as token nodes the cell values thatare overlapping across the two datasets and models as singlewords the others.We also test our local embeddings in the supervised settingwith a state of the art ER system (DeepER L ), comparing itsresults to the ones obtained with pre-trained embeddings(DeepER P ).Results in Table 4 for unsupervised settings show thatEmbDI-O embeddings obtain the best quality results in threescenarios and second to the best in four cases. In every case,local embeddings obtained from our graph outperform pre-trained ones. For supervised settings, as in the SM experi-ments, using local embeddings instead of pre-trained onesincreases the quality of an existing system. In this case, super-vised DeepER shows an average 5% absolute improvementin F-measure with 5% of the ground truth passed as trainingdata. The improvements decrease to 4% with more trainingdata (10%). Similarly to SM, in the ER case local embeddingsobtained with the Basic method lead to 0 rows matched. top P R FAG BB DA IA IM WA AG BB DA IA IM WA AG BB DA IA IM WA1 .803 .929 .991 .278 .973 .925 .407 .765 .884 .039 .862 .634 .540 .839 .935 .068 .914 .7525 .716 .885 .986 .132 .963 .853 .494 .794 .917 .055 .911 .748 .585 .837 .950 .077 .936 .797
10 .715 .885 .986 .137 .963 .841 .496 .794 .917 .078 .912 .757 .586 .837 .950 .100 .936 .797
100 .714 .885 .986 .125 .962 .834 .496 .794 .917 .078 .912 .764 .585 .837 .950 .096 .936 .797Table 5: Effects of n top on ER quality. Finally, we investigated if our task agnostic embeddingscan be fine-tuned for a specific task. This process of pre-training followed by fine-tuning is a common workflow inNLP. Specifically, we start with the relational embeddingslearned by EmbDI and allow it to be fine-tuned for eachindividual tuple pair if it improves performance. We achievethis by modifying the embedding lookup layer of DeepER.By default, this layer does a “lookup” of a given token fromthe embedding dictionary. We allow DeepER to learn anadditional weight matrix W such that the original EmbDIembeddings can be tuned for ER. The final two columns ofTable 4 shows the results. Take-away : EmbDI embeddings are more effective than pre-trained ones for entity resolution in both the unsupervisedand the supervised settings.
Token Matching.
Differently from the previous experi-ments, we do not claim an unsupervised solution for this in-tegration task. In fact, we argue that our embeddings shouldbe used as an additional signal to be combined with the othersimilarity measures used for this task, e.g., edit distance, Jac-card, TF/IDF. We evaluated the accuracy of this approachon the IM scenario in matching of tokens across the twodatasets in two (aligned) pairs of columns. We picked thisdataset and these columns as it was possible to manuallycraft the ground truth for their matches. Two columns in apair have the information about the same entities, but ex-pressed in different formats, such as “DK” for “Denmark”,“UK” for “Great Britain”, and so on. We used the unsupervisedmatching based on nearest neighbor also used for ER.For the column expressing information about coun-tries, pre-trained embeddings and Jaccard similarity obtainmatches with 0.13 and 0.19 F-measure, respectively, whileEmbDI embeddings get 0.31. For the column about languages,the baselines obtain 0.17 and .20, while EmbDI obtains 0.30.These results suggest that local embeddings can bring astronger signal than pre-trained embeddings and Jaccarddistance in string matching systems.
We now show the effect of the different parameters, designchoices, and optimizations in our framework.
Parameters.
Several parameters in EmbDI affect the qualityof the local embeddings. All the results reported above havebeen obtained using a single configuration, but the qualityof the results for the different tasks increases significantlyby tuning the parameters for the specific tasks.The default setting uses walks of size 60, 300 dimensionsfor the embeddings space, and the Skip-Gram model inword2vec with a window size of 3. We noticed that CBOWperforms better than Skip-Gram on the ER task, while hav-ing worse results in the EQ and SM. For example, executingthe ER task with CBOW increases F-measure by at least 2absolute points for IM and DS. Similarly, decreasing the sizeof the walks to 5 for the SM task raises the F-measure for DSto 1. This is because embeddings from shorter walks bettermodel the value overlap across columns. As this signal drivesthe matching task, a lower value increases the quality of theSM matches, but reduces the quality for EQ and ER. We alsoobserve that an even lower value (3) decreases the resultsalso for SM, demonstrating that a semantic characterizationin also needed. A larger window for word2vec (5) has a nega-tive effect on all tests and all datasets. Reducing the numberof dimensions has limited, mixed effects on average, thusshowing that our method is robust to this parameter.A larger corpus leads to better results in general, butwe empirically observed diminishing returns after a cer-tain size. As a rule of thumb, we fix the total number oftokens in the corpus with the following formula: n top =
10 in our ER experiments; by varying n top we observe the expected trade-offs between P and R , as re-ported in Table 5 for six datasets. Results for the FZ scenariodo not change with different n top values and results for DSare close in values and trend to those reported for DA. Optimizations.
We tested optimizations of the original de-fault configuration for EmbDI. For replacement (Section 5.3),we used an external dictionary for one column in eachdataset, e.g., different formats of country codes. The biggestimprovement is in ER with an absolute 3% on average, whilethe quality is stable for SM and EQ. For alignment (Sec-tion 5.4), we fed the optimization step with the outcomeof the default model, i.e., we got candidate RIDs and CIDsrom a first execution and then refined the embeddings withthis information. This leads to a an absolute 2% increase inF-measure for ER, with the higher contribution coming fromthe better recall. % null Year values E R F - m ea s u r e Skip FD
Figure 3:
EmbDI
ER F-measure for IM with increasingamount of missing values in the data.
Figure 3 shows the impact on ER of inserted missing valuesin the IM dataset. We defined the FD
Title,Director → Year andinserted increasing amount of noise at random in the columnYear. As the number of records in common across the twodatasets is very low, most of the NULLs are modifying recordsthat are only in one dataset. Surprisingly, this has a visibleeffects on the results in terms of F-measure. While the defaultSkip solution (ignore NULL values in the graph creation)stays stable until a large number of NULLs is introduced, theresults improve for the optimization that enforces the FD inthe graph construction. This improvement is driven by theincreasing precision. In fact, there are non duplicate moviesthat have a large number of attribute values in common,including the year, and that are identified as duplicates by ourunsupervised method (based on neighbor RIDs). However,the FD enforces that any missing value is treated as a newvalue, distinct from the others, and this information movesthe embedding of the RID with the NULL away from thesimilar tuple that is not a duplicate.
Execution times.
Compared to Node2Vec and Harp, theexecution of EmbDI is much faster and is able to computelocal embeddings for all medium size datasets in minutes ona commodity laptop. As reported in Table 6 for experimentswith the default configuration (using word2vec and Skip-Gram), the embedding creation (E) takes on average about80% of the total execution time, while graph generation (G)takes less than 1%, and sentence creation (W) the remaining19%. The execution times for the embeddings creation fromthe sentences depends drastically on the algorithm used andits configuration, e.g., CBOW is much faster than Skip-Gram. DS G W E W+E N2V HARPBB 2.47 66.7 133 200 1663 732WA 13.4 329 1113 1442 mem 2394AG 1.19 34.4 122 156 953 135FZ 0.3 12.0 40.7 52.6 178 27.0IA 32.0 533 1360 1893 mem 9122DA 2.08 43.6 130 173 920 128DS 33.9 919 3027 3947 mem 21659IM 31.6 768 2772 3540 mem 8001MSD 146 6377 27050 33427 mem t.o.
Table 6: Execution times (in seconds) for embeddingsgeneration for
EmbDI , Node2Vec (N2V) and
Harp . As the graph generation is common to all methods, wecompare our solution with Node2Vec (N2V) and Harp interms of time to generate walks and produce embeddings(W+E). EmbDI is faster in most cases, up to 7x in two datasets,and, in contrast with Harp, never hits the time out (t.o.) of10 hours. With larger datasets, Node2Vec raised a memoryerror on our 32GB reference machine. EmbDI does not sufferfrom this problem, even in a laptop with 16GB of main mem-ory, we have been able to run all tests, including the onesfor the biggest dataset of 1M tuples (139MB).
In this paper, we proposed a novel framework -EmbDI- for au-tomatically learning local relation embeddings of high qual-ity from the data. The learned embeddings provide promis-ing results for a number of challenging and well studieddata integration tasks such as entity resolution and schemamatching. Our embeddings are generic to data integrationand could also be tuned in a task specific manner to obtainbetter results.There are a number of intriguing research questions thatwe plan to tackle next. One of our key focus areas is in seam-lessly combining pre-trained and local embeddings. Whileblindly using pre-trained embeddings provide sub-optimalresults, they could be intelligently combined with local em-beddings provided by EmbDI to obtain a hybrid embeddingthat is more effective. Recently, there has been increasinginterest in incorporating contextual information into wordembeddings and language modeling. Approaches such asBERT [13] achieve state of the art results in NLP due to this.An important open question is to formally define appropri-ate context for relational data integration so that DL modelscould be built for learning contextualized word embeddings.
Acknowledgement.
This work has been partially supportedby the ANR grant
ANR-18-CE23-0019 and by the IMT
Futur& Ruptures program “AutoClean”.
EFERENCES [1] Ziawasch Abedjan, Xu Chu, Dong Deng, Raul Castro Fernandez, Ihab F.Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker, and NanTang. 2016. Detecting Data Errors: Where are we and what needs tobe done?
PVLDB
9, 12 (2016), 993–1004.[2] Patricia C. Arocena, Boris Glavic, Giansalvatore Mecca, Renée J. Miller,Paolo Papotti, and Donatello Santoro. 2015. Messing Up with BART:Error Generation for Evaluating Data-Cleaning Algorithms.
PVLDB
CoRR abs/1607.04606 (2016). arXiv:1607.04606 http://arxiv.org/abs/1607.04606[4] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov.2017. Enriching Word Vectors with Subword Information.
TACL arXiv preprint arXiv:1712.07199 (2017).[6] Rajesh Bordawekar and Oded Shmueli. 2017. Using word embeddingto enable semantic queries in relational databases. In
DEEM Workshop .ACM, 5.[7] Rajesh Bordawekar and Oded Shmueli. 2019. Exploiting Latent Infor-mation in Relational Databases via Word Embedding and Applicationto Degrees of Disclosure.. In
CIDR .[8] Öykü Özlem Çakal, Mohammad Mahdavi, and Ziawasch Abedjan. 2019.CLRL: Feature Engineering for Cross-Language Record Linkage. In
EDBT . 678–681.[9] Haochen Chen, Bryan Perozzi, Yifan Hu, and Steven Skiena. 2017.HARP: Hierarchical Representation Learning for Networks.
CoRR abs/1706.07845 (2017). arXiv:1706.07845 http://arxiv.org/abs/1706.07845[10] Xu Chu and Ihab F. Ilyas. 2019.
Data Cleaning . ACM.[11] Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, LudovicDenoyer, and Hervé Jégou. 2017. Word translation without paralleldata. arXiv preprint arXiv:1710.04087 (2017).[12] Sanjib Das, Paul Suganthan G. C., AnHai Doan, Jeffrey F. Naughton,Ganesh Krishnan, Rohit Deep, Esteban Arcaute, Vijay Raghavendra,and Youngchoon Park. 2017. Falcon: Scaling Up Hands-Off Crowd-sourced Entity Matching to Build Cloud Services. In
SIGMOD .[13] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.2018. Bert: Pre-training of deep bidirectional transformers for languageunderstanding. arXiv preprint arXiv:1810.04805 (2018).[14] Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq Joty,Mourad Ouzzani, and Nan Tang. 2018. Distributed representations oftuples for entity resolution.
PVLDB
11, 11 (2018), 1454–1467.[15] Raul Castro Fernandez and Samuel Madden. 2019. Termite: a system fortunneling through heterogeneous data. arXiv preprint arXiv:1903.05008 (2019).[16] Raul Castro Fernandez, Essam Mansour, Abdulhakim A Qahtan,Ahmed Elmagarmid, Ihab Ilyas, Samuel Madden, Mourad Ouzzani,Michael Stonebraker, and Nan Tang. 2018. Seeping semantics: Linkingdatasets using word embeddings for data discovery. In
ICDE .[17] FigureEight. 2016. Data Science Report. https://visit.figure-eight.com/data-science-report.html. (2016).[18] Chaitanya Gokhale, Sanjib Das, AnHai Doan, Jeffrey F. Naughton,Narasimhan Rampalli, Jude W. Shavlik, and Xiaojin Zhu. 2014. Cor-leone: hands-off crowdsourcing for entity matching. In
SIGMOD .[19] Behzad Golshan, Alon Y. Halevy, George A. Mihaila, and Wang-ChiewTan. 2017. Data Integration: After the Teenage Years. In
PODS . 101–106. [20] Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable featurelearning for networks. In
SIGKDD . ACM, 855–864.[21] Michael Günther. 2018. FREDDY: Fast Word Embeddings in DatabaseSystems. In
SIGMOD . ACM, 1817–1819.[22] Michael Günther, Maik Thiele, Erik Nikulski, and Wolfgang Lehner.2020. RetroLive: Analysis of Relational Retrofitted Word Embeddings.
EDBT (2020).[23] Richard Hull and Masatoshi Yoshikawa. 1990. ILOG: Declarative Cre-ation and Manipulation of Object Identifiers. In
VLDB . 455–468.[24] Madelon Hulsebos, Kevin Hu, Michiel Bakker, Emanuel Zgraggen,Arvind Satyanarayan, Tim Kraska, Çagatay Demiralp, and César Hi-dalgo. 2019. Sherlock: A Deep Learning Approach to Semantic DataType Detection. In
SIGKDD . ACM.[25] Jungo Kasai, Kun Qian, Sairam Gurajada, Yunyao Li, and Lucian Popa.2019. Low-resource Deep Entity Resolution with Transfer and ActiveLearning. arXiv preprint arXiv:1906.08042 (2019).[26] Christos Koutras, Marios Fragkoulis, Asterios Katsifodimos, andChristoph Lofi. 2020. REMA: Graph Embeddings-based RelationalSchema Matching.
SEA Data workshop (2020).[27] Bruno Marnette, Giansalvatore Mecca, Paolo Papotti, Salvatore Rau-nich, and Donatello Santoro. 2011. ++Spicy: an OpenSource Tool forSecond-Generation Schema Mapping and Data Exchange.
PVLDB
International Workshop on Ontology Matching .[29] Renée J Miller, Fatemeh Nargesian, Erkang Zhu, ChristinaChristodoulakis, Ken Q Pu, and Periklis Andritsos. 2018. Mak-ing Open Data Transparent: Data Discovery on Open Data.
IEEE DataEng. Bull.
41, 2 (2018), 59–70.[30] Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Young-choon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and VijayRaghavendra. 2018. Deep learning for entity matching: A design spaceexploration. In
SIGMOD .[31] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014.Glove: Global Vectors for Word Representation. In
EMNLP . 1532–1543.[32] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christo-pher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextual-ized word representations.
CoRR abs/1802.05365 (2018).[33] Tye Rattenbury, Joseph M Hellerstein, Jeffrey Heer, Sean Kandel, andConnor Carreras. 2017.
Principles of data wrangling: Practical tech-niques for data preparation . " O’Reilly Media, Inc.".[34] Paul Suganthan, Adel Ardalan, AnHai Doan, and Aditya Akella. 2018.Smurf: Self-Service String Matching Using Random Forests.
PVLDB
12, 3 (2018), 278–291.[35] Saravanan Thirumuruganathan, Shameem A Puthiya Parambath,Mourad Ouzzani, Nan Tang, and Shafiq Joty. 2018. Reuse and adap-tation for entity resolution through transfer learning. arXiv preprintarXiv:1809.11084 (2018).[36] Saravanan Thirumuruganathan, Nan Tang, Mourad Ouzzani, and An-Hai Doan. 2020. Data curation with Deep Learning.
EDBT (2020).[37] Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word represen-tations: a simple and general method for semi-supervised learning. In
ACL . ACL, 384–394.[38] Chen Zhao and Yeye He. 2019. Auto-EM: End-to-end Fuzzy Entity-Matching using Pre-trained Deep Models and Transfer Learning. In
WWW . 2413–2424.[39] Erkang Zhu, Yeye He, and Surajit Chaudhuri. 2017. Auto-Join: JoiningTables by Leveraging Transformations.