Toward Data Cleaning with a Target Accuracy: A Case Study for Value Normalization
Adel Ardalan, Derek Paulsen, Amanpreet Singh Saini, Walter Cai, AnHai Doan
TToward Data Cleaning with a Target Accuracy:A Case Study for Value Normalization
Adel Ardalan
University of [email protected]
Derek Paulsen
University of [email protected]
Amanpreet Singh Saini
University of [email protected]
Walter Cai
University of [email protected]
AnHai Doan
University of [email protected]
ABSTRACT
Many applications need to clean data with a target accuracy . Asfar as we know, this problem has not been studied in depth. In thispaper we take the first step toward solving it. We focus on valuenormalization (VN) , the problem of replacing all string that referto the same entity with a unique string. VN is ubiquitous, and weoften want to do VN with 100% accuracy. This is typically donetoday in industry by automatically clustering the strings thenasking a user to verify and clean the clusters, until reaching 100%accuracy. This solution has significant limitations. It does not tellthe users how to verify and clean the clusters. This part also oftentakes a lot of time, e.g., days. Further, there is no effective way formultiple users to collaboratively verify and clean. In this paperwe address these challenges. Overall, our work advances thestate of the art in data cleaning by introducing a novel cleaningproblem and describing a promising solution template.
Data cleaning (DC) has been a long-standing challenge in thedatabase community. Many DC problems have been studied,such as cleaning with a budget, cleaning to satisfy constraintsbut minimize changes to the data, etc. Recently, however, wehave seen another novel DC problem in industry: cleaning witha target accuracy , e.g., with at least 95% precision and 90% recall.While pervasive, this problem appears to have received littleattention in the research community.In this work, we take the first step toward solving this problem.We focus on value normalization (VN) , the problem of replacingall strings (in a given set) that refer to the same real-world entitywith a unique string. VN is ubiquitous, and industrial users oftenwant to do VN with 100% accuracy.
Example 1.1. To enable product browsing by brand on wal-mart.com, the business group at WalmartLabs asks the IT groupto normalize the brands, e.g., converting those in Figure 1.a intothose in Figure 1.b. If some brands are not normalized correctly,then customers may not find those products, resulting in revenuelosses. So the business group asks the IT group to ensure that thebrands are normalized with 100% accuracy.
Many enterprise customers of Informatica (which sells dataintegration software) also face this problem, in building businessglossaries, master data management, and knowledge graph con-struction. In general, if even a small amount of inaccuracy in VNcan cause significant problems for the target application, then thebusiness group will typically ask the IT group to help perform VNwith 100% accuracy.
In response, the IT group typically employs an algorithm tocluster the strings, then asks a user to verify and clean the clusters.
Figure 1: An example of normalizing product brands
Consider the five brands in Figure 2.a. The IT group applies analgorithm to produce two clusters π and π (Figure 2.b). A user π manually verifies and cleans the clusters, by moving βVizioCorpβ from cluster π to π , producing the two clusters π and π in Figure 2.c. Finally, π replaces each string in a cluster with acanonical string, producing the VN result in Figure 2.d.Typically, a data scientist performs the βmachineβ part thatclusters the strings, then a data analyst performs the βhumanβpart that verifies and cleans the clusters. The IT group assuresthe business group that the resulting output is 100% accuratebecause a data analyst has examined it (assuming that he/shedoes not make mistakes).While popular, the above solution has significant limitations.First, there is no precise procedure that a user can follow toexecute the βhumanβ part. So users often verify and clean in anad-hoc, suboptimal, and often incorrect fashion. This also makesit impossible to understand the assumptions under which thesolution reaches 100% accuracy and to formally prove it.Second, the βhumanβ part often incurs a huge amount of time,e.g., days. In contrast, the βmachineβ part often takes mere min-utes. (In most cases that we have seen, users verified and cleanedusing Excel, in a slow and tedious process.) So it is critical todevelop a better solution and GUI tool to minimize the time ofthe βhumanβ part.Finally, it is difficult for multiple users to collaboratively verifyand clean, even though this setting commonly occurs in practice.In this paper we develop Winston , a solution for the abovechallenges. (In the movie βPulp Fictionβ, Winston Wolfe is thefixer who cleans up messes made by other gangsters.) We first
Figure 2: A popular solution in industry to perform VNwith 100% accuracy. a r X i v : . [ c s . D B ] J a n efine a set of basic operations on a GUI for users, e.g., select-ing a value, verifying if a cluster is clean, merging two clusters,etc. Then we provide precise procedures involving these actionsthat users can execute to verify/clean clusters. We prove that ifusers execute these actions correctly, then the output has 100%accuracy.To minimize the time of the βhumanβ part, we adopt an RDBMS-style solution. Specifically, we compose the GUI operations withclustering algorithms to form multiple βmachine-humanβ plans,each executes the VN pipeline end-to-end. Next, we estimate thetotal time a user must spend per plan, select the plan with theleast estimated time, execute its machine part, then show theoutput of that part to the user so that he/she can verify and cleanit using a GUI (following the sequence of user operations thatthis plan specifies).Finally, we show how to extend our solution to effectivelydivide the verification and cleaning work among multiple users.Our solution appears highly effective. Section 8 shows thatusing the existing solution, a single user needs 29 days, 4.4 years,and 11.5 years to verify/clean 100K, 500K, and 1M strings, respec-tively. Winston drastically reduces these times to just 13 days, 9.6months, and 1.3 years, using 1 user, and to 4.25 days, 2.2 months,and 3.5 months, using 3 users.In summary, we make the following contributions: β’ We formally define the novel data cleaning problem of VNwith 100% accuracy. As far as we know, this paper is thefirst to study this problem in depth. β’ We propose
Winston , a novel RDBMS-style solution.
Win-ston defines complex human operations and optimizes thehuman time of a plan. This is in contrast to traditionalRDBMSs which define machine operations and optimizemachine time. β’ We describe extensive experiments (comparing
Winston to a tool in a company, to the popular open-source tool
OpenRefine , and to state-of-the-art string matching andentity matching solutions) that show the promise of ourapproach.Overall, our work advances the state of the art in data cleaning byintroducing a novel cleaning problem and describing a promisingsolution template. It also advances the state of the art in human-in-the-loop data analytics (HILDA) by showing that it is possible todevelop an RDBMS-style solution to HILDA, by defining complexhuman operations, combining them to form plans, and selectingthe plan with the lowest estimated human effort.
In this section we define the problem of VN with 100% accu-racy, and examine when we can reach this accuracy under whatconditions. We first define
Definition 2.1 (Value normalization).
Let π be a set of strings { π£ , . . . , π£ π } . Replace each π£ β π with a string π ( π£ ) such that π ( π£ π ) = π ( π£ π ) if and only if π£ π and π£ π refer to the same real-worldentity, for all π£ π , π£ π in π .This problem is often solved in two steps: (1) partition π intoa set of disjoint clusters V = { π , . . . ,π π } , such that two stringsrefer to the same real-world entity if and only if they belong tothe same cluster; (2) replace all strings in each cluster π π β V with a canonical string π π .In this paper we will consider only Step 1, which tries tofind the correct partitioning of π . Step 2 is typically applicationdependent (e.g., a common method is to select the longest string Figure 3: Actions and their verification sets in a cluster π π to be its canonical string, because this string tendsto be the most informative one). Gold Partition & Accuracy of Partitions:
Let π be a userwho will verify and clean the clusters. To do so, π must be capableof creating a βgoldβ, i.e., correct, partition V β = { π β , . . . ,π β π } ,such that two strings refer to the same real-world entity if andonly if they are in the same cluster. For example, the two clusters π and π in Figure 2.c form the gold partition for the set of stringsin Figure 2.a.Our goal is to find the gold partition V β . But the partitionthat we find may not be as accurate. We now describe how tocompute the accuracy of any partition. First, we define Definition 2.2 (Match and non-match).
A match π£ π = π£ π meansβ π£ π and π£ π refer to the same real-world entityβ, and is correct ifthis is indeed true. π£ π = π£ π and π£ π = π£ π are considered the samematch. We define a non-match π£ π β π£ π similarly. Definition 2.3 (Set of matches specified by a partition).
A clus-ter π π specifies the set of matches π ( π π ) = { π£ π = π£ π | π£ π β π π , π£ π β π π , π β π } . Partition V = { π , . . . ,π π } specifies theset of matches π (V) = βͺ ππ = π ( π π ) .For example, cluster π in Figure 2.b specifies three matches: π ( π ) = { ππππ¦ = ππππ¦ πΆπππ, ππππ¦ = π ππ§ππ πΆπππ, ππππ¦ πΆπππ = π ππ§ππ πΆπππ } . Cluster π specifies one match: π ( π ) = { π ππ§ππ = π ππ§ππ πΌππ } . These two clusters form a partition V , which speci-fies the set of matches π ( π ) βͺ π ( π ) . The accuracy of a partitionis then measured as follows: Definition 2.4 (Precision and recall of a partition).
Let V β be thegold partition of a set of strings π . The precision of a partition V is the fraction of matches in π (V) that are correct, i.e., appearingin π (V β ) , and the recall of V is the fraction of matches in π (V β ) that appear in π (V) .Given the gold partition V = { π , π } in Figure 2.c, the pre-cision of partition V = { π , π } in Figure 2.b is 2/4 = 50%, andthe recall is 2/4 = 50%. Actions & Their Verification Sets:
Henceforth, we use βac-tionβ and βoperationβ interchangeably. When a user π performsan action, π has implicitly verified a set of matches and non-matches, called a verification set . Formally, Definition 2.5 (User action and verification set).
We assume aGUI on which user π can perform a set of actions π΄ = { π , . . . , π π } .Each action π π inputs data πΌ π and outputs data π π , both of whichinvolve sets of strings in π . After correctly executing an action π π on input πΌ π , as a side effect, user π has implicitly verified a set π ( π π , πΌ π ) of matches and non-matches to be correct. We refer to π ( π π , πΌ π ) as a verification set.To illustrate, suppose π has employed an algorithm to producethe partition { π , π } in Figure 3. Next, π uses a GUI to verifyand clean these clusters. Call a cluster βpureβ if all strings in itefer to the same real-world entity. Suppose the GUI supportsonly two actions: π splits a cluster into pure clusters, and π merges two pure clusters into one.Suppose π starts by using π to split cluster π into two pureclusters π = { ππππ¦, ππππ¦ πΆπππ } and π = { π ππ§ππ πΆπππ } (seeFigure 3). Cluster π specifies three matches: ππππ¦ = ππππ¦ πΆπππ , ππππ¦ = π ππ§ππ πΆπππ , and
ππππ¦ πΆπππ = π ππ§ππ πΆπππ . With the abovesplitting, intuitively, π has verified that the first match ππππ¦ = ππππ¦ πΆπππ is indeed correct, but the remaining two matchesare not. Thus, the resulting verification set π ( π , π ) is the setof 1 match and 2 non-matches: { ππππ¦ = ππππ¦ πΆπππ, ππππ¦ β π ππ§ππ πΆπππ, ππππ¦ πΆπππ β π ππ§ππ πΆπππ } , as shown in Figure 3.Next, π uses π to split cluster π . π determines that π isalready pure, so no new clusters are created. Implicitly, π has ver-ified that the sole match specified by π is correct. So π ( π , π ) = { π ππ§π = π ππ§ππ } . Finally, π uses action π to merge the two pureclusters π and π into cluster π (see Figure 3). Implicitly, π has verified that the two matches { π ππ§π = π ππ§ππ πΆπππ,π ππ§ππ = π ππ§ππ πΆπππ } are correct. These form the verification set π ( π , { π , π }) .In a similar fashion, we can define the verification set for thesequence π ( π ) , π ( π ) , π ({ π , π }) in Figure 3 to be π ( π , π ) βͺ π ( π , π ) βͺ π ( π , { π , π }) . Formally, Definition 2.6 (Verification set of action sequence).
If user π has performed an action sequence πΊ = π π , . . . , π ππ , then theverification set of πΊ is π ( πΊ ) = βͺ ππ = π ( π ππ ) . Match Transitivity:
Recall that we assume user π can create agold partition V β . This implies that transitivity holds for matches,i.e., if π£ π = π£ π and π£ π = π£ β are correct, then π£ π = π£ β is also correct(because all three must be in the same gold cluster). Similarly, if π£ π = π£ π and π£ π β π£ β are correct, then π£ π β π£ β is also correct. Definition 2.7 (Inferring matches).
We say that match π£ π = π£ π can be inferred from a verification set π ( πΊ ) if and only if thereexists a sequence of strings π£ β , . . . , π£ βπ such that the matches π£ π = π£ β , π£ β = π£ β , . . . , π£ βπ = π£ π are in π ( πΊ ) . We say that thesematches form a transitivity path from π£ π to π£ π . Similarly, we saythat non-match π£ π β π£ π can be inferred from π ( πΊ ) if and only ifthere exists such a path, except that exactly one of the edges ofthe path is a non-match. VN with 100% Accuracy:
Recall that the gold partition V β specifies a set of correct matches π (V β ) . We say that it alsospecifies a set of correct non-matches π (V β ) , which consists ofall non-match π£ π β π£ π such that π£ π = π£ π is not a match in π (V β ) .We define Definition 2.8 (Gold sequence of actions).
A sequence πΊ of ac-tions of user π is βgoldβ if and only if any match in π (V β ) ornon-match in π (V β ) either already exists in the verification set π ( πΊ ) or can be inferred from π ( πΊ ) .It is not difficult to prove that executing a gold action sequence πΊ will produce the gold partition V β . We now can define ourproblem as follows: Definition 2.9 (VN with 100% accuracy).
Let π be a set of strings.Let ( π, π ) be a pair of machine/human algorithms, such that themachine part π can be executed on π to produce a partition V , then the human part π can be executed by a user π on V to produce a new partition V+ . Find π and π such that (a) theaction sequence executed by user π in part π is a gold sequence,and (b) the total time spent by user π is minimized. Return theresulting partition V+ . Figure 4: An illustration of the split and merge stages.
Thus, we reach 100% accuracy if the user executes a gold sequence πΊ of actions. Then all correct matches and non-matcheswill have already been in the verification set of πΊ , or inferredfrom this verification set via match transitivity. As discussed, each VN plan ( π, π ) consists of a machine part π and a human part π . In part π we apply an algorithm to the inputstrings to obtain a set of clusters V , then in part π we employa user π to verify and clean V . We now design part π ; the nextsection designs part π .The key challenge in designing the human part π is to ensurethat it is easy for users to understand and execute, minimizestheir effort, and is amenable to cost analysis. Toward these goals,we discuss the user setting, describe a solution called split andmerge , then define a set of user operations that can be used toimplement this solution.We assume user π will work with a graphical user interface(GUI), using mouse and keyboard. π has a short-term memory (or STM for short). According to [36] each individual could remember7 Β± π can use paper and pen for those cases where π needs to keep track of more objects than can be fit into STM.User π can clean the clusters output by the machine part inmany different ways. In this paper, based on what we have seenusers do in industry, we propose that π clean in two stages. Thefirst stage splits the clusters recursively until all resulting clustersare βpureβ, i.e., each containing only the values of a single real-world entity (though often not all such values). The second stagethen merges clusters that refer to the same entity. Example 3.1.
Suppose the machine part produces clusters 1-2in Figure 4. The split stage splits cluster 1 into clusters 3-4, cluster2 into clusters 5-6, cluster 6 into clusters 7-8, then cluster 8 intoclusters 9-10 (see the solid arrows). The output of the split stageis the set of pure clusters 3, 4, 5, 7, 9, 10. The merge stage thenmerges clusters 3 and 5 into cluster 11, and clusters 4 and 9 intocluster 12 (see the dotted arrows). The end result is the set ofclean clusters 11, 12, 7, 10.
We now describe the split stage (Section 3.2 describes the mergestage). First, we define a dominating entity of a cluster π to bethe one with the most values in π (henceforth we use βvalueβ andβstringβ interchangeably). Formally, lgorithm 1 Split Phase
Procedure
Split( πΆ ) Input: a set of clusters πΆ = { π , . . .,π π } , output by machine Output: a set of clean clusters π· = { π , . . .,π π } s.t. βͺ π π π = βͺ π π π π· β β for each cluster π β πΆ do π· β π· βͺ SplitCluster ( π ) return π· Procedure
SplitCluster( π ) Input: a cluster π Output: a set of clean clusters πΊ = { π , . . .,π π } βͺ π π π = π if | π | = then return { π } isPure ( π ) // at the end, user selects yes/no button if yes button is selected then return { π } findDomEntityValue ( π ) // at the end, user knows π β and πΌ // or βmark values" button if βclean mixed cluster" button is selected then return Merge( π ) // πΌ < . in this case MarkValues( π,π β ,πΌ )// at the end, user selects βcreate/clean new cluster"// or βcreate new cluster / clean old cluster" button Move all marked values in π into a new cluster π if βcreate/clean new cluster" button is selected then return π βͺ SplitCluster ( π ) // πΌ β₯ . else return SplitCluster ( π ) βͺ π // πΌ < . Procedure
MarkValues( π,π β ,πΌ ) Input: a cluster π , dominating entity π β and purity πΌ of π Output: a set of values in π will be selected by the user Let πΏ be the list of values in π , displayed on GUI if πΌ β₯ . then for π β , . . ., | π | do focus ( πΏ [ π ] ), if not match ( πΏ [ π ] ,π β ) then select ( πΏ [ π ] ) else for π β , . . ., | π | do focus ( πΏ [ π ] ), if match ( πΏ [ π ] ,π β ) then select ( πΏ [ π ] ) Definition 3.2 (Dominating entity).
Let G be a partition of clus-ter π into groups of values πΊ , . . . , πΊ π , such that all values in eachgroup refer to the same real-world entity and different groupsrefer to different entities. Then the dominating entity of π is theentity of the group with the largest size: πΊ π = πππ πππ₯ πΊ π βG | πΊ π | .Henceforth, we will use πππ ( π ) (or π β when there is no ambigu-ity) to denote the dominating entity of π .In Figure 4, dom(Cluster 1) and dom(Cluster 2) are Sony Corpo-ration. Cluster 6 has three candidates; we break tie by randomlyselecting one to be the dominating entity.Let πΆ be the set of clusters output by the machine part. Ourkey idea for the split stage is that if the machine part has beenreasonably accurate, then any cluster π β πΆ is likely to be dom-inated by πππ ( π ) . If so, user π can clean π by moving all thevalues in π that do not refer to πππ ( π ) into a new cluster π , thenclean π , and so on.Specifically, for each cluster π β πΆ , user π should (1) check if π is pure; if yes, stop; (2) otherwise find the dominating entity πππ ( π ) ; (3) move all values in π that do not refer to πππ ( π ) intoa new cluster π ; then (4) apply Steps 1-3 to cluster π (cluster π has become pure, so needs no further splitting). This recursiveprocedure will split the original cluster π into a set of pure clusters.It is relatively easy for human users to understand and follow,and as we will see in Section 5, it is also highly amenable to costanalysis. Example 3.3.
Given cluster 1 in Figure 4, user π splits it intothe pure cluster 3, which contains only the values of the domi-nating entity Sony Corporation, and cluster 4, which contains allremaining values of cluster 1. A similar recursive splitting processapplies to cluster 2. (Note that cluster 6 has three dominating-entity candidates, so we break tie randomly and select Dell to bethe dominating entity.)We now optimize the above procedure in three ways. First, if π is a singleton cluster, then we do not invoke the above splittingprocedure, because π is already pure. Second, there are cases where the number of values referring to πππ ( π ) is less than 50%of | π | . Formally, we define Definition 3.4 (Cluster purity).
The purity of a cluster π is thefraction of the values in π that refer to πππ ( π ) . Henceforth wewill use π ( π ) (or πΌ when there is no ambiguity) to denote thepurity of π .For example, in Figure 4, the purity of cluster 2 is 2/5 = 0.4 < π that do notrefer to πππ ( π ) , as discussed so far, it is less work for the user tomove the values that do refer to πππ ( π ) into a new cluster π (e.g.,for cluster 2, π should move βSonnyβ and βSONY Corpβ, insteadof the other three values).Finally, if π ( π ) is below a threshold, currently set to 0.1, then π is very βmixedβ, with each entity having less than 10% of thevalues. In this case, we have found that instead of splitting π , itis often more effective to apply the Merge procedure describedin Section 3.2 to π . This produces a set of pure clusters that arethen fed to the merge stage. Basic User Operations:
To implement the above solution, wedefine the following five basic user operations:β’ focus(a) : User π moves his or her focus to a particular object π on the GUI or on the paper, such as a cluster, a value withina cluster, a GUI button, a number on the paper, etc. Intuitively,user π will shift his or her attention from one object to anotheron the GUI or the paper, and that incurs a certain amount of time.This operation is designed to capture this physical action (andits cost).β’ select(a) : User π selects an object π on the GUI (e.g., a cluster,a value, a GUI button, etc.) by moving the mouse pointer to thatobject and clicking on it, or pressing a keyboard button (e.g.,Page Up, Page Down). This operation is designed to capture thisphysical action (and its cost).β’ match(x,y) : Given two values, or a value and a real-worldentity (in π βs short-term memory), π determines if they refer tothe same real-world entity.β’ isPure(c) : π examines cluster π to see if it is pure (i.e., if it isclean). Specifically, we assume the values in π is listed (e.g., on theGUI) as a list of values πΏ . User π reads the first value of πΏ , mapsit to an entity π , then scans the values in the rest of πΏ . As soon as π sees a value that does not refer to π , the cluster is not pure, π stops and returns false. Otherwise π exhausts πΏ and returns true.β’ findDom(c) : finds the dominating entity πππ ( π ) and the purity π ( π ) of a cluster π . If | π | β€
7, the size of the short-term memory(STM), then π does this entirely in STM. Specifically, π scansthe list of values in π , maps each value into an entity, and keepstrack of the number of times π has encountered a particularentity. Then π returns the entity with the highest count π as thedominating one, and π /| π | as the purity of cluster π . If | π | > π proceeds as above, but uses paper and pen to keep trackof the counts of the encountered entities. The Split Procedure:
Algorithm 1 describes
Split , a procedurethat uses the above five operations to implement the split stage.
Split takes the set of clusters output by the machine part, thenapplies the
SplitCluster procedure to each cluster. We distinguishtwo kinds of procedures: GUI-driven and human-driven.
Split and
SplitCluster are GUI-driven, i.e., executed by the computer. AGUI-driven procedure, e.g.,
SplitCluster , may call human-drivenprocedures e.g., isPure , findDom , then pass control to user π to igure 5: An example of local merging. execute those procedures. To distinguish between the two, weunderline the names of human-driven procedures.Algorithm 1 shows that SplitCluster handles the corner caseof singleton clusters (Step 1), then calls isPure and asks user π to take over (Step 2). At the end of this procedure, π wouldhave selected either βyesβ or βnoβ button, indicating whether thecluster is pure or not. In the former case, SplitCluster terminates,returning the pure cluster (Step 3). Otherwise, it calls findDom (Step 4), and so on. Note that at the end of findDom , user π knowsthe dominating entity π β and the purity πΌ , but the computer doesnot know these. Hence these quantities (and all quantities thatonly π know) are shown as underlined, e.g., π β , πΌ . Given a set of pure clusters output by the split stage, the mergestate merges clusters that refer to the same entity. Clearly, fromeach cluster we can select just a single representative value (saythe longest string), then merge those (if we know how to mergethose, we can easily merge the original clusters). For example,in Figure 4 the split stage produces clusters 3, 4, 5, 7, 9, and 10.To merge them, it is sufficient to consider merging the valuesβSony Corpβ, βLgβ, βSONY Corpβ, βDellβ, βLGβ, and βAppleβ. Hence-forth we will consider this simpler problem of merging π values π£ , . . . , π£ π .Naively merging by considering all pair takes quadratic time.To address this problem, we propose a two-step process. First, π does one pass through the list of values to do a βlocal mergingβthat merges matching values that are near one another. Thisreduces π . Then π does βglobal mergingβ that considers all pairs(of the remaining values). Both steps will exploit the parallel pro-cessing capability of short-term memory (STM).
We now describethese two steps.
Local Merging:
This step uses STM to merge matching valuesthat are near one another. Specifically, first the set of values issorted. Currently we use alphabetical sorting, because matchingvalues often share the first few characters (e.g., IBM, IBM Corp).Figure 5.a shows such a sorted list πΏ (ignoring the arrows fornow).Next, user π processes the values in πΏ top down. For eachvalue, π stores it and the associated entity in STM. For the sakeof this example, assume STM can only store three such pairs.Figure 5.b shows a full STM after π has processed the first threevalues of the list. Then when processing the 4th value, βGarminβ, π needs to evict the oldest pair from STM to make space forβGarminβ (see Figure 5.c).Then when processing the 5th value, βGeβ, π realizes that itsentity, π , is already in STM, associated with a previous valueβGEβ. So π links βGeβ with βGEβ, and replaces the value βGEβ inSTM with the new value βGeβ (see Figure 5.d). Next, βIBMβ willbe stored in STM, displacing βGameviceβ (Figure 5.e), and so on.At the end, π has linked together certain matching values (seethe arrows in Figure 5.a). Algorithm 2
Local Merging
Procedure
LocalMerge( πΏ ) Input: a list of values πΏ sorted alphabetically Output: links among certain values in πΏ that match for π β , . . ., | πΏ | do ( π,π‘ ) β memorize ( πΏ [ π ] ) if π is not null then // πΏ [ π ] maps to π , π is already in STM and associated with π‘ select ( πΏ [ π ] ) focus ( πΏ [ π ] ), select ( πΏ [ π ] ) // πΏ [ π ] is the previous value π‘ focus (link button), select (link button)// at the end, user selects βdone local merging" button Algorithm 2 describes local merging, which uses previouslydefined user operations focus(a) and select(a) (see Section 3.1),as well as the following new user operation:β’ memorize(v) : π maps the input value π£ to an entity π , thenmemorizes, i.e., stores the pair ( π, π£ ) in STM. Specifically, if apair ( π, π‘ ) is already in STM, π replaces it with ( π, π£ ) , then exits,returning ( π, π‘ ) (see Line 2 in Algorithm 2). Otherwise, π addsthe pair ( π, π£ ) to STM, βkickingβ the oldest pair out of STM tomake space if necessary. Global Merging:
After local merging, the original list of valuesis consolidated, i.e., from each set of linked values we again selectjust a single representative value (e.g., the longest one). Thisproduces a new shorter list, e.g., consolidating the list in Figure5.a produces the shorter list in Figure 6.a (ignoring the arrow).Let πΏ = [ π£ , . . . , π£ π ] be this new shorter list. Naively, user π can compare π£ with π£ , . . . , π£ π , then π£ with π£ , . . . , π£ π , etc.A better solution however is to exploit the parallel processingcapability of STM: read multiple values, say π£ , . . . , π£ π , into STMall at once, then compare them all in parallel with π£ π + , . . . , π£ π ,etc. Example 3.5.
Consider again the list in Figure 6.a. User π canread the first two values, βBig Blueβ and βGEβ, into STM, thenscan the rest of the values and match them with these two inparallel (using a GUI, see Figure 6.b). If there is a match, e.g.,βIBM Corpβ and βBig Blueβ, then π checks off the appropriate box(see Figure 6.b). At the end of the list, π pushes a button to linkthe matching values. Next, π reads into STM the next two values,βGameviceβ and βGarminβ, then match βIBM Corpβ with thesetwo. π detects no more matches, thus wrapping up global merge.The system uses the results of both local and global merges toproduce the final clusters shown in Figure 6.c.In practice, even though STM can hold 7 objects [36], we foundthat users prefer to read only 3 values at a time into STM. First,7 values often take up too much horizontal space on the GUI(especially if the strings are long), making it hard for users tocomprehend. Second, users want to reserve some STM capacityto read and remember the values in the rows.As a result, we currently use π = Figure 6: An example of global merging.igure 7: How HAC and HAC with a limit on cluster sizework on the same dataset. β’ recall(v) : π maps the input value π£ to an entity π , then checksto see if π is already in STM, returning π and the associated valueif yes, and null otherwise.The Merge procedure (Appendix A) implements the entiremerge stage. It calls
LocalMerge on the output of the split stage,then
GlobalMerge on the output of
LocalMerge . The followingtheorem (whose proof involves the verification sets of actions)shows the correctness of the human part:Theorem 3.6.
Let π be a set of strings to be normalized. Ap-plying the Split followed by
Merge procedures to any partition V of π produces a set of clusters with 100% precision and recall,assuming that the user correctly executes the operations, per theirinstructions. We now discuss the machine part of VN plans, which appliesan algorithm to cluster the input strings. Many algorithms canbe used, e.g., string clustering, string matching (SM), and entitymatching (EM). We first discuss using string clustering algo-rithms, specifically HAC (hierarchical agglomerative clustering).Then we discuss why existing SM/EM algorithms do not workwell for our purposes.
We consider using a generic clustering algorithm in the machinepart. Many such algorithms exist [27, 48]. For now, we considerhierarchical agglomerative clustering (HAC), because it is easy tounderstand and debug, can achieve good accuracy, and commonlyused in practice. To cluster a set of values, HAC initializes eachvalue as a singleton cluster. Next, it finds the two clusters with thehighest similarity score (using a pre-specified similarity measure),merges them, then repeats, until reaching a stopping criterion,e.g., the highest similarity score falling below a pre-specifiedthreshold.
Example 4.1.
Consider clustering the seven values in Figure7.a. HAC may first cluster βLGβ and βLgβ into a cluster π , thenβSonyβ and βSonnyβ into π , then π and βSony Corpβ into π , etc.The final result is clusters π and π . Problems with Large Mixed Clusters Produced by HAC:
Using HAC βas isβ however does not work well, because it oftenproduces large mixed clusters that are time consuming for user π to clean. Specifically, as HAC iterates, it grows bigger clusters.Initially, when these clusters are small, their quality is often quitegood, because they often group together syntactically similarvalues that refer to the same entity (e.g., βLGβ and βLgβ, or βSonyβand βSonnyβ, see Figure 7.a).As the clusters grow, however, they start attracting βjunkβ,e.g., cluster π attracts βIBM Corpβ (Figure 7.a). If the similaritymeasure used by HAC happens to be βliberalβ for the data setat hand, HAC often grows large clusters that are βmixedβ, i.e.,containing the values of multiple entities. It is very expensive for Figure 8: Cleaning (a) is less work than cleaning (b). user π to clean such clusters, using the Split and
Merge proceduredescribed in the previous section.
Proposed Solution:
Ideally, HAC should stop before its clustersbecome too βmixedβ. If HACβs clusters are smaller but pure (e.g.,Figure 8.a), then π mostly just have to merge these clusters usinga few mouse clicks. However, if HACβs clusters are larger butβmixedβ (e.g., Figure 8.b), then π would need to split them up intopure clusters, before merging them. This incurs far more mouseclicks and thus far more work.Of course, we do not know when to stop HAC. To address thisproblem, we introduce multiple HAC variations, each stoppingat a different time, then try to select a good one. Specifically,to cluster π values, we consider π HAC variations, where the π -th variation, denoted HAC(i), limits the cluster size to at most π . In each iteration, HAC(i) finds the two clusters π and π withthe highest similarity score, then merges them if | π βͺ π | β€ π .Otherwise, HAC(i) finds the two clusters with the next highestscore, and merges them if the resulting size is at most π , and soon. HAC(i) terminates when it cannot find any more clusters tomerge. Example 4.2.
Consider applying HAC(2) to the values in Figure7.a. HAC(2) first forms cluster π , then π , exactly as the normalHAC. Then normal HAC goes on to form cluster π in Figure 7.a,but HAC(2) cannot, because π βs size exceeds 2. Instead, HAC(2)find the next two clusters with the highest similarity score. Sup-pose these are the singleton clusters for βSony Corpβ and βSonyIncβ. Then HAC(2) merges them to form cluster π in Figure 7.b.At this point HAC(2) cannot form any more cluster, because anyresulting cluster size would exceed 2. So it stops, returning theclusters in Figure 7.b as the output.HAC(1) produces the smallest but cleanest clusters (as they aresingleton). As we increase π , HAC(i) tends to produce bigger butless clean clusters. Typically, there exists an i* such that HAC(i*)βsclusters are still so clean that they help user π , but HAC(i*+1)βsclusters are already βtoo dirtyβ to help (e.g., π would need tosplit them extensively before her or she can merge). This roughlycorresponds to the point where we want HAC to stop. HAC(i*)thus is the βbestβ HAC variation for the current data set.To find HAC(i*), we pair HAC(1), . . . , HAC(n) with Split and
Merge to form π end-to-end plans. Sections 5 and 6 show how toestimate the costs of these plans and find the one with the leastestimated cost. We are now in a position to explain why existing string matching(SM) and entity matching (EM) solutions do not work well inour context. (Section 8 shows experimentally that
Winston withHAC outperforms these solutions.)At the core, VN is an SM problem. So SM solutions can beused in the machine part. EM solutions can also be used, bylimiting each entity to be a string. Many such solutions havebeen developed, e.g.,
TransER [47],
Magellan [32],
Falcon [13],
Waldo [44] (see Section 9).hese SM/EM solutions (e.g.,
Magellan , Falcon ) typically out-put a set of matches. One way to use them is to ask user π toverify certain matches, then infer even more matches using matchtransitivity. For example, given 5 string π, π, π, π, π , suppose a so-lution outputs π = π and π = π as matches. If π has verified thesematches, then we can infer π = π as another match. A recentwork, TransER [47], exemplifies this approach.
A serious problem,however, is that we cannot guarantee 100% recall, as shown experi-mentally in Section 8
For example, no user verification and matchtransitivity on the outputs π = π, π = π can help us infer π = π (assuming this is also a correct match). Thus, these solutions arenot appropriate for Winston .Another way to use existing SM/EM solutions is to clusterthe input strings in a way that respect the output matches. Thework [43] describes multiple ways to do this. Continuing withthe above example, given the output matches π = π, π = π , we cancluster the five input strings into, say, 3 clusters { π, π, π } , { π } , { π } .User π can verify/clean these clusters, as discussed in the humanpart. A serious problem here, however, is that this approach oftenproduces large mixed clusters, which are very time consuming for π to clean, as shown experimentally in Section 8. With HAC, wesolve this problem by modifying HAC to stop early to produceclean clusters (see Section 4.1). But there is no obvious way tomodify the clustering algorithms in [43] to stop early such thatthey produce relatively clean clusters and the quality of theseclusters can be estimated (e.g., see Section 5).The above works provide no GUI, or very basic inefficientGUIs for user feedback, e.g.,
Falcon and
TransER ask users tolabel string pairs as match/non-match. A recent work,
Waldo [44], considers a far more efficient GUI, which displays 6 stringsso that a user can cluster all of them in one shot. As such, itsβhumanβ part is more similar to ours. But its βmachineβ part con-siders a very different optimization problem: namely minimizingcrowdsourcing cost (e.g., clustering 6 strings incurs the samemonetary cost, regardless of which human user does it). Thus,it cannot be used in
Winston , which focuses on minimizing theeffort of human users.
We now discuss estimating the cost of a plan, which is the totaltime user π spends in the human part to clean the clusters outputby the machine part. As we will see below, the key idea is toestimate the quality of these clusters, then use that to estimatethe time needed to clean them.Specifically, let π = { π£ , . . . , π£ π } be the set of input values,and π , . . . , π π be the plans that we will consider, where eachplan π π applies HAC( π ) to π to obtain a set of clusters πΆ π , thenemploys a user π to clean πΆ π , using Split and
Merge . Let πΆ π = { π , . . . , π π π } . Then the cost of π π (i.e., the time for π to clean πΆ π )can be expressed as πππ π‘ ( π π ) = (cid:205) π π π = π‘πππ ( πππππ‘πΆππ’π π‘ππ ( π π )) + π‘πππ ( πΏπππππππππ ( πΏ )) + π‘πππ ( πΊππππππππππ ( π )) , where πΏ is a list of values summarizingthe output of SplitCluster , and π is a list summarizing the outputof LocalMerge . We now estimate these quantities.
Estimating the Cost of SplitCluster:
We need to estimate
πππππ‘πΆππ’π π‘ππ ( π π ) for each cluster π π β πΆ π . To do this, we maketwo assumptions:(1) All clusters π , . . . , π π π produced by π π have the same clusterpurity πΌ π (which is defined in Definition 3.4).(2) When we use SplitCluster to split a cluster π π (produced by π π ) into a pure cluster containing all values of the dominating Parameter Meaning Model EstimationMethod πΌ π Cluster purity for HAC( π ) πΌ π = ππ π User feedback π π Cost of focus(a) A constant Set to 0.5 π π Cost of select(a) A constant Set to 0.5 π π Cost of match(x,y) A constant User feedback π π Cost of isPure(c) π π ( π,πΌ ) = πΎππΌ + πΎ User feedback π π Cost of findDom(c) π π ( π ) = π π if π β€ | πππ | ,π π + π o.w. User feedback π π§ Cost of memorize(v) A constant Set to 0.4 π π Cost of recall(v) A constant Set to π π§ π Shrinkage factor for local merge A constant Set to 0.98 π Hit factor for global merge A constant Set to 0.1
Table 1: Parameters for our cost models. entity and a βmixedβ cluster containing all the remaining values,the βmixedβ cluster also has purity πΌ π . When we split this βmixedβcluster, the resulting βmixedβ cluster also has purity πΌ π , and soon.These are obviously simplifying assumptions. However, theyreflect the intuition that each plan HAC( π ) produces clusters of acertain quality level, and that this quality level can be capturedby a single number, πΌ π , which is the purity of all the clusters.Further, they allow us to efficiently estimate plan costs. Finally,Section 8 empirically shows that with these assumptions we canalready find good plans.Next, we use the above assumptions to estimate the cost of πππππ‘πΆππ’π π‘ππ ( π π ) . Suppose that we already know πΌ π (we showhow to estimate πΌ π later in this section), and that πΌ π β₯ . π π using SplitCluster , user π creates twoclusters: a pure dominating cluster π π,π€ of size πΌ π π π , where π π is the size of π π , and a remainder cluster π π,π’ of size ( β πΌ π ) π π .If | π π,π’ | >
1, we assume that its purity is also πΌ π . π then splits π π,π’ , etc. After π½ π splits, π has created π½ π + πΌ π π π , πΌ π ( β πΌ π ) π π , . . . , πΌ π ( β πΌ π ) π½ π β π π , ( β πΌ π ) π½ π π π , such thatthe last cluster has a single element. Thus we can estimate π½ π as ββ log π π β πΌ π β . Since we can split a cluster of size π π at most π π β π½ π = min ( π π β , ββ log π π β πΌ π β) .Recall that Section 3 defines seven user operations: focus(a),select(a), match(x,y), isPure(c), findDom(c), memorize(v), and recall(v)(see Table 1). Let π π , π π , π π , π π , π π , π π§ , and π π be their costs (i.e.,times), respectively. As we will see later, the cost π π of isPure(c) isa function of π π , the size of π , and πΌ π , the purity of π . Hence, abus-ing notations, we will denote this cost as π π ( π π , πΌ π ) for cluster π π .Similarly, the cost π π of findDom(c) is a function of the size of π ,and will be denoted as π π ( π π ) for cluster π π . The remaining fivecosts (e.g., π π , π π , etc.) will be constants. Now we can estimatethe cost of πππππ‘πΆππ’π π‘ππ ( π π ) where πΌ π β₯ . πππ π‘ πππππ‘πΆππ’π π‘ππ ( π π ; πΌ π β₯ . ) = (cid:205) π½ππ = [ π π (cid:0) ( β πΌ π ) π β π π ,πΌ π (cid:1) + π π + π π + π π (cid:0) ( β πΌ π ) π β π π (cid:1) + π π + π π + (( β πΌ π ) π β π π ) ( π π + π π + ( β πΌ π ) π π ) + π π + π π ] . Appendix B discusses deriving the above formula, and computingthe cost for the cases πΌ π β [ . , . ) and πΌ π < . Estimating the Cost of LocalMerge:
Recall that HAC( π ) pro-duces the set of clusters πΆ π = { π , . . . , π π π } , and that π½ π is thetotal number of splits user π performs in SplitCluster for eachcluster π π . Then at the end of the split phase, π has produceda set of π π pure clusters, where π π = (cid:205) π π π = ( π½ π + ) . Assumingthat executing LocalMerge on any list will shrink its size bya factor of π (currently set to 0.98), we can estimate the timeof executing LocalMerge on the output of the split phase as π π π π§ + π π ( β π )( π π + π π ) + π π + π π (see Appendix B for anexplanation). Estimating the Cost of GlobalMerge:
LocalMerge produces π β² π = ππ π pure clusters to which user π will apply GlobalMerge .ecall that
GlobalMerge takes the first three values in the inputlist πΏ , displays them in three columns, then asks π to go throughthe rest of the values of πΏ and check a box if any value matchesthe values of the columns (see Figure 6), and so on. We assumethat in each such iteration, for each column, π values will match,resulting in π checkboxes being marked. Then we can estimatethe cost of GlobalMege as (cid:205) β /( π ) β π = [ π π§ + ( π β² π β ( π β ) ππ β² π β ) π π + ( ππ β² π β )( π π + π π ) + π π + π π ] (see Appendix B). Estimating the Cluster Purity πΌ π : Recall that we assume allclusters π , . . . , π π produced by HAC( π ) have the same clusterpurity πΌ π . Using set-aside datasets, we found that πΌ π could beestimated reasonably well using a power-law function ππ π (where π is negative, see Table 1). To estimate π and π , we compute π and π . To compute π , we apply HAC(10) to the set of inputvalues to obtain a set πΆ of clusters. Next, we randomly sample 3clusters of size 10 from πΆ (if there are less than 3 such clusters,we select the three largest). Next, we show each cluster to user π , ask him/her to identify all values referring to the dominatingentity, then use those to compute the cluster purity. Finally, wetake the average purity of these clusters to be πΆ . We proceedsimilarly to compute πΆ .We now have three data points: (1,1), (10, π ), and (20, π ),which we can use to estimate π and π in the function ππ π , usingthe ordinary least-squares method. Estimating the Costs of User Operations:
Finally, we esti-mate the costs of the seven user operations (see Table 1). Thecosts of focus(a) and select(a) measure the times user π focuseson an object π then selects it (e.g., by clicking a mouse button).After a number of timing with various users, we found that thesetimes are roughly the same for most users, and we set themto be π π = π π = . π π§ = π π = . π π of match(x,y), however, while largely not depen-dent on π₯ and π¦ , does vary depending on user π . Further, estimat-ing the time π π of isPure(c) and time π π of findDom(c) is signifi-cantly more involved. To determine whether a cluster π is pure,user π needs to examine at most πΌπ values in π (where π is thesize of π ) before he/she sees the first value not referring to πππ ( π ) .Hence, we model the time of isPure(c) as π π ( π, πΌ ) = πΎ πΌ π + πΎ .To find the dominating entity π β of cluster π , we distinguishtwo cases. If π β€ | ππ π | , then user π can execute findDom(c)entirely in π βs short-term memory. In this case the time is pro-portional to π . Otherwise π needs to use paper and pen, and wefound that the time roughly correlates to π . Thus, we model thetime π π ( π ) of findDom(c) as π π if π β€ | ππ π | and as π π + π otherwise.All that is left is to estimate the cost π π of match(x,y), andthe parameters πΎ, πΎ , π , π , π of the cost models of isPure(c) andfindDom(c). To do so, when running HAC(20) (to estimate clusterpurity πΌ π ), we also ask user π to perform a few match, isPure,and findDom operations, then use the recorded times to estimatethe above quantities (see Appendix B). Altogether, the time ittakes for users to calibrate cluster purity πΌ π and the cost modelsof user operations was mere minutes in our experiments (andwas included in the total time of our solution). Recall that to cluster the values π = { π£ , . . . , π£ π } , we consider π plan π , . . . , π π , where each plan π π applies HAC( π ) to π toobtain a set of clusters πΆ π , then employs a user π to clean πΆ π . We now discuss how to efficiently find the plan π π β with the leastestimated cost.Naively, we can (1) execute HAC( π ) for each plan π π to obtain πΆ π , (2) apply the cost estimation procedures in the previous sec-tion to πΆ π to compute the cost of π π , then (3) return the plan withthe lowest cost. Steps 2-3 take negligible times. Step 1 howeverapplies HAC(1), Β· Β· Β· , HAC(n) separately to π , which altogethercan take a lot of time, e.g., 7.3 minutes for | π | =
480 and 1.1 hoursfor | π | =
960 in our experiments.To address this problem, we have developed a solution tojointly execute HAC(1), Β· Β· Β· , HAC(n), such that executing a plancan reuse the intermediate results of executing a previous plan.Specifically, we first execute HAC(n), i.e., the regular HAC. Recallthat each iteration
πΌπ‘ππ π of HAC(n) merges two clusters. Let π ( π ) be the size of the largest cluster at the end of πΌπ‘ππ π . Suppose thereis a π such that π ( π ) β€ π but π ( π + ) > π . Then we know thatHAC( π ) can reuse everything HAC(n) has produced up to πΌπ‘ππ π ,but cannot proceed to πΌπ‘ππ π + . So at the end of πΌπ‘ππ π we savecertain information for HAC( π ) (e.g., the merge commands so far,the value π ), then continue with HAC(n). Once HAC(n) is done,we go back to each saved point π and resume HAC( π ) from there.This strategy enables great reuse, especially for high values of π ,e.g., slashing the time for 960 values from 1.1 hours to 18 secs. Putting It All Together:
We can now describe the entire
Win-ston system, as used by a single user π . Given a set of values π to normalize, (1) Winston first calibrate the cluster purity πΌ π and the cost models. To do so, it runs HAC(10) and HAC(20),asks user π to perform a few basic tasks on sample clusters fromthese algorithms, then use π βs results to calibrate (see Section5). (2) Winston runs the above search procedure to find a plan π π β with the least estimated cost. (3) Finally, Winston sends theoutput clusters of π π β to user π to clean, using procedures Split and
Merge . So far we have discussed how
Winston works with a single user.In practice, however, multiple users (e.g., people in the sameteam) are often willing to jointly perform VN. We now discusshow to extend
Winston to divide the work among such users, tospeed up VN.Consider the case of 3 users. Naively, we can divide the set ofinput strings into 3 equal parts, ask each user to apply
Winston toperform VN for a part, then combine the three outputs to form aset of clusters. We can obtain a canonical string from each cluster,producing a new list of strings. Then we can divide this new listamong 3 users, repeat the process, and so on. This naive solutionhowever does not work well, because it often spreads matchingstrings, i.e., those belonging to a golden cluster, among all 3 users,causing much additional work in matching across the individuallists, in later steps.Intuitively, strings within a golden cluster should be assignedto a single user, as much as possible. We have extended
Winston to realize this intuition. In the extension,
Winston first brieflyinteracts with each user to learn his/her profiles. Next, it usesthese profiles to search a plan space to find a good VN plan. Next,it executes the machine part of this plan to produce a set πΆ ofclusters. It then divides πΆ among the users, such that each willhave roughly the same workload. The intuition here is that acluster in πΆ captures many strings that belong to the same goldencluster, and is assigned to a single user. Next, Winston asks eachuser to use
Split and
Merge to clean the assigned clusters. Fi-nally, it obtains the set of (cleaned) clusters from all users, then ame Size Description Sample Values
Nickname 5132 Nicknames andsome of their typos βCissyβ, βFannyβ, βFrannieβCitation 3000 Article citations fromGoogle Scholar and DBLP βcaching technologies for webapplications c mohan vldb 2001βLife Stage 199 Target life stage(s)of products βMaternityβ, βMothersβ, βYouth|Young ProfessionalsβBig Ten 74 Names of Big TenConference colleges βUniversity of Iowaβ, βUIowaβ,βUM Twin Citiesβ
Table 2: Datasets for our experiments. repeatedly performs a distributed version of
GlobalMerge untilall clusters have been verified and cleaned. Appendix C describesthe algorithm in detail, provides the pseudo code, and discussescost estimation procedures for this version of
Winston . We now evaluate
Winston . Among others, we show that
Win-ston can significantly outperform existing solutions, that it canleverage multiple users to drastically cut VN time, and that it canscale to large datasets.
We first compare
Winston with state-of-the-art manual and clus-tering solutions (Section 8.3 considers string/entity matchingsolutions). We use the four datasets in Table 2, obtained onlineand from VN tasks at a company. For each dataset we manuallycreated all correct clusters, to serve as the ground truth. (Weconsider larger datasets later in Section 8.4.)
The Existing Solutions:
We consider four solutions:
Manual , Merge , Quack , and
OpenRefine . Manual is the typical manualmethod that we have observed in industry. It can be viewedas performing the
GlobalMerge method (Section 3.2).
Merge isour own manual VN method, which performs
LocalMerge then
GlobalMerge . Quack is a string clustering tool used extensively for VN ata company. It also uses HAC like
Winston , but does not place alimit on the cluster size. We extended
Quack by asking the userto clean the clusters using
Split and
Merge . Merge and
Quack can be viewed as the two plans HAC(1) and HAC( π ) in the planspace explored by Winston (where π is the number of values tobe normalized). OpenRefine is a popular open-source tool to wrangle data [1].It uses several string clustering algorithms to perform VN [2].Among these, the most effective one appears to be KNN-basedclustering [2]. We extend this algorithm to work with
Split and
Merge (because the GUI provided by
OpenRefine is very limited).
Results:
Table 3 shows the times of
Winston vs. the abovefour methods (in minutes), using a single user. For each methodwe measure the total time the user spends cleaning the clusters(for
Winston this includes the calibration time). It is difficultto recruit a large number of real users for these experiments,because cleaning some datasets (e.g., Nickname) would take afew working days. So we use synthetic users and each data pointhere is averaged over 100 such users, see Appendix D. (We usereal users to βsanity checkβ these results in Section 8.4.)The table shows that
Manual performs worst, incurring 3-6800minutes.
Merge performs much better, especially on the two largedatasets, incurring 4-1961 minutes, suggesting that performing alocal merge before a global merge is important.
Merge is clearlythe manual method to beat.
Quack is a bit faster than
Merge on Nickname (1808 vs. 1961),but slower on the remaining three datasets.
OpenRefine βs perfor-mance is very uneven. It is a bit faster than
Merge on Citation,but far slower on the other three datasets.
Dataset Manual Merge Quack OpenRefine Winston SavingsNickname > Citation
Life Stage
13 9 15 12 9 0hrs
Big Ten
Table 3: Winston vs four existing solutions.
Dataset 1 User 3 Users 5 Users 7 Users 9 Users
Nickname 1512 890 610 460 412Citation 1112 428 278 212 177Life Stage 9 6 6 6 6Big Ten 7 4 4 4 4
Table 4: The times of Winston with multiple users.
In contrast,
Winston performs much better than
Merge . OnNickname it saves 7.5 hours of user time (see the last column).On Citation it saves 3.34 hours of user time. On Life Stage it iscomparable to
Merge , and on Big Ten it is only 3 mins worse(due to the overhead of user calibration time).
Winston also outperforms both
Quack and
OpenRefine . Im-portantly, in all cases where
Quack or OpenRefine performsworse than
Merge , Winston is able to select a good plan whichallows it to outperform
Merge . We have shown that
Winston outperforms existing manual andclustering methods. We now examine how
Winston can leveragemultiple users to reduce VN time. Table 4 shows that
Winston can leverage multiple users to drastically cut the VN time, e.g.,from 1512 minutes with 1 user to 412 with 9 users for Nickname,and from 1112 to 177 for Citation. The most significant reductionis achieved early, e.g., from 1 to 3-5 users. After that, adding moreusers still helps reduce the VN time, but only in a βdiminishing-returnβ fashion.
We now compare
Winston to existing string matching (SM) andentity matching (EM) solutions, specifically with
TransER [47],
Falcon [13], and
Magellan [32].
Comparing with TransER:
As discussed in Section 4.2, thereare two main ways to use SM/EM solutions in our context. First,a solution can produce a set of matches π , employ a user π toverify certain matches in π , then use match transitivity to infereven more matches. The work [47] describes such a solution,which we call TransER .The main problem, as discussed in Section 4.2, is that suchsolutions cannot guarantee 100% recall. Consider
TransER , whichmatches strings using rule
π½ππππππ ( π ( π£ π ) , π ( π£ π )) β₯ πΌ . Assum-ing a perfect user π who does not make mistakes when verify-ing matches, Figure 9 shows the recall of TransER on our fourdatasets as we vary πΌ . It shows that to reach 100% recall, πΌ mustbe set to less than 0.08. But that would produce a huge numberof matches (almost the entire Cartesian product), which requirea huge amount of effort from the user to verify. In such cases, itis not difficult to show that TransER would perform worse than
Merge . Comparing with Falcon and Magellan:
The second way touse current SM/EM solutions is to produce the matches, thengroup them into clusters. To examine this approach, we use
Falcon [13] and
Magellan [32]. A recent work (name withheldfor anonymous reviewing) has adapted
Falcon to SM, and shownthat it outperforms existing SM solutions. Thus,
Falcon can beviewed as a state-of-the-art SM solution.
Magellan , on the otherhand, can be viewed as a state-of-the-art EM solution.
To learn amatcher, both
Falcon and
Magellan require the user to label a .0 0.2 0.4 0.6 0.8 1.0
Threshold R e c a ll NicknameCitationLifestageBigten
Figure 9: Recall of TransER for varying threshold πΌ . Dataset 1 User 3 Users 5 Users 7 Users 9 Users Labeling
Nickname 1930 (418) 1519 (629) 1210 (600) 958 (498) 807 (395) 14Citation 1114 (2) 1302 (874) 779 (501) 562 (350) 459 (282) 19Life Stage 22 (13) 21 (15) 21 (15) 22 (16) 22 (16) 20Big Ten 21 (14) 21 (17) 21 (17) 21 (17) 21 (17) 20
Table 5: The human times of Falcon.
Dataset 1 User 3 Users 5 Users 7 Users 9 Users Labeling& Debugging
Nickname 1482 (-30) 1062 (172) 788 (178) 646 (186) 563 (151) 89Citation 1150 (38) 900 (472) 599 (321) 481 (269) 393 (216) 109Life Stage 85 (76) 85 (79) 85 (79) 85 (79) 85 (79) 84Big Ten 85 (78) 85 (81) 85 (81) 85 (81) 85 (81) 84
Table 6: The human times of Magellan. set of pairs as match/non-match. In
Magellan the user can alsodebug the matcher to improve its accuracy.Once
Falcon and
Magellan have produced the matches, weuse Markov clustering in [43] to partition the input strings intoclusters that are consistent with the matches. Finally, we ask oneor more users to clean the clusters using
Split and
Merge .Table 5 shows the human time for
Falcon on the four datasets.For example, the first cell β1930 (418)β means that for 1 user,
Falcon incurs 1930 mins of human time, 418 mins more than
Winston . This time includes the labeling time (14 mins, shownin the last column). The table shows that
Winston outperforms
Falcon in all cases, reducing human time by 2-874 mins. Thelarger the dataset, the more the gain, e.g., more than 14.5 hourson Citation, using 3 users.Table 6 shows the human time for
Magellan on the fourdatasets. The meaning of the table cells here are similar to thosefor
Falcon . The table shows that
Winston outperforms
Magellan in all cases, reducing human time by 38-472 mins, except in thecase of 1 user for Nickname, where it is slower by 30 mins (seethe red font).The above time includes labeling and debugging (the last col-umn). Interestingly, even if we ignore the labeling and debuggingtime,
Winston still outperforms
Magellan by a large margin inall cases requiring 3, 5, 7, and 9 users, for Nickname and Citation.It is slower only in the case of 1 user, by 119 mins for Nicknameand 71 mins for Citation. Thus, overall
Winston outperforms
Magellan . In addition,
Winston is suitable for lay users, whereas
Magellan requires the user to have expertise in EM and machinelearning.A major reason for the worse performance of
Falcon and
Mag-ellan is that they often produce large mixed clusters. For example,
Magellan produces clusters of up to 314 strings coming from 137real-world entities on Nickname, and clusters of up to 98 stringscoming from 62 real-world entities on Citation. Clearly, it is verytime consuming for the user to clean such clusters. In contrast,
Winston selects VN plans that produce clusters of only up to 20strings, which are much easier for the user to understand andclean. βSanity Checkβ with Real Users:
We want to βsanity checkβour results so far using real users. Extensive checking is very
Figure 10: βSanity checkβ with real users. difficult because it is hard to recruit real users for these time-consuming experiments. As a result, we carried out a limitedchecking. Specifically, we performed stratified sampling to obtaina Nickname sample of 316 values and a Citation sample of 343values. On each sample we recruited multiple real users and askedthem to perform
Merge , Winston and (i.e.,
Winston with 3 users), taking care to minimize user bias. The right sideof Figure 10.a shows the results for Nickname. For comparisonpurposes, the left side of the figure shows the times with syntheticusers. Figure 10.b shows similar results for Citation.The figures show that βSimulationβ approximates βReal Userβquite well. In both cases, the ordering of the methods is thesame. Further, the results show that
Winston can do much betterthan
Merge , and in turn can do much better than
Winston . While limited, this result with real users does providesome anecdotal support for our simulation findings.
Finding Good Plans:
Table 7 shows that
Winston finds goodplans. Consider Nickname. Recall that we ran 100 synthetic usersfor this dataset. For each user π π Winston estimated the costs ofall plans then selected plan π + , the one with the least estimatedcost. Knowing gold clusters, however, we can simulate how π π executes each plan and thus compute the planβs exact cost. Thisallows us to find the rank of π + on the list of all plans sorted byincreasing cost, as well as the time difference between π + andthe best plan.The first row of Table 7 shows this information. Here, Winston considered a space of 100 plans. For all 100 users, it selected theplan ranked 2nd. The difference between this plan and the bestplan, however, is just 3-4 mins (over 100 users). The next twocells show the average/min/max times of the best plan, and theaverage/min/max difference in percentage. The remaining rowsare similar. Thus,
Winston did a good job. In many cases, itselected top-ranked plans, and most importantly, all the selectedplans differ in time from the best plans by only 0-14% (see thelast column).
Scaling to Large Datasets:
Finally, we examine how
Winston scales to large datasets. Table 8 shows the estimated cleaningtime of
Merge , Quack , Winston , and , i.e.,
Winston with 3 users, for synthetic datasets of various sizes. The tableshows that
Merge is not practical, taking 29 days, 4.4 years, and11.5 years for 100K, 500K, and 1M strings, respectively.
Quack isbetter, but still incurs huge times.
Winston , in contrast, can reduce these times drastically, tojust 13 days, 9.6 months, and 1.3 years, respectively. As discussedin Section 1, this is because
Winston provides a better UI, so the ataset Picked PlanRank (Freq) Size ofPlan Space Time Diffto Best Plan Time ofBest Plan Diff in %Nickname
Citation
Life Stage
Big Ten
Table 7: The quality of the plans found by Winston.
Dataset Size
Merge Quack Winston Winston
10K 22 22 13 4100K 231 (29d) 199 107 (13d) 34 (4.2d)500K 9415 (4.4y) 3231 1688 (9.6m) 387 (2.2m)1M 24449 (11.5y) 19971 2710 (1.3y) 618 (3.5m)
Table 8: Cleaning times vs dataset sizes. user can do more with less effort. Further, the machine part of
Winston outputs clusters that are βuser friendlyβ, i.e., requiringlittle effort for the user to clean. Finally,
Winston searches a largespace of plans to find one with minimal estimated human effort. does even better, cutting the times to clean 500K and1M strings to just 2.2 and 3.5 months, respectively. These suggestthat cleaning large datasets with
Winston indeed can be practical,especially by dividing the work among multiple users.
Data Cleaning:
Data cleaning has received enormous attention(e.g., [3, 5, 9, 11, 14, 16, 17, 20β24, 26, 30, 31, 33, 34, 37β39, 47]). See[10, 12, 15, 40] for recent tutorials, surveys, and books. However,as far as we can tell, no published work has examined the problemof cleaning with 100% accuracy, as we do for VN in this paper.
Our work here shows that the problem of cleaning to reach a desiredlevel of accuracy raises many novel challenges for data cleaning.
Value Normalization:
Much work has addressed VN, typi-cally under the name βsynonym discoveryβ. Most solutions usestring/contextual similarities to measure the relatedness of values[8, 49], and employ various techniques, e.g., clustering, regularexpressions, learning, etc. [35, 49] to match values. However, nowork has examined verifying and cleaning VN results to reach100% accuracy, as we do here.
Clustering:
Our work is related to clustering (which we use inVN). Numerous clustering algorithms exist [18, 27, 48], but weare not aware of any work that has developed a human-drivenprocedure to clean up clustering output and tried to minimizethe human effort of this procedure. Much work has also tunedclustering (e.g., [6, 7]), but for accuracy. In contrast, our work canbe viewed as tuning clustering to minimize the post-clusteringcleaning effort.
String/Entity Matching for the βMachineβ Part:
At the coreVN is a matching problem, and hence string matching (SM) andentity matching (EM) solutions can be used in the βmachineβ part.Numerous such solutions have been developed (e.g.,
TransER , Falcon , Magellan , Waldo and more [13, 32, 44, 47]). We havediscussed in Section 4.2 and experimentally validated in Section8.3 that these methods do not work well for our context. The mainreason is that they generate large mixed clusters that are verytime consuming for users to clean.
This result suggests that whenwe combine a machine part with a human part, it is importantto develop the machine part such that it generates results that areβuser friendlyβ for the user in the human part to work with.
User Interaction Techniques for the βHumanβ Part:
Many recent works on string/entity matching and crowdsourcingsolicit user feedback/action via GUIs to verify and further clean(e.g.,
CrowdDB , CrowdER , and more [19, 20, 44β47]). These workshowever allow only a limited range of user actions (e.g., asking users if two tuples match). A recent work,
Waldo [44], considersmore expressive user actions, such as showing six values on asingle screen and allowing the user to cluster all six in βone shotβ.The above works differ from
Winston in two important ways.First, the range of user actions that they allow is still quite limited.In contrast,
Winston considers far more expressive user actions,such splitting a cluster, merging two clusters, etc. Second, theabove works do not explicitly model the human effort of the useractions and do not seek to minimize this total human effort, as
Winston does. For example, they model the cost of labeling avalue pair or clustering six values to be a fixed value (e.g., 3 centspaid to a crowd worker), regardless of how much effort a userputs into doing it.
As such, our work can be viewed as advancing therecent human-in-the-loop (HILDA) line of research, by consideringmore expressive user actions and studying how to optimize theirhuman-effort cost using RDBMS-style techniques.
RDBMS-Style Cleaning Systems:
Many cleaning works havealso adopted an RDBMS-style operator framework, e.g., AJAX[22], Wisteria [23], Arnold [28], QuERy [4]. They however do notconsider expressive human operations, modeling human actionsat a coarse level, e.g, labeling a tuple, converting a dirty tupleinto a clean one. In contrast, we model and estimate the cost ofcomplex human operations, e.g., removing a value from a cluster,verifying if a cluster is clean, etc. Finally, current work typicallyoptimizes for the accuracy and time of cleaning algorithms (whileassuming a ceiling on the human effort). In contrast, we minimizethe human effort, which can be a major bottleneck in practice.
Interactive Cleaning Systems:
Another prominent body ofwork develops interactive cleaning systems (e.g., AJAX [22], Pot-ter Wheel [41], Wrangler [29], Trifacta [26], ALIAS [42], and[25]. Such systems often try to maximize cleaning accuracy, orefficiently build data transformations/cleaning scripts, while min-imizing the user effort. To the best of our knowledge, however,they have not examined the problem of VN with 100% accuracy.For example, active learning-based approaches such as [42] donot tell the user what to do (to reach 100% accuracy) if after usingthem the accuracy of the cleaned dataset is still below 100%.
10 CONCLUSIONS & FUTURE WORK
We have examined the problem of value normalization with 100%accuracy. We have described
Winston , an RDBMS-style solutionthat defines human operations, combines them with clusteringalgorithms to form hybrid plans, estimates plan costs (in termsof human verification and cleaning effort), then selects the bestplan.Overall, our work here shows that it is indeed possible toapply an RDBMS-style solution approach to the problems of100% accurate cleaning. Going forward, we plan to open sourceour current VN solution, explore other clustering algorithms forVN, and explore applying the solutions here to other cleaningtasks, such as deduplication, outlier removal, extraction, and datarepair.
REFERENCES [1] 2018.
OpenRefine open-source tool. openrefine.org.[2] 2018.
The value normalization capabilities of OpenRefine. https://github.com/OpenRefine/OpenRefine/wiki/Clustering.[3] Z. Abedjan et al. 2016. Detecting Data Errors: Where are we and what needsto be done?
PVLDB
9, 12 (2016), 993β1004.[4] H. Altwaijry et al. 2015. QuERy: A Framework for Integrating Entity Resolu-tion with Query Processing.
PVLDB
9, 3 (2015), 120β131.[5] A. Arasu et al. 2011. Towards a Domain Independent Platform for DataCleaning.
IEEE Data Eng. Bull.
34, 3 (2011), 43β50.[6] S. Basu et al. 2002. Semi-supervised Clustering by Seeding. In
ICML .7] M. Bilenko et al. 2004. Integrating constraints and metric learning in semi-supervised clustering. In
ICML .[8] K. Chakrabarti et al. 2012. A Framework for Robust Discovery of EntitySynonyms. In
SIGKDD .[9] S. Chaudhuri et al. 2006. Data Debugger: An Operator-Centric Approach forData Quality Solutions.
IEEE Data Eng. Bull.
29, 2 (2006), 60β66.[10] Xu Chu et al. 2016. Data Cleaning: Overview and Emerging Challenges. In
SIGMOD .[11] X. Chu et al. 2016. Distributed Data Deduplication. In
VLDB .[12] X. Chu and I. F. Ilyas. 2016. Qualitative Data Cleaning.
PVLDB
9, 13 (2016).[13] S. Das et al. 2017. Falcon: Scaling Up Hands-Off Crowdsourced Entity Matchingto Build Cloud Services. In
SIGMOD .[14] A. Das Sarma et al. 2012. An automatic blocking mechanism for large-scalede-duplication tasks. In
CIKM .[15] T. Dasu and T. Johnson. 2003.
Exploratory Data Mining and Data Cleaning .John Wiley.[16] X. Dong et al. 2010. Global Detection of Complex Copying RelationshipsBetween Sources.
PVLDB
3, 1 (2010), 1358β1369.[17] V. Efthymiou et al. 2015. Parallel Meta-blocking: Realizing Scalable EntityResolution over Large, Heterogeneous Data. In
Big Data .[18] A. Fahad et al. 2014. A Survey of Clustering Algorithms for Big Data: Taxon-omy and Empirical Analysis.
IEEE Trans. Emerging Topics in Computing
2, 3(2014), 267β279.[19] D. Firmani et al. 2016. Online Entity Resolution Using an Oracle.
PVLDB
9, 5(2016), 384β395.[20] M. J. Franklin et al. 2011. CrowdDB: answering queries with crowdsourcing.In
SIGMOD .[21] J. Freire et al. 2016. Exploring What not to Clean in Urban Data: A StudyUsing New York City Taxi Trips.
IEEE Data Eng. Bull.
39, 2 (2016), 63β77.[22] Helena Galhardas et al. 2001. Declarative Data Cleaning: Language, Model,and Algorithms. In
VLDB .[23] D. Haas et al. 2015. Wisteria: Nurturing Scalable Data Cleaning Infrastructure.In
VLDB .[24] D. Haas et al. 2016. CLAMShell: Speeding up Crowds for Low-latency DataLabeling. In
VLDB .[25] J. He et al. 2016. Interactive and Deterministic Data Cleaning. In
SIGMOD .[26] J. Heer et al. 2015. Predictive Interaction for Data Transformation. In
CIDR .[27] A. K. Jain et al. 1999. Data Clustering: A Review.
ACM Comput. Surv.
31, 3(1999), 264β323.[28] S. R. Jeffery et al. 2013. Arnold: Declarative Crowd-Machine Data Integration.In
CIDR .[29] S. Kandel et al. 2011. Wrangler: Interactive Visual Specification of DataTransformation Scripts. In
SIGCHI . 3363β3372.[30] Z. Khayyat et al. 2015. BigDansing: A System for Big Data Cleansing. In
SIGMOD .[31] L. Kolb et al. 2011. Parallel Sorted Neighborhood Blocking with MapReduce.In
BTW .[32] P. Konda et al. 2016. Magellan: Toward Building Entity Matching ManagementSystems.
PVLDB
9, 12 (2016), 1197β1208.[33] S. Krishnan et al. 2016. ActiveClean: Interactive Data Cleaning For StatisticalModeling.
PVLDB
9, 12 (2016).[34] A. Marcus et al. 2011. Crowdsourced databases: Query processing with people.In
CIDR .[35] J. McCrae and N. Collier. 2008. Synonym set extraction from the biomedicalliterature by lexical pattern discovery.
BMC Bioinformatics
Psychological Review
63, 2 (1956).[37] B. Mozafari et al. 2014. Scaling Up Crowd-Sourcing to Very Large Datasets: ACase for Active Learning. In
VLDB .[38] A. G. Parameswaran and N. Polyzotis. 2011. Answering Queries using Humans,Algorithms and Databases. In
CIDR .[39] H. Park and J. Widom. 2013. Query Optimization over Crowdsourced Data. In
VLDB .[40] E. Rahm and H. H. Do. 2000. Data Cleaning: Problems and Current Approaches.
IEEE Data Eng. Bull.
23, 4 (2000).[41] V. Raman and J. M. Hellerstein. 2001. Potterβs Wheel: An Interactive DataCleaning System. In
VLDB .[42] S. Sarawagi et al. 2002. ALIAS: An Active Learning led Interactive Deduplica-tion System. In
VLDB .[43] S. Van Dongen. 2008. Graph Clustering Via a Discrete Uncoupling Process.
SIAM J. Matrix Anal. Appl.
30, 1 (2008).[44] V. Verroios et al. 2017. Waldo: An Adaptive Human Interface for Crowd EntityResolution. In
SIGMOD .[45] V. Verroios and H. Garcia-Molina. 2015. Entity Resolution with crowd errors.In
ICDE .[46] J. Wang et al. 2012. CrowdER: Crowdsourcing Entity Resolution.
PVLDB
5, 11(2012), 1483β1494.[47] J. Wang et al. 2013. Leveraging Transitive Relations for Crowdsourced Joins.In
SIGMOD .[48] R. Xu and D. Wunsch, II. 2005. Survey of Clustering Algorithms.
Trans. Neur.Netw.
16, 3 (2005), 645β678.[49] A. Yates and O. Etzioni. 2009. Unsupervised Methods for Determining Objectand Relation Synonyms on the Web.
J. Artif. Int. Res.
34, 1 (2009), 255β296.
Algorithm 3
Merge Phase
Procedure
Merge( πΏ ) Input: a list of values πΏ representing output clusters of Split phase Output: a set of clean clusters πΆ of values in πΏ LocalMerge( πΏ ) πΏ β consolidated list of values from LocalMerge step return GlobalMerge( πΏ ) Procedure
GlobalMerge( πΏ ) Input: a list of values πΏ sorted alphabetically Output: a set of clean clusters π of values in πΏ while | πΏ | > do if | πΏ | < then π΅ β [ πΏ [ ] , πΏ [ ]] else π΅ β [ πΏ [ ] , πΏ [ ] , πΏ [ ]] // π΅ is the list of values to be displayed on columns MarkValuesForGlobalMerge(
π΅, πΏ )// at the end, user selects βglobal merge" button for π β , . . ., | π΅ | do Merge π΅ [ π ] and values marked to match it into cluster π Remove values in π from πΏ , π β π βͺ { π } return π Procedure
MarkValuesForGlobalMerge(
π΅, π· ) Input: a list π΅ of values on columns, a list π· of values on rows Output: links among values in π΅ and π· that match for each π β , . . ., | π΅ | do π β π΅ [ π ] , ( π,π‘ ) β memorize ( π ) if π is not null then focus (h π,π‘ ), select (h π,π‘ )// h π₯,π¦ is a checkbox to be selected if π₯ and π¦ match for each π β , . . ., | π· | do π β π· [ π ] , ( π,π‘ ) β recall ( π ) if π is not null then focus (h π,π‘ ), focus (h π,π‘ ) A DEFINING THE HUMAN PART
Algorithm 3 describes the
Merge and
GlobalMerge procedures(
LocalMerge has been described in Section 3.2).
B ESTIMATING PLAN COSTS
In this section we describe the cost estimation formula for vari-ous procedures used in the human part of value normalizationplans and how we have derived them.
SplitCluster Procedure:
To estimate the cost of applying
Split-Cluster to a cluster π π during the execution of the plan π π weconsider the following three cases: Case 1 ( πΌ π β₯ . πΌ π β₯ . SplitCluster to π π as follows: πππ π‘ πππππ‘πΆππ’π π‘ππ ( π π ; πΌ π β₯ . ) = π½π βοΈ π = [ π π (cid:0) ( β πΌ π ) π β π π ,πΌ π (cid:1) + π π + π π (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) π + π π (cid:0) ( β πΌ π ) π β π π (cid:1) + π π + π π (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) π + (( β πΌ π ) π β π π ) ( π π + π π + ( β πΌ π ) π π ) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) π , + π π + π π (cid:124) (cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32) (cid:125) π , ] Also recall that we go through π½ π iterations of splitting π π and atiteration π β { , . . . , π½ π } we split an impure cluster of approximatesize ( β πΌ π ) π β π π , e.g. at iteration 1 we split the whole cluster ofsize ( β πΌ π ) π π = π π . Each iteration corresponds to a (recursive)call of the SplitCluster . At each execution of
SplitCluster , thereare three lines (numbered 2, 4 and 7 in Algorithm 1) involvinguser operations and thus only these lines contribute to the costof the procedure.The cost of line 2 is captured by part π of the above for-mula: it consists of the cost of isPure (executed on a cluster ofsize ( β πΌ π ) π β π π ) and then focusing on and selecting βno" but-ton. Part π captures the cost of line 4: it consists of the cost offindDom and then focusing on and selecting βmark values" but-ton. Finally parts π , and π , of the above formula capture thecost of line 7: π , is the cost of MarkValues and π , is the cost offocusing on and selecting βcreate/clean new cluster" button. Part π , in turn consists of going through the cluster values (line 3 in MarkValues pseudo code), focusing on each value, matching itwith the dominating entity of the cluster and selecting the valuef they match (i.e. for 1 β πΌ π fraction of the values). Case 2 ( πΌ π β [ . , . ) ): We estimate the cost of applying Split-Cluster to π π when πΌ π β [ . , . ) as follows: πππ π‘ πππππ‘πΆππ’π π‘ππ ( π π ; πΌ π β [ . , . )) = (cid:205) π½ππ = [ π π (cid:0) ( β πΌ π ) π β π π ,πΌ π (cid:1) + π π + π π + π π (cid:0) ( β πΌ π ) π β π π (cid:1) + π π + π π + (( β πΌ π ) π β π π ) ( π π + π π + πΌ π π π ) + π π + π π ] . The derivation is very similar to the previous case. The onlydifference is the fraction of matching values at each execution of
MarkValues which is πΌ π instead of 1 β πΌ π . Case 3 ( πΌ π < . πΌ π < . SplitCluster to π π as follows: πππ π‘ πππππ‘πΆππ’π π‘ππ ( π π ; πΌ π < . ) = π π (cid:0) π π ,πΌ π (cid:1) + π π + π π (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) π β² + π π (cid:0) π π (cid:1) + π π + π π (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) π β² + π π π π§ + π π ( β π ) ( π π + π π ) + π π + π π (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) π β² + (cid:205) π½ππ = [ π π§ + ( ( β πΌ π ) ( π β ) ππ π β ) π π + ( (cid:205) π = πΌ π ( β πΌ π ) ( π β )+ π ππ π β ) ( π π + π π ) + π π + π π ] Here π β² is the cost of executing isPure on π π and then focusingon and selecting βno" button. π β² is the cost of executing findDomand then focusing on and selecting βclean mixed cluster" button. π β² is the cost of executing LocalMerge on π π and the rest of theformula is the cost of executing GlobalMerge on the results ofthe previous step. We will describe the costs of
LocalMerge and
GlobalMerge in the following sections.
LocalMerge Procedure:
Recall that for a particular plan π π the split phase result consists of approximately π π pure clus-ters of input values. Thus the size of the input list πΏ to the Lo-calMerge is π π . User π goes through the values in πΏ and for eachvalue, he or she first memorizes it. For 1 β π fraction of the valuesin πΏ , π finds a value in his or her short-term memory (STM)in which case he or she (1) selects the current value, then fo-cuses on and selects the value retrieved from STM and finallyfocuses on and selects the link button. Finally the user focuseson and selects βdone local mergingβ button to proceed to theglobal merging. Adding up these costs gives us the cost formula π π π π§ + π π ( β π )( π π + π π ) + π π + π π . GlobalMerge Procedure:
GlobalMerge takes as input a listof values πΏ with approximate size π β² π . The GlobalMerge consistsof possibly several iterations and in each iteration, we assumethat each of the three values displayed on the columns of theGUI would match approximately ππ β² π β ππ β² π . Thus the number of iterationsof GlobalMerge would be approximately π π = (cid:106) π β² π /( ππ β² π ) (cid:107) = β /( π )β .At iteration π β { , . . . , π π } the user sees π β² π β ( π β ) ππ β² π val-ues remained to be matched, three of which are displayed onthe columns and the rest on the rows of the GUI. The user firstmemorizes the three values on the columns (with total cost of3 π π§ ). Then for the π β² π β ( π β ) ππ β² π β ππ β² π β (cid:205) β /( π ) β π = [ π π§ + ( π β² π β ( π β ) ππ β² π β ) π π + ( ππ β² π β )( π π + π π ) + π π + π π ] Estimating the Costs of User Operations:
We now describehow we estimate the cost π π of the match operation, the pa-rameters πΎ and πΎ of the isPure operation cost function and the parameters π , π and π of the findDom operation cost functionduring the calibration stage.To estimate π π we first pick three pairs of random values of π . We then ask the user π to match each pair and, depending onwhether they match or not, to selects a βyesβ or βnoβ button. Forthe π th pair we measure the time π‘ π,π it takes from when we showthe screen containing the pair of values and the buttons to π tillhe or she selects one of the buttons. During this time the usermatches the values shown on the screen, then focuses on one ofthe buttons and selects it. Hence the time we measure is equal to π π,π + π π + π π where π π,π is our estimated cost of the π th matchoperation. We then calculate the π π,π = π‘ π,π β ( π π + π π ) andestimate π π to be the average of π π,π s, i.e. π π = (cid:205) π = π π,π / πΆ . Firstwe pick three random non-singleton clusters π , π and π (ofdifferent sizes if possible) from πΆ . Then we show each π π and ask π to select a βyesβ button if π π is pure and a βnoβ button otherwise.We record the time π‘ π,π it takes from when we show π π to π tillone of the buttons is selected. We also record which button isselected. Using the same timing analysis we described for π π , weform three equations of the form πΎπΌ ( π π ) π π + πΎ = π‘ π,π β ( π π + π π ) where π β { , , } , πΌ ( π π ) is the purity of π π and π π is the sizeof π π . However since we donβt know the purity of π π s, we usethe button π has selected to guess the purity of π π : if π hasselected the βyesβ button, we set πΌ ( π π ) =
1, otherwise we set πΌ ( π π ) = π ( ) π (we have already estimated π and π during thecalibration of the purity function). Finally we use ordinary least-squares method to solve the system of three equations above toestimate the parameters πΎ and πΎ .To estimate π we first pick three clusters π β² , π β² and π β² from πΆ such that π β² π = | π β² π | β€ | ππ π | . We then show each π β² π to π andask him/her to find the dominating entity of π β² π , and then select avalue in π β² π which refers to dom( π β² π ). For each π β² π we measure thetime π‘ π ,π it takes from when it is shown to π till he/she selectsthe value referring to dom( π β² π ). Using the same timing analysis asabove we obtain three equations of the form π‘ π ,π = π π β² π + π π + π π .Then we solve each equation for π and finally average the threenumbers we get to estimate π as (cid:205) π = ( π‘ π ,π β ( π π + π π ))/ π β² π .To estimate π and π we follow a similar process: we firstpick three clusters π β²β² , π β²β² and π β²β² from πΆ such that π β²β² π = | π β²β² π | > | ππ π | . We then show each π β²β² π to π and ask him/her to find thedominating entity of π β²β² π and then select a value in π β²β² π which refersto dom( π β²β² π ). For each π β²β² π we measure the time π‘ π , ,π it takes fromwhen it is shown to π till he/she selects the value referring todom( π β²β² π ). Using the same timing analysis as above we obtain threeequations of the form π‘ π , ,π = π ( π β²β² π ) + π + π π + π π . Finally weuse ordinary least-squares method to solve the system of threeequations above to estimate the parameters π and π . C WORKING WITH MULTIPLE USERS
We now describe cWinston , which extends
Winston to workwith multiple users. Assuming π users want to collaborate tonormalize a set π of input values, cWinston goes through fourmain stages. In the first stage, it shows each user a few clusters ofvalues in π and asks them to perform some basic operations onthem. cWinston then uses the results of these operations to tunethe purity function parameters and user operation cost models foreach user (the same way as Winston ). Next, it takes the averagef purity function and cost model parameters to create a singlepurity function and a single cost model for each user operation.In the second stage, cWinston uses the above purity functionand user operation cost models to find the best VN plan. To doso, it uses the same plan space searching procedure as
Winston (see Section 6) to find the best plan. It then executes the machinepart of the best plan to obtain a set C of clusters.In the third stage, cWinston partitions C into π subsets ofroughly the same number of values. It then assigns each subsetto one of the users and asks them to clean their respective subsetsof clusters using Split and
Merge algorithms.In the last stage, cWinston starts by collecting the results of
Split + Merge from all the users and for each user, it creates a list ofrepresentative values of the clean clusters he or she has produced.It then picks the longest list and divides it into π chunks of roughlythe same size. Next, cWinston asks each user to merge one ofthese chunks with the rest of the lists using the GlobalMerge procedure (see Section 6). It then collects the results from all theusers and repeats this stage (i.e., takes representative values fromthe merged clusters, divides the largest list into π chunks, and soon) until all of the lists are verified/merged. Algorithm 4 showsthe pseudocode of cWinston . Estimating the Cost of cWinston : Next, we describe how weestimate the cost of cWinston . To do so, we traverse Algorithm4, using the same assumption as we discussed in Section 5, toarrive at the following cost formula: πππ π‘ cWinston = max β€ π β€ π (cid:16) πππ π‘ Split ,π’π ( πΆ β² π )+ πππ π‘ LocalMerge ,π’π ( πΈ π )+ πππ π‘ GlobalMerge ,π’π ( πΏ π ) (cid:17) + πππ π‘ MultiUserMerge ,π ( π ) where π is the number of users, the max term finds the longestit takes any of the users π’ π s to perform Split + Merge on theirrespective partitions, and the last term is the cost of multi-usermerge calculated using the following formula: πππ π‘
MultiUserMerge ,π ( π ) = π β βοΈ π‘ = (cid:32) max π’ β π (cid:16) | π·π‘ |( β π π‘ β π ) π βοΈ π = (cid:0) π π§,π’ + π βοΈ π = π‘ ( π + ) ( π π ,π’ + π π ,π’ ) + π | π· π |( β ( π π‘ β + π ) π ) π π,π’ (cid:1)(cid:17)(cid:33) In the above formula, π· π‘ is the largest list of representative valuesat iteration π‘ , π π‘ is the number of entities that are completelymerged and removed from the current list π· after iteration π‘ , π .,π’ sare the cost of human operations for user π’ , π is the proportion ofthe rows which, on average, must be examined before all columnsare matched, and π is described in Section 5.Here is how we derive this formula. The outer summationcorresponds to choosing the longest list of representative valuesand dividing it among the users. The max operation chooses thelongest it takes any of the users to perform each one of the aboveiterations, i.e., the longest path, which would determine the timeit takes the users to collaboratively perform MultiUserMergefrom start to finish.The middle sum corresponds to scanning the remaining lists ofrepresentative values each user π’ has to perform at each iteration.To determine the number of such scans per iteration, we needthe number of column values that π’ has to read and memorize formerging. This number is equal to the proportion of the current list π· π‘ still remaining to be merged by π’ which is | π· π‘ |( β π π‘ β π ) π . Sincewe show π’ three column values at a time, we divide the abovenumber by three to arrive at the correct number of iterations | π· π‘ |( β π π‘ β π ) π . During each of these scans, π’ memorizes threecolumn values, hence the 3 π π§,π’ term. Algorithm 4 cWinston
Procedure cWinston ( π,π ) Input: a set π of representative value sets, a set π = { π’ , . . .,π’ π } of users Output: a set clean clusters π for each π’ π β π do : π΄ π β tune purity function and cost model parameters for π’ π π π β β search plan space to find the best plan using π΄ π s πΆ β run HAC( π β ) on π Divide πΆ into πΆ β² = { πΆ β² ,πΆ β² , . . .,πΆ β² π } s.t. | πΆ β² π | β | πΆ |/ π , π β β for each π’ π β π do : //each user π’ π executes Winston on πΆ β² π π· π β Split ( πΆ β² π ), πΈ π β list of representative values of clusters in π· π LocalMerge ( πΈ π ), πΏ π β consolidated list of values from LocalMerge step π β π βͺ GlobalMerge ( πΏ π ) if | π | = then : return π //effectively, (single-user) Winston else : return MultiUserMerge(
π,π ) Procedure
MultiUserMerge(
π,π ) Input: a set π of clean clusters, a set π = { π’ , . . .,π’ π } of users Output: a set clean clusters π β² π· β β //representative values for the sets of clusters in π π β² β β //flattened π for each π π β π do : π· π β a set of values representing the clusters in π π , π β² β π β² βͺ π π while each | π· | > do : π· β β argmax π·π β π· (| π· π |) , π· β β π· \ π· β , π β β //matches Divide π· β into πΏ = { πΏ , πΏ , . . ., πΏ π } s.t. | πΏ π | β | π· β |/ π //all users perform merge with their respective column values in parallel for each πΏ π β πΏ do : π β π βͺ GroupedMerge( πΏ π , copy ( π· β ) ,π’ π )//resolve matches for each π· π β π· do : for each ( π£, π€ ) β π do : π· π β π· π \ { π£, π€ } Merge the clusters in π β² which π£ and π€ refer to, andset π£ to refer to the new cluster return π β² Procedure
GroupedMerge(
πΏ, π·,π’ ) Input: a list πΏ of column values, a set π· of representative value sets, a user π’ Output: a set π of matches π β β while each | πΏ | > do : //while there are still column values left π β max ( , | πΏ |) , π΅ β π values from πΏ for each π β π΅ do : memorize ( π ) for each π· π β π· do : π β π βͺ SetMerge(
π΅, π· π ,π’ ) πΏ β πΏ \ π΅ return π Procedure
SetMerge(
π΅, π· π ,π’ ) Input: a set π΅ of column values, a set π· π of representative values, a user π’ Output: a set π β² of matches π· π β sort according to similarity to the values in π΅ , π β² β β for each π£ β π· π do : if recall ( π£ ) then : π β² β π β² βͺ {( π, π£ ) } s.t. π β π΅ β§ π matches π£ if | π | = then : break π· π β π· π \ { π£ } return π β² The inner-most sum corresponds to the number of rows exam-ined per each set of three representative values during each scan.At iteration π‘ , there are π β π‘ lists of representative values left toappear on the rows. Each time π’ scans one of these lists, he orshe matches on average 3 π rows, each of which requires a buttonclick. Additionally, π’ has to click the merge button, hence the term ( π + )( π π ,π’ + π π ,π’ ) . To account for the number of rows examinedin each list before finding the matches, we calculate the numberof rows left in the list by finding the number of entities removedfrom the list at the end of this iteration, i.e., | π· π |(( π π‘ β + π )) π ,and then subtracting this value from | π· π | . We also assume thatonly a π proportion of these rows need to be investigated beforefinding matches, hence the term π | π· π |( β ( π π‘ β + π ) π ) . D EMPIRICAL EVALUATION
Generating Synthetic Users:
We use a deterministic modelof a user, i.e. we use constant values for π π , π π , π π , π π , π π§ , πΎ,πΎ ,π , π , π . To generate a synthetic user we first assume a constantvalue for π π and π π , i.e. π π = π π = .
5. We then assume a rangeof values for each of π π β [ . , . ] , π π β [ . , . ] , πΎ β [ . , . ] , πΎ β [ . , ] and π β [ . , . ] . Next, we generate a randomimulated user by uniformly randomly sampling a number fromeach of the above ranges and assigning these values to the cor-responding parameters of the cost model. Finally, we assign theremaining parameters as π π§ = π π , π = π /(| ππ π | Γ ) and π = . π | ππ π ||