[PDF] Toward Data Cleaning with a Target Accuracy: A Case Study for Value Normalization

Abstract

Many applications need to clean data with a target accuracy. As far as we know, this problem has not been studied in depth. In this paper we take the first step toward solving it. We focus on value normalization (VN), the problem of replacing all string that refer to the same entity with a unique string. VN is ubiquitous, and we often want to do VN with 100% accuracy. This is typically done today in industry by automatically clustering the strings then asking a user to verify and clean the clusters, until reaching 100% accuracy. This solution has significant limitations. It does not tell the users how to verify and clean the clusters. This part also often takes a lot of time, e.g., days. Further, there is no effective way for multiple users to collaboratively verify and clean. In this paper we address these challenges. Overall, our work advances the state of the art in data cleaning by introducing a novel cleaning problem and describing a promising solution template.

Full PDF

TToward Data Cleaning with a Target Accuracy:A Case Study for Value Normalization

Adel Ardalan

University of [email protected]

Derek Paulsen

University of [email protected]

Amanpreet Singh Saini

University of [email protected]

Walter Cai

University of [email protected]

AnHai Doan

University of [email protected]

ABSTRACT

Many applications need to clean data with a target accuracy . Asfar as we know, this problem has not been studied in depth. In thispaper we take the first step toward solving it. We focus on valuenormalization (VN) , the problem of replacing all string that referto the same entity with a unique string. VN is ubiquitous, and weoften want to do VN with 100% accuracy. This is typically donetoday in industry by automatically clustering the strings thenasking a user to verify and clean the clusters, until reaching 100%accuracy. This solution has significant limitations. It does not tellthe users how to verify and clean the clusters. This part also oftentakes a lot of time, e.g., days. Further, there is no effective way formultiple users to collaboratively verify and clean. In this paperwe address these challenges. Overall, our work advances thestate of the art in data cleaning by introducing a novel cleaningproblem and describing a promising solution template.

Data cleaning (DC) has been a long-standing challenge in thedatabase community. Many DC problems have been studied,such as cleaning with a budget, cleaning to satisfy constraintsbut minimize changes to the data, etc. Recently, however, wehave seen another novel DC problem in industry: cleaning witha target accuracy , e.g., with at least 95% precision and 90% recall.While pervasive, this problem appears to have received littleattention in the research community.In this work, we take the first step toward solving this problem.We focus on value normalization (VN) , the problem of replacingall strings (in a given set) that refer to the same real-world entitywith a unique string. VN is ubiquitous, and industrial users oftenwant to do VN with 100% accuracy.

Example 1.1. To enable product browsing by brand on wal-mart.com, the business group at WalmartLabs asks the IT groupto normalize the brands, e.g., converting those in Figure 1.a intothose in Figure 1.b. If some brands are not normalized correctly,then customers may not find those products, resulting in revenuelosses. So the business group asks the IT group to ensure that thebrands are normalized with 100% accuracy.

Many enterprise customers of Informatica (which sells dataintegration software) also face this problem, in building businessglossaries, master data management, and knowledge graph con-struction. In general, if even a small amount of inaccuracy in VNcan cause significant problems for the target application, then thebusiness group will typically ask the IT group to help perform VNwith 100% accuracy.

In response, the IT group typically employs an algorithm tocluster the strings, then asks a user to verify and clean the clusters.

Figure 1: An example of normalizing product brands

Consider the five brands in Figure 2.a. The IT group applies analgorithm to produce two clusters 𝑐 and 𝑐 (Figure 2.b). A user 𝑈 manually verifies and cleans the clusters, by moving “VizioCorp” from cluster 𝑐 to 𝑐 , producing the two clusters 𝑐 and 𝑐 in Figure 2.c. Finally, 𝑈 replaces each string in a cluster with acanonical string, producing the VN result in Figure 2.d.Typically, a data scientist performs the “machine” part thatclusters the strings, then a data analyst performs the “human”part that verifies and cleans the clusters. The IT group assuresthe business group that the resulting output is 100% accuratebecause a data analyst has examined it (assuming that he/shedoes not make mistakes).While popular, the above solution has significant limitations.First, there is no precise procedure that a user can follow toexecute the “human” part. So users often verify and clean in anad-hoc, suboptimal, and often incorrect fashion. This also makesit impossible to understand the assumptions under which thesolution reaches 100% accuracy and to formally prove it.Second, the “human” part often incurs a huge amount of time,e.g., days. In contrast, the “machine” part often takes mere min-utes. (In most cases that we have seen, users verified and cleanedusing Excel, in a slow and tedious process.) So it is critical todevelop a better solution and GUI tool to minimize the time ofthe “human” part.Finally, it is difficult for multiple users to collaboratively verifyand clean, even though this setting commonly occurs in practice.In this paper we develop Winston , a solution for the abovechallenges. (In the movie “Pulp Fiction”, Winston Wolfe is thefixer who cleans up messes made by other gangsters.) We first

Figure 2: A popular solution in industry to perform VNwith 100% accuracy. a r X i v : . [ c s . D B ] J a n efine a set of basic operations on a GUI for users, e.g., select-ing a value, verifying if a cluster is clean, merging two clusters,etc. Then we provide precise procedures involving these actionsthat users can execute to verify/clean clusters. We prove that ifusers execute these actions correctly, then the output has 100%accuracy.To minimize the time of the “human” part, we adopt an RDBMS-style solution. Specifically, we compose the GUI operations withclustering algorithms to form multiple “machine-human” plans,each executes the VN pipeline end-to-end. Next, we estimate thetotal time a user must spend per plan, select the plan with theleast estimated time, execute its machine part, then show theoutput of that part to the user so that he/she can verify and cleanit using a GUI (following the sequence of user operations thatthis plan specifies).Finally, we show how to extend our solution to effectivelydivide the verification and cleaning work among multiple users.Our solution appears highly effective. Section 8 shows thatusing the existing solution, a single user needs 29 days, 4.4 years,and 11.5 years to verify/clean 100K, 500K, and 1M strings, respec-tively. Winston drastically reduces these times to just 13 days, 9.6months, and 1.3 years, using 1 user, and to 4.25 days, 2.2 months,and 3.5 months, using 3 users.In summary, we make the following contributions: • We formally define the novel data cleaning problem of VNwith 100% accuracy. As far as we know, this paper is thefirst to study this problem in depth. • We propose

Winston , a novel RDBMS-style solution.

Win-ston defines complex human operations and optimizes thehuman time of a plan. This is in contrast to traditionalRDBMSs which define machine operations and optimizemachine time. • We describe extensive experiments (comparing

Winston to a tool in a company, to the popular open-source tool

OpenRefine , and to state-of-the-art string matching andentity matching solutions) that show the promise of ourapproach.Overall, our work advances the state of the art in data cleaning byintroducing a novel cleaning problem and describing a promisingsolution template. It also advances the state of the art in human-in-the-loop data analytics (HILDA) by showing that it is possible todevelop an RDBMS-style solution to HILDA, by defining complexhuman operations, combining them to form plans, and selectingthe plan with the lowest estimated human effort.

In this section we define the problem of VN with 100% accu-racy, and examine when we can reach this accuracy under whatconditions. We first define

Definition 2.1 (Value normalization).

Let 𝑉 be a set of strings { 𝑣 , . . . , 𝑣 𝑛 } . Replace each 𝑣 ∈ 𝑉 with a string 𝑠 ( 𝑣 ) such that 𝑠 ( 𝑣 𝑖 ) = 𝑠 ( 𝑣 𝑗 ) if and only if 𝑣 𝑖 and 𝑣 𝑗 refer to the same real-worldentity, for all 𝑣 𝑖 , 𝑣 𝑗 in 𝑉 .This problem is often solved in two steps: (1) partition 𝑉 intoa set of disjoint clusters V = { 𝑉 , . . . ,𝑉 𝑚 } , such that two stringsrefer to the same real-world entity if and only if they belong tothe same cluster; (2) replace all strings in each cluster 𝑉 𝑖 ∈ V with a canonical string 𝑠 𝑖 .In this paper we will consider only Step 1, which tries tofind the correct partitioning of 𝑉 . Step 2 is typically applicationdependent (e.g., a common method is to select the longest string Figure 3: Actions and their verification sets in a cluster 𝑉 𝑖 to be its canonical string, because this string tendsto be the most informative one). Gold Partition & Accuracy of Partitions:

Let 𝑈 be a userwho will verify and clean the clusters. To do so, 𝑈 must be capableof creating a “gold”, i.e., correct, partition V ∗ = { 𝑉 ∗ , . . . ,𝑉 ∗ 𝑘 } ,such that two strings refer to the same real-world entity if andonly if they are in the same cluster. For example, the two clusters 𝑐 and 𝑐 in Figure 2.c form the gold partition for the set of stringsin Figure 2.a.Our goal is to find the gold partition V ∗ . But the partitionthat we find may not be as accurate. We now describe how tocompute the accuracy of any partition. First, we define Definition 2.2 (Match and non-match).

A match 𝑣 𝑖 = 𝑣 𝑗 means“ 𝑣 𝑖 and 𝑣 𝑗 refer to the same real-world entity”, and is correct ifthis is indeed true. 𝑣 𝑖 = 𝑣 𝑗 and 𝑣 𝑗 = 𝑣 𝑖 are considered the samematch. We define a non-match 𝑣 𝑖 ≠ 𝑣 𝑗 similarly. Definition 2.3 (Set of matches specified by a partition).

A clus-ter 𝑉 𝑖 specifies the set of matches 𝑀 ( 𝑉 𝑖 ) = { 𝑣 𝑝 = 𝑣 𝑞 | 𝑣 𝑝 ∈ 𝑉 𝑖 , 𝑣 𝑞 ∈ 𝑉 𝑖 , 𝑝 ≠ 𝑞 } . Partition V = { 𝑉 , . . . ,𝑉 𝑚 } specifies theset of matches 𝑀 (V) = ∪ 𝑚𝑖 = 𝑀 ( 𝑉 𝑖 ) .For example, cluster 𝑐 in Figure 2.b specifies three matches: 𝑀 ( 𝑐 ) = { 𝑆𝑜𝑛𝑦 = 𝑆𝑜𝑛𝑦 𝐶𝑜𝑟𝑝, 𝑆𝑜𝑛𝑦 = 𝑉 𝑖𝑧𝑖𝑜 𝐶𝑜𝑟𝑝, 𝑆𝑜𝑛𝑦 𝐶𝑜𝑟𝑝 = 𝑉 𝑖𝑧𝑖𝑜 𝐶𝑜𝑟𝑝 } . Cluster 𝑐 specifies one match: 𝑀 ( 𝑐 ) = { 𝑉 𝑖𝑧𝑖𝑜 = 𝑉 𝑖𝑧𝑖𝑜 𝐼𝑛𝑐 } . These two clusters form a partition V , which speci-fies the set of matches 𝑀 ( 𝑐 ) ∪ 𝑀 ( 𝑐 ) . The accuracy of a partitionis then measured as follows: Definition 2.4 (Precision and recall of a partition).

Let V ∗ be thegold partition of a set of strings 𝑉 . The precision of a partition V is the fraction of matches in 𝑀 (V) that are correct, i.e., appearingin 𝑀 (V ∗ ) , and the recall of V is the fraction of matches in 𝑀 (V ∗ ) that appear in 𝑀 (V) .Given the gold partition V = { 𝑐 , 𝑐 } in Figure 2.c, the pre-cision of partition V = { 𝑐 , 𝑐 } in Figure 2.b is 2/4 = 50%, andthe recall is 2/4 = 50%. Actions & Their Verification Sets:

Henceforth, we use “ac-tion” and “operation” interchangeably. When a user 𝑈 performsan action, 𝑈 has implicitly verified a set of matches and non-matches, called a verification set . Formally, Definition 2.5 (User action and verification set).

We assume aGUI on which user 𝑈 can perform a set of actions 𝐴 = { 𝑎 , . . . , 𝑎 𝑟 } .Each action 𝑎 𝑖 inputs data 𝐼 𝑖 and outputs data 𝑂 𝑖 , both of whichinvolve sets of strings in 𝑉 . After correctly executing an action 𝑎 𝑖 on input 𝐼 𝑖 , as a side effect, user 𝑈 has implicitly verified a set 𝑄 ( 𝑎 𝑖 , 𝐼 𝑖 ) of matches and non-matches to be correct. We refer to 𝑄 ( 𝑎 𝑖 , 𝐼 𝑖 ) as a verification set.To illustrate, suppose 𝑈 has employed an algorithm to producethe partition { 𝑑 , 𝑑 } in Figure 3. Next, 𝑈 uses a GUI to verifyand clean these clusters. Call a cluster “pure” if all strings in itefer to the same real-world entity. Suppose the GUI supportsonly two actions: 𝑎 splits a cluster into pure clusters, and 𝑎 merges two pure clusters into one.Suppose 𝑈 starts by using 𝑎 to split cluster 𝑑 into two pureclusters 𝑑 = { 𝑆𝑜𝑛𝑦, 𝑆𝑜𝑛𝑦 𝐶𝑜𝑟𝑝 } and 𝑑 = { 𝑉 𝑖𝑧𝑖𝑜 𝐶𝑜𝑟𝑝 } (seeFigure 3). Cluster 𝑑 specifies three matches: 𝑆𝑜𝑛𝑦 = 𝑆𝑜𝑛𝑦 𝐶𝑜𝑟𝑝 , 𝑆𝑜𝑛𝑦 = 𝑉 𝑖𝑧𝑖𝑜 𝐶𝑜𝑟𝑝 , and

𝑆𝑜𝑛𝑦 𝐶𝑜𝑟𝑝 = 𝑉 𝑖𝑧𝑖𝑜 𝐶𝑜𝑟𝑝 . With the abovesplitting, intuitively, 𝑈 has verified that the first match 𝑆𝑜𝑛𝑦 = 𝑆𝑜𝑛𝑦 𝐶𝑜𝑟𝑝 is indeed correct, but the remaining two matchesare not. Thus, the resulting verification set 𝑄 ( 𝑎 , 𝑑 ) is the setof 1 match and 2 non-matches: { 𝑆𝑜𝑛𝑦 = 𝑆𝑜𝑛𝑦 𝐶𝑜𝑟𝑝, 𝑆𝑜𝑛𝑦 ≠ 𝑉 𝑖𝑧𝑖𝑜 𝐶𝑜𝑟𝑝, 𝑆𝑜𝑛𝑦 𝐶𝑜𝑟𝑝 ≠ 𝑉 𝑖𝑧𝑖𝑜 𝐶𝑜𝑟𝑝 } , as shown in Figure 3.Next, 𝑈 uses 𝑎 to split cluster 𝑑 . 𝑈 determines that 𝑑 isalready pure, so no new clusters are created. Implicitly, 𝑈 has ver-ified that the sole match specified by 𝑑 is correct. So 𝑄 ( 𝑎 , 𝑑 ) = { 𝑉 𝑖𝑧𝑜 = 𝑉 𝑖𝑧𝑖𝑜 } . Finally, 𝑈 uses action 𝑎 to merge the two pureclusters 𝑑 and 𝑑 into cluster 𝑑 (see Figure 3). Implicitly, 𝑈 has verified that the two matches { 𝑉 𝑖𝑧𝑜 = 𝑉 𝑖𝑧𝑖𝑜 𝐶𝑜𝑟𝑝,𝑉 𝑖𝑧𝑖𝑜 = 𝑉 𝑖𝑧𝑖𝑜 𝐶𝑜𝑟𝑝 } are correct. These form the verification set 𝑄 ( 𝑎 , { 𝑑 , 𝑑 }) .In a similar fashion, we can define the verification set for thesequence 𝑎 ( 𝑑 ) , 𝑎 ( 𝑑 ) , 𝑎 ({ 𝑑 , 𝑑 }) in Figure 3 to be 𝑄 ( 𝑎 , 𝑑 ) ∪ 𝑄 ( 𝑎 , 𝑑 ) ∪ 𝑄 ( 𝑎 , { 𝑑 , 𝑑 }) . Formally, Definition 2.6 (Verification set of action sequence).

If user 𝑈 has performed an action sequence 𝐺 = 𝑎 𝑖 , . . . , 𝑎 𝑖𝑙 , then theverification set of 𝐺 is 𝑄 ( 𝐺 ) = ∪ 𝑙𝑞 = 𝑄 ( 𝑎 𝑖𝑞 ) . Match Transitivity:

Recall that we assume user 𝑈 can create agold partition V ∗ . This implies that transitivity holds for matches,i.e., if 𝑣 𝑖 = 𝑣 𝑗 and 𝑣 𝑗 = 𝑣 ℎ are correct, then 𝑣 𝑖 = 𝑣 ℎ is also correct(because all three must be in the same gold cluster). Similarly, if 𝑣 𝑖 = 𝑣 𝑗 and 𝑣 𝑗 ≠ 𝑣 ℎ are correct, then 𝑣 𝑖 ≠ 𝑣 ℎ is also correct. Definition 2.7 (Inferring matches).

We say that match 𝑣 𝑖 = 𝑣 𝑗 can be inferred from a verification set 𝑄 ( 𝐺 ) if and only if thereexists a sequence of strings 𝑣 ℎ , . . . , 𝑣 ℎ𝑙 such that the matches 𝑣 𝑖 = 𝑣 ℎ , 𝑣 ℎ = 𝑣 ℎ , . . . , 𝑣 ℎ𝑙 = 𝑣 𝑗 are in 𝑄 ( 𝐺 ) . We say that thesematches form a transitivity path from 𝑣 𝑖 to 𝑣 𝑗 . Similarly, we saythat non-match 𝑣 𝑖 ≠ 𝑣 𝑗 can be inferred from 𝑄 ( 𝐺 ) if and only ifthere exists such a path, except that exactly one of the edges ofthe path is a non-match. VN with 100% Accuracy:

Recall that the gold partition V ∗ specifies a set of correct matches 𝑀 (V ∗ ) . We say that it alsospecifies a set of correct non-matches 𝑁 (V ∗ ) , which consists ofall non-match 𝑣 𝑖 ≠ 𝑣 𝑗 such that 𝑣 𝑖 = 𝑣 𝑗 is not a match in 𝑀 (V ∗ ) .We define Definition 2.8 (Gold sequence of actions).

A sequence 𝐺 of ac-tions of user 𝑈 is “gold” if and only if any match in 𝑀 (V ∗ ) ornon-match in 𝑁 (V ∗ ) either already exists in the verification set 𝑄 ( 𝐺 ) or can be inferred from 𝑄 ( 𝐺 ) .It is not difficult to prove that executing a gold action sequence 𝐺 will produce the gold partition V ∗ . We now can define ourproblem as follows: Definition 2.9 (VN with 100% accuracy).

Let 𝑉 be a set of strings.Let ( 𝑋, 𝑌 ) be a pair of machine/human algorithms, such that themachine part 𝑋 can be executed on 𝑉 to produce a partition V , then the human part 𝑌 can be executed by a user 𝑈 on V to produce a new partition V+ . Find 𝑋 and 𝑌 such that (a) theaction sequence executed by user 𝑈 in part 𝑌 is a gold sequence,and (b) the total time spent by user 𝑈 is minimized. Return theresulting partition V+ . Figure 4: An illustration of the split and merge stages.

Thus, we reach 100% accuracy if the user executes a gold sequence 𝐺 of actions. Then all correct matches and non-matcheswill have already been in the verification set of 𝐺 , or inferredfrom this verification set via match transitivity. As discussed, each VN plan ( 𝑋, 𝑌 ) consists of a machine part 𝑋 and a human part 𝑌 . In part 𝑋 we apply an algorithm to the inputstrings to obtain a set of clusters V , then in part 𝑌 we employa user 𝑈 to verify and clean V . We now design part 𝑌 ; the nextsection designs part 𝑋 .The key challenge in designing the human part 𝑌 is to ensurethat it is easy for users to understand and execute, minimizestheir effort, and is amenable to cost analysis. Toward these goals,we discuss the user setting, describe a solution called split andmerge , then define a set of user operations that can be used toimplement this solution.We assume user 𝑈 will work with a graphical user interface(GUI), using mouse and keyboard. 𝑈 has a short-term memory (or STM for short). According to [36] each individual could remember7 ± 𝑈 can use paper and pen for those cases where 𝑈 needs to keep track of more objects than can be fit into STM.User 𝑈 can clean the clusters output by the machine part inmany different ways. In this paper, based on what we have seenusers do in industry, we propose that 𝑈 clean in two stages. Thefirst stage splits the clusters recursively until all resulting clustersare “pure”, i.e., each containing only the values of a single real-world entity (though often not all such values). The second stagethen merges clusters that refer to the same entity. Example 3.1.

Suppose the machine part produces clusters 1-2in Figure 4. The split stage splits cluster 1 into clusters 3-4, cluster2 into clusters 5-6, cluster 6 into clusters 7-8, then cluster 8 intoclusters 9-10 (see the solid arrows). The output of the split stageis the set of pure clusters 3, 4, 5, 7, 9, 10. The merge stage thenmerges clusters 3 and 5 into cluster 11, and clusters 4 and 9 intocluster 12 (see the dotted arrows). The end result is the set ofclean clusters 11, 12, 7, 10.

We now describe the split stage (Section 3.2 describes the mergestage). First, we define a dominating entity of a cluster 𝑐 to bethe one with the most values in 𝑐 (henceforth we use “value” and“string” interchangeably). Formally, lgorithm 1 Split Phase

Procedure

Split( 𝐶 ) Input: a set of clusters 𝐶 = { 𝑐 , . . .,𝑐 𝑛 } , output by machine Output: a set of clean clusters 𝐷 = { 𝑑 , . . .,𝑑 𝑚 } s.t. ∪ 𝑖 𝑐 𝑖 = ∪ 𝑗 𝑑 𝑗 𝐷 ← ∅ for each cluster 𝑐 ∈ 𝐶 do 𝐷 ← 𝐷 ∪ SplitCluster ( 𝑐 ) return 𝐷 Procedure

SplitCluster( 𝑐 ) Input: a cluster 𝑐 Output: a set of clean clusters 𝐺 = { 𝑔 , . . .,𝑔 𝑝 } ∪ 𝑘 𝑔 𝑘 = 𝑐 if | 𝑐 | = then return { 𝑐 } isPure ( 𝑐 ) // at the end, user selects yes/no button if yes button is selected then return { 𝑐 } findDomEntityValue ( 𝑐 ) // at the end, user knows 𝑒 ∗ and 𝛼 // or “mark values" button if “clean mixed cluster" button is selected then return Merge( 𝑐 ) // 𝛼 < . in this case MarkValues( 𝑐,𝑒 ∗ ,𝛼 )// at the end, user selects “create/clean new cluster"// or “create new cluster / clean old cluster" button Move all marked values in 𝑐 into a new cluster 𝑑 if “create/clean new cluster" button is selected then return 𝑐 ∪ SplitCluster ( 𝑑 ) // 𝛼 ≥ . else return SplitCluster ( 𝑐 ) ∪ 𝑑 // 𝛼 < . Procedure

MarkValues( 𝑐,𝑒 ∗ ,𝛼 ) Input: a cluster 𝑐 , dominating entity 𝑒 ∗ and purity 𝛼 of 𝑐 Output: a set of values in 𝑐 will be selected by the user Let 𝐿 be the list of values in 𝑐 , displayed on GUI if 𝛼 ≥ . then for 𝑖 ← , . . ., | 𝑐 | do focus ( 𝐿 [ 𝑖 ] ), if not match ( 𝐿 [ 𝑖 ] ,𝑒 ∗ ) then select ( 𝐿 [ 𝑖 ] ) else for 𝑖 ← , . . ., | 𝑐 | do focus ( 𝐿 [ 𝑖 ] ), if match ( 𝐿 [ 𝑖 ] ,𝑒 ∗ ) then select ( 𝐿 [ 𝑖 ] ) Definition 3.2 (Dominating entity).

Let G be a partition of clus-ter 𝑐 into groups of values 𝐺 , . . . , 𝐺 𝑛 , such that all values in eachgroup refer to the same real-world entity and different groupsrefer to different entities. Then the dominating entity of 𝑐 is theentity of the group with the largest size: 𝐺 𝑘 = 𝑎𝑟𝑔 𝑚𝑎𝑥 𝐺 𝑖 ∈G | 𝐺 𝑖 | .Henceforth, we will use 𝑑𝑜𝑚 ( 𝑐 ) (or 𝑒 ∗ when there is no ambigu-ity) to denote the dominating entity of 𝑐 .In Figure 4, dom(Cluster 1) and dom(Cluster 2) are Sony Corpo-ration. Cluster 6 has three candidates; we break tie by randomlyselecting one to be the dominating entity.Let 𝐶 be the set of clusters output by the machine part. Ourkey idea for the split stage is that if the machine part has beenreasonably accurate, then any cluster 𝑐 ∈ 𝐶 is likely to be dom-inated by 𝑑𝑜𝑚 ( 𝑐 ) . If so, user 𝑈 can clean 𝑐 by moving all thevalues in 𝑐 that do not refer to 𝑑𝑜𝑚 ( 𝑐 ) into a new cluster 𝑑 , thenclean 𝑑 , and so on.Specifically, for each cluster 𝑐 ∈ 𝐶 , user 𝑈 should (1) check if 𝑐 is pure; if yes, stop; (2) otherwise find the dominating entity 𝑑𝑜𝑚 ( 𝑐 ) ; (3) move all values in 𝑐 that do not refer to 𝑑𝑜𝑚 ( 𝑐 ) intoa new cluster 𝑑 ; then (4) apply Steps 1-3 to cluster 𝑑 (cluster 𝑐 has become pure, so needs no further splitting). This recursiveprocedure will split the original cluster 𝑐 into a set of pure clusters.It is relatively easy for human users to understand and follow,and as we will see in Section 5, it is also highly amenable to costanalysis. Example 3.3.

Given cluster 1 in Figure 4, user 𝑈 splits it intothe pure cluster 3, which contains only the values of the domi-nating entity Sony Corporation, and cluster 4, which contains allremaining values of cluster 1. A similar recursive splitting processapplies to cluster 2. (Note that cluster 6 has three dominating-entity candidates, so we break tie randomly and select Dell to bethe dominating entity.)We now optimize the above procedure in three ways. First, if 𝑐 is a singleton cluster, then we do not invoke the above splittingprocedure, because 𝑐 is already pure. Second, there are cases where the number of values referring to 𝑑𝑜𝑚 ( 𝑐 ) is less than 50%of | 𝑐 | . Formally, we define Definition 3.4 (Cluster purity).

The purity of a cluster 𝑐 is thefraction of the values in 𝑐 that refer to 𝑑𝑜𝑚 ( 𝑐 ) . Henceforth wewill use 𝑝 ( 𝑐 ) (or 𝛼 when there is no ambiguity) to denote thepurity of 𝑐 .For example, in Figure 4, the purity of cluster 2 is 2/5 = 0.4 < 𝑐 that do notrefer to 𝑑𝑜𝑚 ( 𝑐 ) , as discussed so far, it is less work for the user tomove the values that do refer to 𝑑𝑜𝑚 ( 𝑐 ) into a new cluster 𝑑 (e.g.,for cluster 2, 𝑈 should move “Sonny” and “SONY Corp”, insteadof the other three values).Finally, if 𝑝 ( 𝑐 ) is below a threshold, currently set to 0.1, then 𝑐 is very “mixed”, with each entity having less than 10% of thevalues. In this case, we have found that instead of splitting 𝑐 , itis often more effective to apply the Merge procedure describedin Section 3.2 to 𝑐 . This produces a set of pure clusters that arethen fed to the merge stage. Basic User Operations:

To implement the above solution, wedefine the following five basic user operations:• focus(a) : User 𝑈 moves his or her focus to a particular object 𝑎 on the GUI or on the paper, such as a cluster, a value withina cluster, a GUI button, a number on the paper, etc. Intuitively,user 𝑈 will shift his or her attention from one object to anotheron the GUI or the paper, and that incurs a certain amount of time.This operation is designed to capture this physical action (andits cost).• select(a) : User 𝑈 selects an object 𝑎 on the GUI (e.g., a cluster,a value, a GUI button, etc.) by moving the mouse pointer to thatobject and clicking on it, or pressing a keyboard button (e.g.,Page Up, Page Down). This operation is designed to capture thisphysical action (and its cost).• match(x,y) : Given two values, or a value and a real-worldentity (in 𝑈 ’s short-term memory), 𝑈 determines if they refer tothe same real-world entity.• isPure(c) : 𝑈 examines cluster 𝑐 to see if it is pure (i.e., if it isclean). Specifically, we assume the values in 𝑐 is listed (e.g., on theGUI) as a list of values 𝐿 . User 𝑈 reads the first value of 𝐿 , mapsit to an entity 𝑒 , then scans the values in the rest of 𝐿 . As soon as 𝑈 sees a value that does not refer to 𝑒 , the cluster is not pure, 𝑈 stops and returns false. Otherwise 𝑈 exhausts 𝐿 and returns true.• findDom(c) : finds the dominating entity 𝑑𝑜𝑚 ( 𝑐 ) and the purity 𝑝 ( 𝑐 ) of a cluster 𝑐 . If | 𝑐 | ≤

7, the size of the short-term memory(STM), then 𝑈 does this entirely in STM. Specifically, 𝑈 scansthe list of values in 𝑐 , maps each value into an entity, and keepstrack of the number of times 𝑈 has encountered a particularentity. Then 𝑈 returns the entity with the highest count 𝑔 as thedominating one, and 𝑔 /| 𝑐 | as the purity of cluster 𝑐 . If | 𝑐 | > 𝑈 proceeds as above, but uses paper and pen to keep trackof the counts of the encountered entities. The Split Procedure:

Algorithm 1 describes

Split , a procedurethat uses the above five operations to implement the split stage.

Split takes the set of clusters output by the machine part, thenapplies the

SplitCluster procedure to each cluster. We distinguishtwo kinds of procedures: GUI-driven and human-driven.

Split and

SplitCluster are GUI-driven, i.e., executed by the computer. AGUI-driven procedure, e.g.,

SplitCluster , may call human-drivenprocedures e.g., isPure , findDom , then pass control to user 𝑈 to igure 5: An example of local merging. execute those procedures. To distinguish between the two, weunderline the names of human-driven procedures.Algorithm 1 shows that SplitCluster handles the corner caseof singleton clusters (Step 1), then calls isPure and asks user 𝑈 to take over (Step 2). At the end of this procedure, 𝑈 wouldhave selected either “yes” or “no” button, indicating whether thecluster is pure or not. In the former case, SplitCluster terminates,returning the pure cluster (Step 3). Otherwise, it calls findDom (Step 4), and so on. Note that at the end of findDom , user 𝑈 knowsthe dominating entity 𝑒 ∗ and the purity 𝛼 , but the computer doesnot know these. Hence these quantities (and all quantities thatonly 𝑈 know) are shown as underlined, e.g., 𝑒 ∗ , 𝛼 . Given a set of pure clusters output by the split stage, the mergestate merges clusters that refer to the same entity. Clearly, fromeach cluster we can select just a single representative value (saythe longest string), then merge those (if we know how to mergethose, we can easily merge the original clusters). For example,in Figure 4 the split stage produces clusters 3, 4, 5, 7, 9, and 10.To merge them, it is sufficient to consider merging the values“Sony Corp”, “Lg”, “SONY Corp”, “Dell”, “LG”, and “Apple”. Hence-forth we will consider this simpler problem of merging 𝑛 values 𝑣 , . . . , 𝑣 𝑛 .Naively merging by considering all pair takes quadratic time.To address this problem, we propose a two-step process. First, 𝑈 does one pass through the list of values to do a “local merging”that merges matching values that are near one another. Thisreduces 𝑛 . Then 𝑈 does “global merging” that considers all pairs(of the remaining values). Both steps will exploit the parallel pro-cessing capability of short-term memory (STM).

We now describethese two steps.

Local Merging:

This step uses STM to merge matching valuesthat are near one another. Specifically, first the set of values issorted. Currently we use alphabetical sorting, because matchingvalues often share the first few characters (e.g., IBM, IBM Corp).Figure 5.a shows such a sorted list 𝐿 (ignoring the arrows fornow).Next, user 𝑈 processes the values in 𝐿 top down. For eachvalue, 𝑈 stores it and the associated entity in STM. For the sakeof this example, assume STM can only store three such pairs.Figure 5.b shows a full STM after 𝑈 has processed the first threevalues of the list. Then when processing the 4th value, “Garmin”, 𝑈 needs to evict the oldest pair from STM to make space for“Garmin” (see Figure 5.c).Then when processing the 5th value, “Ge”, 𝑈 realizes that itsentity, 𝑒 , is already in STM, associated with a previous value“GE”. So 𝑈 links “Ge” with “GE”, and replaces the value “GE” inSTM with the new value “Ge” (see Figure 5.d). Next, “IBM” willbe stored in STM, displacing “Gamevice” (Figure 5.e), and so on.At the end, 𝑈 has linked together certain matching values (seethe arrows in Figure 5.a). Algorithm 2

Local Merging

Procedure

LocalMerge( 𝐿 ) Input: a list of values 𝐿 sorted alphabetically Output: links among certain values in 𝐿 that match for 𝑖 ← , . . ., | 𝐿 | do ( 𝑒,𝑡 ) ← memorize ( 𝐿 [ 𝑖 ] ) if 𝑒 is not null then // 𝐿 [ 𝑖 ] maps to 𝑒 , 𝑒 is already in STM and associated with 𝑡 select ( 𝐿 [ 𝑖 ] ) focus ( 𝐿 [ 𝑗 ] ), select ( 𝐿 [ 𝑗 ] ) // 𝐿 [ 𝑗 ] is the previous value 𝑡 focus (link button), select (link button)// at the end, user selects “done local merging" button Algorithm 2 describes local merging, which uses previouslydefined user operations focus(a) and select(a) (see Section 3.1),as well as the following new user operation:• memorize(v) : 𝑈 maps the input value 𝑣 to an entity 𝑒 , thenmemorizes, i.e., stores the pair ( 𝑒, 𝑣 ) in STM. Specifically, if apair ( 𝑒, 𝑡 ) is already in STM, 𝑈 replaces it with ( 𝑒, 𝑣 ) , then exits,returning ( 𝑒, 𝑡 ) (see Line 2 in Algorithm 2). Otherwise, 𝑈 addsthe pair ( 𝑒, 𝑣 ) to STM, “kicking” the oldest pair out of STM tomake space if necessary. Global Merging:

After local merging, the original list of valuesis consolidated, i.e., from each set of linked values we again selectjust a single representative value (e.g., the longest one). Thisproduces a new shorter list, e.g., consolidating the list in Figure5.a produces the shorter list in Figure 6.a (ignoring the arrow).Let 𝐿 = [ 𝑣 , . . . , 𝑣 𝑛 ] be this new shorter list. Naively, user 𝑈 can compare 𝑣 with 𝑣 , . . . , 𝑣 𝑛 , then 𝑣 with 𝑣 , . . . , 𝑣 𝑛 , etc.A better solution however is to exploit the parallel processingcapability of STM: read multiple values, say 𝑣 , . . . , 𝑣 𝑘 , into STMall at once, then compare them all in parallel with 𝑣 𝑘 + , . . . , 𝑣 𝑛 ,etc. Example 3.5.

Consider again the list in Figure 6.a. User 𝑈 canread the first two values, “Big Blue” and “GE”, into STM, thenscan the rest of the values and match them with these two inparallel (using a GUI, see Figure 6.b). If there is a match, e.g.,“IBM Corp” and “Big Blue”, then 𝑈 checks off the appropriate box(see Figure 6.b). At the end of the list, 𝑈 pushes a button to linkthe matching values. Next, 𝑈 reads into STM the next two values,“Gamevice” and “Garmin”, then match “IBM Corp” with thesetwo. 𝑈 detects no more matches, thus wrapping up global merge.The system uses the results of both local and global merges toproduce the final clusters shown in Figure 6.c.In practice, even though STM can hold 7 objects [36], we foundthat users prefer to read only 3 values at a time into STM. First,7 values often take up too much horizontal space on the GUI(especially if the strings are long), making it hard for users tocomprehend. Second, users want to reserve some STM capacityto read and remember the values in the rows.As a result, we currently use 𝑘 = Figure 6: An example of global merging.igure 7: How HAC and HAC with a limit on cluster sizework on the same dataset. • recall(v) : 𝑈 maps the input value 𝑣 to an entity 𝑒 , then checksto see if 𝑒 is already in STM, returning 𝑒 and the associated valueif yes, and null otherwise.The Merge procedure (Appendix A) implements the entiremerge stage. It calls

LocalMerge on the output of the split stage,then

GlobalMerge on the output of

LocalMerge . The followingtheorem (whose proof involves the verification sets of actions)shows the correctness of the human part:Theorem 3.6.

Let 𝑉 be a set of strings to be normalized. Ap-plying the Split followed by

Merge procedures to any partition V of 𝑉 produces a set of clusters with 100% precision and recall,assuming that the user correctly executes the operations, per theirinstructions. We now discuss the machine part of VN plans, which appliesan algorithm to cluster the input strings. Many algorithms canbe used, e.g., string clustering, string matching (SM), and entitymatching (EM). We first discuss using string clustering algo-rithms, specifically HAC (hierarchical agglomerative clustering).Then we discuss why existing SM/EM algorithms do not workwell for our purposes.

We consider using a generic clustering algorithm in the machinepart. Many such algorithms exist [27, 48]. For now, we considerhierarchical agglomerative clustering (HAC), because it is easy tounderstand and debug, can achieve good accuracy, and commonlyused in practice. To cluster a set of values, HAC initializes eachvalue as a singleton cluster. Next, it finds the two clusters with thehighest similarity score (using a pre-specified similarity measure),merges them, then repeats, until reaching a stopping criterion,e.g., the highest similarity score falling below a pre-specifiedthreshold.

Example 4.1.

Consider clustering the seven values in Figure7.a. HAC may first cluster “LG” and “Lg” into a cluster 𝑐 , then“Sony” and “Sonny” into 𝑐 , then 𝑐 and “Sony Corp” into 𝑐 , etc.The final result is clusters 𝑐 and 𝑐 . Problems with Large Mixed Clusters Produced by HAC:

Using HAC “as is” however does not work well, because it oftenproduces large mixed clusters that are time consuming for user 𝑈 to clean. Specifically, as HAC iterates, it grows bigger clusters.Initially, when these clusters are small, their quality is often quitegood, because they often group together syntactically similarvalues that refer to the same entity (e.g., “LG” and “Lg”, or “Sony”and “Sonny”, see Figure 7.a).As the clusters grow, however, they start attracting “junk”,e.g., cluster 𝑐 attracts “IBM Corp” (Figure 7.a). If the similaritymeasure used by HAC happens to be “liberal” for the data setat hand, HAC often grows large clusters that are “mixed”, i.e.,containing the values of multiple entities. It is very expensive for Figure 8: Cleaning (a) is less work than cleaning (b). user 𝑈 to clean such clusters, using the Split and

Merge proceduredescribed in the previous section.

Proposed Solution:

Ideally, HAC should stop before its clustersbecome too “mixed”. If HAC’s clusters are smaller but pure (e.g.,Figure 8.a), then 𝑈 mostly just have to merge these clusters usinga few mouse clicks. However, if HAC’s clusters are larger but“mixed” (e.g., Figure 8.b), then 𝑈 would need to split them up intopure clusters, before merging them. This incurs far more mouseclicks and thus far more work.Of course, we do not know when to stop HAC. To address thisproblem, we introduce multiple HAC variations, each stoppingat a different time, then try to select a good one. Specifically,to cluster 𝑛 values, we consider 𝑛 HAC variations, where the 𝑖 -th variation, denoted HAC(i), limits the cluster size to at most 𝑖 . In each iteration, HAC(i) finds the two clusters 𝑐 and 𝑑 withthe highest similarity score, then merges them if | 𝑐 ∪ 𝑑 | ≤ 𝑖 .Otherwise, HAC(i) finds the two clusters with the next highestscore, and merges them if the resulting size is at most 𝑖 , and soon. HAC(i) terminates when it cannot find any more clusters tomerge. Example 4.2.

Consider applying HAC(2) to the values in Figure7.a. HAC(2) first forms cluster 𝑐 , then 𝑐 , exactly as the normalHAC. Then normal HAC goes on to form cluster 𝑐 in Figure 7.a,but HAC(2) cannot, because 𝑐 ’s size exceeds 2. Instead, HAC(2)find the next two clusters with the highest similarity score. Sup-pose these are the singleton clusters for “Sony Corp” and “SonyInc”. Then HAC(2) merges them to form cluster 𝑐 in Figure 7.b.At this point HAC(2) cannot form any more cluster, because anyresulting cluster size would exceed 2. So it stops, returning theclusters in Figure 7.b as the output.HAC(1) produces the smallest but cleanest clusters (as they aresingleton). As we increase 𝑖 , HAC(i) tends to produce bigger butless clean clusters. Typically, there exists an i* such that HAC(i*)’sclusters are still so clean that they help user 𝑈 , but HAC(i*+1)’sclusters are already “too dirty” to help (e.g., 𝑈 would need tosplit them extensively before her or she can merge). This roughlycorresponds to the point where we want HAC to stop. HAC(i*)thus is the “best” HAC variation for the current data set.To find HAC(i*), we pair HAC(1), . . . , HAC(n) with Split and

Merge to form 𝑛 end-to-end plans. Sections 5 and 6 show how toestimate the costs of these plans and find the one with the leastestimated cost. We are now in a position to explain why existing string matching(SM) and entity matching (EM) solutions do not work well inour context. (Section 8 shows experimentally that

Winston withHAC outperforms these solutions.)At the core, VN is an SM problem. So SM solutions can beused in the machine part. EM solutions can also be used, bylimiting each entity to be a string. Many such solutions havebeen developed, e.g.,

TransER [47],

Magellan [32],

Falcon [13],

Waldo [44] (see Section 9).hese SM/EM solutions (e.g.,

Magellan , Falcon ) typically out-put a set of matches. One way to use them is to ask user 𝑈 toverify certain matches, then infer even more matches using matchtransitivity. For example, given 5 string 𝑎, 𝑏, 𝑐, 𝑑, 𝑒 , suppose a so-lution outputs 𝑎 = 𝑏 and 𝑎 = 𝑐 as matches. If 𝑈 has verified thesematches, then we can infer 𝑏 = 𝑐 as another match. A recentwork, TransER [47], exemplifies this approach.

A serious problem,however, is that we cannot guarantee 100% recall, as shown experi-mentally in Section 8

For example, no user verification and matchtransitivity on the outputs 𝑎 = 𝑏, 𝑎 = 𝑐 can help us infer 𝑑 = 𝑒 (assuming this is also a correct match). Thus, these solutions arenot appropriate for Winston .Another way to use existing SM/EM solutions is to clusterthe input strings in a way that respect the output matches. Thework [43] describes multiple ways to do this. Continuing withthe above example, given the output matches 𝑎 = 𝑏, 𝑎 = 𝑐 , we cancluster the five input strings into, say, 3 clusters { 𝑎, 𝑏, 𝑐 } , { 𝑑 } , { 𝑒 } .User 𝑈 can verify/clean these clusters, as discussed in the humanpart. A serious problem here, however, is that this approach oftenproduces large mixed clusters, which are very time consuming for 𝑈 to clean, as shown experimentally in Section 8. With HAC, wesolve this problem by modifying HAC to stop early to produceclean clusters (see Section 4.1). But there is no obvious way tomodify the clustering algorithms in [43] to stop early such thatthey produce relatively clean clusters and the quality of theseclusters can be estimated (e.g., see Section 5).The above works provide no GUI, or very basic inefficientGUIs for user feedback, e.g.,

Falcon and

TransER ask users tolabel string pairs as match/non-match. A recent work,

Waldo [44], considers a far more efficient GUI, which displays 6 stringsso that a user can cluster all of them in one shot. As such, its“human” part is more similar to ours. But its “machine” part con-siders a very different optimization problem: namely minimizingcrowdsourcing cost (e.g., clustering 6 strings incurs the samemonetary cost, regardless of which human user does it). Thus,it cannot be used in

Winston , which focuses on minimizing theeffort of human users.

We now discuss estimating the cost of a plan, which is the totaltime user 𝑈 spends in the human part to clean the clusters outputby the machine part. As we will see below, the key idea is toestimate the quality of these clusters, then use that to estimatethe time needed to clean them.Specifically, let 𝑉 = { 𝑣 , . . . , 𝑣 𝑛 } be the set of input values,and 𝑝 , . . . , 𝑝 𝑛 be the plans that we will consider, where eachplan 𝑝 𝜆 applies HAC( 𝜆 ) to 𝑉 to obtain a set of clusters 𝐶 𝜆 , thenemploys a user 𝑈 to clean 𝐶 𝜆 , using Split and

Merge . Let 𝐶 𝜆 = { 𝑐 , . . . , 𝑐 𝑛 𝜆 } . Then the cost of 𝑝 𝜆 (i.e., the time for 𝑈 to clean 𝐶 𝜆 )can be expressed as 𝑐𝑜𝑠𝑡 ( 𝑝 𝜆 ) = (cid:205) 𝑛 𝜆 𝑖 = 𝑡𝑖𝑚𝑒 ( 𝑆𝑝𝑙𝑖𝑡𝐶𝑙𝑢𝑠𝑡𝑒𝑟 ( 𝑐 𝑖 )) + 𝑡𝑖𝑚𝑒 ( 𝐿𝑜𝑐𝑎𝑙𝑀𝑒𝑟𝑔𝑒 ( 𝐿 )) + 𝑡𝑖𝑚𝑒 ( 𝐺𝑙𝑜𝑏𝑎𝑙𝑀𝑒𝑟𝑔𝑒 ( 𝑇 )) , where 𝐿 is a list of values summarizingthe output of SplitCluster , and 𝑇 is a list summarizing the outputof LocalMerge . We now estimate these quantities.

Estimating the Cost of SplitCluster:

We need to estimate

𝑆𝑝𝑙𝑖𝑡𝐶𝑙𝑢𝑠𝑡𝑒𝑟 ( 𝑐 𝑖 ) for each cluster 𝑐 𝑖 ∈ 𝐶 𝜆 . To do this, we maketwo assumptions:(1) All clusters 𝑐 , . . . , 𝑐 𝑛 𝜆 produced by 𝑝 𝜆 have the same clusterpurity 𝛼 𝜆 (which is defined in Definition 3.4).(2) When we use SplitCluster to split a cluster 𝑐 𝑖 (produced by 𝑝 𝜆 ) into a pure cluster containing all values of the dominating Parameter Meaning Model EstimationMethod 𝛼 𝜆 Cluster purity for HAC( 𝜆 ) 𝛼 𝜆 = 𝑎𝜆 𝑏 User feedback 𝜌 𝑓 Cost of focus(a) A constant Set to 0.5 𝜌 𝑠 Cost of select(a) A constant Set to 0.5 𝜌 𝑚 Cost of match(x,y) A constant User feedback 𝜌 𝑝 Cost of isPure(c) 𝜌 𝑝 ( 𝜓,𝛼 ) = 𝛾𝜓𝛼 + 𝛾 User feedback 𝜌 𝑑 Cost of findDom(c) 𝜌 𝑑 ( 𝜓 ) = 𝜂 𝜓 if 𝜓 ≤ | 𝑆𝑇𝑀 | ,𝜂 𝜓 + 𝜂 o.w. User feedback 𝜌 𝑧 Cost of memorize(v) A constant Set to 0.4 𝜌 𝑟 Cost of recall(v) A constant Set to 𝜌 𝑧 𝜏 Shrinkage factor for local merge A constant Set to 0.98 𝜉 Hit factor for global merge A constant Set to 0.1

Table 1: Parameters for our cost models. entity and a “mixed” cluster containing all the remaining values,the “mixed” cluster also has purity 𝛼 𝜆 . When we split this “mixed”cluster, the resulting “mixed” cluster also has purity 𝛼 𝜆 , and soon.These are obviously simplifying assumptions. However, theyreflect the intuition that each plan HAC( 𝜆 ) produces clusters of acertain quality level, and that this quality level can be capturedby a single number, 𝛼 𝜆 , which is the purity of all the clusters.Further, they allow us to efficiently estimate plan costs. Finally,Section 8 empirically shows that with these assumptions we canalready find good plans.Next, we use the above assumptions to estimate the cost of 𝑆𝑝𝑙𝑖𝑡𝐶𝑙𝑢𝑠𝑡𝑒𝑟 ( 𝑐 𝑖 ) . Suppose that we already know 𝛼 𝜆 (we showhow to estimate 𝛼 𝜆 later in this section), and that 𝛼 𝜆 ≥ . 𝑐 𝑖 using SplitCluster , user 𝑈 creates twoclusters: a pure dominating cluster 𝑐 𝑖,𝑤 of size 𝛼 𝜆 𝜓 𝑖 , where 𝜓 𝑖 is the size of 𝑐 𝑖 , and a remainder cluster 𝑐 𝑖,𝑢 of size ( − 𝛼 𝜆 ) 𝜓 𝑖 .If | 𝑐 𝑖,𝑢 | >

1, we assume that its purity is also 𝛼 𝜆 . 𝑈 then splits 𝑐 𝑖,𝑢 , etc. After 𝛽 𝑖 splits, 𝑈 has created 𝛽 𝑖 + 𝛼 𝜆 𝜓 𝑖 , 𝛼 𝜆 ( − 𝛼 𝜆 ) 𝜓 𝑖 , . . . , 𝛼 𝜆 ( − 𝛼 𝜆 ) 𝛽 𝑖 − 𝜓 𝑖 , ( − 𝛼 𝜆 ) 𝛽 𝑖 𝜓 𝑖 , such thatthe last cluster has a single element. Thus we can estimate 𝛽 𝑖 as ⌊− log 𝜓 𝑖 − 𝛼 𝜆 ⌋ . Since we can split a cluster of size 𝜓 𝑖 at most 𝜓 𝑖 − 𝛽 𝑖 = min ( 𝜓 𝑖 − , ⌊− log 𝜓 𝑖 − 𝛼 𝜆 ⌋) .Recall that Section 3 defines seven user operations: focus(a),select(a), match(x,y), isPure(c), findDom(c), memorize(v), and recall(v)(see Table 1). Let 𝜌 𝑓 , 𝜌 𝑠 , 𝜌 𝑚 , 𝜌 𝑝 , 𝜌 𝑑 , 𝜌 𝑧 , and 𝜌 𝑟 be their costs (i.e.,times), respectively. As we will see later, the cost 𝜌 𝑝 of isPure(c) isa function of 𝜓 𝑖 , the size of 𝑐 , and 𝛼 𝜆 , the purity of 𝑐 . Hence, abus-ing notations, we will denote this cost as 𝜌 𝑝 ( 𝜓 𝑖 , 𝛼 𝜆 ) for cluster 𝑐 𝑖 .Similarly, the cost 𝜌 𝑑 of findDom(c) is a function of the size of 𝑐 ,and will be denoted as 𝜌 𝑑 ( 𝜓 𝑖 ) for cluster 𝑐 𝑖 . The remaining fivecosts (e.g., 𝜌 𝑓 , 𝜌 𝑠 , etc.) will be constants. Now we can estimatethe cost of 𝑆𝑝𝑙𝑖𝑡𝐶𝑙𝑢𝑠𝑡𝑒𝑟 ( 𝑐 𝑖 ) where 𝛼 𝜆 ≥ . 𝑐𝑜𝑠𝑡 𝑆𝑝𝑙𝑖𝑡𝐶𝑙𝑢𝑠𝑡𝑒𝑟 ( 𝑐 𝑖 ; 𝛼 𝜆 ≥ . ) = (cid:205) 𝛽𝑖𝑗 = [ 𝜌 𝑝 (cid:0) ( − 𝛼 𝜆 ) 𝑗 − 𝜓 𝑖 ,𝛼 𝜆 (cid:1) + 𝜌 𝑓 + 𝜌 𝑠 + 𝜌 𝑑 (cid:0) ( − 𝛼 𝜆 ) 𝑗 − 𝜓 𝑖 (cid:1) + 𝜌 𝑓 + 𝜌 𝑠 + (( − 𝛼 𝜆 ) 𝑗 − 𝜓 𝑖 ) ( 𝜌 𝑓 + 𝜌 𝑚 + ( − 𝛼 𝜆 ) 𝜌 𝑠 ) + 𝜌 𝑓 + 𝜌 𝑠 ] . Appendix B discusses deriving the above formula, and computingthe cost for the cases 𝛼 𝜆 ∈ [ . , . ) and 𝛼 𝜆 < . Estimating the Cost of LocalMerge:

Recall that HAC( 𝜆 ) pro-duces the set of clusters 𝐶 𝜆 = { 𝑐 , . . . , 𝑐 𝑛 𝜆 } , and that 𝛽 𝑖 is thetotal number of splits user 𝑈 performs in SplitCluster for eachcluster 𝑐 𝑖 . Then at the end of the split phase, 𝑈 has produceda set of 𝑟 𝜆 pure clusters, where 𝑟 𝜆 = (cid:205) 𝑛 𝜆 𝑖 = ( 𝛽 𝑖 + ) . Assumingthat executing LocalMerge on any list will shrink its size bya factor of 𝜏 (currently set to 0.98), we can estimate the timeof executing LocalMerge on the output of the split phase as 𝑟 𝜆 𝜌 𝑧 + 𝑟 𝜆 ( − 𝜏 )( 𝜌 𝑓 + 𝜌 𝑠 ) + 𝜌 𝑓 + 𝜌 𝑠 (see Appendix B for anexplanation). Estimating the Cost of GlobalMerge:

LocalMerge produces 𝑟 ′ 𝜆 = 𝜏𝑟 𝜆 pure clusters to which user 𝑈 will apply GlobalMerge .ecall that

GlobalMerge takes the first three values in the inputlist 𝐿 , displays them in three columns, then asks 𝑈 to go throughthe rest of the values of 𝐿 and check a box if any value matchesthe values of the columns (see Figure 6), and so on. We assumethat in each such iteration, for each column, 𝜉 values will match,resulting in 𝜉 checkboxes being marked. Then we can estimatethe cost of GlobalMege as (cid:205) ⌊ /( 𝜉 ) ⌋ 𝑗 = [ 𝜌 𝑧 + ( 𝑟 ′ 𝜆 − ( 𝑗 − ) 𝜉𝑟 ′ 𝜆 − ) 𝜌 𝑟 + ( 𝜉𝑟 ′ 𝜆 − )( 𝜌 𝑓 + 𝜌 𝑠 ) + 𝜌 𝑓 + 𝜌 𝑠 ] (see Appendix B). Estimating the Cluster Purity 𝛼 𝜆 : Recall that we assume allclusters 𝑐 , . . . , 𝑐 𝑛 produced by HAC( 𝜆 ) have the same clusterpurity 𝛼 𝜆 . Using set-aside datasets, we found that 𝛼 𝜆 could beestimated reasonably well using a power-law function 𝑎𝜆 𝑏 (where 𝑏 is negative, see Table 1). To estimate 𝑎 and 𝑏 , we compute 𝜆 and 𝜆 . To compute 𝜆 , we apply HAC(10) to the set of inputvalues to obtain a set 𝐶 of clusters. Next, we randomly sample 3clusters of size 10 from 𝐶 (if there are less than 3 such clusters,we select the three largest). Next, we show each cluster to user 𝑈 , ask him/her to identify all values referring to the dominatingentity, then use those to compute the cluster purity. Finally, wetake the average purity of these clusters to be 𝐶 . We proceedsimilarly to compute 𝐶 .We now have three data points: (1,1), (10, 𝜆 ), and (20, 𝜆 ),which we can use to estimate 𝑎 and 𝑏 in the function 𝑎𝜆 𝑏 , usingthe ordinary least-squares method. Estimating the Costs of User Operations:

Finally, we esti-mate the costs of the seven user operations (see Table 1). Thecosts of focus(a) and select(a) measure the times user 𝑈 focuseson an object 𝑎 then selects it (e.g., by clicking a mouse button).After a number of timing with various users, we found that thesetimes are roughly the same for most users, and we set themto be 𝜌 𝑓 = 𝜌 𝑠 = . 𝜌 𝑧 = 𝜌 𝑟 = . 𝜌 𝑚 of match(x,y), however, while largely not depen-dent on 𝑥 and 𝑦 , does vary depending on user 𝑈 . Further, estimat-ing the time 𝜌 𝑝 of isPure(c) and time 𝜌 𝑑 of findDom(c) is signifi-cantly more involved. To determine whether a cluster 𝑐 is pure,user 𝑈 needs to examine at most 𝛼𝜓 values in 𝑐 (where 𝜓 is thesize of 𝑐 ) before he/she sees the first value not referring to 𝑑𝑜𝑚 ( 𝑐 ) .Hence, we model the time of isPure(c) as 𝜌 𝑝 ( 𝜓, 𝛼 ) = 𝛾 𝛼 𝜓 + 𝛾 .To find the dominating entity 𝑒 ∗ of cluster 𝑐 , we distinguishtwo cases. If 𝜓 ≤ | 𝑆𝑇 𝑀 | , then user 𝑈 can execute findDom(c)entirely in 𝑈 ’s short-term memory. In this case the time is pro-portional to 𝜓 . Otherwise 𝑈 needs to use paper and pen, and wefound that the time roughly correlates to 𝜓 . Thus, we model thetime 𝜌 𝑑 ( 𝜓 ) of findDom(c) as 𝜂 𝜓 if 𝜓 ≤ | 𝑆𝑇 𝑀 | and as 𝜂 𝜓 + 𝜂 otherwise.All that is left is to estimate the cost 𝜌 𝑚 of match(x,y), andthe parameters 𝛾, 𝛾 , 𝜂 , 𝜂 , 𝜂 of the cost models of isPure(c) andfindDom(c). To do so, when running HAC(20) (to estimate clusterpurity 𝛼 𝜆 ), we also ask user 𝑈 to perform a few match, isPure,and findDom operations, then use the recorded times to estimatethe above quantities (see Appendix B). Altogether, the time ittakes for users to calibrate cluster purity 𝛼 𝜆 and the cost modelsof user operations was mere minutes in our experiments (andwas included in the total time of our solution). Recall that to cluster the values 𝑉 = { 𝑣 , . . . , 𝑣 𝑛 } , we consider 𝑛 plan 𝑝 , . . . , 𝑝 𝑛 , where each plan 𝑝 𝜆 applies HAC( 𝜆 ) to 𝑉 toobtain a set of clusters 𝐶 𝜆 , then employs a user 𝑈 to clean 𝐶 𝜆 . We now discuss how to efficiently find the plan 𝑝 𝜆 ∗ with the leastestimated cost.Naively, we can (1) execute HAC( 𝜆 ) for each plan 𝑝 𝜆 to obtain 𝐶 𝜆 , (2) apply the cost estimation procedures in the previous sec-tion to 𝐶 𝜆 to compute the cost of 𝑝 𝜆 , then (3) return the plan withthe lowest cost. Steps 2-3 take negligible times. Step 1 howeverapplies HAC(1), · · · , HAC(n) separately to 𝑉 , which altogethercan take a lot of time, e.g., 7.3 minutes for | 𝑉 | =

480 and 1.1 hoursfor | 𝑉 | =

960 in our experiments.To address this problem, we have developed a solution tojointly execute HAC(1), · · · , HAC(n), such that executing a plancan reuse the intermediate results of executing a previous plan.Specifically, we first execute HAC(n), i.e., the regular HAC. Recallthat each iteration

𝐼𝑡𝑒𝑟 𝑖 of HAC(n) merges two clusters. Let 𝑠 ( 𝑖 ) be the size of the largest cluster at the end of 𝐼𝑡𝑒𝑟 𝑖 . Suppose thereis a 𝑘 such that 𝑠 ( 𝑖 ) ≤ 𝜆 but 𝑠 ( 𝑖 + ) > 𝜆 . Then we know thatHAC( 𝜆 ) can reuse everything HAC(n) has produced up to 𝐼𝑡𝑒𝑟 𝑖 ,but cannot proceed to 𝐼𝑡𝑒𝑟 𝑖 + . So at the end of 𝐼𝑡𝑒𝑟 𝑖 we savecertain information for HAC( 𝜆 ) (e.g., the merge commands so far,the value 𝜆 ), then continue with HAC(n). Once HAC(n) is done,we go back to each saved point 𝜆 and resume HAC( 𝜆 ) from there.This strategy enables great reuse, especially for high values of 𝜆 ,e.g., slashing the time for 960 values from 1.1 hours to 18 secs. Putting It All Together:

We can now describe the entire

Win-ston system, as used by a single user 𝑈 . Given a set of values 𝑉 to normalize, (1) Winston first calibrate the cluster purity 𝛼 𝜆 and the cost models. To do so, it runs HAC(10) and HAC(20),asks user 𝑈 to perform a few basic tasks on sample clusters fromthese algorithms, then use 𝑈 ’s results to calibrate (see Section5). (2) Winston runs the above search procedure to find a plan 𝑝 𝜆 ∗ with the least estimated cost. (3) Finally, Winston sends theoutput clusters of 𝑝 𝜆 ∗ to user 𝑈 to clean, using procedures Split and

Merge . So far we have discussed how

Winston works with a single user.In practice, however, multiple users (e.g., people in the sameteam) are often willing to jointly perform VN. We now discusshow to extend

Winston to divide the work among such users, tospeed up VN.Consider the case of 3 users. Naively, we can divide the set ofinput strings into 3 equal parts, ask each user to apply

Winston toperform VN for a part, then combine the three outputs to form aset of clusters. We can obtain a canonical string from each cluster,producing a new list of strings. Then we can divide this new listamong 3 users, repeat the process, and so on. This naive solutionhowever does not work well, because it often spreads matchingstrings, i.e., those belonging to a golden cluster, among all 3 users,causing much additional work in matching across the individuallists, in later steps.Intuitively, strings within a golden cluster should be assignedto a single user, as much as possible. We have extended

Winston to realize this intuition. In the extension,

Winston first brieflyinteracts with each user to learn his/her profiles. Next, it usesthese profiles to search a plan space to find a good VN plan. Next,it executes the machine part of this plan to produce a set 𝐶 ofclusters. It then divides 𝐶 among the users, such that each willhave roughly the same workload. The intuition here is that acluster in 𝐶 captures many strings that belong to the same goldencluster, and is assigned to a single user. Next, Winston asks eachuser to use

Split and

Merge to clean the assigned clusters. Fi-nally, it obtains the set of (cleaned) clusters from all users, then ame Size Description Sample Values

Nickname 5132 Nicknames andsome of their typos “Cissy”, “Fanny”, “Frannie”Citation 3000 Article citations fromGoogle Scholar and DBLP “caching technologies for webapplications c mohan vldb 2001”Life Stage 199 Target life stage(s)of products “Maternity”, “Mothers”, “Youth|Young Professionals”Big Ten 74 Names of Big TenConference colleges “University of Iowa”, “UIowa”,“UM Twin Cities”

Table 2: Datasets for our experiments. repeatedly performs a distributed version of

GlobalMerge untilall clusters have been verified and cleaned. Appendix C describesthe algorithm in detail, provides the pseudo code, and discussescost estimation procedures for this version of

Winston . We now evaluate

Winston . Among others, we show that

Win-ston can significantly outperform existing solutions, that it canleverage multiple users to drastically cut VN time, and that it canscale to large datasets.

We first compare

Winston with state-of-the-art manual and clus-tering solutions (Section 8.3 considers string/entity matchingsolutions). We use the four datasets in Table 2, obtained onlineand from VN tasks at a company. For each dataset we manuallycreated all correct clusters, to serve as the ground truth. (Weconsider larger datasets later in Section 8.4.)

The Existing Solutions:

We consider four solutions:

Manual , Merge , Quack , and

OpenRefine . Manual is the typical manualmethod that we have observed in industry. It can be viewedas performing the

GlobalMerge method (Section 3.2).

Merge isour own manual VN method, which performs

LocalMerge then

GlobalMerge . Quack is a string clustering tool used extensively for VN ata company. It also uses HAC like

Winston , but does not place alimit on the cluster size. We extended

Quack by asking the userto clean the clusters using

Split and

Merge . Merge and

Quack can be viewed as the two plans HAC(1) and HAC( 𝑛 ) in the planspace explored by Winston (where 𝑛 is the number of values tobe normalized). OpenRefine is a popular open-source tool to wrangle data [1].It uses several string clustering algorithms to perform VN [2].Among these, the most effective one appears to be KNN-basedclustering [2]. We extend this algorithm to work with

Split and

Merge (because the GUI provided by

OpenRefine is very limited).

Results:

Table 3 shows the times of

Winston vs. the abovefour methods (in minutes), using a single user. For each methodwe measure the total time the user spends cleaning the clusters(for

Winston this includes the calibration time). It is difficultto recruit a large number of real users for these experiments,because cleaning some datasets (e.g., Nickname) would take afew working days. So we use synthetic users and each data pointhere is averaged over 100 such users, see Appendix D. (We usereal users to “sanity check” these results in Section 8.4.)The table shows that

Manual performs worst, incurring 3-6800minutes.

Merge performs much better, especially on the two largedatasets, incurring 4-1961 minutes, suggesting that performing alocal merge before a global merge is important.

Merge is clearlythe manual method to beat.

Quack is a bit faster than

Merge on Nickname (1808 vs. 1961),but slower on the remaining three datasets.

OpenRefine ’s perfor-mance is very uneven. It is a bit faster than

Merge on Citation,but far slower on the other three datasets.

Dataset Manual Merge Quack OpenRefine Winston SavingsNickname > Citation

Life Stage

13 9 15 12 9 0hrs

Big Ten

Table 3: Winston vs four existing solutions.

Dataset 1 User 3 Users 5 Users 7 Users 9 Users

Nickname 1512 890 610 460 412Citation 1112 428 278 212 177Life Stage 9 6 6 6 6Big Ten 7 4 4 4 4

Table 4: The times of Winston with multiple users.

In contrast,

Winston performs much better than

Merge . OnNickname it saves 7.5 hours of user time (see the last column).On Citation it saves 3.34 hours of user time. On Life Stage it iscomparable to

Merge , and on Big Ten it is only 3 mins worse(due to the overhead of user calibration time).

Winston also outperforms both

Quack and

OpenRefine . Im-portantly, in all cases where

Quack or OpenRefine performsworse than

Merge , Winston is able to select a good plan whichallows it to outperform

Merge . We have shown that

Winston outperforms existing manual andclustering methods. We now examine how

Winston can leveragemultiple users to reduce VN time. Table 4 shows that

Winston can leverage multiple users to drastically cut the VN time, e.g.,from 1512 minutes with 1 user to 412 with 9 users for Nickname,and from 1112 to 177 for Citation. The most significant reductionis achieved early, e.g., from 1 to 3-5 users. After that, adding moreusers still helps reduce the VN time, but only in a “diminishing-return” fashion.

We now compare

Winston to existing string matching (SM) andentity matching (EM) solutions, specifically with

TransER [47],

Falcon [13], and

Magellan [32].

Comparing with TransER:

As discussed in Section 4.2, thereare two main ways to use SM/EM solutions in our context. First,a solution can produce a set of matches 𝑀 , employ a user 𝑈 toverify certain matches in 𝑀 , then use match transitivity to infereven more matches. The work [47] describes such a solution,which we call TransER .The main problem, as discussed in Section 4.2, is that suchsolutions cannot guarantee 100% recall. Consider

TransER , whichmatches strings using rule

𝐽𝑎𝑐𝑐𝑎𝑟𝑑 ( 𝑔 ( 𝑣 𝑖 ) , 𝑔 ( 𝑣 𝑗 )) ≥ 𝛼 . Assum-ing a perfect user 𝑈 who does not make mistakes when verify-ing matches, Figure 9 shows the recall of TransER on our fourdatasets as we vary 𝛼 . It shows that to reach 100% recall, 𝛼 mustbe set to less than 0.08. But that would produce a huge numberof matches (almost the entire Cartesian product), which requirea huge amount of effort from the user to verify. In such cases, itis not difficult to show that TransER would perform worse than

Merge . Comparing with Falcon and Magellan:

The second way touse current SM/EM solutions is to produce the matches, thengroup them into clusters. To examine this approach, we use

Falcon [13] and

Magellan [32]. A recent work (name withheldfor anonymous reviewing) has adapted

Falcon to SM, and shownthat it outperforms existing SM solutions. Thus,

Falcon can beviewed as a state-of-the-art SM solution.

Magellan , on the otherhand, can be viewed as a state-of-the-art EM solution.

To learn amatcher, both

Falcon and

Magellan require the user to label a .0 0.2 0.4 0.6 0.8 1.0

Threshold R e c a ll NicknameCitationLifestageBigten

Figure 9: Recall of TransER for varying threshold 𝛼 . Dataset 1 User 3 Users 5 Users 7 Users 9 Users Labeling

Nickname 1930 (418) 1519 (629) 1210 (600) 958 (498) 807 (395) 14Citation 1114 (2) 1302 (874) 779 (501) 562 (350) 459 (282) 19Life Stage 22 (13) 21 (15) 21 (15) 22 (16) 22 (16) 20Big Ten 21 (14) 21 (17) 21 (17) 21 (17) 21 (17) 20

Table 5: The human times of Falcon.

Dataset 1 User 3 Users 5 Users 7 Users 9 Users Labeling& Debugging

Nickname 1482 (-30) 1062 (172) 788 (178) 646 (186) 563 (151) 89Citation 1150 (38) 900 (472) 599 (321) 481 (269) 393 (216) 109Life Stage 85 (76) 85 (79) 85 (79) 85 (79) 85 (79) 84Big Ten 85 (78) 85 (81) 85 (81) 85 (81) 85 (81) 84

Table 6: The human times of Magellan. set of pairs as match/non-match. In

Magellan the user can alsodebug the matcher to improve its accuracy.Once

Falcon and

Magellan have produced the matches, weuse Markov clustering in [43] to partition the input strings intoclusters that are consistent with the matches. Finally, we ask oneor more users to clean the clusters using

Split and

Merge .Table 5 shows the human time for

Falcon on the four datasets.For example, the first cell “1930 (418)” means that for 1 user,

Falcon incurs 1930 mins of human time, 418 mins more than

Winston . This time includes the labeling time (14 mins, shownin the last column). The table shows that

Winston outperforms

Falcon in all cases, reducing human time by 2-874 mins. Thelarger the dataset, the more the gain, e.g., more than 14.5 hourson Citation, using 3 users.Table 6 shows the human time for

Magellan on the fourdatasets. The meaning of the table cells here are similar to thosefor

Falcon . The table shows that

Winston outperforms

Magellan in all cases, reducing human time by 38-472 mins, except in thecase of 1 user for Nickname, where it is slower by 30 mins (seethe red font).The above time includes labeling and debugging (the last col-umn). Interestingly, even if we ignore the labeling and debuggingtime,

Winston still outperforms

Magellan by a large margin inall cases requiring 3, 5, 7, and 9 users, for Nickname and Citation.It is slower only in the case of 1 user, by 119 mins for Nicknameand 71 mins for Citation. Thus, overall

Winston outperforms

Magellan . In addition,

Winston is suitable for lay users, whereas

Magellan requires the user to have expertise in EM and machinelearning.A major reason for the worse performance of

Falcon and

Mag-ellan is that they often produce large mixed clusters. For example,

Magellan produces clusters of up to 314 strings coming from 137real-world entities on Nickname, and clusters of up to 98 stringscoming from 62 real-world entities on Citation. Clearly, it is verytime consuming for the user to clean such clusters. In contrast,

Winston selects VN plans that produce clusters of only up to 20strings, which are much easier for the user to understand andclean. “Sanity Check” with Real Users:

We want to “sanity check”our results so far using real users. Extensive checking is very

Figure 10: “Sanity check” with real users. difficult because it is hard to recruit real users for these time-consuming experiments. As a result, we carried out a limitedchecking. Specifically, we performed stratified sampling to obtaina Nickname sample of 316 values and a Citation sample of 343values. On each sample we recruited multiple real users and askedthem to perform

Merge , Winston and (i.e.,

Winston with 3 users), taking care to minimize user bias. The right sideof Figure 10.a shows the results for Nickname. For comparisonpurposes, the left side of the figure shows the times with syntheticusers. Figure 10.b shows similar results for Citation.The figures show that “Simulation” approximates “Real User”quite well. In both cases, the ordering of the methods is thesame. Further, the results show that

Winston can do much betterthan

Merge , and in turn can do much better than

Winston . While limited, this result with real users does providesome anecdotal support for our simulation findings.

Finding Good Plans:

Table 7 shows that

Winston finds goodplans. Consider Nickname. Recall that we ran 100 synthetic usersfor this dataset. For each user 𝑈 𝑖 Winston estimated the costs ofall plans then selected plan 𝑝 + , the one with the least estimatedcost. Knowing gold clusters, however, we can simulate how 𝑈 𝑖 executes each plan and thus compute the plan’s exact cost. Thisallows us to find the rank of 𝑝 + on the list of all plans sorted byincreasing cost, as well as the time difference between 𝑝 + andthe best plan.The first row of Table 7 shows this information. Here, Winston considered a space of 100 plans. For all 100 users, it selected theplan ranked 2nd. The difference between this plan and the bestplan, however, is just 3-4 mins (over 100 users). The next twocells show the average/min/max times of the best plan, and theaverage/min/max difference in percentage. The remaining rowsare similar. Thus,

Winston did a good job. In many cases, itselected top-ranked plans, and most importantly, all the selectedplans differ in time from the best plans by only 0-14% (see thelast column).

Scaling to Large Datasets:

Finally, we examine how

Winston scales to large datasets. Table 8 shows the estimated cleaningtime of

Merge , Quack , Winston , and , i.e.,

Winston with 3 users, for synthetic datasets of various sizes. The tableshows that

Merge is not practical, taking 29 days, 4.4 years, and11.5 years for 100K, 500K, and 1M strings, respectively.

Quack isbetter, but still incurs huge times.

Winston , in contrast, can reduce these times drastically, tojust 13 days, 9.6 months, and 1.3 years, respectively. As discussedin Section 1, this is because

Winston provides a better UI, so the ataset Picked PlanRank (Freq) Size ofPlan Space Time Diffto Best Plan Time ofBest Plan Diff in %Nickname

Citation

Life Stage

Big Ten

Table 7: The quality of the plans found by Winston.

Dataset Size

Merge Quack Winston Winston

10K 22 22 13 4100K 231 (29d) 199 107 (13d) 34 (4.2d)500K 9415 (4.4y) 3231 1688 (9.6m) 387 (2.2m)1M 24449 (11.5y) 19971 2710 (1.3y) 618 (3.5m)

Table 8: Cleaning times vs dataset sizes. user can do more with less effort. Further, the machine part of

Winston outputs clusters that are “user friendly”, i.e., requiringlittle effort for the user to clean. Finally,

Winston searches a largespace of plans to find one with minimal estimated human effort. does even better, cutting the times to clean 500K and1M strings to just 2.2 and 3.5 months, respectively. These suggestthat cleaning large datasets with

Winston indeed can be practical,especially by dividing the work among multiple users.

Data Cleaning:

Data cleaning has received enormous attention(e.g., [3, 5, 9, 11, 14, 16, 17, 20–24, 26, 30, 31, 33, 34, 37–39, 47]). See[10, 12, 15, 40] for recent tutorials, surveys, and books. However,as far as we can tell, no published work has examined the problemof cleaning with 100% accuracy, as we do for VN in this paper.

Our work here shows that the problem of cleaning to reach a desiredlevel of accuracy raises many novel challenges for data cleaning.

Value Normalization:

Much work has addressed VN, typi-cally under the name “synonym discovery”. Most solutions usestring/contextual similarities to measure the relatedness of values[8, 49], and employ various techniques, e.g., clustering, regularexpressions, learning, etc. [35, 49] to match values. However, nowork has examined verifying and cleaning VN results to reach100% accuracy, as we do here.

Clustering:

Our work is related to clustering (which we use inVN). Numerous clustering algorithms exist [18, 27, 48], but weare not aware of any work that has developed a human-drivenprocedure to clean up clustering output and tried to minimizethe human effort of this procedure. Much work has also tunedclustering (e.g., [6, 7]), but for accuracy. In contrast, our work canbe viewed as tuning clustering to minimize the post-clusteringcleaning effort.

String/Entity Matching for the “Machine” Part:

At the coreVN is a matching problem, and hence string matching (SM) andentity matching (EM) solutions can be used in the “machine” part.Numerous such solutions have been developed (e.g.,

TransER , Falcon , Magellan , Waldo and more [13, 32, 44, 47]). We havediscussed in Section 4.2 and experimentally validated in Section8.3 that these methods do not work well for our context. The mainreason is that they generate large mixed clusters that are verytime consuming for users to clean.

This result suggests that whenwe combine a machine part with a human part, it is importantto develop the machine part such that it generates results that are“user friendly” for the user in the human part to work with.

User Interaction Techniques for the “Human” Part:

Many recent works on string/entity matching and crowdsourcingsolicit user feedback/action via GUIs to verify and further clean(e.g.,

CrowdDB , CrowdER , and more [19, 20, 44–47]). These workshowever allow only a limited range of user actions (e.g., asking users if two tuples match). A recent work,

Waldo [44], considersmore expressive user actions, such as showing six values on asingle screen and allowing the user to cluster all six in “one shot”.The above works differ from

Winston in two important ways.First, the range of user actions that they allow is still quite limited.In contrast,

Winston considers far more expressive user actions,such splitting a cluster, merging two clusters, etc. Second, theabove works do not explicitly model the human effort of the useractions and do not seek to minimize this total human effort, as

Winston does. For example, they model the cost of labeling avalue pair or clustering six values to be a fixed value (e.g., 3 centspaid to a crowd worker), regardless of how much effort a userputs into doing it.

As such, our work can be viewed as advancing therecent human-in-the-loop (HILDA) line of research, by consideringmore expressive user actions and studying how to optimize theirhuman-effort cost using RDBMS-style techniques.

RDBMS-Style Cleaning Systems:

Many cleaning works havealso adopted an RDBMS-style operator framework, e.g., AJAX[22], Wisteria [23], Arnold [28], QuERy [4]. They however do notconsider expressive human operations, modeling human actionsat a coarse level, e.g, labeling a tuple, converting a dirty tupleinto a clean one. In contrast, we model and estimate the cost ofcomplex human operations, e.g., removing a value from a cluster,verifying if a cluster is clean, etc. Finally, current work typicallyoptimizes for the accuracy and time of cleaning algorithms (whileassuming a ceiling on the human effort). In contrast, we minimizethe human effort, which can be a major bottleneck in practice.

Interactive Cleaning Systems:

Another prominent body ofwork develops interactive cleaning systems (e.g., AJAX [22], Pot-ter Wheel [41], Wrangler [29], Trifacta [26], ALIAS [42], and[25]. Such systems often try to maximize cleaning accuracy, orefficiently build data transformations/cleaning scripts, while min-imizing the user effort. To the best of our knowledge, however,they have not examined the problem of VN with 100% accuracy.For example, active learning-based approaches such as [42] donot tell the user what to do (to reach 100% accuracy) if after usingthem the accuracy of the cleaned dataset is still below 100%.

10 CONCLUSIONS & FUTURE WORK

We have examined the problem of value normalization with 100%accuracy. We have described

Winston , an RDBMS-style solutionthat defines human operations, combines them with clusteringalgorithms to form hybrid plans, estimates plan costs (in termsof human verification and cleaning effort), then selects the bestplan.Overall, our work here shows that it is indeed possible toapply an RDBMS-style solution approach to the problems of100% accurate cleaning. Going forward, we plan to open sourceour current VN solution, explore other clustering algorithms forVN, and explore applying the solutions here to other cleaningtasks, such as deduplication, outlier removal, extraction, and datarepair.

REFERENCES [1] 2018.

OpenRefine open-source tool. openrefine.org.[2] 2018.

The value normalization capabilities of OpenRefine. https://github.com/OpenRefine/OpenRefine/wiki/Clustering.[3] Z. Abedjan et al. 2016. Detecting Data Errors: Where are we and what needsto be done?

PVLDB

9, 12 (2016), 993–1004.[4] H. Altwaijry et al. 2015. QuERy: A Framework for Integrating Entity Resolu-tion with Query Processing.

PVLDB

9, 3 (2015), 120–131.[5] A. Arasu et al. 2011. Towards a Domain Independent Platform for DataCleaning.

IEEE Data Eng. Bull.

34, 3 (2011), 43–50.[6] S. Basu et al. 2002. Semi-supervised Clustering by Seeding. In

ICML .7] M. Bilenko et al. 2004. Integrating constraints and metric learning in semi-supervised clustering. In

ICML .[8] K. Chakrabarti et al. 2012. A Framework for Robust Discovery of EntitySynonyms. In

SIGKDD .[9] S. Chaudhuri et al. 2006. Data Debugger: An Operator-Centric Approach forData Quality Solutions.

IEEE Data Eng. Bull.

29, 2 (2006), 60–66.[10] Xu Chu et al. 2016. Data Cleaning: Overview and Emerging Challenges. In

SIGMOD .[11] X. Chu et al. 2016. Distributed Data Deduplication. In

VLDB .[12] X. Chu and I. F. Ilyas. 2016. Qualitative Data Cleaning.

PVLDB

9, 13 (2016).[13] S. Das et al. 2017. Falcon: Scaling Up Hands-Off Crowdsourced Entity Matchingto Build Cloud Services. In

SIGMOD .[14] A. Das Sarma et al. 2012. An automatic blocking mechanism for large-scalede-duplication tasks. In

CIKM .[15] T. Dasu and T. Johnson. 2003.

Exploratory Data Mining and Data Cleaning .John Wiley.[16] X. Dong et al. 2010. Global Detection of Complex Copying RelationshipsBetween Sources.

PVLDB

3, 1 (2010), 1358–1369.[17] V. Efthymiou et al. 2015. Parallel Meta-blocking: Realizing Scalable EntityResolution over Large, Heterogeneous Data. In

Big Data .[18] A. Fahad et al. 2014. A Survey of Clustering Algorithms for Big Data: Taxon-omy and Empirical Analysis.

IEEE Trans. Emerging Topics in Computing

2, 3(2014), 267–279.[19] D. Firmani et al. 2016. Online Entity Resolution Using an Oracle.

PVLDB

9, 5(2016), 384–395.[20] M. J. Franklin et al. 2011. CrowdDB: answering queries with crowdsourcing.In

SIGMOD .[21] J. Freire et al. 2016. Exploring What not to Clean in Urban Data: A StudyUsing New York City Taxi Trips.

IEEE Data Eng. Bull.

39, 2 (2016), 63–77.[22] Helena Galhardas et al. 2001. Declarative Data Cleaning: Language, Model,and Algorithms. In

VLDB .[23] D. Haas et al. 2015. Wisteria: Nurturing Scalable Data Cleaning Infrastructure.In

VLDB .[24] D. Haas et al. 2016. CLAMShell: Speeding up Crowds for Low-latency DataLabeling. In

VLDB .[25] J. He et al. 2016. Interactive and Deterministic Data Cleaning. In

SIGMOD .[26] J. Heer et al. 2015. Predictive Interaction for Data Transformation. In

CIDR .[27] A. K. Jain et al. 1999. Data Clustering: A Review.

ACM Comput. Surv.

31, 3(1999), 264–323.[28] S. R. Jeffery et al. 2013. Arnold: Declarative Crowd-Machine Data Integration.In

CIDR .[29] S. Kandel et al. 2011. Wrangler: Interactive Visual Specification of DataTransformation Scripts. In

SIGCHI . 3363–3372.[30] Z. Khayyat et al. 2015. BigDansing: A System for Big Data Cleansing. In

SIGMOD .[31] L. Kolb et al. 2011. Parallel Sorted Neighborhood Blocking with MapReduce.In

BTW .[32] P. Konda et al. 2016. Magellan: Toward Building Entity Matching ManagementSystems.

PVLDB

9, 12 (2016), 1197–1208.[33] S. Krishnan et al. 2016. ActiveClean: Interactive Data Cleaning For StatisticalModeling.

PVLDB

9, 12 (2016).[34] A. Marcus et al. 2011. Crowdsourced databases: Query processing with people.In

CIDR .[35] J. McCrae and N. Collier. 2008. Synonym set extraction from the biomedicalliterature by lexical pattern discovery.

BMC Bioinformatics

Psychological Review

63, 2 (1956).[37] B. Mozafari et al. 2014. Scaling Up Crowd-Sourcing to Very Large Datasets: ACase for Active Learning. In

VLDB .[38] A. G. Parameswaran and N. Polyzotis. 2011. Answering Queries using Humans,Algorithms and Databases. In

CIDR .[39] H. Park and J. Widom. 2013. Query Optimization over Crowdsourced Data. In

VLDB .[40] E. Rahm and H. H. Do. 2000. Data Cleaning: Problems and Current Approaches.

IEEE Data Eng. Bull.

23, 4 (2000).[41] V. Raman and J. M. Hellerstein. 2001. Potter’s Wheel: An Interactive DataCleaning System. In

VLDB .[42] S. Sarawagi et al. 2002. ALIAS: An Active Learning led Interactive Deduplica-tion System. In

VLDB .[43] S. Van Dongen. 2008. Graph Clustering Via a Discrete Uncoupling Process.

SIAM J. Matrix Anal. Appl.

30, 1 (2008).[44] V. Verroios et al. 2017. Waldo: An Adaptive Human Interface for Crowd EntityResolution. In

SIGMOD .[45] V. Verroios and H. Garcia-Molina. 2015. Entity Resolution with crowd errors.In

ICDE .[46] J. Wang et al. 2012. CrowdER: Crowdsourcing Entity Resolution.

PVLDB

5, 11(2012), 1483–1494.[47] J. Wang et al. 2013. Leveraging Transitive Relations for Crowdsourced Joins.In

SIGMOD .[48] R. Xu and D. Wunsch, II. 2005. Survey of Clustering Algorithms.

Trans. Neur.Netw.

16, 3 (2005), 645–678.[49] A. Yates and O. Etzioni. 2009. Unsupervised Methods for Determining Objectand Relation Synonyms on the Web.

J. Artif. Int. Res.

34, 1 (2009), 255–296.

Algorithm 3

Merge Phase

Procedure

Merge( 𝐿 ) Input: a list of values 𝐿 representing output clusters of Split phase Output: a set of clean clusters 𝐶 of values in 𝐿 LocalMerge( 𝐿 ) 𝐿 ← consolidated list of values from LocalMerge step return GlobalMerge( 𝐿 ) Procedure

GlobalMerge( 𝐿 ) Input: a list of values 𝐿 sorted alphabetically Output: a set of clean clusters 𝑆 of values in 𝐿 while | 𝐿 | > do if | 𝐿 | < then 𝐵 ← [ 𝐿 [ ] , 𝐿 [ ]] else 𝐵 ← [ 𝐿 [ ] , 𝐿 [ ] , 𝐿 [ ]] // 𝐵 is the list of values to be displayed on columns MarkValuesForGlobalMerge(

𝐵, 𝐿 )// at the end, user selects “global merge" button for 𝑖 ← , . . ., | 𝐵 | do Merge 𝐵 [ 𝑖 ] and values marked to match it into cluster 𝑠 Remove values in 𝑠 from 𝐿 , 𝑆 ← 𝑆 ∪ { 𝑠 } return 𝑆 Procedure

MarkValuesForGlobalMerge(

𝐵, 𝐷 ) Input: a list 𝐵 of values on columns, a list 𝐷 of values on rows Output: links among values in 𝐵 and 𝐷 that match for each 𝑖 ← , . . ., | 𝐵 | do 𝑏 ← 𝐵 [ 𝑖 ] , ( 𝑒,𝑡 ) ← memorize ( 𝑏 ) if 𝑒 is not null then focus (h 𝑏,𝑡 ), select (h 𝑏,𝑡 )// h 𝑥,𝑦 is a checkbox to be selected if 𝑥 and 𝑦 match for each 𝑗 ← , . . ., | 𝐷 | do 𝑑 ← 𝐷 [ 𝑗 ] , ( 𝑒,𝑡 ) ← recall ( 𝑑 ) if 𝑒 is not null then focus (h 𝑑,𝑡 ), focus (h 𝑑,𝑡 ) A DEFINING THE HUMAN PART

Algorithm 3 describes the

Merge and

GlobalMerge procedures(

LocalMerge has been described in Section 3.2).

B ESTIMATING PLAN COSTS

In this section we describe the cost estimation formula for vari-ous procedures used in the human part of value normalizationplans and how we have derived them.

SplitCluster Procedure:

To estimate the cost of applying

Split-Cluster to a cluster 𝑐 𝑖 during the execution of the plan 𝑝 𝜆 weconsider the following three cases: Case 1 ( 𝛼 𝜆 ≥ . 𝛼 𝜆 ≥ . SplitCluster to 𝑐 𝑖 as follows: 𝑐𝑜𝑠𝑡 𝑆𝑝𝑙𝑖𝑡𝐶𝑙𝑢𝑠𝑡𝑒𝑟 ( 𝑐 𝑖 ; 𝛼 𝜆 ≥ . ) = 𝛽𝑖 ∑︁ 𝑗 = [ 𝜌 𝑝 (cid:0) ( − 𝛼 𝜆 ) 𝑗 − 𝜓 𝑖 ,𝛼 𝜆 (cid:1) + 𝜌 𝑓 + 𝜌 𝑠 (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) 𝑞 + 𝜌 𝑑 (cid:0) ( − 𝛼 𝜆 ) 𝑗 − 𝜓 𝑖 (cid:1) + 𝜌 𝑓 + 𝜌 𝑠 (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) 𝑞 + (( − 𝛼 𝜆 ) 𝑗 − 𝜓 𝑖 ) ( 𝜌 𝑓 + 𝜌 𝑚 + ( − 𝛼 𝜆 ) 𝜌 𝑠 ) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) 𝑞 , + 𝜌 𝑓 + 𝜌 𝑠 (cid:124) (cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32) (cid:125) 𝑞 , ] Also recall that we go through 𝛽 𝑖 iterations of splitting 𝑐 𝑖 and atiteration 𝑗 ∈ { , . . . , 𝛽 𝑖 } we split an impure cluster of approximatesize ( − 𝛼 𝜆 ) 𝑗 − 𝜓 𝑖 , e.g. at iteration 1 we split the whole cluster ofsize ( − 𝛼 𝜆 ) 𝜓 𝑖 = 𝜓 𝑖 . Each iteration corresponds to a (recursive)call of the SplitCluster . At each execution of

SplitCluster , thereare three lines (numbered 2, 4 and 7 in Algorithm 1) involvinguser operations and thus only these lines contribute to the costof the procedure.The cost of line 2 is captured by part 𝑞 of the above for-mula: it consists of the cost of isPure (executed on a cluster ofsize ( − 𝛼 𝜆 ) 𝑗 − 𝜓 𝑖 ) and then focusing on and selecting “no" but-ton. Part 𝑞 captures the cost of line 4: it consists of the cost offindDom and then focusing on and selecting “mark values" but-ton. Finally parts 𝑞 , and 𝑞 , of the above formula capture thecost of line 7: 𝑞 , is the cost of MarkValues and 𝑞 , is the cost offocusing on and selecting “create/clean new cluster" button. Part 𝑞 , in turn consists of going through the cluster values (line 3 in MarkValues pseudo code), focusing on each value, matching itwith the dominating entity of the cluster and selecting the valuef they match (i.e. for 1 − 𝛼 𝜆 fraction of the values). Case 2 ( 𝛼 𝜆 ∈ [ . , . ) ): We estimate the cost of applying Split-Cluster to 𝑐 𝑖 when 𝛼 𝜆 ∈ [ . , . ) as follows: 𝑐𝑜𝑠𝑡 𝑆𝑝𝑙𝑖𝑡𝐶𝑙𝑢𝑠𝑡𝑒𝑟 ( 𝑐 𝑖 ; 𝛼 𝜆 ∈ [ . , . )) = (cid:205) 𝛽𝑖𝑗 = [ 𝜌 𝑝 (cid:0) ( − 𝛼 𝜆 ) 𝑗 − 𝜓 𝑖 ,𝛼 𝜆 (cid:1) + 𝜌 𝑓 + 𝜌 𝑠 + 𝜌 𝑑 (cid:0) ( − 𝛼 𝜆 ) 𝑗 − 𝜓 𝑖 (cid:1) + 𝜌 𝑓 + 𝜌 𝑠 + (( − 𝛼 𝜆 ) 𝑗 − 𝜓 𝑖 ) ( 𝜌 𝑓 + 𝜌 𝑚 + 𝛼 𝜆 𝜌 𝑠 ) + 𝜌 𝑓 + 𝜌 𝑠 ] . The derivation is very similar to the previous case. The onlydifference is the fraction of matching values at each execution of

MarkValues which is 𝛼 𝜆 instead of 1 − 𝛼 𝜆 . Case 3 ( 𝛼 𝜆 < . 𝛼 𝜆 < . SplitCluster to 𝑐 𝑖 as follows: 𝑐𝑜𝑠𝑡 𝑆𝑝𝑙𝑖𝑡𝐶𝑙𝑢𝑠𝑡𝑒𝑟 ( 𝑐 𝑖 ; 𝛼 𝜆 < . ) = 𝜌 𝑝 (cid:0) 𝜓 𝑖 ,𝛼 𝜆 (cid:1) + 𝜌 𝑓 + 𝜌 𝑠 (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) 𝑞 ′ + 𝜌 𝑑 (cid:0) 𝜓 𝑖 (cid:1) + 𝜌 𝑓 + 𝜌 𝑠 (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) 𝑞 ′ + 𝜓 𝑖 𝜌 𝑧 + 𝜓 𝑖 ( − 𝜏 ) ( 𝜌 𝑓 + 𝜌 𝑠 ) + 𝜌 𝑓 + 𝜌 𝑠 (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) 𝑞 ′ + (cid:205) 𝛽𝑖𝑗 = [ 𝜌 𝑧 + ( ( − 𝛼 𝜆 ) ( 𝑗 − ) 𝜏𝜓 𝑖 − ) 𝜌 𝑟 + ( (cid:205) 𝑘 = 𝛼 𝜆 ( − 𝛼 𝜆 ) ( 𝑗 − )+ 𝑘 𝜏𝜓 𝑖 − ) ( 𝜌 𝑓 + 𝜌 𝑠 ) + 𝜌 𝑓 + 𝜌 𝑠 ] Here 𝑞 ′ is the cost of executing isPure on 𝑐 𝑖 and then focusingon and selecting “no" button. 𝑞 ′ is the cost of executing findDomand then focusing on and selecting “clean mixed cluster" button. 𝑞 ′ is the cost of executing LocalMerge on 𝑐 𝑖 and the rest of theformula is the cost of executing GlobalMerge on the results ofthe previous step. We will describe the costs of

LocalMerge and

GlobalMerge in the following sections.

LocalMerge Procedure:

Recall that for a particular plan 𝑝 𝜆 the split phase result consists of approximately 𝑟 𝜆 pure clus-ters of input values. Thus the size of the input list 𝐿 to the Lo-calMerge is 𝑟 𝜆 . User 𝑈 goes through the values in 𝐿 and for eachvalue, he or she first memorizes it. For 1 − 𝜏 fraction of the valuesin 𝐿 , 𝑈 finds a value in his or her short-term memory (STM)in which case he or she (1) selects the current value, then fo-cuses on and selects the value retrieved from STM and finallyfocuses on and selects the link button. Finally the user focuseson and selects “done local merging” button to proceed to theglobal merging. Adding up these costs gives us the cost formula 𝑟 𝜆 𝜌 𝑧 + 𝑟 𝜆 ( − 𝜏 )( 𝜌 𝑓 + 𝜌 𝑠 ) + 𝜌 𝑓 + 𝜌 𝑠 . GlobalMerge Procedure:

GlobalMerge takes as input a listof values 𝐿 with approximate size 𝑟 ′ 𝜆 . The GlobalMerge consistsof possibly several iterations and in each iteration, we assumethat each of the three values displayed on the columns of theGUI would match approximately 𝜉𝑟 ′ 𝜆 − 𝜉𝑟 ′ 𝜆 . Thus the number of iterationsof GlobalMerge would be approximately 𝑔 𝜆 = (cid:106) 𝑟 ′ 𝜆 /( 𝜉𝑟 ′ 𝜆 ) (cid:107) = ⌊ /( 𝜉 )⌋ .At iteration 𝑗 ∈ { , . . . , 𝑔 𝜆 } the user sees 𝑟 ′ 𝜆 − ( 𝑗 − ) 𝜉𝑟 ′ 𝜆 val-ues remained to be matched, three of which are displayed onthe columns and the rest on the rows of the GUI. The user firstmemorizes the three values on the columns (with total cost of3 𝜌 𝑧 ). Then for the 𝑟 ′ 𝜆 − ( 𝑗 − ) 𝜉𝑟 ′ 𝜆 − 𝜉𝑟 ′ 𝜆 − (cid:205) ⌊ /( 𝜉 ) ⌋ 𝑗 = [ 𝜌 𝑧 + ( 𝑟 ′ 𝜆 − ( 𝑗 − ) 𝜉𝑟 ′ 𝜆 − ) 𝜌 𝑟 + ( 𝜉𝑟 ′ 𝜆 − )( 𝜌 𝑓 + 𝜌 𝑠 ) + 𝜌 𝑓 + 𝜌 𝑠 ] Estimating the Costs of User Operations:

We now describehow we estimate the cost 𝜌 𝑚 of the match operation, the pa-rameters 𝛾 and 𝛾 of the isPure operation cost function and the parameters 𝜂 , 𝜂 and 𝜂 of the findDom operation cost functionduring the calibration stage.To estimate 𝜌 𝑚 we first pick three pairs of random values of 𝑉 . We then ask the user 𝑈 to match each pair and, depending onwhether they match or not, to selects a “yes” or “no” button. Forthe 𝑘 th pair we measure the time 𝑡 𝑚,𝑘 it takes from when we showthe screen containing the pair of values and the buttons to 𝑈 tillhe or she selects one of the buttons. During this time the usermatches the values shown on the screen, then focuses on one ofthe buttons and selects it. Hence the time we measure is equal to 𝜌 𝑚,𝑘 + 𝜌 𝑓 + 𝜌 𝑠 where 𝜌 𝑚,𝑘 is our estimated cost of the 𝑘 th matchoperation. We then calculate the 𝜌 𝑚,𝑘 = 𝑡 𝑚,𝑘 − ( 𝜌 𝑓 + 𝜌 𝑠 ) andestimate 𝜌 𝑚 to be the average of 𝜌 𝑚,𝑘 s, i.e. 𝜌 𝑚 = (cid:205) 𝑘 = 𝜌 𝑚,𝑘 / 𝐶 . Firstwe pick three random non-singleton clusters 𝑐 , 𝑐 and 𝑐 (ofdifferent sizes if possible) from 𝐶 . Then we show each 𝑐 𝑘 and ask 𝑈 to select a “yes” button if 𝑐 𝑘 is pure and a “no” button otherwise.We record the time 𝑡 𝑝,𝑘 it takes from when we show 𝑐 𝑘 to 𝑈 tillone of the buttons is selected. We also record which button isselected. Using the same timing analysis we described for 𝜌 𝑚 , weform three equations of the form 𝛾𝛼 ( 𝑐 𝑘 ) 𝜓 𝑘 + 𝛾 = 𝑡 𝑝,𝑘 − ( 𝜌 𝑓 + 𝜌 𝑠 ) where 𝑘 ∈ { , , } , 𝛼 ( 𝑐 𝑘 ) is the purity of 𝑐 𝑘 and 𝜓 𝑘 is the sizeof 𝑐 𝑘 . However since we don’t know the purity of 𝑐 𝑘 s, we usethe button 𝑈 has selected to guess the purity of 𝑐 𝑘 : if 𝑈 hasselected the “yes” button, we set 𝛼 ( 𝑐 𝑘 ) =

1, otherwise we set 𝛼 ( 𝑐 𝑘 ) = 𝑎 ( ) 𝑏 (we have already estimated 𝑎 and 𝑏 during thecalibration of the purity function). Finally we use ordinary least-squares method to solve the system of three equations above toestimate the parameters 𝛾 and 𝛾 .To estimate 𝜂 we first pick three clusters 𝑐 ′ , 𝑐 ′ and 𝑐 ′ from 𝐶 such that 𝜓 ′ 𝑘 = | 𝑐 ′ 𝑘 | ≤ | 𝑆𝑇 𝑀 | . We then show each 𝑐 ′ 𝑘 to 𝑈 andask him/her to find the dominating entity of 𝑐 ′ 𝑘 , and then select avalue in 𝑐 ′ 𝑘 which refers to dom( 𝑐 ′ 𝑘 ). For each 𝑐 ′ 𝑘 we measure thetime 𝑡 𝜂 ,𝑘 it takes from when it is shown to 𝑈 till he/she selectsthe value referring to dom( 𝑐 ′ 𝑘 ). Using the same timing analysis asabove we obtain three equations of the form 𝑡 𝜂 ,𝑘 = 𝜂 𝜓 ′ 𝑘 + 𝜌 𝑓 + 𝜌 𝑠 .Then we solve each equation for 𝜂 and finally average the threenumbers we get to estimate 𝜂 as (cid:205) 𝑘 = ( 𝑡 𝜂 ,𝑘 − ( 𝜌 𝑓 + 𝜌 𝑠 ))/ 𝜓 ′ 𝑘 .To estimate 𝜂 and 𝜂 we follow a similar process: we firstpick three clusters 𝑐 ′′ , 𝑐 ′′ and 𝑐 ′′ from 𝐶 such that 𝜓 ′′ 𝑘 = | 𝑐 ′′ 𝑘 | > | 𝑆𝑇 𝑀 | . We then show each 𝑐 ′′ 𝑘 to 𝑈 and ask him/her to find thedominating entity of 𝑐 ′′ 𝑘 and then select a value in 𝑐 ′′ 𝑘 which refersto dom( 𝑐 ′′ 𝑘 ). For each 𝑐 ′′ 𝑘 we measure the time 𝑡 𝜂 , ,𝑘 it takes fromwhen it is shown to 𝑈 till he/she selects the value referring todom( 𝑐 ′′ 𝑘 ). Using the same timing analysis as above we obtain threeequations of the form 𝑡 𝜂 , ,𝑘 = 𝜂 ( 𝜓 ′′ 𝑘 ) + 𝜂 + 𝜌 𝑓 + 𝜌 𝑠 . Finally weuse ordinary least-squares method to solve the system of threeequations above to estimate the parameters 𝜂 and 𝜂 . C WORKING WITH MULTIPLE USERS

We now describe cWinston , which extends

Winston to workwith multiple users. Assuming 𝑘 users want to collaborate tonormalize a set 𝑉 of input values, cWinston goes through fourmain stages. In the first stage, it shows each user a few clusters ofvalues in 𝑉 and asks them to perform some basic operations onthem. cWinston then uses the results of these operations to tunethe purity function parameters and user operation cost models foreach user (the same way as Winston ). Next, it takes the averagef purity function and cost model parameters to create a singlepurity function and a single cost model for each user operation.In the second stage, cWinston uses the above purity functionand user operation cost models to find the best VN plan. To doso, it uses the same plan space searching procedure as

Winston (see Section 6) to find the best plan. It then executes the machinepart of the best plan to obtain a set C of clusters.In the third stage, cWinston partitions C into 𝑘 subsets ofroughly the same number of values. It then assigns each subsetto one of the users and asks them to clean their respective subsetsof clusters using Split and

Merge algorithms.In the last stage, cWinston starts by collecting the results of

Split + Merge from all the users and for each user, it creates a list ofrepresentative values of the clean clusters he or she has produced.It then picks the longest list and divides it into 𝑘 chunks of roughlythe same size. Next, cWinston asks each user to merge one ofthese chunks with the rest of the lists using the GlobalMerge procedure (see Section 6). It then collects the results from all theusers and repeats this stage (i.e., takes representative values fromthe merged clusters, divides the largest list into 𝑘 chunks, and soon) until all of the lists are verified/merged. Algorithm 4 showsthe pseudocode of cWinston . Estimating the Cost of cWinston : Next, we describe how weestimate the cost of cWinston . To do so, we traverse Algorithm4, using the same assumption as we discussed in Section 5, toarrive at the following cost formula: 𝑐𝑜𝑠𝑡 cWinston = max ≤ 𝑖 ≤ 𝑘 (cid:16) 𝑐𝑜𝑠𝑡 Split ,𝑢𝑖 ( 𝐶 ′ 𝑖 )+ 𝑐𝑜𝑠𝑡 LocalMerge ,𝑢𝑖 ( 𝐸 𝑖 )+ 𝑐𝑜𝑠𝑡 GlobalMerge ,𝑢𝑖 ( 𝐿 𝑖 ) (cid:17) + 𝑐𝑜𝑠𝑡 MultiUserMerge ,𝑈 ( 𝑆 ) where 𝑘 is the number of users, the max term finds the longestit takes any of the users 𝑢 𝑖 s to perform Split + Merge on theirrespective partitions, and the last term is the cost of multi-usermerge calculated using the following formula: 𝑐𝑜𝑠𝑡

MultiUserMerge ,𝑈 ( 𝑆 ) = 𝑘 − ∑︁ 𝑡 = (cid:32) max 𝑢 ∈ 𝑈 (cid:16) | 𝐷𝑡 |( − 𝑅𝑡 − 𝜉 ) 𝑘 ∑︁ 𝑖 = (cid:0) 𝜌 𝑧,𝑢 + 𝑘 ∑︁ 𝑗 = 𝑡 ( 𝜉 + ) ( 𝜌 𝑓 ,𝑢 + 𝜌 𝑠,𝑢 ) + 𝜇 | 𝐷 𝑗 |( − ( 𝑅 𝑡 − + 𝑖 ) 𝜉 ) 𝜌 𝑟,𝑢 (cid:1)(cid:17)(cid:33) In the above formula, 𝐷 𝑡 is the largest list of representative valuesat iteration 𝑡 , 𝑅 𝑡 is the number of entities that are completelymerged and removed from the current list 𝐷 after iteration 𝑡 , 𝜌 .,𝑢 sare the cost of human operations for user 𝑢 , 𝜇 is the proportion ofthe rows which, on average, must be examined before all columnsare matched, and 𝜉 is described in Section 5.Here is how we derive this formula. The outer summationcorresponds to choosing the longest list of representative valuesand dividing it among the users. The max operation chooses thelongest it takes any of the users to perform each one of the aboveiterations, i.e., the longest path, which would determine the timeit takes the users to collaboratively perform MultiUserMergefrom start to finish.The middle sum corresponds to scanning the remaining lists ofrepresentative values each user 𝑢 has to perform at each iteration.To determine the number of such scans per iteration, we needthe number of column values that 𝑢 has to read and memorize formerging. This number is equal to the proportion of the current list 𝐷 𝑡 still remaining to be merged by 𝑢 which is | 𝐷 𝑡 |( − 𝑅 𝑡 − 𝜉 ) 𝑘 . Sincewe show 𝑢 three column values at a time, we divide the abovenumber by three to arrive at the correct number of iterations | 𝐷 𝑡 |( − 𝑅 𝑡 − 𝜉 ) 𝑘 . During each of these scans, 𝑢 memorizes threecolumn values, hence the 3 𝜌 𝑧,𝑢 term. Algorithm 4 cWinston

Procedure cWinston ( 𝑉,𝑈 ) Input: a set 𝑉 of representative value sets, a set 𝑈 = { 𝑢 , . . .,𝑢 𝑘 } of users Output: a set clean clusters 𝑆 for each 𝑢 𝑖 ∈ 𝑈 do : 𝐴 𝑖 ← tune purity function and cost model parameters for 𝑢 𝑖 𝑝 𝜆 ∗ ← search plan space to find the best plan using 𝐴 𝑖 s 𝐶 ← run HAC( 𝜆 ∗ ) on 𝑉 Divide 𝐶 into 𝐶 ′ = { 𝐶 ′ ,𝐶 ′ , . . .,𝐶 ′ 𝑘 } s.t. | 𝐶 ′ 𝑖 | ≈ | 𝐶 |/ 𝑘 , 𝑆 ← ∅ for each 𝑢 𝑖 ∈ 𝑈 do : //each user 𝑢 𝑖 executes Winston on 𝐶 ′ 𝑖 𝐷 𝑖 ← Split ( 𝐶 ′ 𝑖 ), 𝐸 𝑖 ← list of representative values of clusters in 𝐷 𝑖 LocalMerge ( 𝐸 𝑖 ), 𝐿 𝑖 ← consolidated list of values from LocalMerge step 𝑆 ← 𝑆 ∪ GlobalMerge ( 𝐿 𝑖 ) if | 𝑈 | = then : return 𝑆 //effectively, (single-user) Winston else : return MultiUserMerge(

𝑆,𝑈 ) Procedure

MultiUserMerge(

𝑆,𝑈 ) Input: a set 𝑆 of clean clusters, a set 𝑈 = { 𝑢 , . . .,𝑢 𝑘 } of users Output: a set clean clusters 𝑆 ′ 𝐷 ← ∅ //representative values for the sets of clusters in 𝑆 𝑆 ′ ← ∅ //flattened 𝑆 for each 𝑆 𝑖 ∈ 𝑆 do : 𝐷 𝑖 ← a set of values representing the clusters in 𝑆 𝑖 , 𝑆 ′ ← 𝑆 ′ ∪ 𝑆 𝑖 while each | 𝐷 | > do : 𝐷 ∗ ← argmax 𝐷𝑗 ∈ 𝐷 (| 𝐷 𝑗 |) , 𝐷 † ← 𝐷 \ 𝐷 ∗ , 𝑀 ← ∅ //matches Divide 𝐷 ∗ into 𝐿 = { 𝐿 , 𝐿 , . . ., 𝐿 𝑘 } s.t. | 𝐿 𝑖 | ≈ | 𝐷 ∗ |/ 𝑘 //all users perform merge with their respective column values in parallel for each 𝐿 𝑖 ∈ 𝐿 do : 𝑀 ← 𝑀 ∪ GroupedMerge( 𝐿 𝑖 , copy ( 𝐷 † ) ,𝑢 𝑖 )//resolve matches for each 𝐷 𝑗 ∈ 𝐷 do : for each ( 𝑣, 𝑤 ) ∈ 𝑀 do : 𝐷 𝑗 ← 𝐷 𝑗 \ { 𝑣, 𝑤 } Merge the clusters in 𝑆 ′ which 𝑣 and 𝑤 refer to, andset 𝑣 to refer to the new cluster return 𝑆 ′ Procedure

GroupedMerge(

𝐿, 𝐷,𝑢 ) Input: a list 𝐿 of column values, a set 𝐷 of representative value sets, a user 𝑢 Output: a set 𝑀 of matches 𝑀 ← ∅ while each | 𝐿 | > do : //while there are still column values left 𝑟 ← max ( , | 𝐿 |) , 𝐵 ← 𝑟 values from 𝐿 for each 𝑏 ∈ 𝐵 do : memorize ( 𝑏 ) for each 𝐷 𝑗 ∈ 𝐷 do : 𝑀 ← 𝑀 ∪ SetMerge(

𝐵, 𝐷 𝑗 ,𝑢 ) 𝐿 ← 𝐿 \ 𝐵 return 𝑀 Procedure

SetMerge(

𝐵, 𝐷 𝑗 ,𝑢 ) Input: a set 𝐵 of column values, a set 𝐷 𝑗 of representative values, a user 𝑢 Output: a set 𝑀 ′ of matches 𝐷 𝑗 ← sort according to similarity to the values in 𝐵 , 𝑀 ′ ← ∅ for each 𝑣 ∈ 𝐷 𝑗 do : if recall ( 𝑣 ) then : 𝑀 ′ ← 𝑀 ′ ∪ {( 𝑏, 𝑣 ) } s.t. 𝑏 ∈ 𝐵 ∧ 𝑏 matches 𝑣 if | 𝑀 | = then : break 𝐷 𝑗 ← 𝐷 𝑗 \ { 𝑣 } return 𝑀 ′ The inner-most sum corresponds to the number of rows exam-ined per each set of three representative values during each scan.At iteration 𝑡 , there are 𝑘 − 𝑡 lists of representative values left toappear on the rows. Each time 𝑢 scans one of these lists, he orshe matches on average 3 𝜉 rows, each of which requires a buttonclick. Additionally, 𝑢 has to click the merge button, hence the term ( 𝜉 + )( 𝜌 𝑠,𝑢 + 𝜌 𝑓 ,𝑢 ) . To account for the number of rows examinedin each list before finding the matches, we calculate the numberof rows left in the list by finding the number of entities removedfrom the list at the end of this iteration, i.e., | 𝐷 𝑗 |(( 𝑅 𝑡 − + 𝑖 )) 𝜉 ,and then subtracting this value from | 𝐷 𝑗 | . We also assume thatonly a 𝜇 proportion of these rows need to be investigated beforefinding matches, hence the term 𝜇 | 𝐷 𝑗 |( − ( 𝑅 𝑡 − + 𝑖 ) 𝜉 ) . D EMPIRICAL EVALUATION

Generating Synthetic Users:

We use a deterministic modelof a user, i.e. we use constant values for 𝜌 𝑓 , 𝜌 𝑠 , 𝜌 𝑚 , 𝜌 𝑟 , 𝜌 𝑧 , 𝛾,𝛾 ,𝜂 , 𝜂 , 𝜂 . To generate a synthetic user we first assume a constantvalue for 𝜌 𝑓 and 𝜌 𝑠 , i.e. 𝜌 𝑓 = 𝜌 𝑠 = .

5. We then assume a rangeof values for each of 𝜌 𝑚 ∈ [ . , . ] , 𝜌 𝑟 ∈ [ . , . ] , 𝛾 ∈ [ . , . ] , 𝛾 ∈ [ . , ] and 𝜂 ∈ [ . , . ] . Next, we generate a randomimulated user by uniformly randomly sampling a number fromeach of the above ranges and assigning these values to the cor-responding parameters of the cost model. Finally, we assign theremaining parameters as 𝜌 𝑧 = 𝜌 𝑟 , 𝜂 = 𝜂 /(| 𝑆𝑇 𝑀 | × ) and 𝜂 = . 𝜂 | 𝑆𝑇 𝑀 ||